[ https://issues.apache.org/jira/browse/SPARK-21799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16161992#comment-16161992 ]
Joseph K. Bradley commented on SPARK-21799: ------------------------------------------- Now that I've caught up on these, this is just a special case of the bug in [SPARK-18608]. I'm going to close this issue and ask for a PR like [~podongfeng]'s original PR be sent for [SPARK-18608], fixing the use of {{dataset.rdd.getStorageLevel}}. I think we should fix it for all algorithms, not just K-Means. > KMeans performance regression (5-6x slowdown) in Spark 2.2 > ---------------------------------------------------------- > > Key: SPARK-21799 > URL: https://issues.apache.org/jira/browse/SPARK-21799 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 2.2.0 > Reporter: Siddharth Murching > > I've been running KMeans performance tests using > [spark-sql-perf|https://github.com/databricks/spark-sql-perf/] and have > noticed a regression (slowdowns of 5-6x) when running tests on large datasets > in Spark 2.2 vs 2.1. > The test params are: > * Cluster: 510 GB RAM, 16 workers > * Data: 1000000 examples, 10000 features > After talking to [~josephkb], the issue seems related to the changes in > [SPARK-18356|https://issues.apache.org/jira/browse/SPARK-18356] introduced in > [this PR|https://github.com/apache/spark/pull/16295]. > It seems `df.cache()` doesn't set the storageLevel of `df.rdd`, so > `handlePersistence` is true even when KMeans is run on a cached DataFrame. > This unnecessarily causes another copy of the input dataset to be persisted. > As of Spark 2.1 ([JIRA > link|https://issues.apache.org/jira/browse/SPARK-16063]) `df.storageLevel` > returns the correct result after calling `df.cache()`, so I'd suggest > replacing instances of `df.rdd.getStorageLevel` with df.storageLevel` in > MLlib algorithms (the same pattern shows up in LogisticRegression, > LinearRegression, and others). I've verified this behavior in [this > notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5211178207246023/950505630032626/7788830288800223/latest.html] -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org