zakaria hili created SPARK-18356: ------------------------------------ Summary: Kmeans Spark Performances (ML package) Key: SPARK-18356 URL: https://issues.apache.org/jira/browse/SPARK-18356 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.0.1, 2.0.0 Reporter: zakaria hili Priority: Minor
Hello, I'm newbie in spark, but I think that I found a small problem that can affect spark Kmeans performances. Before starting to explain the problem, I want to explain the warning that I faced. I tried to use Spark Kmeans with Dataframes to cluster my data df_Part = assembler.transform(df_Part) df_Part.cache() while (k<=max_cluster) and (wssse > seuilStop): kmeans = KMeans().setK(k) model = kmeans.fit(df_Part) wssse = model.computeCost(df_Part) k=k+1 but when I run the code I receive the warning : WARN KMeans: The input data is not directly cached, which may hurt performance if its parent RDDs are also uncached. I searched in spark source code to find the source of this problem, then I realized there is two classes responsible for this warning: (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ) (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ) When my dataframe is cached, the fit method transform my dataframe into an internally rdd which is not cached. Dataframe -> rdd -> run Training Kmeans Algo(rdd) -> The first class (ml package) responsible for converting the dataframe into rdd then call Kmeans Algorithm ->The second class (mllib package) implements Kmeans Algorithm, and here spark verify if the rdd is cached, if not a warning will be generated. So, the solution of this problem is to cache the rdd before running Kmeans Algorithm. https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala All what we need is to add two lines: Cache rdd just after dataframe transformation, then uncached it after training algorithm. I hope that I was clear. If you think that I was wrong, please let me know. Sincerely, Zakaria HILI -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org