[ https://issues.apache.org/jira/browse/SPARK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joseph K. Bradley updated SPARK-18356: -------------------------------------- Summary: KMeans should cache RDD before training (was: Issue + Resolution: Kmeans Spark Performances (ML package)) > KMeans should cache RDD before training > --------------------------------------- > > Key: SPARK-18356 > URL: https://issues.apache.org/jira/browse/SPARK-18356 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 2.0.0, 2.0.1 > Reporter: zakaria hili > Assignee: zakaria hili > Priority: Minor > Labels: easyfix > Fix For: 2.2.0 > > > Hello, > I'm newbie in spark, but I think that I found a small problem that can affect > spark Kmeans performances. > Before starting to explain the problem, I want to explain the warning that I > faced. > I tried to use Spark Kmeans with Dataframes to cluster my data > df_Part = assembler.transform(df_Part) > df_Part.cache() > while (k<=max_cluster) and (wssse > seuilStop): > kmeans = KMeans().setK(k) > model = kmeans.fit(df_Part) > wssse = model.computeCost(df_Part) > k=k+1 > but when I run the code I receive the warning : > WARN KMeans: The input data is not directly cached, which may hurt > performance if its parent RDDs are also uncached. > I searched in spark source code to find the source of this problem, then I > realized there is two classes responsible for this warning: > (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ) > (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala ) > > When my dataframe is cached, the fit method transform my dataframe into an > internally rdd which is not cached. > Dataframe -> rdd -> run Training Kmeans Algo(rdd) > -> The first class (ml package) responsible for converting the dataframe into > rdd then call Kmeans Algorithm > ->The second class (mllib package) implements Kmeans Algorithm, and here > spark verify if the rdd is cached, if not a warning will be generated. > So, the solution of this problem is to cache the rdd before running Kmeans > Algorithm. > https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala > All what we need is to add two lines: > Cache rdd just after dataframe transformation, then uncached it after > training algorithm. > I hope that I was clear. > If you think that I was wrong, please let me know. > Sincerely, > Zakaria HILI -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org