[jira] [Commented] (SPARK-18356) Issue + Resolution: Kmeans Spark Performances (ML package)

zakaria hili (JIRA) Tue, 15 Nov 2016 04:41:44 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15667034#comment-15667034
 ]


zakaria hili commented on SPARK-18356:
--------------------------------------

Sorry for this late.
Sean Owen, unfortunately I don't have a cluster of tests the performances. but 
I think that Joseph K. Bradley was right, caching data frame is the best 
solution, because as mentioned before, catching operation is more expansive 
than generating RDD from cached dataframe.
If we imagine that we have a huge cached dataframe, if we tried to cache the 
rdd, it will take a lot of  time + space in memory and it can generate an 
OutOfMemory exception

yuhao yang, I don't know about others algorithms, but for Kmeans algo, spark 
doesn't cache the RDD.

> Issue + Resolution: Kmeans Spark Performances (ML package)
> ----------------------------------------------------------
>
>                 Key: SPARK-18356
>                 URL: https://issues.apache.org/jira/browse/SPARK-18356
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.0.0, 2.0.1
>            Reporter: zakaria hili
>            Priority: Minor
>              Labels: easyfix
>
> Hello,
> I'm newbie in spark, but I think that I found a small problem that can affect 
> spark Kmeans performances.
> Before starting to explain the problem, I want to explain the warning that I 
> faced.
> I tried to use Spark Kmeans with Dataframes to cluster my data
> df_Part = assembler.transform(df_Part)    
> df_Part.cache()
> while (k<=max_cluster) and (wssse > seuilStop):
>                     kmeans = KMeans().setK(k)
>                     model = kmeans.fit(df_Part)
>                     wssse = model.computeCost(df_Part)
>                     k=k+1
> but when I run the code I receive the warning :
> WARN KMeans: The input data is not directly cached, which may hurt 
> performance if its parent RDDs are also uncached.
> I searched in spark source code to find the source of this problem, then I 
> realized there is two classes responsible for this warning: 
> (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala )
> (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala )
>  
> When my  dataframe is cached, the fit method transform my dataframe into an 
> internally rdd which is not cached.
> Dataframe -> rdd -> run Training Kmeans Algo(rdd)
> -> The first class (ml package) responsible for converting the dataframe into 
> rdd then call Kmeans Algorithm
> ->The second class (mllib package) implements Kmeans Algorithm, and here 
> spark verify if the rdd is cached, if not a warning will be generated.  
> So, the solution of this problem is to cache the rdd before running Kmeans 
> Algorithm.
> https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
> All what we need is to add two lines:
> Cache rdd just after dataframe transformation, then uncached it after 
> training algorithm.
> I hope that I was clear.
> If you think that I was wrong, please let me know.
> Sincerely,
> Zakaria HILI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18356) Issue + Resolution: Kmeans Spark Performances (ML package)

Reply via email to