zakaria hili created SPARK-18356:
------------------------------------

             Summary: Kmeans Spark Performances (ML package)
                 Key: SPARK-18356
                 URL: https://issues.apache.org/jira/browse/SPARK-18356
             Project: Spark
          Issue Type: Improvement
          Components: ML
    Affects Versions: 2.0.1, 2.0.0
            Reporter: zakaria hili
            Priority: Minor


Hello,

I'm newbie in spark, but I think that I found a small problem that can affect 
spark Kmeans performances.
Before starting to explain the problem, I want to explain the warning that I 
faced.

I tried to use Spark Kmeans with Dataframes to cluster my data

df_Part = assembler.transform(df_Part)    
df_Part.cache()
while (k<=max_cluster) and (wssse > seuilStop):
                    kmeans = KMeans().setK(k)
                    model = kmeans.fit(df_Part)
                    wssse = model.computeCost(df_Part)
                    k=k+1


but when I run the code I receive the warning :
WARN KMeans: The input data is not directly cached, which may hurt performance 
if its parent RDDs are also uncached.

I searched in spark source code to find the source of this problem, then I 
realized there is two classes responsible for this warning: 
(mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala )
(mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala )
 
When my  dataframe is cached, the fit method transform my dataframe into an 
internally rdd which is not cached.
Dataframe -> rdd -> run Training Kmeans Algo(rdd)

-> The first class (ml package) responsible for converting the dataframe into 
rdd then call Kmeans Algorithm
->The second class (mllib package) implements Kmeans Algorithm, and here spark 
verify if the rdd is cached, if not a warning will be generated.  

So, the solution of this problem is to cache the rdd before running Kmeans 
Algorithm.
https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
All what we need is to add two lines:
Cache rdd just after dataframe transformation, then uncached it after training 
algorithm.


I hope that I was clear.
If you think that I was wrong, please let me know.

Sincerely,
Zakaria HILI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to