[jira] [Commented] (SPARK-18356) Issue + Resolution: Kmeans Spark Performances (ML package)

2016-11-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15683700#comment-15683700
 ] 

Apache Spark commented on SPARK-18356:
--

User 'ZakariaHili' has created a pull request for this issue:
https://github.com/apache/spark/pull/15965

> Issue + Resolution: Kmeans Spark Performances (ML package)
> --
>
> Key: SPARK-18356
> URL: https://issues.apache.org/jira/browse/SPARK-18356
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0, 2.0.1
>Reporter: zakaria hili
>Priority: Minor
>  Labels: easyfix
>
> Hello,
> I'm newbie in spark, but I think that I found a small problem that can affect 
> spark Kmeans performances.
> Before starting to explain the problem, I want to explain the warning that I 
> faced.
> I tried to use Spark Kmeans with Dataframes to cluster my data
> df_Part = assembler.transform(df_Part)
> df_Part.cache()
> while (k<=max_cluster) and (wssse > seuilStop):
> kmeans = KMeans().setK(k)
> model = kmeans.fit(df_Part)
> wssse = model.computeCost(df_Part)
> k=k+1
> but when I run the code I receive the warning :
> WARN KMeans: The input data is not directly cached, which may hurt 
> performance if its parent RDDs are also uncached.
> I searched in spark source code to find the source of this problem, then I 
> realized there is two classes responsible for this warning: 
> (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala )
> (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala )
>  
> When my  dataframe is cached, the fit method transform my dataframe into an 
> internally rdd which is not cached.
> Dataframe -> rdd -> run Training Kmeans Algo(rdd)
> -> The first class (ml package) responsible for converting the dataframe into 
> rdd then call Kmeans Algorithm
> ->The second class (mllib package) implements Kmeans Algorithm, and here 
> spark verify if the rdd is cached, if not a warning will be generated.  
> So, the solution of this problem is to cache the rdd before running Kmeans 
> Algorithm.
> https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
> All what we need is to add two lines:
> Cache rdd just after dataframe transformation, then uncached it after 
> training algorithm.
> I hope that I was clear.
> If you think that I was wrong, please let me know.
> Sincerely,
> Zakaria HILI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18356) Issue + Resolution: Kmeans Spark Performances (ML package)

2016-11-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15683669#comment-15683669
 ] 

Apache Spark commented on SPARK-18356:
--

User 'ZakariaHili' has created a pull request for this issue:
https://github.com/apache/spark/pull/15964

> Issue + Resolution: Kmeans Spark Performances (ML package)
> --
>
> Key: SPARK-18356
> URL: https://issues.apache.org/jira/browse/SPARK-18356
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0, 2.0.1
>Reporter: zakaria hili
>Priority: Minor
>  Labels: easyfix
>
> Hello,
> I'm newbie in spark, but I think that I found a small problem that can affect 
> spark Kmeans performances.
> Before starting to explain the problem, I want to explain the warning that I 
> faced.
> I tried to use Spark Kmeans with Dataframes to cluster my data
> df_Part = assembler.transform(df_Part)
> df_Part.cache()
> while (k<=max_cluster) and (wssse > seuilStop):
> kmeans = KMeans().setK(k)
> model = kmeans.fit(df_Part)
> wssse = model.computeCost(df_Part)
> k=k+1
> but when I run the code I receive the warning :
> WARN KMeans: The input data is not directly cached, which may hurt 
> performance if its parent RDDs are also uncached.
> I searched in spark source code to find the source of this problem, then I 
> realized there is two classes responsible for this warning: 
> (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala )
> (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala )
>  
> When my  dataframe is cached, the fit method transform my dataframe into an 
> internally rdd which is not cached.
> Dataframe -> rdd -> run Training Kmeans Algo(rdd)
> -> The first class (ml package) responsible for converting the dataframe into 
> rdd then call Kmeans Algorithm
> ->The second class (mllib package) implements Kmeans Algorithm, and here 
> spark verify if the rdd is cached, if not a warning will be generated.  
> So, the solution of this problem is to cache the rdd before running Kmeans 
> Algorithm.
> https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
> All what we need is to add two lines:
> Cache rdd just after dataframe transformation, then uncached it after 
> training algorithm.
> I hope that I was clear.
> If you think that I was wrong, please let me know.
> Sincerely,
> Zakaria HILI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18356) Issue + Resolution: Kmeans Spark Performances (ML package)

2016-11-18 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15678196#comment-15678196
 ] 

yuhao yang commented on SPARK-18356:


Surely I would not not mind. You're more than welcome to send a PR for this. If 
you haven't already, please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark for 
some tips.

> Issue + Resolution: Kmeans Spark Performances (ML package)
> --
>
> Key: SPARK-18356
> URL: https://issues.apache.org/jira/browse/SPARK-18356
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0, 2.0.1
>Reporter: zakaria hili
>Priority: Minor
>  Labels: easyfix
>
> Hello,
> I'm newbie in spark, but I think that I found a small problem that can affect 
> spark Kmeans performances.
> Before starting to explain the problem, I want to explain the warning that I 
> faced.
> I tried to use Spark Kmeans with Dataframes to cluster my data
> df_Part = assembler.transform(df_Part)
> df_Part.cache()
> while (k<=max_cluster) and (wssse > seuilStop):
> kmeans = KMeans().setK(k)
> model = kmeans.fit(df_Part)
> wssse = model.computeCost(df_Part)
> k=k+1
> but when I run the code I receive the warning :
> WARN KMeans: The input data is not directly cached, which may hurt 
> performance if its parent RDDs are also uncached.
> I searched in spark source code to find the source of this problem, then I 
> realized there is two classes responsible for this warning: 
> (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala )
> (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala )
>  
> When my  dataframe is cached, the fit method transform my dataframe into an 
> internally rdd which is not cached.
> Dataframe -> rdd -> run Training Kmeans Algo(rdd)
> -> The first class (ml package) responsible for converting the dataframe into 
> rdd then call Kmeans Algorithm
> ->The second class (mllib package) implements Kmeans Algorithm, and here 
> spark verify if the rdd is cached, if not a warning will be generated.  
> So, the solution of this problem is to cache the rdd before running Kmeans 
> Algorithm.
> https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
> All what we need is to add two lines:
> Cache rdd just after dataframe transformation, then uncached it after 
> training algorithm.
> I hope that I was clear.
> If you think that I was wrong, please let me know.
> Sincerely,
> Zakaria HILI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18356) Issue + Resolution: Kmeans Spark Performances (ML package)

2016-11-18 Thread zakaria hili (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15676337#comment-15676337
 ] 

zakaria hili commented on SPARK-18356:
--

if you don't mind , yes

> Issue + Resolution: Kmeans Spark Performances (ML package)
> --
>
> Key: SPARK-18356
> URL: https://issues.apache.org/jira/browse/SPARK-18356
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0, 2.0.1
>Reporter: zakaria hili
>Priority: Minor
>  Labels: easyfix
>
> Hello,
> I'm newbie in spark, but I think that I found a small problem that can affect 
> spark Kmeans performances.
> Before starting to explain the problem, I want to explain the warning that I 
> faced.
> I tried to use Spark Kmeans with Dataframes to cluster my data
> df_Part = assembler.transform(df_Part)
> df_Part.cache()
> while (k<=max_cluster) and (wssse > seuilStop):
> kmeans = KMeans().setK(k)
> model = kmeans.fit(df_Part)
> wssse = model.computeCost(df_Part)
> k=k+1
> but when I run the code I receive the warning :
> WARN KMeans: The input data is not directly cached, which may hurt 
> performance if its parent RDDs are also uncached.
> I searched in spark source code to find the source of this problem, then I 
> realized there is two classes responsible for this warning: 
> (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala )
> (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala )
>  
> When my  dataframe is cached, the fit method transform my dataframe into an 
> internally rdd which is not cached.
> Dataframe -> rdd -> run Training Kmeans Algo(rdd)
> -> The first class (ml package) responsible for converting the dataframe into 
> rdd then call Kmeans Algorithm
> ->The second class (mllib package) implements Kmeans Algorithm, and here 
> spark verify if the rdd is cached, if not a warning will be generated.  
> So, the solution of this problem is to cache the rdd before running Kmeans 
> Algorithm.
> https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
> All what we need is to add two lines:
> Cache rdd just after dataframe transformation, then uncached it after 
> training algorithm.
> I hope that I was clear.
> If you think that I was wrong, please let me know.
> Sincerely,
> Zakaria HILI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18356) Issue + Resolution: Kmeans Spark Performances (ML package)

2016-11-17 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675460#comment-15675460
 ] 

yuhao yang commented on SPARK-18356:


I assume the performance improvement depends on the computation costs of 
uncached RDD lineage. Do you plan to send a PR for the improvement?

> Issue + Resolution: Kmeans Spark Performances (ML package)
> --
>
> Key: SPARK-18356
> URL: https://issues.apache.org/jira/browse/SPARK-18356
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0, 2.0.1
>Reporter: zakaria hili
>Priority: Minor
>  Labels: easyfix
>
> Hello,
> I'm newbie in spark, but I think that I found a small problem that can affect 
> spark Kmeans performances.
> Before starting to explain the problem, I want to explain the warning that I 
> faced.
> I tried to use Spark Kmeans with Dataframes to cluster my data
> df_Part = assembler.transform(df_Part)
> df_Part.cache()
> while (k<=max_cluster) and (wssse > seuilStop):
> kmeans = KMeans().setK(k)
> model = kmeans.fit(df_Part)
> wssse = model.computeCost(df_Part)
> k=k+1
> but when I run the code I receive the warning :
> WARN KMeans: The input data is not directly cached, which may hurt 
> performance if its parent RDDs are also uncached.
> I searched in spark source code to find the source of this problem, then I 
> realized there is two classes responsible for this warning: 
> (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala )
> (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala )
>  
> When my  dataframe is cached, the fit method transform my dataframe into an 
> internally rdd which is not cached.
> Dataframe -> rdd -> run Training Kmeans Algo(rdd)
> -> The first class (ml package) responsible for converting the dataframe into 
> rdd then call Kmeans Algorithm
> ->The second class (mllib package) implements Kmeans Algorithm, and here 
> spark verify if the rdd is cached, if not a warning will be generated.  
> So, the solution of this problem is to cache the rdd before running Kmeans 
> Algorithm.
> https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
> All what we need is to add two lines:
> Cache rdd just after dataframe transformation, then uncached it after 
> training algorithm.
> I hope that I was clear.
> If you think that I was wrong, please let me know.
> Sincerely,
> Zakaria HILI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18356) Issue + Resolution: Kmeans Spark Performances (ML package)

2016-11-17 Thread zakaria hili (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674161#comment-15674161
 ] 

zakaria hili commented on SPARK-18356:
--

Hi [~yuhaoyan],
I tried to improve the Kmeans using the same concept of caching in Logistic 
Regression.
https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala#L310
and my result of performances:
I used only one VM (Local Mode) with python
-> Spark without improvement: the training takes ~0,605s (as a mean value)
-> Spark with Kmeans improved: ~0,518s (the warning disappeared)
so we can say that we did not gain a lot, but maybe we will see the difference 
if we run the train method many times.

what do you think ?



> Issue + Resolution: Kmeans Spark Performances (ML package)
> --
>
> Key: SPARK-18356
> URL: https://issues.apache.org/jira/browse/SPARK-18356
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0, 2.0.1
>Reporter: zakaria hili
>Priority: Minor
>  Labels: easyfix
>
> Hello,
> I'm newbie in spark, but I think that I found a small problem that can affect 
> spark Kmeans performances.
> Before starting to explain the problem, I want to explain the warning that I 
> faced.
> I tried to use Spark Kmeans with Dataframes to cluster my data
> df_Part = assembler.transform(df_Part)
> df_Part.cache()
> while (k<=max_cluster) and (wssse > seuilStop):
> kmeans = KMeans().setK(k)
> model = kmeans.fit(df_Part)
> wssse = model.computeCost(df_Part)
> k=k+1
> but when I run the code I receive the warning :
> WARN KMeans: The input data is not directly cached, which may hurt 
> performance if its parent RDDs are also uncached.
> I searched in spark source code to find the source of this problem, then I 
> realized there is two classes responsible for this warning: 
> (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala )
> (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala )
>  
> When my  dataframe is cached, the fit method transform my dataframe into an 
> internally rdd which is not cached.
> Dataframe -> rdd -> run Training Kmeans Algo(rdd)
> -> The first class (ml package) responsible for converting the dataframe into 
> rdd then call Kmeans Algorithm
> ->The second class (mllib package) implements Kmeans Algorithm, and here 
> spark verify if the rdd is cached, if not a warning will be generated.  
> So, the solution of this problem is to cache the rdd before running Kmeans 
> Algorithm.
> https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
> All what we need is to add two lines:
> Cache rdd just after dataframe transformation, then uncached it after 
> training algorithm.
> I hope that I was clear.
> If you think that I was wrong, please let me know.
> Sincerely,
> Zakaria HILI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18356) Issue + Resolution: Kmeans Spark Performances (ML package)

2016-11-16 Thread zakaria hili (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15670894#comment-15670894
 ] 

zakaria hili commented on SPARK-18356:
--

[~yuhaoyan] Thank you.
I will try to improve the Kmeans in my free time, then we can discuss the 
result.

> Issue + Resolution: Kmeans Spark Performances (ML package)
> --
>
> Key: SPARK-18356
> URL: https://issues.apache.org/jira/browse/SPARK-18356
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0, 2.0.1
>Reporter: zakaria hili
>Priority: Minor
>  Labels: easyfix
>
> Hello,
> I'm newbie in spark, but I think that I found a small problem that can affect 
> spark Kmeans performances.
> Before starting to explain the problem, I want to explain the warning that I 
> faced.
> I tried to use Spark Kmeans with Dataframes to cluster my data
> df_Part = assembler.transform(df_Part)
> df_Part.cache()
> while (k<=max_cluster) and (wssse > seuilStop):
> kmeans = KMeans().setK(k)
> model = kmeans.fit(df_Part)
> wssse = model.computeCost(df_Part)
> k=k+1
> but when I run the code I receive the warning :
> WARN KMeans: The input data is not directly cached, which may hurt 
> performance if its parent RDDs are also uncached.
> I searched in spark source code to find the source of this problem, then I 
> realized there is two classes responsible for this warning: 
> (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala )
> (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala )
>  
> When my  dataframe is cached, the fit method transform my dataframe into an 
> internally rdd which is not cached.
> Dataframe -> rdd -> run Training Kmeans Algo(rdd)
> -> The first class (ml package) responsible for converting the dataframe into 
> rdd then call Kmeans Algorithm
> ->The second class (mllib package) implements Kmeans Algorithm, and here 
> spark verify if the rdd is cached, if not a warning will be generated.  
> So, the solution of this problem is to cache the rdd before running Kmeans 
> Algorithm.
> https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
> All what we need is to add two lines:
> Cache rdd just after dataframe transformation, then uncached it after 
> training algorithm.
> I hope that I was clear.
> If you think that I was wrong, please let me know.
> Sincerely,
> Zakaria HILI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18356) Issue + Resolution: Kmeans Spark Performances (ML package)

2016-11-15 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15667692#comment-15667692
 ] 

yuhao yang commented on SPARK-18356:


Hi [~zahili], if you're interested in an improvement for this, maybe you can 
refer to 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L317
 for an example, which can probably help here.

> Issue + Resolution: Kmeans Spark Performances (ML package)
> --
>
> Key: SPARK-18356
> URL: https://issues.apache.org/jira/browse/SPARK-18356
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0, 2.0.1
>Reporter: zakaria hili
>Priority: Minor
>  Labels: easyfix
>
> Hello,
> I'm newbie in spark, but I think that I found a small problem that can affect 
> spark Kmeans performances.
> Before starting to explain the problem, I want to explain the warning that I 
> faced.
> I tried to use Spark Kmeans with Dataframes to cluster my data
> df_Part = assembler.transform(df_Part)
> df_Part.cache()
> while (k<=max_cluster) and (wssse > seuilStop):
> kmeans = KMeans().setK(k)
> model = kmeans.fit(df_Part)
> wssse = model.computeCost(df_Part)
> k=k+1
> but when I run the code I receive the warning :
> WARN KMeans: The input data is not directly cached, which may hurt 
> performance if its parent RDDs are also uncached.
> I searched in spark source code to find the source of this problem, then I 
> realized there is two classes responsible for this warning: 
> (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala )
> (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala )
>  
> When my  dataframe is cached, the fit method transform my dataframe into an 
> internally rdd which is not cached.
> Dataframe -> rdd -> run Training Kmeans Algo(rdd)
> -> The first class (ml package) responsible for converting the dataframe into 
> rdd then call Kmeans Algorithm
> ->The second class (mllib package) implements Kmeans Algorithm, and here 
> spark verify if the rdd is cached, if not a warning will be generated.  
> So, the solution of this problem is to cache the rdd before running Kmeans 
> Algorithm.
> https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
> All what we need is to add two lines:
> Cache rdd just after dataframe transformation, then uncached it after 
> training algorithm.
> I hope that I was clear.
> If you think that I was wrong, please let me know.
> Sincerely,
> Zakaria HILI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18356) Issue + Resolution: Kmeans Spark Performances (ML package)

2016-11-15 Thread zakaria hili (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15667034#comment-15667034
 ] 

zakaria hili commented on SPARK-18356:
--

Sorry for this late.
Sean Owen, unfortunately I don't have a cluster of tests the performances. but 
I think that Joseph K. Bradley was right, caching data frame is the best 
solution, because as mentioned before, catching operation is more expansive 
than generating RDD from cached dataframe.
If we imagine that we have a huge cached dataframe, if we tried to cache the 
rdd, it will take a lot of  time + space in memory and it can generate an 
OutOfMemory exception

yuhao yang, I don't know about others algorithms, but for Kmeans algo, spark 
doesn't cache the RDD.

> Issue + Resolution: Kmeans Spark Performances (ML package)
> --
>
> Key: SPARK-18356
> URL: https://issues.apache.org/jira/browse/SPARK-18356
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0, 2.0.1
>Reporter: zakaria hili
>Priority: Minor
>  Labels: easyfix
>
> Hello,
> I'm newbie in spark, but I think that I found a small problem that can affect 
> spark Kmeans performances.
> Before starting to explain the problem, I want to explain the warning that I 
> faced.
> I tried to use Spark Kmeans with Dataframes to cluster my data
> df_Part = assembler.transform(df_Part)
> df_Part.cache()
> while (k<=max_cluster) and (wssse > seuilStop):
> kmeans = KMeans().setK(k)
> model = kmeans.fit(df_Part)
> wssse = model.computeCost(df_Part)
> k=k+1
> but when I run the code I receive the warning :
> WARN KMeans: The input data is not directly cached, which may hurt 
> performance if its parent RDDs are also uncached.
> I searched in spark source code to find the source of this problem, then I 
> realized there is two classes responsible for this warning: 
> (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala )
> (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala )
>  
> When my  dataframe is cached, the fit method transform my dataframe into an 
> internally rdd which is not cached.
> Dataframe -> rdd -> run Training Kmeans Algo(rdd)
> -> The first class (ml package) responsible for converting the dataframe into 
> rdd then call Kmeans Algorithm
> ->The second class (mllib package) implements Kmeans Algorithm, and here 
> spark verify if the rdd is cached, if not a warning will be generated.  
> So, the solution of this problem is to cache the rdd before running Kmeans 
> Algorithm.
> https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
> All what we need is to add two lines:
> Cache rdd just after dataframe transformation, then uncached it after 
> training algorithm.
> I hope that I was clear.
> If you think that I was wrong, please let me know.
> Sincerely,
> Zakaria HILI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18356) Issue + Resolution: Kmeans Spark Performances (ML package)

2016-11-13 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15662092#comment-15662092
 ] 

yuhao yang commented on SPARK-18356:


Checking and caching the training data are quite common in MLlib algorithms. 
Some algorithms (LR, ANN) would persist the rdd data if parents DataFrame are 
not cached (use a variable handlePersistence). We can refer to that for the 
implementation.

> Issue + Resolution: Kmeans Spark Performances (ML package)
> --
>
> Key: SPARK-18356
> URL: https://issues.apache.org/jira/browse/SPARK-18356
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0, 2.0.1
>Reporter: zakaria hili
>Priority: Minor
>  Labels: easyfix
>
> Hello,
> I'm newbie in spark, but I think that I found a small problem that can affect 
> spark Kmeans performances.
> Before starting to explain the problem, I want to explain the warning that I 
> faced.
> I tried to use Spark Kmeans with Dataframes to cluster my data
> df_Part = assembler.transform(df_Part)
> df_Part.cache()
> while (k<=max_cluster) and (wssse > seuilStop):
> kmeans = KMeans().setK(k)
> model = kmeans.fit(df_Part)
> wssse = model.computeCost(df_Part)
> k=k+1
> but when I run the code I receive the warning :
> WARN KMeans: The input data is not directly cached, which may hurt 
> performance if its parent RDDs are also uncached.
> I searched in spark source code to find the source of this problem, then I 
> realized there is two classes responsible for this warning: 
> (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala )
> (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala )
>  
> When my  dataframe is cached, the fit method transform my dataframe into an 
> internally rdd which is not cached.
> Dataframe -> rdd -> run Training Kmeans Algo(rdd)
> -> The first class (ml package) responsible for converting the dataframe into 
> rdd then call Kmeans Algorithm
> ->The second class (mllib package) implements Kmeans Algorithm, and here 
> spark verify if the rdd is cached, if not a warning will be generated.  
> So, the solution of this problem is to cache the rdd before running Kmeans 
> Algorithm.
> https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
> All what we need is to add two lines:
> Cache rdd just after dataframe transformation, then uncached it after 
> training algorithm.
> I hope that I was clear.
> If you think that I was wrong, please let me know.
> Sincerely,
> Zakaria HILI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18356) Issue + Resolution: Kmeans Spark Performances (ML package)

2016-11-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15659416#comment-15659416
 ] 

Sean Owen commented on SPARK-18356:
---

CC [~josephkb] as this was a follow up to your comment at 
http://apache-spark-developers-list.1001551.n3.nabble.com/Issue-Resolution-Kmeans-Spark-Performances-ML-package-td19775.html

[~zahili] are you interested in investigating quieting the warning in the case 
you describe?

> Issue + Resolution: Kmeans Spark Performances (ML package)
> --
>
> Key: SPARK-18356
> URL: https://issues.apache.org/jira/browse/SPARK-18356
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0, 2.0.1
>Reporter: zakaria hili
>Priority: Minor
>  Labels: easyfix
>
> Hello,
> I'm newbie in spark, but I think that I found a small problem that can affect 
> spark Kmeans performances.
> Before starting to explain the problem, I want to explain the warning that I 
> faced.
> I tried to use Spark Kmeans with Dataframes to cluster my data
> df_Part = assembler.transform(df_Part)
> df_Part.cache()
> while (k<=max_cluster) and (wssse > seuilStop):
> kmeans = KMeans().setK(k)
> model = kmeans.fit(df_Part)
> wssse = model.computeCost(df_Part)
> k=k+1
> but when I run the code I receive the warning :
> WARN KMeans: The input data is not directly cached, which may hurt 
> performance if its parent RDDs are also uncached.
> I searched in spark source code to find the source of this problem, then I 
> realized there is two classes responsible for this warning: 
> (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala )
> (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala )
>  
> When my  dataframe is cached, the fit method transform my dataframe into an 
> internally rdd which is not cached.
> Dataframe -> rdd -> run Training Kmeans Algo(rdd)
> -> The first class (ml package) responsible for converting the dataframe into 
> rdd then call Kmeans Algorithm
> ->The second class (mllib package) implements Kmeans Algorithm, and here 
> spark verify if the rdd is cached, if not a warning will be generated.  
> So, the solution of this problem is to cache the rdd before running Kmeans 
> Algorithm.
> https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
> All what we need is to add two lines:
> Cache rdd just after dataframe transformation, then uncached it after 
> training algorithm.
> I hope that I was clear.
> If you think that I was wrong, please let me know.
> Sincerely,
> Zakaria HILI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18356) Issue + Resolution: Kmeans Spark Performances (ML package)

2016-11-08 Thread zakaria hili (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15648043#comment-15648043
 ] 

zakaria hili commented on SPARK-18356:
--

Thank you for your response,
I hate that warning, so I had to search the source of it.
maybe you are right, calling cache may be it's more expensive.
I think that the best solution is to convert the rdd by myself, then call 
 instead of 

> Issue + Resolution: Kmeans Spark Performances (ML package)
> --
>
> Key: SPARK-18356
> URL: https://issues.apache.org/jira/browse/SPARK-18356
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0, 2.0.1
>Reporter: zakaria hili
>Priority: Minor
>  Labels: easyfix
>
> Hello,
> I'm newbie in spark, but I think that I found a small problem that can affect 
> spark Kmeans performances.
> Before starting to explain the problem, I want to explain the warning that I 
> faced.
> I tried to use Spark Kmeans with Dataframes to cluster my data
> df_Part = assembler.transform(df_Part)
> df_Part.cache()
> while (k<=max_cluster) and (wssse > seuilStop):
> kmeans = KMeans().setK(k)
> model = kmeans.fit(df_Part)
> wssse = model.computeCost(df_Part)
> k=k+1
> but when I run the code I receive the warning :
> WARN KMeans: The input data is not directly cached, which may hurt 
> performance if its parent RDDs are also uncached.
> I searched in spark source code to find the source of this problem, then I 
> realized there is two classes responsible for this warning: 
> (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala )
> (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala )
>  
> When my  dataframe is cached, the fit method transform my dataframe into an 
> internally rdd which is not cached.
> Dataframe -> rdd -> run Training Kmeans Algo(rdd)
> -> The first class (ml package) responsible for converting the dataframe into 
> rdd then call Kmeans Algorithm
> ->The second class (mllib package) implements Kmeans Algorithm, and here 
> spark verify if the rdd is cached, if not a warning will be generated.  
> So, the solution of this problem is to cache the rdd before running Kmeans 
> Algorithm.
> https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
> All what we need is to add two lines:
> Cache rdd just after dataframe transformation, then uncached it after 
> training algorithm.
> I hope that I was clear.
> If you think that I was wrong, please let me know.
> Sincerely,
> Zakaria HILI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18356) Issue + Resolution: Kmeans Spark Performances (ML package)

2016-11-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15647970#comment-15647970
 ] 

Sean Owen commented on SPARK-18356:
---

Yes, the warning is a little bit ominous, but I think safe to ignore. The 
immediate parent is in fact cached, which means that just the brief 
transformation at the outset needs to be recomputed regularly, and that's not 
that expensive.

The problem with just calling cache() is that it forces another whole copy of 
the data set to be persisted, and always to memory. We don't even really 
necessarily want to force persistence of the input, even though at least 
persisting the parent RDD is pretty important.

It would be nice to avoid the warning if the parent RDD is cached, though that 
might be a little tricky. Otherwise I think this can be left as is.

> Issue + Resolution: Kmeans Spark Performances (ML package)
> --
>
> Key: SPARK-18356
> URL: https://issues.apache.org/jira/browse/SPARK-18356
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0, 2.0.1
>Reporter: zakaria hili
>Priority: Minor
>  Labels: easyfix
>
> Hello,
> I'm newbie in spark, but I think that I found a small problem that can affect 
> spark Kmeans performances.
> Before starting to explain the problem, I want to explain the warning that I 
> faced.
> I tried to use Spark Kmeans with Dataframes to cluster my data
> df_Part = assembler.transform(df_Part)
> df_Part.cache()
> while (k<=max_cluster) and (wssse > seuilStop):
> kmeans = KMeans().setK(k)
> model = kmeans.fit(df_Part)
> wssse = model.computeCost(df_Part)
> k=k+1
> but when I run the code I receive the warning :
> WARN KMeans: The input data is not directly cached, which may hurt 
> performance if its parent RDDs are also uncached.
> I searched in spark source code to find the source of this problem, then I 
> realized there is two classes responsible for this warning: 
> (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala )
> (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala )
>  
> When my  dataframe is cached, the fit method transform my dataframe into an 
> internally rdd which is not cached.
> Dataframe -> rdd -> run Training Kmeans Algo(rdd)
> -> The first class (ml package) responsible for converting the dataframe into 
> rdd then call Kmeans Algorithm
> ->The second class (mllib package) implements Kmeans Algorithm, and here 
> spark verify if the rdd is cached, if not a warning will be generated.  
> So, the solution of this problem is to cache the rdd before running Kmeans 
> Algorithm.
> https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
> All what we need is to add two lines:
> Cache rdd just after dataframe transformation, then uncached it after 
> training algorithm.
> I hope that I was clear.
> If you think that I was wrong, please let me know.
> Sincerely,
> Zakaria HILI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org