[ 
https://issues.apache.org/jira/browse/SPARK-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15769718#comment-15769718
 ] 

zakaria hili edited comment on SPARK-18608 at 12/22/16 10:42 AM:
-----------------------------------------------------------------

[~srowen], Now , I understand what you mean, your purpose is to optimize the 
memory,however, I think that all we need is to add an extra check,
if the dataframe is cached, we call the mllib algo directly,
if not, we have to cache the internal rdd

But the problem of this solution is: if we do not cache the internal rdd, we 
will get a lot of warnings from mllib package (RDD is not cached)
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L216
because df.rdd.getStorageLevel == StorageLevel.NONE is true.

So maybe we will need to create a new public class in mllib methods which can 
take the handlePersistence as parameter : if it has true there no need to 
recheck the input RDD


was (Author: zahili):
[~srowen], I understand now what you mean, your purpose is to optimize the 
memory,however, I think that all we need is to add an extra check,
if the dataframe is cached, we call the mllib algo directly,
if not, we have to cache the internal rdd

But the problem of this solution is: if we do not cache the internal rdd, we 
will get a lot of warnings from mllib package (RDD is not cached)
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L216
because df.rdd.getStorageLevel == StorageLevel.NONE is true.

So maybe we will need to create a new public class in mllib methods which can 
take the handlePersistence as parameter : if it has true there no need to 
recheck the input RDD

> Spark ML algorithms that check RDD cache level for internal caching 
> double-cache data
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-18608
>                 URL: https://issues.apache.org/jira/browse/SPARK-18608
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>            Reporter: Nick Pentreath
>
> Some algorithms in Spark ML (e.g. {{LogisticRegression}}, 
> {{LinearRegression}}, and I believe now {{KMeans}}) handle persistence 
> internally. They check whether the input dataset is cached, and if not they 
> cache it for performance.
> However, the check is done using {{dataset.rdd.getStorageLevel == NONE}}. 
> This will actually always be true, since even if the dataset itself is 
> cached, the RDD returned by {{dataset.rdd}} will not be cached.
> Hence if the input dataset is cached, the data will end up being cached 
> twice, which is wasteful.
> To see this:
> {code}
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val df = spark.range(10).toDF("num")
> df: org.apache.spark.sql.DataFrame = [num: bigint]
> scala> df.storageLevel == StorageLevel.NONE
> res0: Boolean = true
> scala> df.persist
> res1: df.type = [num: bigint]
> scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK
> res2: Boolean = true
> scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK
> res3: Boolean = false
> scala> df.rdd.getStorageLevel == StorageLevel.NONE
> res4: Boolean = true
> {code}
> Before SPARK-16063, there was no way to check the storage level of the input 
> {{DataSet}}, but now we can, so the checks should be migrated to use 
> {{dataset.storageLevel}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to