[ https://issues.apache.org/jira/browse/SPARK-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15761233#comment-15761233 ]
zakaria hili edited comment on SPARK-18608 at 12/22/16 8:45 AM: ---------------------------------------------------------------- Hi, [~mlnick]~], Can I join this discussion? Could you please explain to me why do you believe that the generated rdd is cached ? As you can see in https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py we generate a new rdd, so it's normal that this rdd is not cached def rdd(self): """Returns the content as an :class:`pyspark.RDD` of :class:`Row`. """ if self._lazy_rdd is None: jrdd = self._jdf.javaToPython() self._lazy_rdd = RDD(jrdd, self.sql_ctx._sc, BatchedSerializer(PickleSerializer())) return self._lazy_rdd was (Author: zahili): Hi, Can I join this discussion? Could you please explain to me why do you believe that the generated rdd is cached ? As you can see in https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py we generate a new rdd, so it's normal that this rdd is not cached def rdd(self): """Returns the content as an :class:`pyspark.RDD` of :class:`Row`. """ if self._lazy_rdd is None: jrdd = self._jdf.javaToPython() self._lazy_rdd = RDD(jrdd, self.sql_ctx._sc, BatchedSerializer(PickleSerializer())) return self._lazy_rdd > Spark ML algorithms that check RDD cache level for internal caching > double-cache data > ------------------------------------------------------------------------------------- > > Key: SPARK-18608 > URL: https://issues.apache.org/jira/browse/SPARK-18608 > Project: Spark > Issue Type: Bug > Components: ML > Reporter: Nick Pentreath > > Some algorithms in Spark ML (e.g. {{LogisticRegression}}, > {{LinearRegression}}, and I believe now {{KMeans}}) handle persistence > internally. They check whether the input dataset is cached, and if not they > cache it for performance. > However, the check is done using {{dataset.rdd.getStorageLevel == NONE}}. > This will actually always be true, since even if the dataset itself is > cached, the RDD returned by {{dataset.rdd}} will not be cached. > Hence if the input dataset is cached, the data will end up being cached > twice, which is wasteful. > To see this: > {code} > scala> import org.apache.spark.storage.StorageLevel > import org.apache.spark.storage.StorageLevel > scala> val df = spark.range(10).toDF("num") > df: org.apache.spark.sql.DataFrame = [num: bigint] > scala> df.storageLevel == StorageLevel.NONE > res0: Boolean = true > scala> df.persist > res1: df.type = [num: bigint] > scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK > res2: Boolean = true > scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK > res3: Boolean = false > scala> df.rdd.getStorageLevel == StorageLevel.NONE > res4: Boolean = true > {code} > Before SPARK-16063, there was no way to check the storage level of the input > {{DataSet}}, but now we can, so the checks should be migrated to use > {{dataset.storageLevel}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org