[ https://issues.apache.org/jira/browse/SPARK-13744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15186848#comment-15186848 ]
Sean Owen commented on SPARK-13744: ----------------------------------- Caching happens asynchronously. It is not completed after cache() or the next count() (necessarily). Each time I'd expect you'd see it reading larger amounts of data as more is coming from cached blocks in memory. However it proceeds to read partitions from disk anyway if they're not cached already. > Dataframe RDD caching increases the input size for subsequent stages > -------------------------------------------------------------------- > > Key: SPARK-13744 > URL: https://issues.apache.org/jira/browse/SPARK-13744 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI > Affects Versions: 1.6.0 > Environment: OSX > Reporter: Justin Pihony > Priority: Minor > Attachments: Screen Shot 2016-03-08 at 10.35.51 AM.png, stages.png > > > Given the below code, you will see that the first run of count shows up as > ~90KB, and even the next run with cache being set will result in the same > input size. However, every subsequent run thereafter will result in an input > size that is MUCH larger (500MB is listed as 38% for a default run). This > size discrepancy seems to be a bug in the caching of a dataframe's RDD as far > as I can see. > {code} > import sqlContext.implicits._ > case class Person(name:String ="Test", number:Double = 1000.2) > val people = sc.parallelize(1 to 10000000,50).map { p => Person()}.toDF > people.write.parquet("people.parquet") > val parquetFile = sqlContext.read.parquet("people.parquet") > parquetFile.rdd.count() > parquetFile.rdd.cache() > parquetFile.rdd.count() > parquetFile.rdd.count() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org