[ https://issues.apache.org/jira/browse/SPARK-13744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15185069#comment-15185069 ]
Sean Owen commented on SPARK-13744: ----------------------------------- 90KB can't be right. You have an DF of 10m objects. What are you referring to that says 90K? I also see it take hundreds of MB in memory as expected. > Dataframe RDD caching increases the input size for subsequent stages > -------------------------------------------------------------------- > > Key: SPARK-13744 > URL: https://issues.apache.org/jira/browse/SPARK-13744 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.6.0 > Environment: OSX > Reporter: Justin Pihony > Priority: Minor > > Given the below code, you will see that the first run of count shows up as > ~90KB, and even the next run with cache being set will result in the same > input size. However, every subsequent run thereafter will result in an input > size that is MUCH larger (500MB is listed as 38% for a default run). This > size discrepancy seems to be a bug in the caching of a dataframe's RDD as far > as I can see. > {code} > import sqlContext.implicits._ > case class Person(name:String ="Test", number:Double = 1000.2) > val people = sc.parallelize(1 to 10000000,50).map { p => Person()}.toDF > people.write.parquet("people.parquet") > val parquetFile = sqlContext.read.parquet("people.parquet") > parquetFile.rdd.count() > parquetFile.rdd.cache() > parquetFile.rdd.count() > parquetFile.rdd.count() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org