Justin Pihony created SPARK-13744: ------------------------------------- Summary: Dataframe RDD caching increases the input size for subsequent stages Key: SPARK-13744 URL: https://issues.apache.org/jira/browse/SPARK-13744 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0 Environment: OSX Reporter: Justin Pihony Priority: Minor
Given the below code, you will see that the first run of count shows up as ~90KB, and even the next run with cache being set will result in the same input size. However, every subsequent run thereafter will result in an input size that is MUCH larger (500MB is listed as 38% for a default run). This size discrepancy seems to be a bug in the caching of a dataframe's RDD as far as I can see. {code} import sqlContext.implicits._ case class Person(name:String ="Test", number:Double = 1000.2) val people = sc.parallelize(1 to 10000000,50).map { p => Person()}.toDF people.write.parquet("people.parquet") val parquetFile = sqlContext.read.parquet("people.parquet") parquetFile.rdd.count() parquetFile.rdd.cache() parquetFile.rdd.count() parquetFile.rdd.count() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org