Justin Pihony created SPARK-13744:
-------------------------------------

             Summary: Dataframe RDD caching increases the input size for 
subsequent stages
                 Key: SPARK-13744
                 URL: https://issues.apache.org/jira/browse/SPARK-13744
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.6.0
         Environment: OSX
            Reporter: Justin Pihony
            Priority: Minor


Given the below code, you will see that the first run of count shows up as 
~90KB, and even the next run with cache being set will result in the same input 
size. However, every subsequent run thereafter will result in an input size 
that is MUCH larger (500MB is listed as 38% for a default run). This size 
discrepancy seems to be a bug in the caching of a dataframe's RDD as far as I 
can see. 

{code}
import sqlContext.implicits._

case class Person(name:String ="Test", number:Double = 1000.2)

val people = sc.parallelize(1 to 10000000,50).map { p => Person()}.toDF

people.write.parquet("people.parquet")

val parquetFile = sqlContext.read.parquet("people.parquet")

parquetFile.rdd.count()
parquetFile.rdd.cache()
parquetFile.rdd.count()
parquetFile.rdd.count()
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to