[jira] [Comment Edited] (SPARK-13744) Dataframe RDD caching increases the input size for subsequent stages

Stavros Kontopoulos (JIRA) Tue, 08 Mar 2016 15:28:14 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-13744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15186058#comment-15186058
 ]


Stavros Kontopoulos edited comment on SPARK-13744 at 3/8/16 11:27 PM:
----------------------------------------------------------------------

I understand that, my question is the following i run the same example on 
master (not 1.6) along with a bit more counts (see stages image attached):

parquetFile.rdd.count()
parquetFile.rdd.cache()
parquetFile.rdd.count()
parquetFile.rdd.count()
parquetFile.rdd.count()
parquetFile.rdd.count()

Why the third count shows as input 216 mb shouldnt be 359 already? The second 
count should cache everything right?


was (Author: skonto):
I understand that, my question is the following i run the same example on 
master along with a bit more counts (see stages image attached):

parquetFile.rdd.count()
parquetFile.rdd.cache()
parquetFile.rdd.count()
parquetFile.rdd.count()
parquetFile.rdd.count()
parquetFile.rdd.count()

Why the third count shows as input 216 mb shouldnt be 359 already? The second 
count should cache everything right?

> Dataframe RDD caching increases the input size for subsequent stages
> --------------------------------------------------------------------
>
>                 Key: SPARK-13744
>                 URL: https://issues.apache.org/jira/browse/SPARK-13744
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL, Web UI
>    Affects Versions: 1.6.0
>         Environment: OSX
>            Reporter: Justin Pihony
>            Priority: Minor
>         Attachments: Screen Shot 2016-03-08 at 10.35.51 AM.png, stages.png
>
>
> Given the below code, you will see that the first run of count shows up as 
> ~90KB, and even the next run with cache being set will result in the same 
> input size. However, every subsequent run thereafter will result in an input 
> size that is MUCH larger (500MB is listed as 38% for a default run). This 
> size discrepancy seems to be a bug in the caching of a dataframe's RDD as far 
> as I can see. 
> {code}
> import sqlContext.implicits._
> case class Person(name:String ="Test", number:Double = 1000.2)
> val people = sc.parallelize(1 to 10000000,50).map { p => Person()}.toDF
> people.write.parquet("people.parquet")
> val parquetFile = sqlContext.read.parquet("people.parquet")
> parquetFile.rdd.count()
> parquetFile.rdd.cache()
> parquetFile.rdd.count()
> parquetFile.rdd.count()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13744) Dataframe RDD caching increases the input size for subsequent stages

Reply via email to