[jira] [Commented] (SPARK-32274) Add in the ability for a user to replace the serialization format of the cache

Robert Joseph Evans (Jira) Fri, 10 Jul 2020 11:21:54 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-32274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155660#comment-17155660
 ]


Robert Joseph Evans commented on SPARK-32274:
---------------------------------------------

I filed [https://github.com/apache/spark/pull/29067] for this.

> Add in the ability for a user to replace the serialization format of the cache
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-32274
>                 URL: https://issues.apache.org/jira/browse/SPARK-32274
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Robert Joseph Evans
>            Priority: Major
>
> Caching a dataset or dataframe can be a very expensive operation, but has a 
> huge benefit for later queries that use it.  There are many use cases that 
> could benefit from caching the data but not enough to justify the current 
> scheme.  I would like to propose that we make the serialization of the 
> caching plugable.  That way users can explore other formats and compression 
> code.
>  
> As an example I took the line item table from TPCH at a scale factor of 10 
> and converted it to parquet.  This resulted in 2.1 GB of data on disk. With 
> the current caching it can take nearly 8 GB to store that same data in 
> memory, and about 5 GB to store in on disk.
>  
> If I want to read all of that data and and write it out again.
> {code:java}
> scala> val a = spark.read.parquet("../data/tpch/SF10_parquet/lineitem.tbl/")
> a: org.apache.spark.sql.DataFrame = [l_orderkey: bigint, l_partkey: bigint 
> ... 14 more fields]
> scala> spark.time(a.write.mode("overwrite").parquet("./target/tmp"))
> Time taken: 25832 ms {code}
> But a query that reads that data directly from the cache after it is built 
> only takes 21531 ms. For some queries having much more data that can be 
> stored in the cache might be worth the extra query time.
>  
> It also takes about a lot less time to do the parquet compression than it 
> does to do the cache compression.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32274) Add in the ability for a user to replace the serialization format of the cache

Reply via email to