[jira] [Commented] (SPARK-16196) Optimize in-memory scan performance using ColumnarBatches

Kazuaki Ishizaki (JIRA) Tue, 11 Sep 2018 22:55:13 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16611633#comment-16611633
 ]


Kazuaki Ishizaki commented on SPARK-16196:
------------------------------------------

[~cloud_fan] This PR in the Jira entry proposes two fixes
 # Read data in a table cache directry from a columnar storage
 # Generate code to build a table cache

We already implemented 1. But, we have not implmented 2. yet. Let us address 2. 
in the next release.

> Optimize in-memory scan performance using ColumnarBatches
> ---------------------------------------------------------
>
>                 Key: SPARK-16196
>                 URL: https://issues.apache.org/jira/browse/SPARK-16196
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Andrew Or
>            Assignee: Andrew Or
>            Priority: Major
>
> A simple benchmark such as the following reveals inefficiencies in the 
> existing in-memory scan implementation:
> {code}
> spark.range(N)
>   .selectExpr("id", "floor(rand() * 10000) as k")
>   .createOrReplaceTempView("test")
> val ds = spark.sql("select count(k), count(id) from test").cache()
> ds.collect()
> ds.collect()
> {code}
> There are many reasons why caching is slow. The biggest is that compression 
> takes a long time. The second is that there are a lot of virtual function 
> calls in this hot code path since the rows are processed using iterators. 
> Further, the rows are converted to and from ByteBuffers, which are slow to 
> read in general.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16196) Optimize in-memory scan performance using ColumnarBatches

Reply via email to