[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...

MaxGekk Sat, 28 Jul 2018 09:28:07 -0700

GitHub user MaxGekk opened a pull request:

    https://github.com/apache/spark/pull/21909


    [SPARK-24959][SQL] Speed up count() for JSON and CSV

    ## What changes were proposed in this pull request?
    
    In the PR, I propose to skip invoking of the CSV/JSON parser per each line 
in the case if the required schema is empty. Added benchmarks for `count()` 
shows performance improvement up to **3.5 times**.
    
    Before:
    
    ```
    Count a dataset with 10 columns:      Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)
    
--------------------------------------------------------------------------------------
    JSON count()                               7676 / 7715          1.3         
767.6
    CSV count()                                3309 / 3363          3.0         
330.9
    ``` 
    
    After:
    
    ```
    Count a dataset with 10 columns:      Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)
    
--------------------------------------------------------------------------------------
    JSON count()                               2104 / 2156          4.8         
210.4
    CSV count()                                2332 / 2386          4.3         
233.2
    ```
    
    ## How was this patch tested?
    
    It was tested by `CSVSuite` and `JSONSuite` as well as on added benchmarks.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MaxGekk/spark-1 empty-schema-optimization

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21909.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21909
    
----
commit bc4ce261a2d13be0a31b18f006da79b55880d409
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-07-28T15:31:20Z

    Added a benchmark for count()

commit 91250d21d4bb451062873c59df6fe3b4669bc5ff
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-07-28T15:50:15Z

    Added a CSV benchmark for count()

commit bdc5ea540b9eb62bb28606bdeb311ce5662e4bf7
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-07-28T15:59:44Z

    Speed up count()

commit d40f9bb229ab8ea9e2d95499ae203f7c41098bcd
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-07-28T16:00:17Z

    Updating CSV and JSON benchmarks for count()

commit abd8572497ff742ef6ea942864195be75a40ca71
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-07-28T16:23:03Z

    Fix benchmark's output

commit 359c4fcbfdb4f4e77faa3977f381dc8e819e46fa
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-07-28T16:23:44Z

    Uncomment other benchmarks

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...

Reply via email to