GitHub user MaxGekk opened a pull request:

    https://github.com/apache/spark/pull/21969

    [SPARK-24945][SQL] Switching to uniVocity 2.7.3

    ## What changes were proposed in this pull request?
    
    In the PR, I propose to upgrade uniVocity parser from **2.6.3** to 
**2.7.3**. The recent version includes a fix for the SPARK-24645 issue and has 
better performance.
    
    Before changes:
    ```
    Parsing quoted values:                   Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    One quoted string                           33336 / 34122          0.0      
666727.0       1.0X
    
    Wide rows with 1000 columns:             Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    Select 1000 columns                         90287 / 91713          0.0      
 90286.9       1.0X
    Select 100 columns                          31826 / 36589          0.0      
 31826.4       2.8X
    Select one column                           25738 / 25872          0.0      
 25737.9       3.5X
    count()                                       6931 / 7269          0.1      
  6931.5      13.0X
    ```
    after:
    ```
    Parsing quoted values:                   Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    One quoted string                           33411 / 33510          0.0      
668211.4       1.0X
    
    Wide rows with 1000 columns:             Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    Select 1000 columns                         88028 / 89311          0.0      
 88028.1       1.0X
    Select 100 columns                          29010 / 32755          0.0      
 29010.1       3.0X
    Select one column                           22936 / 22953          0.0      
 22936.5       3.8X
    count()                                       6657 / 6740          0.2      
  6656.6      13.5X
    ```
    Closes #21892 
    
    ## How was this patch tested?
    
    It was tested by `CSVSuite` and `CSVBenchmarks`


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MaxGekk/spark-1 univocity-2_7_3

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21969.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21969
    
----
commit 7b569ae1318316129d4b0d46969b02324b18b0aa
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-07-27T11:59:39Z

    Bumping version of uniVocity parser up to 2.7.2

commit b116987d9a0adb887201177d41c1b94e6f5aeb63
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-07-27T13:25:11Z

    Call uniVocity even the set of selected columns is empty

commit 3fb9cf76df65abe14dd39d233d18242e72e0a729
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-08-02T09:14:27Z

    Bumping version to 2.7.3

commit a053994bcc6027668f64c9e55d09dfaa45cb97cf
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-08-02T09:14:48Z

    Revert "Call uniVocity even the set of selected columns is empty"
    
    This reverts commit b116987d9a0adb887201177d41c1b94e6f5aeb63.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to