GitHub user MaxGekk opened a pull request: https://github.com/apache/spark/pull/21892
[SPARK-24945][SQL] Switching to uniVocity 2.7.2 ## What changes were proposed in this pull request? In the PR, I propose to upgrade uniVocity parser from **2.6.3** to **2.7.2**. The recent version includes a fix for the SPARK-24645 issue. Here is the bug report for uniVocity https://github.com/uniVocity/univocity-parsers/issues/250. I removed the changes in `UnivocityParser` introduced by the commit: https://github.com/apache/spark/commit/bd32b509a1728366494cba13f8f6612b7bd46ec0 but leaved the test from the commit. ## How was this patch tested? I tested by `CSVSuite` and by running `CSVBenchmarsk`. The difference between 2.6.3 and 2.7.2 is 0.2% - 8% except a benchmark for `count()`. Performance degradation in the last case is **x3.8**. Before changes: ``` Parsing quoted values: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ One quoted string 33336 / 34122 0.0 666727.0 1.0X Wide rows with 1000 columns: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Select 1000 columns 90287 / 91713 0.0 90286.9 1.0X Select 100 columns 31826 / 36589 0.0 31826.4 2.8X Select one column 25738 / 25872 0.0 25737.9 3.5X count() 6931 / 7269 0.1 6931.5 13.0X ``` after: ``` Parsing quoted values: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ One quoted string 34191 / 34332 0.0 683826.7 1.0X Wide rows with 1000 columns: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Select 1000 columns 90446 / 91900 0.0 90446.1 1.0X Select 100 columns 34315 / 39895 0.0 34314.9 2.6X Select one column 27955 / 28125 0.0 27954.8 3.2X count() 27713 / 27803 0.0 27712.8 3.3X ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/MaxGekk/spark-1 univocity-2_7_2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21892.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21892 ---- commit 7b569ae1318316129d4b0d46969b02324b18b0aa Author: Maxim Gekk <maxim.gekk@...> Date: 2018-07-27T11:59:39Z Bumping version of uniVocity parser up to 2.7.2 commit b116987d9a0adb887201177d41c1b94e6f5aeb63 Author: Maxim Gekk <maxim.gekk@...> Date: 2018-07-27T13:25:11Z Call uniVocity even the set of selected columns is empty ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org