Repository: spark Updated Branches: refs/heads/master effca9868 -> 9948b860a
[SPARK-22516][SQL] Bump up Univocity version to 2.5.9 ## What changes were proposed in this pull request? There was a bug in Univocity Parser that causes the issue in SPARK-22516. This was fixed by upgrading from 2.5.4 to 2.5.9 version of the library : **Executing** ``` spark.read.option("header","true").option("inferSchema", "true").option("multiLine", "true").option("comment", "g").csv("test_file_without_eof_char.csv").show() ``` **Before** ``` ERROR Executor: Exception in task 0.0 in stage 6.0 (TID 6) com.univocity.parsers.common.TextParsingException: java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End of input reached ... Internal state when error was thrown: line=3, column=0, record=2, charIndex=31 at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) ``` **After** ``` +-------+-------+ |column1|column2| +-------+-------+ | abc| def| +-------+-------+ ``` ## How was this patch tested? The already existing `CSVSuite.commented lines in CSV data` test was extended to parse the file also in multiline mode. The test input file was modified to also include a comment in the last line. Author: smurakozi <smurak...@gmail.com> Closes #19906 from smurakozi/SPARK-22516. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9948b860 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9948b860 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9948b860 Branch: refs/heads/master Commit: 9948b860aca42bbc6478ddfbc0ff590adb00c2f3 Parents: effca98 Author: smurakozi <smurak...@gmail.com> Authored: Wed Dec 6 13:22:08 2017 -0800 Committer: Marcelo Vanzin <van...@cloudera.com> Committed: Wed Dec 6 13:22:08 2017 -0800 ---------------------------------------------------------------------- dev/deps/spark-deps-hadoop-2.6 | 2 +- dev/deps/spark-deps-hadoop-2.7 | 2 +- sql/core/pom.xml | 2 +- .../src/test/resources/test-data/comments.csv | 1 + .../execution/datasources/csv/CSVSuite.scala | 23 +++++++++++--------- 5 files changed, 17 insertions(+), 13 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/spark/blob/9948b860/dev/deps/spark-deps-hadoop-2.6 ---------------------------------------------------------------------- diff --git a/dev/deps/spark-deps-hadoop-2.6 b/dev/deps/spark-deps-hadoop-2.6 index 2c68b73..3b5a694 100644 --- a/dev/deps/spark-deps-hadoop-2.6 +++ b/dev/deps/spark-deps-hadoop-2.6 @@ -180,7 +180,7 @@ stax-api-1.0.1.jar stream-2.7.0.jar stringtemplate-3.2.1.jar super-csv-2.2.0.jar -univocity-parsers-2.5.4.jar +univocity-parsers-2.5.9.jar validation-api-1.1.0.Final.jar xbean-asm5-shaded-4.4.jar xercesImpl-2.9.1.jar http://git-wip-us.apache.org/repos/asf/spark/blob/9948b860/dev/deps/spark-deps-hadoop-2.7 ---------------------------------------------------------------------- diff --git a/dev/deps/spark-deps-hadoop-2.7 b/dev/deps/spark-deps-hadoop-2.7 index 2aaac60..64136ba 100644 --- a/dev/deps/spark-deps-hadoop-2.7 +++ b/dev/deps/spark-deps-hadoop-2.7 @@ -181,7 +181,7 @@ stax-api-1.0.1.jar stream-2.7.0.jar stringtemplate-3.2.1.jar super-csv-2.2.0.jar -univocity-parsers-2.5.4.jar +univocity-parsers-2.5.9.jar validation-api-1.1.0.Final.jar xbean-asm5-shaded-4.4.jar xercesImpl-2.9.1.jar http://git-wip-us.apache.org/repos/asf/spark/blob/9948b860/sql/core/pom.xml ---------------------------------------------------------------------- diff --git a/sql/core/pom.xml b/sql/core/pom.xml index 4db3fea..93010c6 100644 --- a/sql/core/pom.xml +++ b/sql/core/pom.xml @@ -38,7 +38,7 @@ <dependency> <groupId>com.univocity</groupId> <artifactId>univocity-parsers</artifactId> - <version>2.5.4</version> + <version>2.5.9</version> <type>jar</type> </dependency> <dependency> http://git-wip-us.apache.org/repos/asf/spark/blob/9948b860/sql/core/src/test/resources/test-data/comments.csv ---------------------------------------------------------------------- diff --git a/sql/core/src/test/resources/test-data/comments.csv b/sql/core/src/test/resources/test-data/comments.csv index 6275be7..c0ace46 100644 --- a/sql/core/src/test/resources/test-data/comments.csv +++ b/sql/core/src/test/resources/test-data/comments.csv @@ -4,3 +4,4 @@ 6,7,8,9,0,2015-08-21 16:58:01 ~0,9,8,7,6,2015-08-22 17:59:02 1,2,3,4,5,2015-08-23 18:00:42 +~ comment in last line to test SPARK-22516 - do not add empty line at the end of this file! \ No newline at end of file http://git-wip-us.apache.org/repos/asf/spark/blob/9948b860/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---------------------------------------------------------------------- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala index e439699..4fe4542 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala @@ -483,18 +483,21 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils { } test("commented lines in CSV data") { - val results = spark.read - .format("csv") - .options(Map("comment" -> "~", "header" -> "false")) - .load(testFile(commentsFile)) - .collect() + Seq("false", "true").foreach { multiLine => - val expected = - Seq(Seq("1", "2", "3", "4", "5.01", "2015-08-20 15:57:00"), - Seq("6", "7", "8", "9", "0", "2015-08-21 16:58:01"), - Seq("1", "2", "3", "4", "5", "2015-08-23 18:00:42")) + val results = spark.read + .format("csv") + .options(Map("comment" -> "~", "header" -> "false", "multiLine" -> multiLine)) + .load(testFile(commentsFile)) + .collect() - assert(results.toSeq.map(_.toSeq) === expected) + val expected = + Seq(Seq("1", "2", "3", "4", "5.01", "2015-08-20 15:57:00"), + Seq("6", "7", "8", "9", "0", "2015-08-21 16:58:01"), + Seq("1", "2", "3", "4", "5", "2015-08-23 18:00:42")) + + assert(results.toSeq.map(_.toSeq) === expected) + } } test("inferring schema with commented lines in CSV data") { --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org