spark git commit: [SPARK-22516][SQL] Bump up Univocity version to 2.5.9

vanzin Wed, 06 Dec 2017 13:22:29 -0800

Repository: spark
Updated Branches:
  refs/heads/master effca9868 -> 9948b860a



[SPARK-22516][SQL] Bump up Univocity version to 2.5.9

## What changes were proposed in this pull request?

There was a bug in Univocity Parser that causes the issue in SPARK-22516. This 
was fixed by upgrading from 2.5.4 to 2.5.9 version of the library :

**Executing**
```
spark.read.option("header","true").option("inferSchema", 
"true").option("multiLine", "true").option("comment", 
"g").csv("test_file_without_eof_char.csv").show()
```
**Before**
```
ERROR Executor: Exception in task 0.0 in stage 6.0 (TID 6)
com.univocity.parsers.common.TextParsingException: 
java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End of 
input reached
...
Internal state when error was thrown: line=3, column=0, record=2, charIndex=31
        at 
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
        at 
com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475)
        at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
```
**After**
```
+-------+-------+
|column1|column2|
+-------+-------+
|    abc|    def|
+-------+-------+
```

## How was this patch tested?
The already existing `CSVSuite.commented lines in CSV data` test was extended 
to parse the file also in multiline mode. The test input file was modified to 
also include a comment in the last line.

Author: smurakozi <smurak...@gmail.com>

Closes #19906 from smurakozi/SPARK-22516.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9948b860
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9948b860
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9948b860

Branch: refs/heads/master
Commit: 9948b860aca42bbc6478ddfbc0ff590adb00c2f3
Parents: effca98
Author: smurakozi <smurak...@gmail.com>
Authored: Wed Dec 6 13:22:08 2017 -0800
Committer: Marcelo Vanzin <van...@cloudera.com>
Committed: Wed Dec 6 13:22:08 2017 -0800

----------------------------------------------------------------------
 dev/deps/spark-deps-hadoop-2.6                  |  2 +-
 dev/deps/spark-deps-hadoop-2.7                  |  2 +-
 sql/core/pom.xml                                |  2 +-
 .../src/test/resources/test-data/comments.csv   |  1 +
 .../execution/datasources/csv/CSVSuite.scala    | 23 +++++++++++---------
 5 files changed, 17 insertions(+), 13 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/9948b860/dev/deps/spark-deps-hadoop-2.6
----------------------------------------------------------------------
diff --git a/dev/deps/spark-deps-hadoop-2.6 b/dev/deps/spark-deps-hadoop-2.6
index 2c68b73..3b5a694 100644
--- a/dev/deps/spark-deps-hadoop-2.6
+++ b/dev/deps/spark-deps-hadoop-2.6
@@ -180,7 +180,7 @@ stax-api-1.0.1.jar
 stream-2.7.0.jar
 stringtemplate-3.2.1.jar
 super-csv-2.2.0.jar
-univocity-parsers-2.5.4.jar
+univocity-parsers-2.5.9.jar
 validation-api-1.1.0.Final.jar
 xbean-asm5-shaded-4.4.jar
 xercesImpl-2.9.1.jar

http://git-wip-us.apache.org/repos/asf/spark/blob/9948b860/dev/deps/spark-deps-hadoop-2.7
----------------------------------------------------------------------
diff --git a/dev/deps/spark-deps-hadoop-2.7 b/dev/deps/spark-deps-hadoop-2.7
index 2aaac60..64136ba 100644
--- a/dev/deps/spark-deps-hadoop-2.7
+++ b/dev/deps/spark-deps-hadoop-2.7
@@ -181,7 +181,7 @@ stax-api-1.0.1.jar
 stream-2.7.0.jar
 stringtemplate-3.2.1.jar
 super-csv-2.2.0.jar
-univocity-parsers-2.5.4.jar
+univocity-parsers-2.5.9.jar
 validation-api-1.1.0.Final.jar
 xbean-asm5-shaded-4.4.jar
 xercesImpl-2.9.1.jar

http://git-wip-us.apache.org/repos/asf/spark/blob/9948b860/sql/core/pom.xml
----------------------------------------------------------------------
diff --git a/sql/core/pom.xml b/sql/core/pom.xml
index 4db3fea..93010c6 100644
--- a/sql/core/pom.xml
+++ b/sql/core/pom.xml
@@ -38,7 +38,7 @@
     <dependency>
       <groupId>com.univocity</groupId>
       <artifactId>univocity-parsers</artifactId>
-      <version>2.5.4</version>
+      <version>2.5.9</version>
       <type>jar</type>
     </dependency>
     <dependency>

http://git-wip-us.apache.org/repos/asf/spark/blob/9948b860/sql/core/src/test/resources/test-data/comments.csv
----------------------------------------------------------------------
diff --git a/sql/core/src/test/resources/test-data/comments.csv 
b/sql/core/src/test/resources/test-data/comments.csv
index 6275be7..c0ace46 100644
--- a/sql/core/src/test/resources/test-data/comments.csv
+++ b/sql/core/src/test/resources/test-data/comments.csv
@@ -4,3 +4,4 @@
 6,7,8,9,0,2015-08-21 16:58:01
 ~0,9,8,7,6,2015-08-22 17:59:02
 1,2,3,4,5,2015-08-23 18:00:42
+~ comment in last line to test SPARK-22516 - do not add empty line at the end 
of this file!
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/spark/blob/9948b860/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
----------------------------------------------------------------------
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
index e439699..4fe4542 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
@@ -483,18 +483,21 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils {
   }
 
   test("commented lines in CSV data") {
-    val results = spark.read
-      .format("csv")
-      .options(Map("comment" -> "~", "header" -> "false"))
-      .load(testFile(commentsFile))
-      .collect()
+    Seq("false", "true").foreach { multiLine =>
 
-    val expected =
-      Seq(Seq("1", "2", "3", "4", "5.01", "2015-08-20 15:57:00"),
-        Seq("6", "7", "8", "9", "0", "2015-08-21 16:58:01"),
-        Seq("1", "2", "3", "4", "5", "2015-08-23 18:00:42"))
+      val results = spark.read
+        .format("csv")
+        .options(Map("comment" -> "~", "header" -> "false", "multiLine" -> 
multiLine))
+        .load(testFile(commentsFile))
+        .collect()
 
-    assert(results.toSeq.map(_.toSeq) === expected)
+      val expected =
+        Seq(Seq("1", "2", "3", "4", "5.01", "2015-08-20 15:57:00"),
+          Seq("6", "7", "8", "9", "0", "2015-08-21 16:58:01"),
+          Seq("1", "2", "3", "4", "5", "2015-08-23 18:00:42"))
+
+      assert(results.toSeq.map(_.toSeq) === expected)
+    }
   }
 
   test("inferring schema with commented lines in CSV data") {


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-22516][SQL] Bump up Univocity version to 2.5.9

Reply via email to