Takeshi Yamamuro created SPARK-21024:
----------------------------------------

             Summary: CSV parse mode handles Univocity parser exceptions
                 Key: SPARK-21024
                 URL: https://issues.apache.org/jira/browse/SPARK-21024
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.1.1
            Reporter: Takeshi Yamamuro
            Priority: Minor


The current master cannot skip the illegal records that Univocity parsers:
This comes from the spark-user mailing list:
https://www.mail-archive.com/user@spark.apache.org/msg63985.html

{code}
scala> Seq("0,1", "0,1,2,3").toDF().write.text("/Users/maropu/Desktop/data")
scala> val df = spark.read.format("csv").schema("a int, b 
int").option("maxColumns", "3").load("/Users/maropu/Desktop/data")
scala> df.show

com.univocity.parsers.common.TextParsingException: 
java.lang.ArrayIndexOutOfBoundsException - 3
Hint: Number of columns processed may have exceeded limit of 3 columns. Use 
settings.setMaxColumns(int) to define the maximum number of columns your input 
can have
Ensure your configuration is correct, with delimiters, quotes and escape 
sequences that match the input format you are trying to parse
Parser Configuration: CsvParserSettings:
        Auto configuration enabled=true
        Autodetect column delimiter=false
        Autodetect quotes=false
        Column reordering enabled=true
        Empty value=null
        Escape unquoted values=false
        ...

at 
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
at 
com.univocity.parsers.common.AbstractParser.handleEOF(AbstractParser.java:195)
at 
com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:544)
at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:191)
at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:308)
at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:308)
at 
org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:60)
at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$parseIterator$1.apply(UnivocityParser.scala:312)
at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$parseIterator$1.apply(UnivocityParser.scala:312)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
...
{code}

We could easily fix this like: 
https://github.com/apache/spark/compare/master...maropu:HandleExceptionInParser



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to