Kumaresh C R created SPARK-22516: ------------------------------------ Summary: CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character Key: SPARK-22516 URL: https://issues.apache.org/jira/browse/SPARK-22516 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: Kumaresh C R
Try to read attached CSV file with following parse properties, scala> *val csvFile = spark.read.option("header","true").option("inferSchema", "true").option("parserLib", "univocity").option("comment", "c").csv("hdfs://localhost:8020/text CommentChar.csv"); * csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string] scala> csvFile.show +---+---+ | a| b| +---+---+ +---+---+ Notice that it works fine. If we add an option "multiLine" = "true", it fails with below exception, scala> val csvFile = *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema", "true").option("parserLib", "univocity").option("comment", "c").csv("hdfs://localhost:8020/textCommentChar.csv");* 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8) com.univocity.parsers.common.TextParsingException: java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End of input reached Parser Configuration: CsvParserSettings: Auto configuration enabled=true Autodetect column delimiter=false Autodetect quotes=false Column reordering enabled=true Empty value=null Escape unquoted values=false Header extraction enabled=null Headers=null Ignore leading whitespaces=false Ignore trailing whitespaces=false Input buffer size=128 Input reading on separate thread=false Keep escape sequences=false Keep quotes=false Length of content displayed on error=-1 Line separator detection enabled=false Maximum number of characters per column=-1 Maximum number of columns=20480 Normalize escaped line separators=true Null value= Number of records to read=all Processor=none Restricting data in exceptions=false RowProcessor error handler=null Selected fields=none Skip empty lines=true Unescaped quote handling=STOP_AT_DELIMITERFormat configuration: CsvFormat: Comment character=c Field delimiter=, Line separator (normalized)=\n Line separator sequence=\r\n Quote character=" Quote escape character=\ Quote escape escape character=null Internal state when error was thrown: line=3, column=0, record=1, charIndex=19 at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281) at scala.collection.Iterator$$anon$12.next(Iterator.scala:444) at scala.collection.Iterator$$anon$10.next(Iterator.scala:393) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) at scala.collection.AbstractIterator.to(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$29.apply(RDD.scala:1354) at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$29.apply(RDD.scala:1354) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.IllegalArgumentException: Unable to skip 1 lines from line 2. End of input reached at com.univocity.parsers.common.input.AbstractCharInputReader.skipLines(AbstractCharInputReader.java:262) at com.univocity.parsers.common.AbstractParser.processComment(AbstractParser.java:96) at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:440) ... 24 more 17/11/14 14:26:17 WARN TaskSetManager: Lost task 0.0 in stage 8.0 (TID 8, localhost, executor driver): com.univocity.parsers.common.TextParsingException: java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End of input reached Parser Configuration: CsvParserSettings: Auto configuration enabled=true Autodetect column delimiter=false Autodetect quotes=false Column reordering enabled=true Empty value=null Escape unquoted values=false Header extraction enabled=null Headers=null Ignore leading whitespaces=false Ignore trailing whitespaces=false Input buffer size=128 Input reading on separate thread=false Keep escape sequences=false Keep quotes=false Length of content displayed on error=-1 Line separator detection enabled=false Maximum number of characters per column=-1 Maximum number of columns=20480 Normalize escaped line separators=true Null value= Number of records to read=all Processor=none Restricting data in exceptions=false RowProcessor error handler=null Selected fields=none Skip empty lines=true Unescaped quote handling=STOP_AT_DELIMITERFormat configuration: CsvFormat: Comment character=c Field delimiter=, Line separator (normalized)=\n Line separator sequence=\r\n Quote character=" Quote escape character=\ Quote escape escape character=null Internal state when error was thrown: line=3, column=0, record=1, charIndex=19 at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281) at scala.collection.Iterator$$anon$12.next(Iterator.scala:444) at scala.collection.Iterator$$anon$10.next(Iterator.scala:393) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) at scala.collection.AbstractIterator.to(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$29.apply(RDD.scala:1354) at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$29.apply(RDD.scala:1354) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.IllegalArgumentException: Unable to skip 1 lines from line 2. End of input reached at com.univocity.parsers.common.input.AbstractCharInputReader.skipLines(AbstractCharInputReader.java:262) at com.univocity.parsers.common.AbstractParser.processComment(AbstractParser.java:96) at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:440) ... 24 more 17/11/14 14:26:17 ERROR TaskSetManager: Task 0 in stage 8.0 failed 1 times; aborting job org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 (TID 8, localhost, executor driver): com.univocity.parsers.common.TextParsingException: java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End of input reached Parser Configuration: CsvParserSettings: Auto configuration enabled=true Autodetect column delimiter=false Autodetect quotes=false Column reordering enabled=true Empty value=null Escape unquoted values=false Header extraction enabled=null Headers=null Ignore leading whitespaces=false Ignore trailing whitespaces=false Input buffer size=128 Input reading on separate thread=false Keep escape sequences=false Keep quotes=false Length of content displayed on error=-1 Line separator detection enabled=false Maximum number of characters per column=-1 Maximum number of columns=20480 Normalize escaped line separators=true Null value= Number of records to read=all Processor=none Restricting data in exceptions=false RowProcessor error handler=null Selected fields=none Skip empty lines=true Unescaped quote handling=STOP_AT_DELIMITERFormat configuration: CsvFormat: Comment character=c Field delimiter=, Line separator (normalized)=\n Line separator sequence=\r\n Quote character=" Quote escape character=\ Quote escape escape character=null Internal state when error was thrown: line=3, column=0, record=1, charIndex=19 at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281) at scala.collection.Iterator$$anon$12.next(Iterator.scala:444) at scala.collection.Iterator$$anon$10.next(Iterator.scala:393) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) at scala.collection.AbstractIterator.to(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$29.apply(RDD.scala:1354) at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$29.apply(RDD.scala:1354) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.IllegalArgumentException: Unable to skip 1 lines from line 2. End of input reached at com.univocity.parsers.common.input.AbstractCharInputReader.skipLines(AbstractCharInputReader.java:262) at com.univocity.parsers.common.AbstractParser.processComment(AbstractParser.java:96) at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:440) ... 24 more Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062) at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1354) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.take(RDD.scala:1327) at org.apache.spark.sql.execution.datasources.csv.MultiLineCSVDataSource$.infer(CSVDataSource.scala:224) at org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62) at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177) at scala.Option.orElse(Option.scala:289) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:176) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412) ... 48 elided Caused by: com.univocity.parsers.common.TextParsingException: java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End of input reached Parser Configuration: CsvParserSettings: Auto configuration enabled=true Autodetect column delimiter=false Autodetect quotes=false Column reordering enabled=true Empty value=null Escape unquoted values=false Header extraction enabled=null Headers=null Ignore leading whitespaces=false Ignore trailing whitespaces=false Input buffer size=128 Input reading on separate thread=false Keep escape sequences=false Keep quotes=false Length of content displayed on error=-1 Line separator detection enabled=false Maximum number of characters per column=-1 Maximum number of columns=20480 Normalize escaped line separators=true Null value= Number of records to read=all Processor=none Restricting data in exceptions=false RowProcessor error handler=null Selected fields=none Skip empty lines=true Unescaped quote handling=STOP_AT_DELIMITERFormat configuration: CsvFormat: Comment character=c Field delimiter=, Line separator (normalized)=\n Line separator sequence=\r\n Quote character=" Quote escape character=\ Quote escape escape character=null Internal state when error was thrown: line=3, column=0, record=1, charIndex=19 at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281) at scala.collection.Iterator$$anon$12.next(Iterator.scala:444) at scala.collection.Iterator$$anon$10.next(Iterator.scala:393) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) at scala.collection.AbstractIterator.to(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$29.apply(RDD.scala:1354) at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$29.apply(RDD.scala:1354) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.IllegalArgumentException: Unable to skip 1 lines from line 2. End of input reached at com.univocity.parsers.common.input.AbstractCharInputReader.skipLines(AbstractCharInputReader.java:262) at com.univocity.parsers.common.AbstractParser.processComment(AbstractParser.java:96) at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:440) ... 24 more scala> -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org