[jira] [Updated] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character
[ https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kumaresh C R updated SPARK-22516: - Attachment: test_file_without_eof_char.csv > CSV Read breaks: When "multiLine" = "true", if "comment" option is set as > last line's first character > - > > Key: SPARK-22516 > URL: https://issues.apache.org/jira/browse/SPARK-22516 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Kumaresh C R >Priority: Minor > Labels: csvparser > Attachments: testCommentChar.csv, test_file_without_eof_char.csv > > > Try to read attached CSV file with following parse properties, > scala> *val csvFile = > spark.read.option("header","true").option("inferSchema", > "true").option("parserLib", "univocity").option("comment", > "c").csv("hdfs://localhost:8020/test > CommentChar.csv"); * > > > csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string] > > > > > > scala> csvFile.show > > > +---+---+ > > > | a| b| > > > +---+---+ > > > +---+---+ > {color:#8eb021}*Noticed that it works fine.*{color} > If we add an option "multiLine" = "true", it fails with below exception. This > happens only if we pass "comment" == input dataset's last line's first > character > scala> val csvFile = > *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema", > "true").option("parserLib", "univocity").option("comment", > "c").csv("hdfs://localhost:8020/testCommentChar.csv");* > 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8) > com.univocity.parsers.common.TextParsingException: > java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End > of input reached > Parser Configuration: CsvParserSettings: > Auto configuration enabled=true > Autodetect column delimiter=false > Autodetect quotes=false > Column reordering enabled=true > Empty value=null > Escape unquoted values=false > Header extraction enabled=null > Headers=null > Ignore leading whitespaces=false > Ignore trailing whitespaces=false > Input buffer size=128 > Input reading on separate thread=false > Keep escape sequences=false > Keep quotes=false > Length of content displayed on error=-1 > Line separator detection enabled=false > Maximum number of characters per column=-1 > Maximum number of columns=20480 > Normalize escaped line separators=true > Null value= > Number of records to read=all > Processor=none > Restricting data in exceptions=false > RowProcessor error handler=null > Selected fields=none > Skip empty lines=true > Unescaped quote handling=STOP_AT_DELIMITERFormat configuration: > CsvFormat: > Comment character=c > Field delimiter=, > Line separator (normalized)=\n > Line separator sequence=\r\n > Quote character=" > Quote escape character=\ > Quote escape escape character=null > Internal state when error was thrown: line=3, column=0, record=1, charIndex=19 > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475) > at > org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281) > at
[jira] [Commented] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character
[ https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16259157#comment-16259157 ] Kumaresh C R commented on SPARK-22516: -- [~mgaido]: Even after I replaced all 'CR LF' to 'LF', still in the below case, the error is thrown. -> When the file doesn't have 'LF' as the last character in its last line i.e. EOF (Note: All other lines in the file ends with LF) character Attached the failing file 'test_file_without_eof_char.csv' for your reference. Is it something the problem with the parser or the input data (which doesn't have any line ending as its last character) ? > CSV Read breaks: When "multiLine" = "true", if "comment" option is set as > last line's first character > - > > Key: SPARK-22516 > URL: https://issues.apache.org/jira/browse/SPARK-22516 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Kumaresh C R >Priority: Minor > Labels: csvparser > Attachments: testCommentChar.csv > > > Try to read attached CSV file with following parse properties, > scala> *val csvFile = > spark.read.option("header","true").option("inferSchema", > "true").option("parserLib", "univocity").option("comment", > "c").csv("hdfs://localhost:8020/test > CommentChar.csv"); * > > > csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string] > > > > > > scala> csvFile.show > > > +---+---+ > > > | a| b| > > > +---+---+ > > > +---+---+ > {color:#8eb021}*Noticed that it works fine.*{color} > If we add an option "multiLine" = "true", it fails with below exception. This > happens only if we pass "comment" == input dataset's last line's first > character > scala> val csvFile = > *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema", > "true").option("parserLib", "univocity").option("comment", > "c").csv("hdfs://localhost:8020/testCommentChar.csv");* > 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8) > com.univocity.parsers.common.TextParsingException: > java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End > of input reached > Parser Configuration: CsvParserSettings: > Auto configuration enabled=true > Autodetect column delimiter=false > Autodetect quotes=false > Column reordering enabled=true > Empty value=null > Escape unquoted values=false > Header extraction enabled=null > Headers=null > Ignore leading whitespaces=false > Ignore trailing whitespaces=false > Input buffer size=128 > Input reading on separate thread=false > Keep escape sequences=false > Keep quotes=false > Length of content displayed on error=-1 > Line separator detection enabled=false > Maximum number of characters per column=-1 > Maximum number of columns=20480 > Normalize escaped line separators=true > Null value= > Number of records to read=all > Processor=none > Restricting data in exceptions=false > RowProcessor error handler=null > Selected fields=none > Skip empty lines=true > Unescaped quote handling=STOP_AT_DELIMITERFormat configuration: > CsvFormat: > Comment character=c > Field delimiter=, > Line separator (normalized)=\n > Line separator sequence=\r\n > Quote character=" > Quote escape character=\ > Quote escape escape character=null >
[jira] [Updated] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character
[ https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kumaresh C R updated SPARK-22516: - Labels: csvparser (was: ) > CSV Read breaks: When "multiLine" = "true", if "comment" option is set as > last line's first character > - > > Key: SPARK-22516 > URL: https://issues.apache.org/jira/browse/SPARK-22516 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Kumaresh C R > Labels: csvparser > Attachments: testCommentChar.csv > > > Try to read attached CSV file with following parse properties, > scala> *val csvFile = > spark.read.option("header","true").option("inferSchema", > "true").option("parserLib", "univocity").option("comment", > "c").csv("hdfs://localhost:8020/test > CommentChar.csv"); * > > > csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string] > > > > > > scala> csvFile.show > > > +---+---+ > > > | a| b| > > > +---+---+ > > > +---+---+ > {color:#8eb021}*Noticed that it works fine.*{color} > If we add an option "multiLine" = "true", it fails with below exception. This > happens only if we pass "comment" == input dataset's last line's first > character > scala> val csvFile = > *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema", > "true").option("parserLib", "univocity").option("comment", > "c").csv("hdfs://localhost:8020/testCommentChar.csv");* > 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8) > com.univocity.parsers.common.TextParsingException: > java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End > of input reached > Parser Configuration: CsvParserSettings: > Auto configuration enabled=true > Autodetect column delimiter=false > Autodetect quotes=false > Column reordering enabled=true > Empty value=null > Escape unquoted values=false > Header extraction enabled=null > Headers=null > Ignore leading whitespaces=false > Ignore trailing whitespaces=false > Input buffer size=128 > Input reading on separate thread=false > Keep escape sequences=false > Keep quotes=false > Length of content displayed on error=-1 > Line separator detection enabled=false > Maximum number of characters per column=-1 > Maximum number of columns=20480 > Normalize escaped line separators=true > Null value= > Number of records to read=all > Processor=none > Restricting data in exceptions=false > RowProcessor error handler=null > Selected fields=none > Skip empty lines=true > Unescaped quote handling=STOP_AT_DELIMITERFormat configuration: > CsvFormat: > Comment character=c > Field delimiter=, > Line separator (normalized)=\n > Line separator sequence=\r\n > Quote character=" > Quote escape character=\ > Quote escape escape character=null > Internal state when error was thrown: line=3, column=0, record=1, charIndex=19 > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475) > at > org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281) > at scala.collection.Iterator$$anon$12.next(Iterator.scala:444) > at
[jira] [Commented] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character
[ https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251110#comment-16251110 ] Kumaresh C R commented on SPARK-22516: -- [~hyukjin.kwon]: Need your help here :) > CSV Read breaks: When "multiLine" = "true", if "comment" option is set as > last line's first character > - > > Key: SPARK-22516 > URL: https://issues.apache.org/jira/browse/SPARK-22516 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Kumaresh C R > Attachments: testCommentChar.csv > > > Try to read attached CSV file with following parse properties, > scala> *val csvFile = > spark.read.option("header","true").option("inferSchema", > "true").option("parserLib", "univocity").option("comment", > "c").csv("hdfs://localhost:8020/test > CommentChar.csv"); * > > > csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string] > > > > > > scala> csvFile.show > > > +---+---+ > > > | a| b| > > > +---+---+ > > > +---+---+ > {color:#8eb021}*Noticed that it works fine.*{color} > If we add an option "multiLine" = "true", it fails with below exception. This > happens only if we pass "comment" == input dataset's last line's first > character > scala> val csvFile = > *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema", > "true").option("parserLib", "univocity").option("comment", > "c").csv("hdfs://localhost:8020/testCommentChar.csv");* > 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8) > com.univocity.parsers.common.TextParsingException: > java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End > of input reached > Parser Configuration: CsvParserSettings: > Auto configuration enabled=true > Autodetect column delimiter=false > Autodetect quotes=false > Column reordering enabled=true > Empty value=null > Escape unquoted values=false > Header extraction enabled=null > Headers=null > Ignore leading whitespaces=false > Ignore trailing whitespaces=false > Input buffer size=128 > Input reading on separate thread=false > Keep escape sequences=false > Keep quotes=false > Length of content displayed on error=-1 > Line separator detection enabled=false > Maximum number of characters per column=-1 > Maximum number of columns=20480 > Normalize escaped line separators=true > Null value= > Number of records to read=all > Processor=none > Restricting data in exceptions=false > RowProcessor error handler=null > Selected fields=none > Skip empty lines=true > Unescaped quote handling=STOP_AT_DELIMITERFormat configuration: > CsvFormat: > Comment character=c > Field delimiter=, > Line separator (normalized)=\n > Line separator sequence=\r\n > Quote character=" > Quote escape character=\ > Quote escape escape character=null > Internal state when error was thrown: line=3, column=0, record=1, charIndex=19 > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475) > at > org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281) > at scala.collection.Iterator$$anon$12.next(Iterator.scala:444) > at
[jira] [Updated] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character
[ https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kumaresh C R updated SPARK-22516: - Attachment: testCommentChar.csv > CSV Read breaks: When "multiLine" = "true", if "comment" option is set as > last line's first character > - > > Key: SPARK-22516 > URL: https://issues.apache.org/jira/browse/SPARK-22516 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Kumaresh C R > Attachments: testCommentChar.csv > > > Try to read attached CSV file with following parse properties, > scala> *val csvFile = > spark.read.option("header","true").option("inferSchema", > "true").option("parserLib", "univocity").option("comment", > "c").csv("hdfs://localhost:8020/text > CommentChar.csv"); * > > > csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string] > > > > > > scala> csvFile.show > > > +---+---+ > > > | a| b| > > > +---+---+ > > > +---+---+ > {color:#8eb021}*Noticed that it works fine.*{color} > If we add an option "multiLine" = "true", it fails with below exception. This > happens only if we pass "comment" == input dataset's last line's first > character > scala> val csvFile = > *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema", > "true").option("parserLib", "univocity").option("comment", > "c").csv("hdfs://localhost:8020/textCommentChar.csv");* > 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8) > com.univocity.parsers.common.TextParsingException: > java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End > of input reached > Parser Configuration: CsvParserSettings: > Auto configuration enabled=true > Autodetect column delimiter=false > Autodetect quotes=false > Column reordering enabled=true > Empty value=null > Escape unquoted values=false > Header extraction enabled=null > Headers=null > Ignore leading whitespaces=false > Ignore trailing whitespaces=false > Input buffer size=128 > Input reading on separate thread=false > Keep escape sequences=false > Keep quotes=false > Length of content displayed on error=-1 > Line separator detection enabled=false > Maximum number of characters per column=-1 > Maximum number of columns=20480 > Normalize escaped line separators=true > Null value= > Number of records to read=all > Processor=none > Restricting data in exceptions=false > RowProcessor error handler=null > Selected fields=none > Skip empty lines=true > Unescaped quote handling=STOP_AT_DELIMITERFormat configuration: > CsvFormat: > Comment character=c > Field delimiter=, > Line separator (normalized)=\n > Line separator sequence=\r\n > Quote character=" > Quote escape character=\ > Quote escape escape character=null > Internal state when error was thrown: line=3, column=0, record=1, charIndex=19 > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475) > at > org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281) > at scala.collection.Iterator$$anon$12.next(Iterator.scala:444) > at
[jira] [Updated] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character
[ https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kumaresh C R updated SPARK-22516: - Description: Try to read attached CSV file with following parse properties, scala> *val csvFile = spark.read.option("header","true").option("inferSchema", "true").option("parserLib", "univocity").option("comment", "c").csv("hdfs://localhost:8020/test CommentChar.csv"); * csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string] scala> csvFile.show +---+---+ | a| b| +---+---+ +---+---+ {color:#8eb021}*Noticed that it works fine.*{color} If we add an option "multiLine" = "true", it fails with below exception. This happens only if we pass "comment" == input dataset's last line's first character scala> val csvFile = *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema", "true").option("parserLib", "univocity").option("comment", "c").csv("hdfs://localhost:8020/testCommentChar.csv");* 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8) com.univocity.parsers.common.TextParsingException: java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End of input reached Parser Configuration: CsvParserSettings: Auto configuration enabled=true Autodetect column delimiter=false Autodetect quotes=false Column reordering enabled=true Empty value=null Escape unquoted values=false Header extraction enabled=null Headers=null Ignore leading whitespaces=false Ignore trailing whitespaces=false Input buffer size=128 Input reading on separate thread=false Keep escape sequences=false Keep quotes=false Length of content displayed on error=-1 Line separator detection enabled=false Maximum number of characters per column=-1 Maximum number of columns=20480 Normalize escaped line separators=true Null value= Number of records to read=all Processor=none Restricting data in exceptions=false RowProcessor error handler=null Selected fields=none Skip empty lines=true Unescaped quote handling=STOP_AT_DELIMITERFormat configuration: CsvFormat: Comment character=c Field delimiter=, Line separator (normalized)=\n Line separator sequence=\r\n Quote character=" Quote escape character=\ Quote escape escape character=null Internal state when error was thrown: line=3, column=0, record=1, charIndex=19 at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281) at scala.collection.Iterator$$anon$12.next(Iterator.scala:444) at scala.collection.Iterator$$anon$10.next(Iterator.scala:393) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) at scala.collection.AbstractIterator.to(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) at
[jira] [Updated] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character
[ https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kumaresh C R updated SPARK-22516: - Description: Try to read attached CSV file with following parse properties, scala> *val csvFile = spark.read.option("header","true").option("inferSchema", "true").option("parserLib", "univocity").option("comment", "c").csv("hdfs://localhost:8020/text CommentChar.csv"); * csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string] scala> csvFile.show +---+---+ | a| b| +---+---+ +---+---+ {color:#8eb021}*Notice that it works fine.*{color} If we add an option "multiLine" = "true", it fails with below exception, scala> val csvFile = *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema", "true").option("parserLib", "univocity").option("comment", "c").csv("hdfs://localhost:8020/textCommentChar.csv");* 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8) com.univocity.parsers.common.TextParsingException: java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End of input reached Parser Configuration: CsvParserSettings: Auto configuration enabled=true Autodetect column delimiter=false Autodetect quotes=false Column reordering enabled=true Empty value=null Escape unquoted values=false Header extraction enabled=null Headers=null Ignore leading whitespaces=false Ignore trailing whitespaces=false Input buffer size=128 Input reading on separate thread=false Keep escape sequences=false Keep quotes=false Length of content displayed on error=-1 Line separator detection enabled=false Maximum number of characters per column=-1 Maximum number of columns=20480 Normalize escaped line separators=true Null value= Number of records to read=all Processor=none Restricting data in exceptions=false RowProcessor error handler=null Selected fields=none Skip empty lines=true Unescaped quote handling=STOP_AT_DELIMITERFormat configuration: CsvFormat: Comment character=c Field delimiter=, Line separator (normalized)=\n Line separator sequence=\r\n Quote character=" Quote escape character=\ Quote escape escape character=null Internal state when error was thrown: line=3, column=0, record=1, charIndex=19 at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281) at scala.collection.Iterator$$anon$12.next(Iterator.scala:444) at scala.collection.Iterator$$anon$10.next(Iterator.scala:393) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) at scala.collection.AbstractIterator.to(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) at
[jira] [Updated] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character
[ https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kumaresh C R updated SPARK-22516: - Description: Try to read attached CSV file with following parse properties, scala> *val csvFile = spark.read.option("header","true").option("inferSchema", "true").option("parserLib", "univocity").option("comment", "c").csv("hdfs://localhost:8020/text CommentChar.csv"); * csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string] scala> csvFile.show +---+---+ | a| b| +---+---+ +---+---+ {color:#8eb021}*Noticed that it works fine.*{color} If we add an option "multiLine" = "true", it fails with below exception. This happens only if we pass "comment" == input dataset's last line's first character scala> val csvFile = *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema", "true").option("parserLib", "univocity").option("comment", "c").csv("hdfs://localhost:8020/textCommentChar.csv");* 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8) com.univocity.parsers.common.TextParsingException: java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End of input reached Parser Configuration: CsvParserSettings: Auto configuration enabled=true Autodetect column delimiter=false Autodetect quotes=false Column reordering enabled=true Empty value=null Escape unquoted values=false Header extraction enabled=null Headers=null Ignore leading whitespaces=false Ignore trailing whitespaces=false Input buffer size=128 Input reading on separate thread=false Keep escape sequences=false Keep quotes=false Length of content displayed on error=-1 Line separator detection enabled=false Maximum number of characters per column=-1 Maximum number of columns=20480 Normalize escaped line separators=true Null value= Number of records to read=all Processor=none Restricting data in exceptions=false RowProcessor error handler=null Selected fields=none Skip empty lines=true Unescaped quote handling=STOP_AT_DELIMITERFormat configuration: CsvFormat: Comment character=c Field delimiter=, Line separator (normalized)=\n Line separator sequence=\r\n Quote character=" Quote escape character=\ Quote escape escape character=null Internal state when error was thrown: line=3, column=0, record=1, charIndex=19 at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281) at scala.collection.Iterator$$anon$12.next(Iterator.scala:444) at scala.collection.Iterator$$anon$10.next(Iterator.scala:393) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) at scala.collection.AbstractIterator.to(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) at
[jira] [Updated] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character
[ https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kumaresh C R updated SPARK-22516: - Description: Try to read attached CSV file with following parse properties, scala> *val csvFile = spark.read.option("header","true").option("inferSchema", "true").option("parserLib", "univocity").option("comment", "c").csv("hdfs://localhost:8020/text CommentChar.csv"); * csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string] scala> csvFile.show +---+---+ | a| b| +---+---+ +---+---+ {color:#8eb021}*Noticed that it works fine.*{color} If we add an option "multiLine" = "true", it fails with below exception, scala> val csvFile = *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema", "true").option("parserLib", "univocity").option("comment", "c").csv("hdfs://localhost:8020/textCommentChar.csv");* 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8) com.univocity.parsers.common.TextParsingException: java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End of input reached Parser Configuration: CsvParserSettings: Auto configuration enabled=true Autodetect column delimiter=false Autodetect quotes=false Column reordering enabled=true Empty value=null Escape unquoted values=false Header extraction enabled=null Headers=null Ignore leading whitespaces=false Ignore trailing whitespaces=false Input buffer size=128 Input reading on separate thread=false Keep escape sequences=false Keep quotes=false Length of content displayed on error=-1 Line separator detection enabled=false Maximum number of characters per column=-1 Maximum number of columns=20480 Normalize escaped line separators=true Null value= Number of records to read=all Processor=none Restricting data in exceptions=false RowProcessor error handler=null Selected fields=none Skip empty lines=true Unescaped quote handling=STOP_AT_DELIMITERFormat configuration: CsvFormat: Comment character=c Field delimiter=, Line separator (normalized)=\n Line separator sequence=\r\n Quote character=" Quote escape character=\ Quote escape escape character=null Internal state when error was thrown: line=3, column=0, record=1, charIndex=19 at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281) at scala.collection.Iterator$$anon$12.next(Iterator.scala:444) at scala.collection.Iterator$$anon$10.next(Iterator.scala:393) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) at scala.collection.AbstractIterator.to(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) at
[jira] [Created] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character
Kumaresh C R created SPARK-22516: Summary: CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character Key: SPARK-22516 URL: https://issues.apache.org/jira/browse/SPARK-22516 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: Kumaresh C R Try to read attached CSV file with following parse properties, scala> *val csvFile = spark.read.option("header","true").option("inferSchema", "true").option("parserLib", "univocity").option("comment", "c").csv("hdfs://localhost:8020/text CommentChar.csv"); * csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string] scala> csvFile.show +---+---+ | a| b| +---+---+ +---+---+ Notice that it works fine. If we add an option "multiLine" = "true", it fails with below exception, scala> val csvFile = *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema", "true").option("parserLib", "univocity").option("comment", "c").csv("hdfs://localhost:8020/textCommentChar.csv");* 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8) com.univocity.parsers.common.TextParsingException: java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End of input reached Parser Configuration: CsvParserSettings: Auto configuration enabled=true Autodetect column delimiter=false Autodetect quotes=false Column reordering enabled=true Empty value=null Escape unquoted values=false Header extraction enabled=null Headers=null Ignore leading whitespaces=false Ignore trailing whitespaces=false Input buffer size=128 Input reading on separate thread=false Keep escape sequences=false Keep quotes=false Length of content displayed on error=-1 Line separator detection enabled=false Maximum number of characters per column=-1 Maximum number of columns=20480 Normalize escaped line separators=true Null value= Number of records to read=all Processor=none Restricting data in exceptions=false RowProcessor error handler=null Selected fields=none Skip empty lines=true Unescaped quote handling=STOP_AT_DELIMITERFormat configuration: CsvFormat: Comment character=c Field delimiter=, Line separator (normalized)=\n Line separator sequence=\r\n Quote character=" Quote escape character=\ Quote escape escape character=null Internal state when error was thrown: line=3, column=0, record=1, charIndex=19 at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281) at scala.collection.Iterator$$anon$12.next(Iterator.scala:444) at scala.collection.Iterator$$anon$10.next(Iterator.scala:393) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) at scala.collection.AbstractIterator.to(Iterator.scala:1336) at
[jira] [Comment Edited] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly
[ https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138405#comment-16138405 ] Kumaresh C R edited comment on SPARK-21820 at 8/23/17 2:24 PM: --- [~hyukjin.kwon]: Sounds great.. We will wait for your proposal https://github.com/apache/spark/pull/18581 to be merged. Thanks a lot :) was (Author: crkumaresh24): [~hyukjin.kwon]: Sounds great.. We will wait for your proposal https://github.com/apache/spark/pull/18581to be merged. Thanks a lot :) > csv option "multiLine" as "true" not parsing windows line feed (CR LF) > properly > --- > > Key: SPARK-21820 > URL: https://issues.apache.org/jira/browse/SPARK-21820 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kumaresh C R > Labels: features > Attachments: windows_CRLF.csv > > > With multiLine=true, windows CR LF is not getting parsed properly. If i make > multiLine=false, it parses properly. Could you please help here ? > Attached the CSV used in the below commands for your reference. > scala> val csvFile = > spark.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").option("parserLib", > "univocity").load("/home/kumar/Desktop/windows_CRLF.csv"); > csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: > string ... 1 more field] > scala> csvFile.schema.fieldNames > res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered) > scala> val csvFile = > spark.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").option("parserLib", > "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv"); > csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: > string ... 1 more field] > scala> csvFile.schema.fieldNames > ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly
[ https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138405#comment-16138405 ] Kumaresh C R commented on SPARK-21820: -- [~hyukjin.kwon]: Sound great.. We will wait for your proposal https://github.com/apache/spark/pull/18581to be merged. Thanks a lot :) > csv option "multiLine" as "true" not parsing windows line feed (CR LF) > properly > --- > > Key: SPARK-21820 > URL: https://issues.apache.org/jira/browse/SPARK-21820 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kumaresh C R > Labels: features > Attachments: windows_CRLF.csv > > > With multiLine=true, windows CR LF is not getting parsed properly. If i make > multiLine=false, it parses properly. Could you please help here ? > Attached the CSV used in the below commands for your reference. > scala> val csvFile = > spark.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").option("parserLib", > "univocity").load("/home/kumar/Desktop/windows_CRLF.csv"); > csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: > string ... 1 more field] > scala> csvFile.schema.fieldNames > res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered) > scala> val csvFile = > spark.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").option("parserLib", > "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv"); > csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: > string ... 1 more field] > scala> csvFile.schema.fieldNames > ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly
[ https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138405#comment-16138405 ] Kumaresh C R edited comment on SPARK-21820 at 8/23/17 2:13 PM: --- [~hyukjin.kwon]: Sounds great.. We will wait for your proposal https://github.com/apache/spark/pull/18581to be merged. Thanks a lot :) was (Author: crkumaresh24): [~hyukjin.kwon]: Sound great.. We will wait for your proposal https://github.com/apache/spark/pull/18581to be merged. Thanks a lot :) > csv option "multiLine" as "true" not parsing windows line feed (CR LF) > properly > --- > > Key: SPARK-21820 > URL: https://issues.apache.org/jira/browse/SPARK-21820 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kumaresh C R > Labels: features > Attachments: windows_CRLF.csv > > > With multiLine=true, windows CR LF is not getting parsed properly. If i make > multiLine=false, it parses properly. Could you please help here ? > Attached the CSV used in the below commands for your reference. > scala> val csvFile = > spark.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").option("parserLib", > "univocity").load("/home/kumar/Desktop/windows_CRLF.csv"); > csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: > string ... 1 more field] > scala> csvFile.schema.fieldNames > res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered) > scala> val csvFile = > spark.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").option("parserLib", > "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv"); > csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: > string ... 1 more field] > scala> csvFile.schema.fieldNames > ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly
[ https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138324#comment-16138324 ] Kumaresh C R edited comment on SPARK-21820 at 8/23/17 1:21 PM: --- [~hyukjin.kwon]: Could you please help us here ?This issue occurs after we moved to "multiLine" as "true" was (Author: crkumaresh24): [~hyukjin.kwon]: Could you please help us here ?This issue after we moved to "multiLine" as "true" > csv option "multiLine" as "true" not parsing windows line feed (CR LF) > properly > --- > > Key: SPARK-21820 > URL: https://issues.apache.org/jira/browse/SPARK-21820 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Kumaresh C R > Labels: features > Attachments: windows_CRLF.csv > > > With multiLine=true, windows CR LF is not getting parsed properly. If i make > multiLine=false, it parses properly. Could you please help here ? > Attached the CSV used in the below commands for your reference. > scala> val csvFile = > spark.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").option("parserLib", > "univocity").load("/home/kumar/Desktop/windows_CRLF.csv"); > csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: > string ... 1 more field] > scala> csvFile.schema.fieldNames > res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered) > scala> val csvFile = > spark.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").option("parserLib", > "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv"); > csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: > string ... 1 more field] > scala> csvFile.schema.fieldNames > ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly
[ https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138327#comment-16138327 ] Kumaresh C R edited comment on SPARK-21820 at 8/23/17 1:20 PM: --- [~sowen] This is an issue with spark databricks-CSV reading. I could not find any such option in the filter. Could you please help me what could be the proper component for this bug ? was (Author: crkumaresh24): @Sean Owen: This is an issue with spark databricks-CSV reading. I could not find any such option in the filter. Could you please help me what could be the proper component for this bug ? > csv option "multiLine" as "true" not parsing windows line feed (CR LF) > properly > --- > > Key: SPARK-21820 > URL: https://issues.apache.org/jira/browse/SPARK-21820 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Kumaresh C R > Labels: features > Attachments: windows_CRLF.csv > > > With multiLine=true, windows CR LF is not getting parsed properly. If i make > multiLine=false, it parses properly. Could you please help here ? > Attached the CSV used in the below commands for your reference. > scala> val csvFile = > spark.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").option("parserLib", > "univocity").load("/home/kumar/Desktop/windows_CRLF.csv"); > csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: > string ... 1 more field] > scala> csvFile.schema.fieldNames > res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered) > scala> val csvFile = > spark.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").option("parserLib", > "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv"); > csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: > string ... 1 more field] > scala> csvFile.schema.fieldNames > ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly
[ https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138327#comment-16138327 ] Kumaresh C R commented on SPARK-21820: -- @Sean Owen: This is an issue with spark databricks-CSV reading. I could not find any such option in the filter. Could you please help me what could be the proper component for this bug ? > csv option "multiLine" as "true" not parsing windows line feed (CR LF) > properly > --- > > Key: SPARK-21820 > URL: https://issues.apache.org/jira/browse/SPARK-21820 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Kumaresh C R > Labels: features > Attachments: windows_CRLF.csv > > > With multiLine=true, windows CR LF is not getting parsed properly. If i make > multiLine=false, it parses properly. Could you please help here ? > Attached the CSV used in the below commands for your reference. > scala> val csvFile = > spark.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").option("parserLib", > "univocity").load("/home/kumar/Desktop/windows_CRLF.csv"); > csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: > string ... 1 more field] > scala> csvFile.schema.fieldNames > res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered) > scala> val csvFile = > spark.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").option("parserLib", > "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv"); > csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: > string ... 1 more field] > scala> csvFile.schema.fieldNames > ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly
[ https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138324#comment-16138324 ] Kumaresh C R commented on SPARK-21820: -- [~hyukjin.kwon]: Could you please help us here ?This issue after we moved to "multiLine" as "true" > csv option "multiLine" as "true" not parsing windows line feed (CR LF) > properly > --- > > Key: SPARK-21820 > URL: https://issues.apache.org/jira/browse/SPARK-21820 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Kumaresh C R > Labels: features > Attachments: windows_CRLF.csv > > > With multiLine=true, windows CR LF is not getting parsed properly. If i make > multiLine=false, it parses properly. Could you please help here ? > Attached the CSV used in the below commands for your reference. > scala> val csvFile = > spark.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").option("parserLib", > "univocity").load("/home/kumar/Desktop/windows_CRLF.csv"); > csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: > string ... 1 more field] > scala> csvFile.schema.fieldNames > res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered) > scala> val csvFile = > spark.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").option("parserLib", > "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv"); > csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: > string ... 1 more field] > scala> csvFile.schema.fieldNames > ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly
[ https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kumaresh C R updated SPARK-21820: - Attachment: windows_CRLF.csv > csv option "multiLine" as "true" not parsing windows line feed (CR LF) > properly > --- > > Key: SPARK-21820 > URL: https://issues.apache.org/jira/browse/SPARK-21820 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Kumaresh C R > Labels: features > Attachments: windows_CRLF.csv > > > With multiLine=true, windows CR LF is not getting parsed properly. If i make > multiLine=false, it parses properly. Could you please help here ? > Attached the CSV used in the below commands for your reference. > scala> val csvFile = > spark.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").option("parserLib", > "univocity").load("/home/kumar/Desktop/windows_CRLF.csv"); > csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: > string ... 1 more field] > scala> csvFile.schema.fieldNames > res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered) > scala> val csvFile = > spark.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").option("parserLib", > "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv"); > csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: > string ... 1 more field] > scala> csvFile.schema.fieldNames > ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly
Kumaresh C R created SPARK-21820: Summary: csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly Key: SPARK-21820 URL: https://issues.apache.org/jira/browse/SPARK-21820 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: Kumaresh C R With multiLine=true, windows CR LF is not getting parsed properly. If i make multiLine=false, it parses properly. Could you please help here ? Attached the CSV used in the below commands for your reference. scala> val csvFile = spark.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").option("parserLib", "univocity").load("/home/kumar/Desktop/windows_CRLF.csv"); csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: string ... 1 more field] scala> csvFile.schema.fieldNames res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered) scala> val csvFile = spark.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").option("parserLib", "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv"); csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: string ... 1 more field] scala> csvFile.schema.fieldNames ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14194) spark csv reader not working properly if CSV content contains CRLF character (newline) in the intermediate cell
[ https://issues.apache.org/jira/browse/SPARK-14194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kumaresh C R updated SPARK-14194: - Description: We have CSV content like below, Sl.NO, Employee_Name, Company, Address, Country, ZIP_Code\n\r "1", "ABCD", "XYZ", "1234", "XZ Street \n\r(CRLF charater), Municapality,","USA", "1234567" Since there is a '\n\r' character in the row middle (to be exact in the Address Column), when we execute the below spark code, it tries to create the dataframe with two rows (excluding header row), which is wrong. Since we have specified delimiter as quote (") character , why it takes the middle character as newline character ? This creates an issue while processing the created dataframe. DataFrame df = sqlContextManager.getSqlContext().read().format("com.databricks.spark.csv") .option("header", "true") .option("inferSchema", "true") .option("delimiter", delim) .option("quote", quote) .option("escape", escape) .load(sourceFile); > spark csv reader not working properly if CSV content contains CRLF character > (newline) in the intermediate cell > --- > > Key: SPARK-14194 > URL: https://issues.apache.org/jira/browse/SPARK-14194 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Kumaresh C R > > We have CSV content like below, > Sl.NO, Employee_Name, Company, Address, Country, ZIP_Code\n\r > "1", "ABCD", "XYZ", "1234", "XZ Street \n\r(CRLF charater), > Municapality,","USA", "1234567" > Since there is a '\n\r' character in the row middle (to be exact in the > Address Column), when we execute the below spark code, it tries to create the > dataframe with two rows (excluding header row), which is wrong. Since we have > specified delimiter as quote (") character , why it takes the middle > character as newline character ? This creates an issue while processing the > created dataframe. > DataFrame df = > sqlContextManager.getSqlContext().read().format("com.databricks.spark.csv") > .option("header", "true") > .option("inferSchema", "true") > .option("delimiter", delim) > .option("quote", quote) > .option("escape", escape) > .load(sourceFile); > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14194) spark csv reader not working properly if CSV content contains CRLF character (newline) in the intermediate cell
Kumaresh C R created SPARK-14194: Summary: spark csv reader not working properly if CSV content contains CRLF character (newline) in the intermediate cell Key: SPARK-14194 URL: https://issues.apache.org/jira/browse/SPARK-14194 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.2 Reporter: Kumaresh C R -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org