[jira] [Updated] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character

2017-11-20 Thread Kumaresh C R (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kumaresh C R updated SPARK-22516:
-
Attachment: test_file_without_eof_char.csv

> CSV Read breaks: When "multiLine" = "true", if "comment" option is set as 
> last line's first character
> -
>
> Key: SPARK-22516
> URL: https://issues.apache.org/jira/browse/SPARK-22516
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>Priority: Minor
>  Labels: csvparser
> Attachments: testCommentChar.csv, test_file_without_eof_char.csv
>
>
> Try to read attached CSV file with following parse properties,
> scala> *val csvFile = 
> spark.read.option("header","true").option("inferSchema", 
> "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/test
> CommentChar.csv");   *
>   
>   
> csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string]  
>   
>  
>   
>   
>  
> scala> csvFile.show   
>   
>  
> +---+---+ 
>   
>  
> |  a|  b| 
>   
>  
> +---+---+ 
>   
>  
> +---+---+   
> {color:#8eb021}*Noticed that it works fine.*{color}
> If we add an option "multiLine" = "true", it fails with below exception. This 
> happens only if we pass "comment" == input dataset's last line's first 
> character
> scala> val csvFile = 
> *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema",
>  "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/testCommentChar.csv");*
> 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8)
> com.univocity.parsers.common.TextParsingException: 
> java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End 
> of input reached
> Parser Configuration: CsvParserSettings:
> Auto configuration enabled=true
> Autodetect column delimiter=false
> Autodetect quotes=false
> Column reordering enabled=true
> Empty value=null
> Escape unquoted values=false
> Header extraction enabled=null
> Headers=null
> Ignore leading whitespaces=false
> Ignore trailing whitespaces=false
> Input buffer size=128
> Input reading on separate thread=false
> Keep escape sequences=false
> Keep quotes=false
> Length of content displayed on error=-1
> Line separator detection enabled=false
> Maximum number of characters per column=-1
> Maximum number of columns=20480
> Normalize escaped line separators=true
> Null value=
> Number of records to read=all
> Processor=none
> Restricting data in exceptions=false
> RowProcessor error handler=null
> Selected fields=none
> Skip empty lines=true
> Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
> CsvFormat:
> Comment character=c
> Field delimiter=,
> Line separator (normalized)=\n
> Line separator sequence=\r\n
> Quote character="
> Quote escape character=\
> Quote escape escape character=null
> Internal state when error was thrown: line=3, column=0, record=1, charIndex=19
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475)
> at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281)
> at 

[jira] [Commented] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character

2017-11-20 Thread Kumaresh C R (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16259157#comment-16259157
 ] 

Kumaresh C R commented on SPARK-22516:
--

[~mgaido]: Even after I replaced all 'CR LF' to 'LF', still in the below case, 
the error is thrown.

 -> When the file doesn't have 'LF' as the last character in its last line  
i.e. EOF
 (Note: All other lines in the file ends with LF) character

Attached the failing file 'test_file_without_eof_char.csv' for your reference.

Is it something the problem with the parser or the input data (which doesn't 
have any line ending as its last character) ?

> CSV Read breaks: When "multiLine" = "true", if "comment" option is set as 
> last line's first character
> -
>
> Key: SPARK-22516
> URL: https://issues.apache.org/jira/browse/SPARK-22516
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>Priority: Minor
>  Labels: csvparser
> Attachments: testCommentChar.csv
>
>
> Try to read attached CSV file with following parse properties,
> scala> *val csvFile = 
> spark.read.option("header","true").option("inferSchema", 
> "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/test
> CommentChar.csv");   *
>   
>   
> csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string]  
>   
>  
>   
>   
>  
> scala> csvFile.show   
>   
>  
> +---+---+ 
>   
>  
> |  a|  b| 
>   
>  
> +---+---+ 
>   
>  
> +---+---+   
> {color:#8eb021}*Noticed that it works fine.*{color}
> If we add an option "multiLine" = "true", it fails with below exception. This 
> happens only if we pass "comment" == input dataset's last line's first 
> character
> scala> val csvFile = 
> *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema",
>  "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/testCommentChar.csv");*
> 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8)
> com.univocity.parsers.common.TextParsingException: 
> java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End 
> of input reached
> Parser Configuration: CsvParserSettings:
> Auto configuration enabled=true
> Autodetect column delimiter=false
> Autodetect quotes=false
> Column reordering enabled=true
> Empty value=null
> Escape unquoted values=false
> Header extraction enabled=null
> Headers=null
> Ignore leading whitespaces=false
> Ignore trailing whitespaces=false
> Input buffer size=128
> Input reading on separate thread=false
> Keep escape sequences=false
> Keep quotes=false
> Length of content displayed on error=-1
> Line separator detection enabled=false
> Maximum number of characters per column=-1
> Maximum number of columns=20480
> Normalize escaped line separators=true
> Null value=
> Number of records to read=all
> Processor=none
> Restricting data in exceptions=false
> RowProcessor error handler=null
> Selected fields=none
> Skip empty lines=true
> Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
> CsvFormat:
> Comment character=c
> Field delimiter=,
> Line separator (normalized)=\n
> Line separator sequence=\r\n
> Quote character="
> Quote escape character=\
> Quote escape escape character=null
> 

[jira] [Updated] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character

2017-11-14 Thread Kumaresh C R (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kumaresh C R updated SPARK-22516:
-
Labels: csvparser  (was: )

> CSV Read breaks: When "multiLine" = "true", if "comment" option is set as 
> last line's first character
> -
>
> Key: SPARK-22516
> URL: https://issues.apache.org/jira/browse/SPARK-22516
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>  Labels: csvparser
> Attachments: testCommentChar.csv
>
>
> Try to read attached CSV file with following parse properties,
> scala> *val csvFile = 
> spark.read.option("header","true").option("inferSchema", 
> "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/test
> CommentChar.csv");   *
>   
>   
> csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string]  
>   
>  
>   
>   
>  
> scala> csvFile.show   
>   
>  
> +---+---+ 
>   
>  
> |  a|  b| 
>   
>  
> +---+---+ 
>   
>  
> +---+---+   
> {color:#8eb021}*Noticed that it works fine.*{color}
> If we add an option "multiLine" = "true", it fails with below exception. This 
> happens only if we pass "comment" == input dataset's last line's first 
> character
> scala> val csvFile = 
> *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema",
>  "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/testCommentChar.csv");*
> 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8)
> com.univocity.parsers.common.TextParsingException: 
> java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End 
> of input reached
> Parser Configuration: CsvParserSettings:
> Auto configuration enabled=true
> Autodetect column delimiter=false
> Autodetect quotes=false
> Column reordering enabled=true
> Empty value=null
> Escape unquoted values=false
> Header extraction enabled=null
> Headers=null
> Ignore leading whitespaces=false
> Ignore trailing whitespaces=false
> Input buffer size=128
> Input reading on separate thread=false
> Keep escape sequences=false
> Keep quotes=false
> Length of content displayed on error=-1
> Line separator detection enabled=false
> Maximum number of characters per column=-1
> Maximum number of columns=20480
> Normalize escaped line separators=true
> Null value=
> Number of records to read=all
> Processor=none
> Restricting data in exceptions=false
> RowProcessor error handler=null
> Selected fields=none
> Skip empty lines=true
> Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
> CsvFormat:
> Comment character=c
> Field delimiter=,
> Line separator (normalized)=\n
> Line separator sequence=\r\n
> Quote character="
> Quote escape character=\
> Quote escape escape character=null
> Internal state when error was thrown: line=3, column=0, record=1, charIndex=19
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475)
> at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281)
> at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
> at 

[jira] [Commented] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character

2017-11-14 Thread Kumaresh C R (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251110#comment-16251110
 ] 

Kumaresh C R commented on SPARK-22516:
--

[~hyukjin.kwon]: Need your help here :)

> CSV Read breaks: When "multiLine" = "true", if "comment" option is set as 
> last line's first character
> -
>
> Key: SPARK-22516
> URL: https://issues.apache.org/jira/browse/SPARK-22516
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
> Attachments: testCommentChar.csv
>
>
> Try to read attached CSV file with following parse properties,
> scala> *val csvFile = 
> spark.read.option("header","true").option("inferSchema", 
> "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/test
> CommentChar.csv");   *
>   
>   
> csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string]  
>   
>  
>   
>   
>  
> scala> csvFile.show   
>   
>  
> +---+---+ 
>   
>  
> |  a|  b| 
>   
>  
> +---+---+ 
>   
>  
> +---+---+   
> {color:#8eb021}*Noticed that it works fine.*{color}
> If we add an option "multiLine" = "true", it fails with below exception. This 
> happens only if we pass "comment" == input dataset's last line's first 
> character
> scala> val csvFile = 
> *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema",
>  "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/testCommentChar.csv");*
> 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8)
> com.univocity.parsers.common.TextParsingException: 
> java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End 
> of input reached
> Parser Configuration: CsvParserSettings:
> Auto configuration enabled=true
> Autodetect column delimiter=false
> Autodetect quotes=false
> Column reordering enabled=true
> Empty value=null
> Escape unquoted values=false
> Header extraction enabled=null
> Headers=null
> Ignore leading whitespaces=false
> Ignore trailing whitespaces=false
> Input buffer size=128
> Input reading on separate thread=false
> Keep escape sequences=false
> Keep quotes=false
> Length of content displayed on error=-1
> Line separator detection enabled=false
> Maximum number of characters per column=-1
> Maximum number of columns=20480
> Normalize escaped line separators=true
> Null value=
> Number of records to read=all
> Processor=none
> Restricting data in exceptions=false
> RowProcessor error handler=null
> Selected fields=none
> Skip empty lines=true
> Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
> CsvFormat:
> Comment character=c
> Field delimiter=,
> Line separator (normalized)=\n
> Line separator sequence=\r\n
> Quote character="
> Quote escape character=\
> Quote escape escape character=null
> Internal state when error was thrown: line=3, column=0, record=1, charIndex=19
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475)
> at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281)
> at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
> at 

[jira] [Updated] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character

2017-11-14 Thread Kumaresh C R (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kumaresh C R updated SPARK-22516:
-
Attachment: testCommentChar.csv

> CSV Read breaks: When "multiLine" = "true", if "comment" option is set as 
> last line's first character
> -
>
> Key: SPARK-22516
> URL: https://issues.apache.org/jira/browse/SPARK-22516
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
> Attachments: testCommentChar.csv
>
>
> Try to read attached CSV file with following parse properties,
> scala> *val csvFile = 
> spark.read.option("header","true").option("inferSchema", 
> "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/text
> CommentChar.csv");   *
>   
>   
> csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string]  
>   
>  
>   
>   
>  
> scala> csvFile.show   
>   
>  
> +---+---+ 
>   
>  
> |  a|  b| 
>   
>  
> +---+---+ 
>   
>  
> +---+---+   
> {color:#8eb021}*Noticed that it works fine.*{color}
> If we add an option "multiLine" = "true", it fails with below exception. This 
> happens only if we pass "comment" == input dataset's last line's first 
> character
> scala> val csvFile = 
> *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema",
>  "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/textCommentChar.csv");*
> 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8)
> com.univocity.parsers.common.TextParsingException: 
> java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End 
> of input reached
> Parser Configuration: CsvParserSettings:
> Auto configuration enabled=true
> Autodetect column delimiter=false
> Autodetect quotes=false
> Column reordering enabled=true
> Empty value=null
> Escape unquoted values=false
> Header extraction enabled=null
> Headers=null
> Ignore leading whitespaces=false
> Ignore trailing whitespaces=false
> Input buffer size=128
> Input reading on separate thread=false
> Keep escape sequences=false
> Keep quotes=false
> Length of content displayed on error=-1
> Line separator detection enabled=false
> Maximum number of characters per column=-1
> Maximum number of columns=20480
> Normalize escaped line separators=true
> Null value=
> Number of records to read=all
> Processor=none
> Restricting data in exceptions=false
> RowProcessor error handler=null
> Selected fields=none
> Skip empty lines=true
> Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
> CsvFormat:
> Comment character=c
> Field delimiter=,
> Line separator (normalized)=\n
> Line separator sequence=\r\n
> Quote character="
> Quote escape character=\
> Quote escape escape character=null
> Internal state when error was thrown: line=3, column=0, record=1, charIndex=19
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475)
> at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281)
> at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
> at 

[jira] [Updated] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character

2017-11-14 Thread Kumaresh C R (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kumaresh C R updated SPARK-22516:
-
Description: 
Try to read attached CSV file with following parse properties,

scala> *val csvFile = spark.read.option("header","true").option("inferSchema", 
"true").option("parserLib", "univocity").option("comment", 
"c").csv("hdfs://localhost:8020/test
CommentChar.csv");   *  

  
csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string]

 


 
scala> csvFile.show 

 
+---+---+   

 
|  a|  b|   

 
+---+---+   

 
+---+---+   

{color:#8eb021}*Noticed that it works fine.*{color}

If we add an option "multiLine" = "true", it fails with below exception. This 
happens only if we pass "comment" == input dataset's last line's first character

scala> val csvFile = 
*spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema",
 "true").option("parserLib", "univocity").option("comment", 
"c").csv("hdfs://localhost:8020/testCommentChar.csv");*
17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8)
com.univocity.parsers.common.TextParsingException: 
java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End of 
input reached
Parser Configuration: CsvParserSettings:
Auto configuration enabled=true
Autodetect column delimiter=false
Autodetect quotes=false
Column reordering enabled=true
Empty value=null
Escape unquoted values=false
Header extraction enabled=null
Headers=null
Ignore leading whitespaces=false
Ignore trailing whitespaces=false
Input buffer size=128
Input reading on separate thread=false
Keep escape sequences=false
Keep quotes=false
Length of content displayed on error=-1
Line separator detection enabled=false
Maximum number of characters per column=-1
Maximum number of columns=20480
Normalize escaped line separators=true
Null value=
Number of records to read=all
Processor=none
Restricting data in exceptions=false
RowProcessor error handler=null
Selected fields=none
Skip empty lines=true
Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
CsvFormat:
Comment character=c
Field delimiter=,
Line separator (normalized)=\n
Line separator sequence=\r\n
Quote character="
Quote escape character=\
Quote escape escape character=null
Internal state when error was thrown: line=3, column=0, record=1, charIndex=19
at 
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
at 
com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475)
at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:393)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at 

[jira] [Updated] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character

2017-11-14 Thread Kumaresh C R (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kumaresh C R updated SPARK-22516:
-
Description: 
Try to read attached CSV file with following parse properties,

scala> *val csvFile = spark.read.option("header","true").option("inferSchema", 
"true").option("parserLib", "univocity").option("comment", 
"c").csv("hdfs://localhost:8020/text
CommentChar.csv");   *  

  
csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string]

 


 
scala> csvFile.show 

 
+---+---+   

 
|  a|  b|   

 
+---+---+   

 
+---+---+   

{color:#8eb021}*Notice that it works fine.*{color}

If we add an option "multiLine" = "true", it fails with below exception,

scala> val csvFile = 
*spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema",
 "true").option("parserLib", "univocity").option("comment", 
"c").csv("hdfs://localhost:8020/textCommentChar.csv");*
17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8)
com.univocity.parsers.common.TextParsingException: 
java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End of 
input reached
Parser Configuration: CsvParserSettings:
Auto configuration enabled=true
Autodetect column delimiter=false
Autodetect quotes=false
Column reordering enabled=true
Empty value=null
Escape unquoted values=false
Header extraction enabled=null
Headers=null
Ignore leading whitespaces=false
Ignore trailing whitespaces=false
Input buffer size=128
Input reading on separate thread=false
Keep escape sequences=false
Keep quotes=false
Length of content displayed on error=-1
Line separator detection enabled=false
Maximum number of characters per column=-1
Maximum number of columns=20480
Normalize escaped line separators=true
Null value=
Number of records to read=all
Processor=none
Restricting data in exceptions=false
RowProcessor error handler=null
Selected fields=none
Skip empty lines=true
Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
CsvFormat:
Comment character=c
Field delimiter=,
Line separator (normalized)=\n
Line separator sequence=\r\n
Quote character="
Quote escape character=\
Quote escape escape character=null
Internal state when error was thrown: line=3, column=0, record=1, charIndex=19
at 
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
at 
com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475)
at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:393)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
at 

[jira] [Updated] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character

2017-11-14 Thread Kumaresh C R (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kumaresh C R updated SPARK-22516:
-
Description: 
Try to read attached CSV file with following parse properties,

scala> *val csvFile = spark.read.option("header","true").option("inferSchema", 
"true").option("parserLib", "univocity").option("comment", 
"c").csv("hdfs://localhost:8020/text
CommentChar.csv");   *  

  
csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string]

 


 
scala> csvFile.show 

 
+---+---+   

 
|  a|  b|   

 
+---+---+   

 
+---+---+   

{color:#8eb021}*Noticed that it works fine.*{color}

If we add an option "multiLine" = "true", it fails with below exception. This 
happens only if we pass "comment" == input dataset's last line's first character

scala> val csvFile = 
*spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema",
 "true").option("parserLib", "univocity").option("comment", 
"c").csv("hdfs://localhost:8020/textCommentChar.csv");*
17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8)
com.univocity.parsers.common.TextParsingException: 
java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End of 
input reached
Parser Configuration: CsvParserSettings:
Auto configuration enabled=true
Autodetect column delimiter=false
Autodetect quotes=false
Column reordering enabled=true
Empty value=null
Escape unquoted values=false
Header extraction enabled=null
Headers=null
Ignore leading whitespaces=false
Ignore trailing whitespaces=false
Input buffer size=128
Input reading on separate thread=false
Keep escape sequences=false
Keep quotes=false
Length of content displayed on error=-1
Line separator detection enabled=false
Maximum number of characters per column=-1
Maximum number of columns=20480
Normalize escaped line separators=true
Null value=
Number of records to read=all
Processor=none
Restricting data in exceptions=false
RowProcessor error handler=null
Selected fields=none
Skip empty lines=true
Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
CsvFormat:
Comment character=c
Field delimiter=,
Line separator (normalized)=\n
Line separator sequence=\r\n
Quote character="
Quote escape character=\
Quote escape escape character=null
Internal state when error was thrown: line=3, column=0, record=1, charIndex=19
at 
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
at 
com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475)
at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:393)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at 

[jira] [Updated] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character

2017-11-14 Thread Kumaresh C R (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kumaresh C R updated SPARK-22516:
-
Description: 
Try to read attached CSV file with following parse properties,

scala> *val csvFile = spark.read.option("header","true").option("inferSchema", 
"true").option("parserLib", "univocity").option("comment", 
"c").csv("hdfs://localhost:8020/text
CommentChar.csv");   *  

  
csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string]

 


 
scala> csvFile.show 

 
+---+---+   

 
|  a|  b|   

 
+---+---+   

 
+---+---+   

{color:#8eb021}*Noticed that it works fine.*{color}

If we add an option "multiLine" = "true", it fails with below exception,

scala> val csvFile = 
*spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema",
 "true").option("parserLib", "univocity").option("comment", 
"c").csv("hdfs://localhost:8020/textCommentChar.csv");*
17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8)
com.univocity.parsers.common.TextParsingException: 
java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End of 
input reached
Parser Configuration: CsvParserSettings:
Auto configuration enabled=true
Autodetect column delimiter=false
Autodetect quotes=false
Column reordering enabled=true
Empty value=null
Escape unquoted values=false
Header extraction enabled=null
Headers=null
Ignore leading whitespaces=false
Ignore trailing whitespaces=false
Input buffer size=128
Input reading on separate thread=false
Keep escape sequences=false
Keep quotes=false
Length of content displayed on error=-1
Line separator detection enabled=false
Maximum number of characters per column=-1
Maximum number of columns=20480
Normalize escaped line separators=true
Null value=
Number of records to read=all
Processor=none
Restricting data in exceptions=false
RowProcessor error handler=null
Selected fields=none
Skip empty lines=true
Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
CsvFormat:
Comment character=c
Field delimiter=,
Line separator (normalized)=\n
Line separator sequence=\r\n
Quote character="
Quote escape character=\
Quote escape escape character=null
Internal state when error was thrown: line=3, column=0, record=1, charIndex=19
at 
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
at 
com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475)
at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:393)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
at 

[jira] [Created] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character

2017-11-14 Thread Kumaresh C R (JIRA)
Kumaresh C R created SPARK-22516:


 Summary: CSV Read breaks: When "multiLine" = "true", if "comment" 
option is set as last line's first character
 Key: SPARK-22516
 URL: https://issues.apache.org/jira/browse/SPARK-22516
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Kumaresh C R


Try to read attached CSV file with following parse properties,

scala> *val csvFile = spark.read.option("header","true").option("inferSchema", 
"true").option("parserLib", "univocity").option("comment", 
"c").csv("hdfs://localhost:8020/text
CommentChar.csv");   *  

  
csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string]

 


 
scala> csvFile.show 

 
+---+---+   

 
|  a|  b|   

 
+---+---+   

 
+---+---+   

Notice that it works fine.

If we add an option "multiLine" = "true", it fails with below exception,

scala> val csvFile = 
*spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema",
 "true").option("parserLib", "univocity").option("comment", 
"c").csv("hdfs://localhost:8020/textCommentChar.csv");*
17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8)
com.univocity.parsers.common.TextParsingException: 
java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End of 
input reached
Parser Configuration: CsvParserSettings:
Auto configuration enabled=true
Autodetect column delimiter=false
Autodetect quotes=false
Column reordering enabled=true
Empty value=null
Escape unquoted values=false
Header extraction enabled=null
Headers=null
Ignore leading whitespaces=false
Ignore trailing whitespaces=false
Input buffer size=128
Input reading on separate thread=false
Keep escape sequences=false
Keep quotes=false
Length of content displayed on error=-1
Line separator detection enabled=false
Maximum number of characters per column=-1
Maximum number of columns=20480
Normalize escaped line separators=true
Null value=
Number of records to read=all
Processor=none
Restricting data in exceptions=false
RowProcessor error handler=null
Selected fields=none
Skip empty lines=true
Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
CsvFormat:
Comment character=c
Field delimiter=,
Line separator (normalized)=\n
Line separator sequence=\r\n
Quote character="
Quote escape character=\
Quote escape escape character=null
Internal state when error was thrown: line=3, column=0, record=1, charIndex=19
at 
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
at 
com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475)
at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:393)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at 

[jira] [Comment Edited] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly

2017-08-23 Thread Kumaresh C R (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138405#comment-16138405
 ] 

Kumaresh C R edited comment on SPARK-21820 at 8/23/17 2:24 PM:
---

[~hyukjin.kwon]: Sounds great.. We will wait for your proposal 
https://github.com/apache/spark/pull/18581 to be merged. 
Thanks a lot :)


was (Author: crkumaresh24):
[~hyukjin.kwon]: Sounds great.. We will wait for your proposal 
https://github.com/apache/spark/pull/18581to be merged. 
Thanks a lot :)

> csv option "multiLine" as "true" not parsing windows line feed (CR LF) 
> properly
> ---
>
> Key: SPARK-21820
> URL: https://issues.apache.org/jira/browse/SPARK-21820
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>  Labels: features
> Attachments: windows_CRLF.csv
>
>
> With multiLine=true, windows CR LF is not getting parsed properly. If i make 
> multiLine=false, it parses properly. Could you please help here ?
> Attached the CSV used in the below commands for your reference.
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered)
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly

2017-08-23 Thread Kumaresh C R (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138405#comment-16138405
 ] 

Kumaresh C R commented on SPARK-21820:
--

[~hyukjin.kwon]: Sound great.. We will wait for your proposal 
https://github.com/apache/spark/pull/18581to be merged. 
Thanks a lot :)

> csv option "multiLine" as "true" not parsing windows line feed (CR LF) 
> properly
> ---
>
> Key: SPARK-21820
> URL: https://issues.apache.org/jira/browse/SPARK-21820
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>  Labels: features
> Attachments: windows_CRLF.csv
>
>
> With multiLine=true, windows CR LF is not getting parsed properly. If i make 
> multiLine=false, it parses properly. Could you please help here ?
> Attached the CSV used in the below commands for your reference.
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered)
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly

2017-08-23 Thread Kumaresh C R (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138405#comment-16138405
 ] 

Kumaresh C R edited comment on SPARK-21820 at 8/23/17 2:13 PM:
---

[~hyukjin.kwon]: Sounds great.. We will wait for your proposal 
https://github.com/apache/spark/pull/18581to be merged. 
Thanks a lot :)


was (Author: crkumaresh24):
[~hyukjin.kwon]: Sound great.. We will wait for your proposal 
https://github.com/apache/spark/pull/18581to be merged. 
Thanks a lot :)

> csv option "multiLine" as "true" not parsing windows line feed (CR LF) 
> properly
> ---
>
> Key: SPARK-21820
> URL: https://issues.apache.org/jira/browse/SPARK-21820
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>  Labels: features
> Attachments: windows_CRLF.csv
>
>
> With multiLine=true, windows CR LF is not getting parsed properly. If i make 
> multiLine=false, it parses properly. Could you please help here ?
> Attached the CSV used in the below commands for your reference.
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered)
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly

2017-08-23 Thread Kumaresh C R (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138324#comment-16138324
 ] 

Kumaresh C R edited comment on SPARK-21820 at 8/23/17 1:21 PM:
---

[~hyukjin.kwon]: Could you please help us here ?This issue occurs  after we 
moved to "multiLine" as "true"


was (Author: crkumaresh24):
[~hyukjin.kwon]: Could you please help us here ?This issue after we moved to 
"multiLine" as "true"

> csv option "multiLine" as "true" not parsing windows line feed (CR LF) 
> properly
> ---
>
> Key: SPARK-21820
> URL: https://issues.apache.org/jira/browse/SPARK-21820
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>  Labels: features
> Attachments: windows_CRLF.csv
>
>
> With multiLine=true, windows CR LF is not getting parsed properly. If i make 
> multiLine=false, it parses properly. Could you please help here ?
> Attached the CSV used in the below commands for your reference.
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered)
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly

2017-08-23 Thread Kumaresh C R (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138327#comment-16138327
 ] 

Kumaresh C R edited comment on SPARK-21820 at 8/23/17 1:20 PM:
---

[~sowen] This is an issue with spark databricks-CSV reading. I could not find 
any such option in the filter. Could you please help me what could be the 
proper component for this bug ?


was (Author: crkumaresh24):
@Sean Owen: This is an issue with spark databricks-CSV reading. I could not 
find any such option in the filter. Could you please help me what could be the 
proper component for this bug ?

> csv option "multiLine" as "true" not parsing windows line feed (CR LF) 
> properly
> ---
>
> Key: SPARK-21820
> URL: https://issues.apache.org/jira/browse/SPARK-21820
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>  Labels: features
> Attachments: windows_CRLF.csv
>
>
> With multiLine=true, windows CR LF is not getting parsed properly. If i make 
> multiLine=false, it parses properly. Could you please help here ?
> Attached the CSV used in the below commands for your reference.
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered)
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly

2017-08-23 Thread Kumaresh C R (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138327#comment-16138327
 ] 

Kumaresh C R commented on SPARK-21820:
--

@Sean Owen: This is an issue with spark databricks-CSV reading. I could not 
find any such option in the filter. Could you please help me what could be the 
proper component for this bug ?

> csv option "multiLine" as "true" not parsing windows line feed (CR LF) 
> properly
> ---
>
> Key: SPARK-21820
> URL: https://issues.apache.org/jira/browse/SPARK-21820
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>  Labels: features
> Attachments: windows_CRLF.csv
>
>
> With multiLine=true, windows CR LF is not getting parsed properly. If i make 
> multiLine=false, it parses properly. Could you please help here ?
> Attached the CSV used in the below commands for your reference.
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered)
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly

2017-08-23 Thread Kumaresh C R (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138324#comment-16138324
 ] 

Kumaresh C R commented on SPARK-21820:
--

[~hyukjin.kwon]: Could you please help us here ?This issue after we moved to 
"multiLine" as "true"

> csv option "multiLine" as "true" not parsing windows line feed (CR LF) 
> properly
> ---
>
> Key: SPARK-21820
> URL: https://issues.apache.org/jira/browse/SPARK-21820
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>  Labels: features
> Attachments: windows_CRLF.csv
>
>
> With multiLine=true, windows CR LF is not getting parsed properly. If i make 
> multiLine=false, it parses properly. Could you please help here ?
> Attached the CSV used in the below commands for your reference.
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered)
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly

2017-08-23 Thread Kumaresh C R (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kumaresh C R updated SPARK-21820:
-
Attachment: windows_CRLF.csv

> csv option "multiLine" as "true" not parsing windows line feed (CR LF) 
> properly
> ---
>
> Key: SPARK-21820
> URL: https://issues.apache.org/jira/browse/SPARK-21820
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>  Labels: features
> Attachments: windows_CRLF.csv
>
>
> With multiLine=true, windows CR LF is not getting parsed properly. If i make 
> multiLine=false, it parses properly. Could you please help here ?
> Attached the CSV used in the below commands for your reference.
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered)
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly

2017-08-23 Thread Kumaresh C R (JIRA)
Kumaresh C R created SPARK-21820:


 Summary: csv option "multiLine" as "true" not parsing windows line 
feed (CR LF) properly
 Key: SPARK-21820
 URL: https://issues.apache.org/jira/browse/SPARK-21820
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Kumaresh C R


With multiLine=true, windows CR LF is not getting parsed properly. If i make 
multiLine=false, it parses properly. Could you please help here ?

Attached the CSV used in the below commands for your reference.

scala> val csvFile = 
spark.read.format("com.databricks.spark.csv").option("header", 
"true").option("inferSchema", "true").option("parserLib", 
"univocity").load("/home/kumar/Desktop/windows_CRLF.csv");
csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
string ... 1 more field]

scala> csvFile.schema.fieldNames
res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered)

scala> val csvFile = 
spark.read.format("com.databricks.spark.csv").option("header", 
"true").option("inferSchema", "true").option("parserLib", 
"univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv");
csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
string ... 1 more field]

scala> csvFile.schema.fieldNames
")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14194) spark csv reader not working properly if CSV content contains CRLF character (newline) in the intermediate cell

2016-03-28 Thread Kumaresh C R (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kumaresh C R updated SPARK-14194:
-
Description: 
We have CSV content like below,

Sl.NO, Employee_Name, Company, Address, Country, ZIP_Code\n\r
"1", "ABCD", "XYZ", "1234", "XZ Street \n\r(CRLF charater), 
Municapality,","USA", "1234567"

Since there is a '\n\r' character in the row middle (to be exact in the Address 
Column), when we execute the below spark code, it tries to create the dataframe 
with two rows (excluding header row), which is wrong. Since we have specified 
delimiter as quote (") character , why it takes the middle character as newline 
character ? This creates an issue while processing the created dataframe.

 DataFrame df = 
sqlContextManager.getSqlContext().read().format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter", delim)
.option("quote", quote)
.option("escape", escape)
.load(sourceFile);

   


> spark csv reader not working properly if CSV content contains CRLF character 
> (newline) in the intermediate cell
> ---
>
> Key: SPARK-14194
> URL: https://issues.apache.org/jira/browse/SPARK-14194
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Kumaresh C R
>
> We have CSV content like below,
> Sl.NO, Employee_Name, Company, Address, Country, ZIP_Code\n\r
> "1", "ABCD", "XYZ", "1234", "XZ Street \n\r(CRLF charater), 
> Municapality,","USA", "1234567"
> Since there is a '\n\r' character in the row middle (to be exact in the 
> Address Column), when we execute the below spark code, it tries to create the 
> dataframe with two rows (excluding header row), which is wrong. Since we have 
> specified delimiter as quote (") character , why it takes the middle 
> character as newline character ? This creates an issue while processing the 
> created dataframe.
>  DataFrame df = 
> sqlContextManager.getSqlContext().read().format("com.databricks.spark.csv")
> .option("header", "true")
> .option("inferSchema", "true")
> .option("delimiter", delim)
> .option("quote", quote)
> .option("escape", escape)
> .load(sourceFile);
>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14194) spark csv reader not working properly if CSV content contains CRLF character (newline) in the intermediate cell

2016-03-28 Thread Kumaresh C R (JIRA)
Kumaresh C R created SPARK-14194:


 Summary: spark csv reader not working properly if CSV content 
contains CRLF character (newline) in the intermediate cell
 Key: SPARK-14194
 URL: https://issues.apache.org/jira/browse/SPARK-14194
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.2
Reporter: Kumaresh C R






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org