[jira] [Commented] (SPARK-21289) Text and CSV formats do not support custom end-of-line delimiters
[ https://issues.apache.org/jira/browse/SPARK-21289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16384633#comment-16384633 ] Hyukjin Kwon commented on SPARK-21289: -- I am going to make this an umbrella to split the PR up. > Text and CSV formats do not support custom end-of-line delimiters > - > > Key: SPARK-21289 > URL: https://issues.apache.org/jira/browse/SPARK-21289 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Yevgen Galchenko >Priority: Minor > > Spark csv and text readers always use default CR, LF or CRLF line terminators > without an option to configure a custom delimiter. > Option "textinputformat.record.delimiter" is not being used to set delimiter > in HadoopFileLinesReader and can only be set for Hadoop RDD when textFile() > is used to read file. > Possible solution would be to change HadoopFileLinesReader and create > LineRecordReader with delimiters specified in configuration. LineRecordReader > already supports passing recordDelimiter in its constructor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21289) Text and CSV formats do not support custom end-of-line delimiters
[ https://issues.apache.org/jira/browse/SPARK-21289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081136#comment-16081136 ] Andrew Ash commented on SPARK-21289: Looks like this will fix SPARK-17227 also > Text and CSV formats do not support custom end-of-line delimiters > - > > Key: SPARK-21289 > URL: https://issues.apache.org/jira/browse/SPARK-21289 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Yevgen Galchenko >Priority: Minor > > Spark csv and text readers always use default CR, LF or CRLF line terminators > without an option to configure a custom delimiter. > Option "textinputformat.record.delimiter" is not being used to set delimiter > in HadoopFileLinesReader and can only be set for Hadoop RDD when textFile() > is used to read file. > Possible solution would be to change HadoopFileLinesReader and create > LineRecordReader with delimiters specified in configuration. LineRecordReader > already supports passing recordDelimiter in its constructor. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21289) Text and CSV formats do not support custom end-of-line delimiters
[ https://issues.apache.org/jira/browse/SPARK-21289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16079905#comment-16079905 ] Apache Spark commented on SPARK-21289: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/18581 > Text and CSV formats do not support custom end-of-line delimiters > - > > Key: SPARK-21289 > URL: https://issues.apache.org/jira/browse/SPARK-21289 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Yevgen Galchenko >Priority: Minor > > Spark csv and text readers always use default CR, LF or CRLF line terminators > without an option to configure a custom delimiter. > Option "textinputformat.record.delimiter" is not being used to set delimiter > in HadoopFileLinesReader and can only be set for Hadoop RDD when textFile() > is used to read file. > Possible solution would be to change HadoopFileLinesReader and create > LineRecordReader with delimiters specified in configuration. LineRecordReader > already supports passing recordDelimiter in its constructor. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21289) Text and CSV formats do not support custom end-of-line delimiters
[ https://issues.apache.org/jira/browse/SPARK-21289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16074294#comment-16074294 ] Wenchen Fan commented on SPARK-21289: - SGTM > Text and CSV formats do not support custom end-of-line delimiters > - > > Key: SPARK-21289 > URL: https://issues.apache.org/jira/browse/SPARK-21289 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Yevgen Galchenko >Priority: Minor > > Spark csv and text readers always use default CR, LF or CRLF line terminators > without an option to configure a custom delimiter. > Option "textinputformat.record.delimiter" is not being used to set delimiter > in HadoopFileLinesReader and can only be set for Hadoop RDD when textFile() > is used to read file. > Possible solution would be to change HadoopFileLinesReader and create > LineRecordReader with delimiters specified in configuration. LineRecordReader > already supports passing recordDelimiter in its constructor. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21289) Text and CSV formats do not support custom end-of-line delimiters
[ https://issues.apache.org/jira/browse/SPARK-21289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16074186#comment-16074186 ] Hyukjin Kwon commented on SPARK-21289: -- Just for sure, {code} curl -O https://raw.githubusercontent.com/HyukjinKwon/spark/264a1dc603164bd264e0c084608f31ffb8ad5f69/sql/core/src/test/resources/cars_utf-16.csv file -I cars_utf-16.csv {code} {code} cars_utf-16.csv: text/plain; charset=utf-16be {code} > Text and CSV formats do not support custom end-of-line delimiters > - > > Key: SPARK-21289 > URL: https://issues.apache.org/jira/browse/SPARK-21289 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Yevgen Galchenko >Priority: Minor > > Spark csv and text readers always use default CR, LF or CRLF line terminators > without an option to configure a custom delimiter. > Option "textinputformat.record.delimiter" is not being used to set delimiter > in HadoopFileLinesReader and can only be set for Hadoop RDD when textFile() > is used to read file. > Possible solution would be to change HadoopFileLinesReader and create > LineRecordReader with delimiters specified in configuration. LineRecordReader > already supports passing recordDelimiter in its constructor. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21289) Text and CSV formats do not support custom end-of-line delimiters
[ https://issues.apache.org/jira/browse/SPARK-21289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16074185#comment-16074185 ] Hyukjin Kwon commented on SPARK-21289: -- I guess it is not entirely impossible given my past investigation (and IIRC) but needs a careful look. We could use the default (e.g., {{LineRecordReader}} -> {{\n}} and {{\r\n}}) if it is not set and use it if it is set. Probably, {{CommonSettings.setLineSeparator(...)}} will also be required for {{multiLine}} option. For the current behaviour, I tried to describe it [here|https://github.com/apache/spark/pull/18304#discussion_r122142421] at my best (< it should be double checked). One of annoying parts to support this should be with {{encoding}} in CSV, which requires [weird code path|https://github.com/apache/spark/blob/9f6b3e65ccfa0daec31b58c5a6386b3a890c2149/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L182-L196] and I believe we should encode the newline accordingly to support this case. So, could we support this case except the case above? I guess it would be less tricky if we get rid of this case. I personally think we should rather deprecate {{encoding}} option in CSV which does not work correctly with non-ascii compatible encodings \[1\] and with {{multiLine}} option \[2\] each. Also, I guess it does not work with {{ignoreCorruptFiles}} SQL option in schema inference as it does not use {{FileScanRDD}} (see SPARK-19885). Otherwise, we could just describe this behaviour correctly as a safe choice. {quote} could we support this case except the case above? I guess it would be less tricky and feasible if we get rid of this case. {quote} {quote} deprecate {{encoding}} option in CSV ... Otherwise, we could just describe this behaviour correctly as a safe choice. {quote} cc [~cloud_fan], [~maropu] What do you think about ^? \[1\] {code} curl -O https://raw.githubusercontent.com/HyukjinKwon/spark/264a1dc603164bd264e0c084608f31ffb8ad5f69/sql/core/src/test/resources/cars_utf-16.csv {code} {code} scala> spark.read.option("encoding", "utf-16").option("header", true).csv("cars_utf-16.csv").show() ++-+-++--+ |year| make|model| comment|blank�| ++-+-++--+ |2012|Tesla|S| No comment| �| | �| null| null|null| null| |1997| Ford| E350|Go get one now th...| �| |2015|Chevy|Volt�|null| null| ++-+-++--+ {code} \[2\] {code} scala> spark.read.option("multiLine", true).option("encoding", "utf-16").option("header", true).csv("cars_utf-16.csv").show() +---+-+---+---+---+ |��year|make|model|comment|blank| +---+-+---+---+---+ +---+-+---+---+---+ {code} > Text and CSV formats do not support custom end-of-line delimiters > - > > Key: SPARK-21289 > URL: https://issues.apache.org/jira/browse/SPARK-21289 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Yevgen Galchenko >Priority: Minor > > Spark csv and text readers always use default CR, LF or CRLF line terminators > without an option to configure a custom delimiter. > Option "textinputformat.record.delimiter" is not being used to set delimiter > in HadoopFileLinesReader and can only be set for Hadoop RDD when textFile() > is used to read file. > Possible solution would be to change HadoopFileLinesReader and create > LineRecordReader with delimiters specified in configuration. LineRecordReader > already supports passing recordDelimiter in its constructor. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21289) Text and CSV formats do not support custom end-of-line delimiters
[ https://issues.apache.org/jira/browse/SPARK-21289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16073112#comment-16073112 ] Hyukjin Kwon commented on SPARK-21289: -- There are all related information in the JIRA. Initially, SPARK-21098 was a duplicate of this but I suggested to turn to the one that fixes line delimiter. > Text and CSV formats do not support custom end-of-line delimiters > - > > Key: SPARK-21289 > URL: https://issues.apache.org/jira/browse/SPARK-21289 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Yevgen Galchenko >Priority: Minor > > Spark csv and text readers always use default CR, LF or CRLF line terminators > without an option to configure a custom delimiter. > Option "textinputformat.record.delimiter" is not being used to set delimiter > in HadoopFileLinesReader and can only be set for Hadoop RDD when textFile() > is used to read file. > Possible solution would be to change HadoopFileLinesReader and create > LineRecordReader with delimiters specified in configuration. LineRecordReader > already supports passing recordDelimiter in its constructor. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org