[ https://issues.apache.org/jira/browse/SPARK-21289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16074185#comment-16074185 ]
Hyukjin Kwon edited comment on SPARK-21289 at 7/5/17 3:00 AM: -------------------------------------------------------------- I guess it is not entirely impossible given my past investigation (and IIRC) but needs a careful look. We could use the default (e.g., {{LineRecordReader}} -> {{\n}} and {{\r\n}}) if it is not set and use it if it is set. Probably, {{CommonSettings.setLineSeparator(...)}} will also be required for {{multiLine}} option. For the current behaviour, I tried to describe it [here|https://github.com/apache/spark/pull/18304#discussion_r122142421] at my best (< it should be double checked). One of annoying parts to support this should be with {{encoding}} in CSV, which requires [weird code path|https://github.com/apache/spark/blob/9f6b3e65ccfa0daec31b58c5a6386b3a890c2149/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L182-L196] and I believe we should encode the newline accordingly to support this case. So, could we support the configurable newline except the case above? I guess it would be less tricky if we get rid of this case. I personally think we should rather deprecate {{encoding}} option in CSV which does not work correctly with non-ascii compatible encodings \[1\] and with {{multiLine}} option \[2\] each. Also, I guess it does not work with {{ignoreCorruptFiles}} SQL option in schema inference as it does not use {{FileScanRDD}} (see SPARK-19885). Otherwise, we could just describe this behaviour correctly as a safe choice. {quote} could we support the configurable newline except the case above? I guess it would be less tricky and feasible if we get rid of this case. {quote} {quote} deprecate {{encoding}} option in CSV ... Otherwise, we could just describe this behaviour correctly as a safe choice. {quote} cc [~cloud_fan], [~maropu] What do you think about ^? \[1\] {code} curl -O https://raw.githubusercontent.com/HyukjinKwon/spark/264a1dc603164bd264e0c084608f31ffb8ad5f69/sql/core/src/test/resources/cars_utf-16.csv {code} {code} scala> spark.read.option("encoding", "utf-16").option("header", true).csv("cars_utf-16.csv").show() +----+-----+-----+--------------------+------+ |year| make|model| comment|blank�| +----+-----+-----+--------------------+------+ |2012|Tesla| S| No comment| �| | �| null| null| null| null| |1997| Ford| E350|Go get one now th...| �| |2015|Chevy|Volt�| null| null| +----+-----+-----+--------------------+------+ {code} \[2\] {code} scala> spark.read.option("multiLine", true).option("encoding", "utf-16").option("header", true).csv("cars_utf-16.csv").show() +-----------+---------+-----------+---------------+-----------+ |��year|make|model|comment|blank| +-----------+---------+-----------+---------------+-----------+ +-----------+---------+-----------+---------------+-----------+ {code} was (Author: hyukjin.kwon): I guess it is not entirely impossible given my past investigation (and IIRC) but needs a careful look. We could use the default (e.g., {{LineRecordReader}} -> {{\n}} and {{\r\n}}) if it is not set and use it if it is set. Probably, {{CommonSettings.setLineSeparator(...)}} will also be required for {{multiLine}} option. For the current behaviour, I tried to describe it [here|https://github.com/apache/spark/pull/18304#discussion_r122142421] at my best (< it should be double checked). One of annoying parts to support this should be with {{encoding}} in CSV, which requires [weird code path|https://github.com/apache/spark/blob/9f6b3e65ccfa0daec31b58c5a6386b3a890c2149/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L182-L196] and I believe we should encode the newline accordingly to support this case. So, could we support the configurable newline except the case above? I guess it would be less tricky if we get rid of this case. I personally think we should rather deprecate {{encoding}} option in CSV which does not work correctly with non-ascii compatible encodings \[1\] and with {{multiLine}} option \[2\] each. Also, I guess it does not work with {{ignoreCorruptFiles}} SQL option in schema inference as it does not use {{FileScanRDD}} (see SPARK-19885). Otherwise, we could just describe this behaviour correctly as a safe choice. {quote} could we support this case except the case above? I guess it would be less tricky and feasible if we get rid of this case. {quote} {quote} deprecate {{encoding}} option in CSV ... Otherwise, we could just describe this behaviour correctly as a safe choice. {quote} cc [~cloud_fan], [~maropu] What do you think about ^? \[1\] {code} curl -O https://raw.githubusercontent.com/HyukjinKwon/spark/264a1dc603164bd264e0c084608f31ffb8ad5f69/sql/core/src/test/resources/cars_utf-16.csv {code} {code} scala> spark.read.option("encoding", "utf-16").option("header", true).csv("cars_utf-16.csv").show() +----+-----+-----+--------------------+------+ |year| make|model| comment|blank�| +----+-----+-----+--------------------+------+ |2012|Tesla| S| No comment| �| | �| null| null| null| null| |1997| Ford| E350|Go get one now th...| �| |2015|Chevy|Volt�| null| null| +----+-----+-----+--------------------+------+ {code} \[2\] {code} scala> spark.read.option("multiLine", true).option("encoding", "utf-16").option("header", true).csv("cars_utf-16.csv").show() +-----------+---------+-----------+---------------+-----------+ |��year|make|model|comment|blank| +-----------+---------+-----------+---------------+-----------+ +-----------+---------+-----------+---------------+-----------+ {code} > Text and CSV formats do not support custom end-of-line delimiters > ----------------------------------------------------------------- > > Key: SPARK-21289 > URL: https://issues.apache.org/jira/browse/SPARK-21289 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.1.1 > Reporter: Yevgen Galchenko > Priority: Minor > > Spark csv and text readers always use default CR, LF or CRLF line terminators > without an option to configure a custom delimiter. > Option "textinputformat.record.delimiter" is not being used to set delimiter > in HadoopFileLinesReader and can only be set for Hadoop RDD when textFile() > is used to read file. > Possible solution would be to change HadoopFileLinesReader and create > LineRecordReader with delimiters specified in configuration. LineRecordReader > already supports passing recordDelimiter in its constructor. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org