[ https://issues.apache.org/jira/browse/SPARK-22236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16481655#comment-16481655 ]
Ondrej Kokes commented on SPARK-22236: -------------------------------------- There is one more setting that makes the default CSV parser in Spark in violation of RFC 4180, that's the multiLine setting. It defaults to false, so newlines are treated as row separators, but CSVs allow for newlines in fields, as long as the field is enclosed in double quotes. Sadly, I think setting multiLine to true by default is less feasible than changing the escape setting, because multiLine=false makes the parser easily parallelisable while still parsing the majority of CSV data correctly. But, combined with mode=PERMISSIVE, this setting makes the default parser a landmine. > CSV I/O: does not respect RFC 4180 > ---------------------------------- > > Key: SPARK-22236 > URL: https://issues.apache.org/jira/browse/SPARK-22236 > Project: Spark > Issue Type: Improvement > Components: Input/Output > Affects Versions: 2.2.0 > Reporter: Ondrej Kokes > Priority: Minor > > When reading or writing CSV files with Spark, double quotes are escaped with > a backslash by default. However, the appropriate behaviour as set out by RFC > 4180 (and adhered to by many software packages) is to escape using a second > double quote. > This piece of Python code demonstrates the issue > {code} > import csv > with open('testfile.csv', 'w') as f: > cw = csv.writer(f) > cw.writerow(['a 2.5" drive', 'another column']) > cw.writerow(['a "quoted" string', '"quoted"']) > cw.writerow([1,2]) > with open('testfile.csv') as f: > print(f.read()) > # "a 2.5"" drive",another column > # "a ""quoted"" string","""quoted""" > # 1,2 > spark.read.csv('testfile.csv').collect() > # [Row(_c0='"a 2.5"" drive"', _c1='another column'), > # Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'), > # Row(_c0='1', _c1='2')] > # explicitly stating the escape character fixed the issue > spark.read.option('escape', '"').csv('testfile.csv').collect() > # [Row(_c0='a 2.5" drive', _c1='another column'), > # Row(_c0='a "quoted" string', _c1='"quoted"'), > # Row(_c0='1', _c1='2')] > {code} > The same applies to writes, where reading the file written by Spark may > result in garbage. > {code} > df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file > correctly > df.write.format("csv").save('testout.csv') > with open('testout.csv/part-....csv') as f: > cr = csv.reader(f) > print(next(cr)) > print(next(cr)) > # ['a 2.5\\ drive"', 'another column'] > # ['a \\quoted\\" string"', '\\quoted\\""'] > {code} > The culprit is in > [CSVOptions.scala|https://github.com/apache/spark/blob/7d0a3ef4ced9684457ad6c5924c58b95249419e1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L91], > where the default escape character is overridden. > While it's possible to work with CSV files in a "compatible" manner, it would > be useful if Spark had sensible defaults that conform to the above-mentioned > RFC (as well as W3C recommendations). I realise this would be a breaking > change and thus if accepted, it would probably need to result in a warning > first, before moving to a new default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org