[ https://issues.apache.org/jira/browse/SPARK-22236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16199245#comment-16199245 ]
Sean Owen commented on SPARK-22236: ----------------------------------- Interesting, because the Univocity parser internally seems to default to RFC4180 settings. But the Spark implementation default overrides this with a default of {{\}}. [~hyukjin.kwon] was that for backwards compatibility with previous implementations? In any event I'm not sure we'd change the default behavior on this side of Spark 3.x, but, you can easily configure the writer to use double-quote for escape. > CSV I/O: does not respect RFC 4180 > ---------------------------------- > > Key: SPARK-22236 > URL: https://issues.apache.org/jira/browse/SPARK-22236 > Project: Spark > Issue Type: Improvement > Components: Input/Output > Affects Versions: 2.2.0 > Reporter: Ondrej Kokes > Priority: Minor > > When reading or writing CSV files with Spark, double quotes are escaped with > a backslash by default. However, the appropriate behaviour as set out by RFC > 4180 (and adhered to by many software packages) is to escape using a second > double quote. > This piece of Python code demonstrates the issue > {code} > import csv > with open('testfile.csv', 'w') as f: > cw = csv.writer(f) > cw.writerow(['a 2.5" drive', 'another column']) > cw.writerow(['a "quoted" string', '"quoted"']) > cw.writerow([1,2]) > with open('testfile.csv') as f: > print(f.read()) > # "a 2.5"" drive",another column > # "a ""quoted"" string","""quoted""" > # 1,2 > spark.read.csv('testfile.csv').collect() > # [Row(_c0='"a 2.5"" drive"', _c1='another column'), > # Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'), > # Row(_c0='1', _c1='2')] > # explicitly stating the escape character fixed the issue > spark.read.option('escape', '"').csv('testfile.csv').collect() > # [Row(_c0='a 2.5" drive', _c1='another column'), > # Row(_c0='a "quoted" string', _c1='"quoted"'), > # Row(_c0='1', _c1='2')] > {code} > The same applies to writes, where reading the file written by Spark may > result in garbage. > {code} > df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file > correctly > df.write.format("csv").save('testout.csv') > with open('testout.csv/part-....csv') as f: > cr = csv.reader(f) > print(next(cr)) > print(next(cr)) > # ['a 2.5\\ drive"', 'another column'] > # ['a \\quoted\\" string"', '\\quoted\\""'] > {code} > While it's possible to work with CSV files in a "compatible" manner, it would > be useful if Spark had sensible defaults that conform to the above-mentioned > RFC (as well as W3C recommendations). I realise this would be a breaking > change and thus if accepted, it would probably need to result in a warning > first, before moving to a new default. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org