[jira] [Commented] (SPARK-22236) CSV I/O: does not respect RFC 4180

Sean Owen (JIRA) Tue, 10 Oct 2017 12:37:15 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-22236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16199245#comment-16199245
 ]


Sean Owen commented on SPARK-22236:
-----------------------------------

Interesting, because the Univocity parser internally seems to default to 
RFC4180 settings. But the Spark implementation default overrides this with a 
default of {{\}}. [~hyukjin.kwon] was that for backwards compatibility with 
previous implementations? In any event I'm not sure we'd change the default 
behavior on this side of Spark 3.x, but, you can easily configure the writer to 
use double-quote for escape.

> CSV I/O: does not respect RFC 4180
> ----------------------------------
>
>                 Key: SPARK-22236
>                 URL: https://issues.apache.org/jira/browse/SPARK-22236
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output
>    Affects Versions: 2.2.0
>            Reporter: Ondrej Kokes
>            Priority: Minor
>
> When reading or writing CSV files with Spark, double quotes are escaped with 
> a backslash by default. However, the appropriate behaviour as set out by RFC 
> 4180 (and adhered to by many software packages) is to escape using a second 
> double quote.
> This piece of Python code demonstrates the issue
> {code}
> import csv
> with open('testfile.csv', 'w') as f:
>     cw = csv.writer(f)
>     cw.writerow(['a 2.5" drive', 'another column'])
>     cw.writerow(['a "quoted" string', '"quoted"'])
>     cw.writerow([1,2])
> with open('testfile.csv') as f:
>     print(f.read())
> # "a 2.5"" drive",another column
> # "a ""quoted"" string","""quoted"""
> # 1,2
> spark.read.csv('testfile.csv').collect()
> # [Row(_c0='"a 2.5"" drive"', _c1='another column'),
> #  Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'),
> #  Row(_c0='1', _c1='2')]
> # explicitly stating the escape character fixed the issue
> spark.read.option('escape', '"').csv('testfile.csv').collect()
> # [Row(_c0='a 2.5" drive', _c1='another column'),
> #  Row(_c0='a "quoted" string', _c1='"quoted"'),
> #  Row(_c0='1', _c1='2')]
> {code}
> The same applies to writes, where reading the file written by Spark may 
> result in garbage.
> {code}
> df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file 
> correctly
> df.write.format("csv").save('testout.csv')
> with open('testout.csv/part-....csv') as f:
>     cr = csv.reader(f)
>     print(next(cr))
>     print(next(cr))
> # ['a 2.5\\ drive"', 'another column']
> # ['a \\quoted\\" string"', '\\quoted\\""']
> {code}
> While it's possible to work with CSV files in a "compatible" manner, it would 
> be useful if Spark had sensible defaults that conform to the above-mentioned 
> RFC (as well as W3C recommendations). I realise this would be a breaking 
> change and thus if accepted, it would probably need to result in a warning 
> first, before moving to a new default.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22236) CSV I/O: does not respect RFC 4180

Reply via email to