[ 
https://issues.apache.org/jira/browse/SPARK-22236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16199825#comment-16199825
 ] 

Hyukjin Kwon edited comment on SPARK-22236 at 10/11/17 5:47 AM:
----------------------------------------------------------------

Yup, what is said ^ looks all correct. I think it'd be rather better to set all 
the default as are in Univocity in the (far) future if possible and if there 
were no cost at all to change them (at least partly because it was sometimes 
difficult to prepare a reproducer and report a bug into Univocity and I 
carefully guess the Univocity author also feels in a similar way assuming few 
comments).

They happen to be set differently in the first place. In a quick look, the root 
cause is due to {{escape}} being {{\}} in Spark where Univocity parser has 
{{"}} by default.


was (Author: hyukjin.kwon):
Yup, what is said ^ looks all correct. I think we should set all the default as 
are in Univocity in the future if possible. They happen to be set differently 
in the first place. In a quick look, the root cause is due to {{escape}} being 
{{\}} in Spark where Univocity parser has {{"}} by default.

> CSV I/O: does not respect RFC 4180
> ----------------------------------
>
>                 Key: SPARK-22236
>                 URL: https://issues.apache.org/jira/browse/SPARK-22236
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output
>    Affects Versions: 2.2.0
>            Reporter: Ondrej Kokes
>            Priority: Minor
>
> When reading or writing CSV files with Spark, double quotes are escaped with 
> a backslash by default. However, the appropriate behaviour as set out by RFC 
> 4180 (and adhered to by many software packages) is to escape using a second 
> double quote.
> This piece of Python code demonstrates the issue
> {code}
> import csv
> with open('testfile.csv', 'w') as f:
>     cw = csv.writer(f)
>     cw.writerow(['a 2.5" drive', 'another column'])
>     cw.writerow(['a "quoted" string', '"quoted"'])
>     cw.writerow([1,2])
> with open('testfile.csv') as f:
>     print(f.read())
> # "a 2.5"" drive",another column
> # "a ""quoted"" string","""quoted"""
> # 1,2
> spark.read.csv('testfile.csv').collect()
> # [Row(_c0='"a 2.5"" drive"', _c1='another column'),
> #  Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'),
> #  Row(_c0='1', _c1='2')]
> # explicitly stating the escape character fixed the issue
> spark.read.option('escape', '"').csv('testfile.csv').collect()
> # [Row(_c0='a 2.5" drive', _c1='another column'),
> #  Row(_c0='a "quoted" string', _c1='"quoted"'),
> #  Row(_c0='1', _c1='2')]
> {code}
> The same applies to writes, where reading the file written by Spark may 
> result in garbage.
> {code}
> df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file 
> correctly
> df.write.format("csv").save('testout.csv')
> with open('testout.csv/part-....csv') as f:
>     cr = csv.reader(f)
>     print(next(cr))
>     print(next(cr))
> # ['a 2.5\\ drive"', 'another column']
> # ['a \\quoted\\" string"', '\\quoted\\""']
> {code}
> While it's possible to work with CSV files in a "compatible" manner, it would 
> be useful if Spark had sensible defaults that conform to the above-mentioned 
> RFC (as well as W3C recommendations). I realise this would be a breaking 
> change and thus if accepted, it would probably need to result in a warning 
> first, before moving to a new default.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to