[ 
https://issues.apache.org/jira/browse/SPARK-20155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rick Moritz updated SPARK-20155:
--------------------------------
    Description: 
According to :
https://tools.ietf.org/html/rfc4180#section-2

7.  If double-quotes are used to enclose fields, then a double-quote
       appearing inside a field must be escaped by preceding it with
       another double quote.  For example:

       "aaa","b""bb","ccc"

This currently works as is, but the following does not:

 "aaa","b""b,b","ccc"

while  "aaa","b\"b,b","ccc" does get parsed.

I assume, this happens because quotes are currently being parsed in pairs, and 
that somehow ends up unquoting delimiter.

Edit: So future readers don't have to dive into the comments: A workaround (as 
of Spark 2.0) is to explicitely declare the escape character to be a double 
quote: (read.csv.option("escape","\""))

I argue that this should be the default setting (or at least the default 
setting should be compatible to the RFC). Related work on how to properly 
escape an escape character which ambiguously escapes a quote would be a good 
vector to pass this change.

  was:
According to :
https://tools.ietf.org/html/rfc4180#section-2

7.  If double-quotes are used to enclose fields, then a double-quote
       appearing inside a field must be escaped by preceding it with
       another double quote.  For example:

       "aaa","b""bb","ccc"

This currently works as is, but the following does not:

 "aaa","b""b,b","ccc"

while  "aaa","b\"b,b","ccc" does get parsed.

I assume, this happens because quotes are currently being parsed in pairs, and 
that somehow ends up unquoting delimiter.

Edit: So future readers don't have to dive into the comments: A workaround (as 
of Spark 2.0) is to explicitely declare the escape character to be a double 
quote: (read.csv.option("escape","\""))


> CSV-files with quoted quotes can't be parsed, if delimiter follows quoted 
> quote
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-20155
>                 URL: https://issues.apache.org/jira/browse/SPARK-20155
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output, SQL
>    Affects Versions: 2.0.0
>            Reporter: Rick Moritz
>
> According to :
> https://tools.ietf.org/html/rfc4180#section-2
> 7.  If double-quotes are used to enclose fields, then a double-quote
>        appearing inside a field must be escaped by preceding it with
>        another double quote.  For example:
>        "aaa","b""bb","ccc"
> This currently works as is, but the following does not:
>  "aaa","b""b,b","ccc"
> while  "aaa","b\"b,b","ccc" does get parsed.
> I assume, this happens because quotes are currently being parsed in pairs, 
> and that somehow ends up unquoting delimiter.
> Edit: So future readers don't have to dive into the comments: A workaround 
> (as of Spark 2.0) is to explicitely declare the escape character to be a 
> double quote: (read.csv.option("escape","\""))
> I argue that this should be the default setting (or at least the default 
> setting should be compatible to the RFC). Related work on how to properly 
> escape an escape character which ambiguously escapes a quote would be a good 
> vector to pass this change.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to