[jira] [Commented] (SPARK-38167) CSV parsing error when using escape='"'

Marnix van den Broek (Jira) Fri, 11 Feb 2022 14:09:08 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-38167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17491170#comment-17491170
 ]


Marnix van den Broek commented on SPARK-38167:
----------------------------------------------

After getting some help from the community navigating the Spark codebase and 
testing the same example in the univocity csv parser, I can confirm this is 
actually a bug in univocity csv parser.

I filed a bug report with them and will update this issue with the status as 
soon as I know more.   

> CSV parsing error when using escape='"' 
> ----------------------------------------
>
>                 Key: SPARK-38167
>                 URL: https://issues.apache.org/jira/browse/SPARK-38167
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core
>    Affects Versions: 3.2.1
>         Environment: Pyspark on a single-node Databricks managed Spark 3.1.2 
> cluster.
>            Reporter: Marnix van den Broek
>            Priority: Major
>              Labels: correctness, csv, csvparser, data-integrity
>
> hi all,
> When reading CSV files with Spark, I ran into a parsing bug.
> {*}The summary{*}:
> When
>  # reading a comma separated, double-quote quoted CSV file using the csv 
> reader options _escape='"'_ and {_}header=True{_},
>  # with a row containing a quoted empty field
>  # followed by a quoted field starting with a comma and followed by one or 
> more characters
> selecting columns from the dataframe at or after the field described in 3) 
> gives incorrect and inconsistent results
> {*}In detail{*}:
> When I instruct Spark to read this CSV file:
>  
> {code:java}
> col1,col2
> "",",a"
> {code}
>  
> using the CSV reader options escape='"' (unnecessary for the example, 
> necessary for the files I'm processing) and header=True, I expect the 
> following result:
>  
> {code:java}
> spark.read.csv(path, escape='"', header=True).show()
>  
> +----+----+
> |col1|col2|
> +----+----+
> |null|  ,a|
> +----+----+   {code}
>  
>  Spark does yield this result, so far so good. However, when I select col2 
> from the dataframe, Spark yields an incorrect result:
>  
> {code:java}
> spark.read.csv(path, escape='"', header=True).select('col2').show()
>  
> +----+
> |col2|
> +----+
> |  a"|
> +----+{code}
>  
> If you run this example with more columns in the file, and more commas in the 
> field, e.g. ",,,,,,,a", the problem compounds, as Spark shifts many values to 
> the right, causing unexpected and incorrect results. The inconsistency 
> between both methods surprised me, as it implies the parsing is evaluated 
> differently between both methods. 
> I expect the bug to be located in the quote-balancing and un-escaping methods 
> of the csv parser, but I can't find where that code is located in the code 
> base. I'd be happy to take a look at it if anyone can point me where it is. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38167) CSV parsing error when using escape='"'

Reply via email to