[ https://issues.apache.org/jira/browse/SPARK-38167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17491170#comment-17491170 ]
Marnix van den Broek commented on SPARK-38167: ---------------------------------------------- After getting some help from the community navigating the Spark codebase and testing the same example in the univocity csv parser, I can confirm this is actually a bug in univocity csv parser. I filed a bug report with them and will update this issue with the status as soon as I know more. > CSV parsing error when using escape='"' > ---------------------------------------- > > Key: SPARK-38167 > URL: https://issues.apache.org/jira/browse/SPARK-38167 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core > Affects Versions: 3.2.1 > Environment: Pyspark on a single-node Databricks managed Spark 3.1.2 > cluster. > Reporter: Marnix van den Broek > Priority: Major > Labels: correctness, csv, csvparser, data-integrity > > hi all, > When reading CSV files with Spark, I ran into a parsing bug. > {*}The summary{*}: > When > # reading a comma separated, double-quote quoted CSV file using the csv > reader options _escape='"'_ and {_}header=True{_}, > # with a row containing a quoted empty field > # followed by a quoted field starting with a comma and followed by one or > more characters > selecting columns from the dataframe at or after the field described in 3) > gives incorrect and inconsistent results > {*}In detail{*}: > When I instruct Spark to read this CSV file: > > {code:java} > col1,col2 > "",",a" > {code} > > using the CSV reader options escape='"' (unnecessary for the example, > necessary for the files I'm processing) and header=True, I expect the > following result: > > {code:java} > spark.read.csv(path, escape='"', header=True).show() > > +----+----+ > |col1|col2| > +----+----+ > |null| ,a| > +----+----+ {code} > > Spark does yield this result, so far so good. However, when I select col2 > from the dataframe, Spark yields an incorrect result: > > {code:java} > spark.read.csv(path, escape='"', header=True).select('col2').show() > > +----+ > |col2| > +----+ > | a"| > +----+{code} > > If you run this example with more columns in the file, and more commas in the > field, e.g. ",,,,,,,a", the problem compounds, as Spark shifts many values to > the right, causing unexpected and incorrect results. The inconsistency > between both methods surprised me, as it implies the parsing is evaluated > differently between both methods. > I expect the bug to be located in the quote-balancing and un-escaping methods > of the csv parser, but I can't find where that code is located in the code > base. I'd be happy to take a look at it if anyone can point me where it is. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org