[jira] [Updated] (SPARK-38167) CSV parsing error when using escape='"'
[ https://issues.apache.org/jira/browse/SPARK-38167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marnix van den Broek updated SPARK-38167: - Description: hi all, When reading CSV files with Spark, I ran into a parsing bug. {*}The summary{*}: When # reading a comma separated, double-quote quoted CSV file using the csv reader options _escape='"'_ and {_}header=True{_}, # with a row containing a quoted empty field # followed by a quoted field starting with a comma and followed by one or more characters selecting columns from the dataframe at or after the field described in 3) gives incorrect and inconsistent results {*}In detail{*}: When I instruct Spark to read this CSV file: {code:java} col1,col2 "",",a" {code} using the CSV reader options escape='"' (unnecessary for the example, necessary for the files I'm processing) and header=True, I expect the following result: {code:java} spark.read.csv(path, escape='"', header=True).show() +++ |col1|col2| +++ |null| ,a| +++ {code} Spark does yield this result, so far so good. However, when I select col2 from the dataframe, Spark yields an incorrect result: {code:java} spark.read.csv(path, escape='"', header=True).select('col2').show() ++ |col2| ++ | a"| ++{code} If you run this example with more columns in the file, and more commas in the field, e.g. ",,,a", the problem compounds, as Spark shifts many values to the right, causing unexpected and incorrect results. The inconsistency between both methods surprised me, as it implies the parsing is evaluated differently between both methods. I expect the bug to be located in the quote-balancing and un-escaping methods of the csv parser, but I can't find where that code is located in the code base. I'd be happy to take a look at it if anyone can point me where it is. was: hi all, When reading CSV files with Spark, I ran into a parsing bug. {*}The summary{*}: When # reading a comma separated, double-quote quoted CSV file using the csv reader options _escape='"'_ and {_}header=True{_}, # with a row containing a quoted empty field # followed by a quoted field starting with a comma and followed by one or more characters selecting columns from the dataframe at or after the field described in 3) gives incorrect and inconsistent results {*}In detail{*}: When I instruct Spark to read this CSV file: {quote}col1,col2 {{"",",a"}} {quote} using the CSV reader options escape='"' (unnecessary for the example, necessary for the files I'm processing) and header=True, I expect the following result: {quote}spark.read.csv(path, escape='"', header=True).show() |*col1*|*col2*| |null|,a| {quote} Spark does yield this result, so far so good. However, when I select col2 from the dataframe, Spark yields an incorrect result: {quote}spark.read.csv(path, escape='"', header=True).select('col2').show() |*col2*| |a"| {quote} If you run this example with more columns in the file, and more commas in the field, e.g. ",,,a", the problem compounds, as Spark shifts many values to the right, causing unexpected and incorrect results. The inconsistency between both methods surprised me, as it implies the parsing is evaluated differently between both methods. I expect the bug to be located in the quote-balancing and un-escaping methods of the csv parser, but I can't find where that code is located in the code base. I'd be happy to take a look at it if anyone can point me where it is. > CSV parsing error when using escape='"' > > > Key: SPARK-38167 > URL: https://issues.apache.org/jira/browse/SPARK-38167 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.2.1 > Environment: Pyspark on a single-node Databricks managed Spark 3.1.2 > cluster. >Reporter: Marnix van den Broek >Priority: Major > Labels: correctness, csv, csvparser, data-integrity > > hi all, > When reading CSV files with Spark, I ran into a parsing bug. > {*}The summary{*}: > When > # reading a comma separated, double-quote quoted CSV file using the csv > reader options _escape='"'_ and {_}header=True{_}, > # with a row containing a quoted empty field > # followed by a quoted field starting with a comma and followed by one or > more characters > selecting columns from the dataframe at or after the field described in 3) > gives incorrect and inconsistent results > {*}In detail{*}: > When I instruct Spark to read this CSV file: > > {code:java} > col1,col2 > "",",a" > {code} > > using the CSV reader options escape='"' (unnecessary for the example, > necessary for the files I'm processing) and header=True, I expect the > following result: > > {code:java} > spark.read.csv(path, escape='"', header=True).sho
[jira] [Updated] (SPARK-38167) CSV parsing error when using escape='"'
[ https://issues.apache.org/jira/browse/SPARK-38167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marnix van den Broek updated SPARK-38167: - Description: hi all, When reading CSV files with Spark, I ran into a parsing bug. {*}The summary{*}: When # reading a comma separated, double-quote quoted CSV file using the csv reader options _escape='"'_ and {_}header=True{_}, # with a row containing a quoted empty field # followed by a quoted field starting with a comma and followed by one or more characters selecting columns from the dataframe at or after the field described in 3) gives incorrect and inconsistent results {*}In detail{*}: When I instruct Spark to read this CSV file: {quote}col1,col2 {{"",",a"}} {quote} using the CSV reader options escape='"' (unnecessary for the example, necessary for the files I'm processing) and header=True, I expect the following result: {quote}spark.read.csv(path, escape='"', header=True).show() |*col1*|*col2*| |null|,a| {quote} Spark does yield this result, so far so good. However, when I select col2 from the dataframe, Spark yields an incorrect result: {quote}spark.read.csv(path, escape='"', header=True).select('col2').show() |*col2*| |a"| {quote} If you run this example with more columns in the file, and more commas in the field, e.g. ",,,a", the problem compounds, as Spark shifts many values to the right, causing unexpected and incorrect results. The inconsistency between both methods surprised me, as it implies the parsing is evaluated differently between both methods. I expect the bug to be located in the quote-balancing and un-escaping methods of the csv parser, but I can't find where that code is located in the code base. I'd be happy to take a look at it if anyone can point me where it is. was: hi all, When reading CSV files with Spark, I ran into a parsing bug. {*}The summary{*}: When # reading a comma separated, double-quote quoted CSV file using the csv reader options _escape='"'_ and {_}header=True{_}, # with a row containing a quoted empty field # followed by a quoted field starting with a comma and followed by one or more characters selecting columns from the dataframe at or after the field described in 3) gives incorrect and inconsistent results {*}In detail{*}: When I instruct Spark to read this CSV file: {quote}{{col1,col2}} {{"",",a"}} {quote} using the CSV reader options escape='"' (unnecessary for the example, necessary for the files I'm processing) and header=True, I expect the following result: {quote}spark.read.csv(path, escape='"', header=True).show() |*col1*|*col2*| |null|,a| {quote} Spark does yield this result, so far so good. However, when I select col2 from the dataframe, Spark yields an incorrect result: {quote}spark.read.csv(path, escape='"', header=True).select('col2').show() |*col2*| |a"| {quote} If you run this example with more columns in the file, and more commas in the field, e.g. ",,,a", the problem compounds, as Spark shifts many values to the right, causing unexpected and incorrect results. The inconsistency between both methods surprised me, as it implies the parsing is evaluated differently between both methods. I expect the bug to be located in the quote-balancing and un-escaping methods of the csv parser, but I can't find where that code is located in the code base. I'd be happy to take a look at it if anyone can point me where it is. > CSV parsing error when using escape='"' > > > Key: SPARK-38167 > URL: https://issues.apache.org/jira/browse/SPARK-38167 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.2.1 > Environment: Pyspark on a single-node Databricks managed Spark 3.1.2 > cluster. >Reporter: Marnix van den Broek >Priority: Major > Labels: correctness, csv, csvparser, data-integrity > > hi all, > When reading CSV files with Spark, I ran into a parsing bug. > {*}The summary{*}: > When > # reading a comma separated, double-quote quoted CSV file using the csv > reader options _escape='"'_ and {_}header=True{_}, > # with a row containing a quoted empty field > # followed by a quoted field starting with a comma and followed by one or > more characters > selecting columns from the dataframe at or after the field described in 3) > gives incorrect and inconsistent results > {*}In detail{*}: > When I instruct Spark to read this CSV file: > {quote}col1,col2 > {{"",",a"}} > {quote} > using the CSV reader options escape='"' (unnecessary for the example, > necessary for the files I'm processing) and header=True, I expect the > following result: > {quote}spark.read.csv(path, escape='"', header=True).show() > > |*col1*|*col2*| > |null|,a| > {quote} > Spark does yield this result, so far so g