[jira] [Updated] (SPARK-38167) CSV parsing error when using escape='"'

2022-02-09 Thread Marnix van den Broek (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marnix van den Broek updated SPARK-38167:
-
Description: 
hi all,

When reading CSV files with Spark, I ran into a parsing bug.

{*}The summary{*}:

When
 # reading a comma separated, double-quote quoted CSV file using the csv reader 
options _escape='"'_ and {_}header=True{_},
 # with a row containing a quoted empty field
 # followed by a quoted field starting with a comma and followed by one or more 
characters

selecting columns from the dataframe at or after the field described in 3) 
gives incorrect and inconsistent results

{*}In detail{*}:

When I instruct Spark to read this CSV file:

 
{code:java}
col1,col2
"",",a"
{code}
 

using the CSV reader options escape='"' (unnecessary for the example, necessary 
for the files I'm processing) and header=True, I expect the following result:

 
{code:java}
spark.read.csv(path, escape='"', header=True).show()
 
+++
|col1|col2|
+++
|null|  ,a|
+++   {code}
 
 Spark does yield this result, so far so good. However, when I select col2 from 
the dataframe, Spark yields an incorrect result:

 
{code:java}
spark.read.csv(path, escape='"', header=True).select('col2').show()
 
++
|col2|
++
|  a"|
++{code}
 
If you run this example with more columns in the file, and more commas in the 
field, e.g. ",,,a", the problem compounds, as Spark shifts many values to 
the right, causing unexpected and incorrect results. The inconsistency between 
both methods surprised me, as it implies the parsing is evaluated differently 
between both methods. 

I expect the bug to be located in the quote-balancing and un-escaping methods 
of the csv parser, but I can't find where that code is located in the code 
base. I'd be happy to take a look at it if anyone can point me where it is. 

  was:
hi all,

When reading CSV files with Spark, I ran into a parsing bug.

{*}The summary{*}:

When
 # reading a comma separated, double-quote quoted CSV file using the csv reader 
options _escape='"'_ and {_}header=True{_},
 # with a row containing a quoted empty field
 # followed by a quoted field starting with a comma and followed by one or more 
characters

selecting columns from the dataframe at or after the field described in 3) 
gives incorrect and inconsistent results

{*}In detail{*}:

When I instruct Spark to read this CSV file:
{quote}col1,col2

{{"",",a"}}
{quote}
using the CSV reader options escape='"' (unnecessary for the example, necessary 
for the files I'm processing) and header=True, I expect the following result:
{quote}spark.read.csv(path, escape='"', header=True).show()
 
|*col1*|*col2*|
|null|,a|
{quote}
 Spark does yield this result, so far so good. However, when I select col2 from 
the dataframe, Spark yields an incorrect result:
{quote}spark.read.csv(path, escape='"', header=True).select('col2').show()
 
|*col2*|
|a"|
{quote}
If you run this example with more columns in the file, and more commas in the 
field, e.g. ",,,a", the problem compounds, as Spark shifts many values to 
the right, causing unexpected and incorrect results. The inconsistency between 
both methods surprised me, as it implies the parsing is evaluated differently 
between both methods. 

I expect the bug to be located in the quote-balancing and un-escaping methods 
of the csv parser, but I can't find where that code is located in the code 
base. I'd be happy to take a look at it if anyone can point me where it is. 


> CSV parsing error when using escape='"' 
> 
>
> Key: SPARK-38167
> URL: https://issues.apache.org/jira/browse/SPARK-38167
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.2.1
> Environment: Pyspark on a single-node Databricks managed Spark 3.1.2 
> cluster.
>Reporter: Marnix van den Broek
>Priority: Major
>  Labels: correctness, csv, csvparser, data-integrity
>
> hi all,
> When reading CSV files with Spark, I ran into a parsing bug.
> {*}The summary{*}:
> When
>  # reading a comma separated, double-quote quoted CSV file using the csv 
> reader options _escape='"'_ and {_}header=True{_},
>  # with a row containing a quoted empty field
>  # followed by a quoted field starting with a comma and followed by one or 
> more characters
> selecting columns from the dataframe at or after the field described in 3) 
> gives incorrect and inconsistent results
> {*}In detail{*}:
> When I instruct Spark to read this CSV file:
>  
> {code:java}
> col1,col2
> "",",a"
> {code}
>  
> using the CSV reader options escape='"' (unnecessary for the example, 
> necessary for the files I'm processing) and header=True, I expect the 
> following result:
>  
> {code:java}
> spark.read.csv(path, escape='"', header=True).sho

[jira] [Updated] (SPARK-38167) CSV parsing error when using escape='"'

2022-02-09 Thread Marnix van den Broek (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marnix van den Broek updated SPARK-38167:
-
Description: 
hi all,

When reading CSV files with Spark, I ran into a parsing bug.

{*}The summary{*}:

When
 # reading a comma separated, double-quote quoted CSV file using the csv reader 
options _escape='"'_ and {_}header=True{_},
 # with a row containing a quoted empty field
 # followed by a quoted field starting with a comma and followed by one or more 
characters

selecting columns from the dataframe at or after the field described in 3) 
gives incorrect and inconsistent results

{*}In detail{*}:

When I instruct Spark to read this CSV file:
{quote}col1,col2

{{"",",a"}}
{quote}
using the CSV reader options escape='"' (unnecessary for the example, necessary 
for the files I'm processing) and header=True, I expect the following result:
{quote}spark.read.csv(path, escape='"', header=True).show()
 
|*col1*|*col2*|
|null|,a|
{quote}
 Spark does yield this result, so far so good. However, when I select col2 from 
the dataframe, Spark yields an incorrect result:
{quote}spark.read.csv(path, escape='"', header=True).select('col2').show()
 
|*col2*|
|a"|
{quote}
If you run this example with more columns in the file, and more commas in the 
field, e.g. ",,,a", the problem compounds, as Spark shifts many values to 
the right, causing unexpected and incorrect results. The inconsistency between 
both methods surprised me, as it implies the parsing is evaluated differently 
between both methods. 

I expect the bug to be located in the quote-balancing and un-escaping methods 
of the csv parser, but I can't find where that code is located in the code 
base. I'd be happy to take a look at it if anyone can point me where it is. 

  was:
hi all,

When reading CSV files with Spark, I ran into a parsing bug.

{*}The summary{*}:

When
 # reading a comma separated, double-quote quoted CSV file using the csv reader 
options _escape='"'_ and {_}header=True{_},
 # with a row containing a quoted empty field
 # followed by a quoted field starting with a comma and followed by one or more 
characters

selecting columns from the dataframe at or after the field described in 3) 
gives incorrect and inconsistent results

{*}In detail{*}:

When I instruct Spark to read this CSV file:
{quote}{{col1,col2}}

{{"",",a"}}
{quote}
using the CSV reader options escape='"' (unnecessary for the example, necessary 
for the files I'm processing) and header=True, I expect the following result:
{quote}spark.read.csv(path, escape='"', header=True).show()

 
|*col1*|*col2*|
|null|,a|

 
{quote}
Spark does yield this result, so far so good. However, when I select col2 from 
the dataframe, Spark yields an incorrect result:
{quote}spark.read.csv(path, escape='"', header=True).select('col2').show()

 
|*col2*|
|a"|
{quote}
If you run this example with more columns in the file, and more commas in the 
field, e.g. ",,,a", the problem compounds, as Spark shifts many values to 
the right, causing unexpected and incorrect results. The inconsistency between 
both methods surprised me, as it implies the parsing is evaluated differently 
between both methods. 

I expect the bug to be located in the quote-balancing and un-escaping methods 
of the csv parser, but I can't find where that code is located in the code 
base. I'd be happy to take a look at it if anyone can point me where it is. 


> CSV parsing error when using escape='"' 
> 
>
> Key: SPARK-38167
> URL: https://issues.apache.org/jira/browse/SPARK-38167
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.2.1
> Environment: Pyspark on a single-node Databricks managed Spark 3.1.2 
> cluster.
>Reporter: Marnix van den Broek
>Priority: Major
>  Labels: correctness, csv, csvparser, data-integrity
>
> hi all,
> When reading CSV files with Spark, I ran into a parsing bug.
> {*}The summary{*}:
> When
>  # reading a comma separated, double-quote quoted CSV file using the csv 
> reader options _escape='"'_ and {_}header=True{_},
>  # with a row containing a quoted empty field
>  # followed by a quoted field starting with a comma and followed by one or 
> more characters
> selecting columns from the dataframe at or after the field described in 3) 
> gives incorrect and inconsistent results
> {*}In detail{*}:
> When I instruct Spark to read this CSV file:
> {quote}col1,col2
> {{"",",a"}}
> {quote}
> using the CSV reader options escape='"' (unnecessary for the example, 
> necessary for the files I'm processing) and header=True, I expect the 
> following result:
> {quote}spark.read.csv(path, escape='"', header=True).show()
>  
> |*col1*|*col2*|
> |null|,a|
> {quote}
>  Spark does yield this result, so far so g