[jira] [Updated] (SPARK-46959) CSV reader reads data inconsistently depending on column position

Martin Rueckl (Jira) Fri, 02 Feb 2024 04:17:05 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-46959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Martin Rueckl updated SPARK-46959:
----------------------------------
    Description: 
Reading the following CSV
{code:java}
"a";"b";"c";"d"
10;100,00;"Some;String";"ok"
20;200,00;"";"still ok"
30;300,00;"also ok";""
40;400,00;"";"" {code}
with these options
{code:java}
spark.read
        .option("header","true")
        .option("sep",";")
        .option("encoding","ISO-8859-1")
        .option("lineSep","\r\n")
        .option("nullValue","")
        .option("quote",'"')
        .option("escape","") {code}
results in the followin inconsistent dataframe

 
||a||b||c||d||
|10|100,00|Some;String|ok|
|20|200,00|<null>|still ok|
|30|300,00|also ok|"|
|40|400,00|<null>|"|

 

 

As one can see, the quoted empty fields of the last column are not correctly 
read as null, whereas it works for column c.

If I recall correctly, this only happens when the "escape" option is set to an 
empty string. Not setting it to "" (defaults to "\") seems to not cause this 
bug.

  was:
Reading the following CSV
{code:java}
"a";"b";"c";"d"
10;100,00;"Some;String";"ok"
20;200,00;"";"still ok"
30;300,00;"also ok";""
40;400,00;"";"" {code}
with these options
{code:java}
spark.read
        .option("header","true")
        .option("sep",";")
        .option("encoding","ISO-8859-1")
        .option("lineSep","\r\n")
        .option("nullValue","")
        .option("quote",'"')
        .option("escape","") {code}
results in the followin inconsistent dataframe

 
||a||b||c||d||
|10|100,00|Some;String|ok|
|20|200,00|<null>|still ok|
|30|300,00|also ok|"|
|40|400,00|<null>|"|
| | | | |

 

 

As one can see, the quoted empty fields of the last column are not correctly 
read as null, whereas it works for column c.

If I recall correctly, this only happens when the "escape" option is set to an 
empty string. Not setting it to "" (defaults to "\") seems to not cause this 
bug.


> CSV reader reads data inconsistently depending on column position
> -----------------------------------------------------------------
>
>                 Key: SPARK-46959
>                 URL: https://issues.apache.org/jira/browse/SPARK-46959
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.4.1
>            Reporter: Martin Rueckl
>            Priority: Critical
>
> Reading the following CSV
> {code:java}
> "a";"b";"c";"d"
> 10;100,00;"Some;String";"ok"
> 20;200,00;"";"still ok"
> 30;300,00;"also ok";""
> 40;400,00;"";"" {code}
> with these options
> {code:java}
> spark.read
>         .option("header","true")
>         .option("sep",";")
>         .option("encoding","ISO-8859-1")
>         .option("lineSep","\r\n")
>         .option("nullValue","")
>         .option("quote",'"')
>         .option("escape","") {code}
> results in the followin inconsistent dataframe
>  
> ||a||b||c||d||
> |10|100,00|Some;String|ok|
> |20|200,00|<null>|still ok|
> |30|300,00|also ok|"|
> |40|400,00|<null>|"|
>  
>  
> As one can see, the quoted empty fields of the last column are not correctly 
> read as null, whereas it works for column c.
> If I recall correctly, this only happens when the "escape" option is set to 
> an empty string. Not setting it to "" (defaults to "\") seems to not cause 
> this bug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-46959) CSV reader reads data inconsistently depending on column position

Reply via email to