[ 
https://issues.apache.org/jira/browse/SPARK-46959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Rueckl updated SPARK-46959:
----------------------------------
    Description: 
Reading the following CSV
{code:java}
"a";"b";"c";"d"
10;100,00;"Some;String";"ok"
20;200,00;"";"still ok"
30;300,00;"also ok";""
40;400,00;"";"" {code}
with these options
{code:java}
spark.read
        .option("header","true")
        .option("sep",";")
        .option("encoding","ISO-8859-1")
        .option("lineSep","\r\n")
        .option("nullValue","")
        .option("quote",'"')
        .option("escape","") {code}
results in the followin inconsistent dataframe

 

As one can see, the quoted empty fields of the last column are not correctly 
read as null, whereas it works for column c.

If I recall correctly, this only happens when the "escape" option is set to an 
empty string. Not setting it to "" (defaults to "\") seems to not cause this 
bug.

  was:
Reading the following CSV
{code:java}
"a";"b";"c";"d"
10;100,00;"Some;String";"ok"
20;200,00;"";"still ok"
30;300,00;"also ok";""
40;400,00;"";"" {code}
with these options
{code:java}
spark.read
        .option("header","true")
        .option("sep",";")
        .option("encoding","ISO-8859-1")
        .option("lineSep","\r\n")
        .option("nullValue","")
        .option("quote",'"')
        .option("escape","") {code}
results in the followin inconsistent dataframe
!image-2024-02-02-13-05-26-203.png|width=352,height=120!

As one can see, the quoted empty fields of the last column are not correctly 
read as null, whereas it works for column c.

If I recall correctly, this only happens when the "escape" option is set to an 
empty string. Not setting it to "" (defaults to "\") seems to not cause this 
bug.


> CSV reader reads data inconsistently depending on column position
> -----------------------------------------------------------------
>
>                 Key: SPARK-46959
>                 URL: https://issues.apache.org/jira/browse/SPARK-46959
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.4.1
>            Reporter: Martin Rueckl
>            Priority: Critical
>
> Reading the following CSV
> {code:java}
> "a";"b";"c";"d"
> 10;100,00;"Some;String";"ok"
> 20;200,00;"";"still ok"
> 30;300,00;"also ok";""
> 40;400,00;"";"" {code}
> with these options
> {code:java}
> spark.read
>         .option("header","true")
>         .option("sep",";")
>         .option("encoding","ISO-8859-1")
>         .option("lineSep","\r\n")
>         .option("nullValue","")
>         .option("quote",'"')
>         .option("escape","") {code}
> results in the followin inconsistent dataframe
>  
> As one can see, the quoted empty fields of the last column are not correctly 
> read as null, whereas it works for column c.
> If I recall correctly, this only happens when the "escape" option is set to 
> an empty string. Not setting it to "" (defaults to "\") seems to not cause 
> this bug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to