Martin Rueckl created SPARK-46959:
-------------------------------------

             Summary: CSV reader reads data inconsistently depending on column 
position
                 Key: SPARK-46959
                 URL: https://issues.apache.org/jira/browse/SPARK-46959
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.4.1
            Reporter: Martin Rueckl


Reading the following CSV
{code:java}
"a";"b";"c";"d"
10;100,00;"Some;String";"ok"
20;200,00;"";"still ok"
30;300,00;"also ok";""
40;400,00;"";"" {code}
with these options
{code:java}
spark.read
        .option("header","true")
        .option("sep",";")
        .option("encoding","ISO-8859-1")
        .option("lineSep","\r\n")
        .option("nullValue","")
        .option("quote",'"')
        .option("escape","") {code}
results in the followin inconsistent dataframe
!image-2024-02-02-13-05-26-203.png|width=352,height=120!

As one can see, the quoted empty fields of the last column are not correctly 
read as null, whereas it works for column c.

If I recall correctly, this only happens when the "escape" option is set to an 
empty string. Not setting it to "" (defaults to "\") seems to not cause this 
bug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to