[
https://issues.apache.org/jira/browse/SPARK-46959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Martin Rueckl updated SPARK-46959:
----------------------------------
Description:
Reading the following CSV
{code:java}
"a";"b";"c";"d"
10;100,00;"Some;String";"ok"
20;200,00;"";"still ok"
30;300,00;"also ok";""
40;400,00;"";"" {code}
with these options
{code:java}
spark.read
.option("header","true")
.option("sep",";")
.option("encoding","ISO-8859-1")
.option("lineSep","\r\n")
.option("nullValue","")
.option("quote",'"')
.option("escape","") {code}
results in the followin inconsistent dataframe
As one can see, the quoted empty fields of the last column are not correctly
read as null, whereas it works for column c.
If I recall correctly, this only happens when the "escape" option is set to an
empty string. Not setting it to "" (defaults to "\") seems to not cause this
bug.
was:
Reading the following CSV
{code:java}
"a";"b";"c";"d"
10;100,00;"Some;String";"ok"
20;200,00;"";"still ok"
30;300,00;"also ok";""
40;400,00;"";"" {code}
with these options
{code:java}
spark.read
.option("header","true")
.option("sep",";")
.option("encoding","ISO-8859-1")
.option("lineSep","\r\n")
.option("nullValue","")
.option("quote",'"')
.option("escape","") {code}
results in the followin inconsistent dataframe
!image-2024-02-02-13-05-26-203.png|width=352,height=120!
As one can see, the quoted empty fields of the last column are not correctly
read as null, whereas it works for column c.
If I recall correctly, this only happens when the "escape" option is set to an
empty string. Not setting it to "" (defaults to "\") seems to not cause this
bug.
> CSV reader reads data inconsistently depending on column position
> -----------------------------------------------------------------
>
> Key: SPARK-46959
> URL: https://issues.apache.org/jira/browse/SPARK-46959
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 3.4.1
> Reporter: Martin Rueckl
> Priority: Critical
>
> Reading the following CSV
> {code:java}
> "a";"b";"c";"d"
> 10;100,00;"Some;String";"ok"
> 20;200,00;"";"still ok"
> 30;300,00;"also ok";""
> 40;400,00;"";"" {code}
> with these options
> {code:java}
> spark.read
> .option("header","true")
> .option("sep",";")
> .option("encoding","ISO-8859-1")
> .option("lineSep","\r\n")
> .option("nullValue","")
> .option("quote",'"')
> .option("escape","") {code}
> results in the followin inconsistent dataframe
>
> As one can see, the quoted empty fields of the last column are not correctly
> read as null, whereas it works for column c.
> If I recall correctly, this only happens when the "escape" option is set to
> an empty string. Not setting it to "" (defaults to "\") seems to not cause
> this bug.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]