[ https://issues.apache.org/jira/browse/SPARK-46959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Martin Rueckl updated SPARK-46959: ---------------------------------- Description: Reading the following CSV {code:java} "a";"b";"c";"d" 10;100,00;"Some;String";"ok" 20;200,00;"";"still ok" 30;300,00;"also ok";"" 40;400,00;"";"" {code} with these options {code:java} spark.read .option("header","true") .option("sep",";") .option("encoding","ISO-8859-1") .option("lineSep","\r\n") .option("nullValue","") .option("quote",'"') .option("escape","") {code} results in the followin inconsistent dataframe ||a||b||c||d|| |10|100,00|Some;String|ok| |20|200,00|<null>|still ok| |30|300,00|also ok|"| |40|400,00|<null>|"| As one can see, the quoted empty fields of the last column are not correctly read as null but instead contain a single double quote. It works for column c. If I recall correctly, this only happens when the "escape" option is set to an empty string. Not setting it to "" (defaults to "\") seems to not cause this bug. I observed this on databricks spark runtime 13.2 (think that is spark 3.4.1). was: Reading the following CSV {code:java} "a";"b";"c";"d" 10;100,00;"Some;String";"ok" 20;200,00;"";"still ok" 30;300,00;"also ok";"" 40;400,00;"";"" {code} with these options {code:java} spark.read .option("header","true") .option("sep",";") .option("encoding","ISO-8859-1") .option("lineSep","\r\n") .option("nullValue","") .option("quote",'"') .option("escape","") {code} results in the followin inconsistent dataframe ||a||b||c||d|| |10|100,00|Some;String|ok| |20|200,00|<null>|still ok| |30|300,00|also ok|"| |40|400,00|<null>|"| As one can see, the quoted empty fields of the last column are not correctly read as null but instead contain a single double quote. It works for column c. If I recall correctly, this only happens when the "escape" option is set to an empty string. Not setting it to "" (defaults to "\") seems to not cause this bug. > CSV reader reads data inconsistently depending on column position > ----------------------------------------------------------------- > > Key: SPARK-46959 > URL: https://issues.apache.org/jira/browse/SPARK-46959 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.4.1 > Reporter: Martin Rueckl > Priority: Critical > > Reading the following CSV > {code:java} > "a";"b";"c";"d" > 10;100,00;"Some;String";"ok" > 20;200,00;"";"still ok" > 30;300,00;"also ok";"" > 40;400,00;"";"" {code} > with these options > {code:java} > spark.read > .option("header","true") > .option("sep",";") > .option("encoding","ISO-8859-1") > .option("lineSep","\r\n") > .option("nullValue","") > .option("quote",'"') > .option("escape","") {code} > results in the followin inconsistent dataframe > > ||a||b||c||d|| > |10|100,00|Some;String|ok| > |20|200,00|<null>|still ok| > |30|300,00|also ok|"| > |40|400,00|<null>|"| > As one can see, the quoted empty fields of the last column are not correctly > read as null but instead contain a single double quote. It works for column c. > If I recall correctly, this only happens when the "escape" option is set to > an empty string. Not setting it to "" (defaults to "\") seems to not cause > this bug. > I observed this on databricks spark runtime 13.2 (think that is spark 3.4.1). -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org