[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wei Guo updated SPARK-37604: ---------------------------- Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue [SPARK-12833|https://issues.apache.org/jira/browse/SPARK-12833] and PR [10766|https://github.com/apache/spark/pull/10766] . According to databricks spark-csv's features description in readme file, the nullValue option is designed as: {noformat} When reading files the API accepts several options: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files the API accepts several options: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For null values, the parameter nullValue can be set when reading or writing in CSVOptions: {code:scala} // For writing, convert: null(dataframe) => nullValue(csv) // For reading, convert: nullValue or ,,(csv) => null(dataframe) {code} For example, a column has null values, if nullValue is set to "null" string. {code:scala} Seq(("Tesla", null.asInstanceOf[String])).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} and if we read this csv file with nullValue set to "null" string. {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} we can get the DataFrame which data is same with the original shown as: ||make||comment|| |tesla|null| {color:#57d9a3}*We can succeed to recovery it to the original DataFrame.*{color} Since Spark 2.4, for empty strings, there are emptyValueInRead for reading and emptyValueInWrite for writing that can be set in CSVOptions: {code:scala} // For writing, convert: ""(dataframe) => emptyValueInWrite(csv) // For reading, convert: "" (csv) => emptyValueInRead(dataframe){code} I think the read handling is not suitable, we can not convert "" or `{color:#172b4d}emptyValueInWrite`{color} values as ""(real empty strings) but get {color:#172b4d}emptyValueInRead's setting value actually{color}, it supposed to be as flows: {code:scala} // For reading, convert: "" or emptyValueInRead (csv) => ""(dataframe){code} For example, a column has empty strings, if emptyValueInWrite is set to "EMPTY" string. {code:scala} Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} and if we read this csv file with emptyValue(emptyValueInRead) set to "EMPTY" string. {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} we actually get the DataFrame which data is shown as: ||make||comment|| |tesla|EMPTY| but the DataFrame which data should be shown as below as expected: ||make||comment|| |tesla| | {color:#de350b}*We can not recovery it to the original DataFrame.*{color} was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue [SPARK-12833|https://issues.apache.org/jira/browse/SPARK-12833] and PR [10766|https://github.com/apache/spark/pull/10766] . In databricks spark-csv, For null values, the parameter nullValue can be set when reading or writing in CSVOptions: {code:scala} // For writing, convert: null(dataframe) => nullValue(csv) // For reading, convert: nullValue or ,,(csv) => null(dataframe) {code} For example, a column has null values, if nullValue is set to "null" string. {code:scala} Seq(("Tesla", null.asInstanceOf[String])).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} and if we read this csv file with nullValue set to "null" string. {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} we can get the DataFrame which data is same with the original shown as: ||make||comment|| |tesla|null| {color:#57d9a3}*We can succeed to recovery it to the original DataFrame.*{color} Since Spark 2.4, for empty strings, there are emptyValueInRead for reading and emptyValueInWrite for writing that can be set in CSVOptions: {code:scala} // For writing, convert: ""(dataframe) => emptyValueInWrite(csv) // For reading, convert: "" (csv) => emptyValueInRead(dataframe){code} I think the read handling is not suitable, we can not convert "" or `{color:#172b4d}emptyValueInWrite`{color} values as ""(real empty strings) but get {color:#172b4d}emptyValueInRead's setting value actually{color}, it supposed to be as flows: {code:scala} // For reading, convert: "" or emptyValueInRead (csv) => ""(dataframe){code} For example, a column has empty strings, if emptyValueInWrite is set to "EMPTY" string. {code:scala} Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} and if we read this csv file with emptyValue(emptyValueInRead) set to "EMPTY" string. {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} we actually get the DataFrame which data is shown as: ||make||comment|| |tesla|EMPTY| but the DataFrame which data should be shown as below as expected: ||make||comment|| |tesla| | {color:#de350b}*We can not recovery it to the original DataFrame.*{color} > The option emptyValueInRead(in CSVOptions) is suggested to be designed as > that any fields matching this string will be set as empty values "" when > reading > ---------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.0, 3.2.0 > Reporter: Wei Guo > Priority: Major > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue > [SPARK-12833|https://issues.apache.org/jira/browse/SPARK-12833] and PR > [10766|https://github.com/apache/spark/pull/10766] . > According to databricks spark-csv's features description in readme file, the > nullValue option is designed as: > {noformat} > When reading files the API accepts several options: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files the API accepts several options: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For null values, the parameter nullValue can be set when reading or writing > in CSVOptions: > {code:scala} > // For writing, convert: null(dataframe) => nullValue(csv) > // For reading, convert: nullValue or ,,(csv) => null(dataframe) > {code} > For example, a column has null values, if nullValue is set to "null" string. > {code:scala} > Seq(("Tesla", null.asInstanceOf[String])).toDF("make", > "comment").write.option("nullValue", "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > and if we read this csv file with nullValue set to "null" string. > {code:java} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > we can get the DataFrame which data is same with the original shown as: > ||make||comment|| > |tesla|null| > {color:#57d9a3}*We can succeed to recovery it to the original > DataFrame.*{color} > > Since Spark 2.4, for empty strings, there are emptyValueInRead for reading > and emptyValueInWrite for writing that can be set in CSVOptions: > {code:scala} > // For writing, convert: ""(dataframe) => emptyValueInWrite(csv) > // For reading, convert: "" (csv) => emptyValueInRead(dataframe){code} > I think the read handling is not suitable, we can not convert "" or > `{color:#172b4d}emptyValueInWrite`{color} values as ""(real empty strings) > but get {color:#172b4d}emptyValueInRead's setting value actually{color}, it > supposed to be as flows: > {code:scala} > // For reading, convert: "" or emptyValueInRead (csv) => ""(dataframe){code} > For example, a column has empty strings, if emptyValueInWrite is set to > "EMPTY" string. > {code:scala} > Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", > "EMPTY").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,EMPTY {noformat} > and if we read this csv file with emptyValue(emptyValueInRead) set to "EMPTY" > string. > {code:java} > spark.read.option("emptyValue", "EMPTY").csv(path).show() > {code} > we actually get the DataFrame which data is shown as: > ||make||comment|| > |tesla|EMPTY| > but the DataFrame which data should be shown as below as expected: > ||make||comment|| > |tesla| | > {color:#de350b}*We can not recovery it to the original DataFrame.*{color} -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org