[ https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455043#comment-17455043 ]
Guo Wei edited comment on SPARK-37575 at 12/8/21, 8:40 AM: ----------------------------------------------------------- I think I maybe have found the root cause through debug Spark source code. In {color:#0747a6}UnivocityGenerator{color}, when the value of column is null values, column's value has been changed to {color:#00875a}options.nullValue{color}, default value is "" {code:java} private def convertRow(row: InternalRow): Seq[String] = { var i = 0 val values = new Array[String](row.numFields) while (i < row.numFields) { if (!row.isNullAt(i)) { values(i) = valueConverters(i).apply(row, i) } else { values(i) = options.nullValue } i += 1 } values } {code} So,in {color:#0747a6}univocity-parsers lib{color}(depended by Spark) {color:#0747a6}AbstractWriter{color} class, element( is original null values) has been changed to '' in UnivocityGenerator,not satisfied condition(element == null),finally equals emptyValue, default value is "\"\"" {code:java} protected String getStringValue(Object element) { usingNullOrEmptyValue = false; if (element == null) { usingNullOrEmptyValue = true; return nullValue; } String string = String.valueOf(element); if (string.isEmpty()) { usingNullOrEmptyValue = true; return emptyValue; } return string; } {code} [~hyukjin.kwon] Should we fix the change(isNullAt) in {color:#0747a6}UnivocityGenerator?{color} was (Author: wayne guo): I think I maybe have found the root cause through debug Spark source code. In {color:#0747a6}UnivocityGenerator{color}, when the value of column is null values, column's value has been changed to {color:#00875a}options.nullValue{color}, default value is "" {code:java} private def convertRow(row: InternalRow): Seq[String] = { var i = 0 val values = new Array[String](row.numFields) while (i < row.numFields) { if (!row.isNullAt(i)) { values(i) = valueConverters(i).apply(row, i) } else { values(i) = options.nullValue } i += 1 } values } {code} So,in {color:#0747a6}univocity-parsers lib{color}(depended by Spark) {color:#0747a6}AbstractWriter{color} class, element( is original null values) has been changed to '' in UnivocityGenerator,not satisfied condition(element == null),finally equals emptyValue, default value is "\"\"" {code:java} protected String getStringValue(Object element) { usingNullOrEmptyValue = false; if (element == null) { usingNullOrEmptyValue = true; return nullValue; } String string = String.valueOf(element); if (string.isEmpty()) { usingNullOrEmptyValue = true; return emptyValue; } return string; } {code} [~hyukjin.kwon] Should we fix the change(isNullAt) in {color:#0747a6}UnivocityGenerator?{color} > Empty strings and null values are both saved as quoted empty Strings "" > rather than "" (for empty strings) and nothing(for null values) > --------------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-37575 > URL: https://issues.apache.org/jira/browse/SPARK-37575 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.0, 3.2.0 > Reporter: Guo Wei > Priority: Major > > As mentioned in sql migration > guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]), > {noformat} > Since Spark 2.4, empty strings are saved as quoted empty strings "". In > version 2.3 and earlier, empty strings are equal to null values and do not > reflect to any characters in saved CSV files. For example, the row of "a", > null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as > a,,"",1. To restore the previous behavior, set the CSV option emptyValue to > empty (not quoted) string.{noformat} > > But actually, both empty strings and null values are saved as quoted empty > Strings "" rather than "" (for empty strings) and nothing(for null values)。 > code: > {code:java} > val data = List("spark", null, "").toDF("name") > data.coalesce(1).write.csv("spark_csv_test") > {code} > actual result: > {noformat} > line1: spark > line2: "" > line3: ""{noformat} > expected result: > {noformat} > line1: spark > line2: > line3: "" > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org