[ 
https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455043#comment-17455043
 ] 

Guo Wei edited comment on SPARK-37575 at 12/8/21, 8:40 AM:
-----------------------------------------------------------

I think I maybe have found the root cause through debug Spark source code.

In {color:#0747a6}UnivocityGenerator{color}, when the value of column is null 
values, column's value has been changed  to 
{color:#00875a}options.nullValue{color}, default value  is ""

 
{code:java}
private def convertRow(row: InternalRow): Seq[String] = {
  var i = 0
  val values = new Array[String](row.numFields)
  while (i < row.numFields) {
    if (!row.isNullAt(i)) {
      values(i) = valueConverters(i).apply(row, i)
    } else {
      values(i) = options.nullValue
    }
    i += 1
  }
  values
} {code}
 

So,in {color:#0747a6}univocity-parsers lib{color}(depended by Spark) 
{color:#0747a6}AbstractWriter{color} class, element( is original null values) 
has been changed to  '' in UnivocityGenerator,not satisfied condition(element 
== null),finally equals emptyValue, default value  is "\"\""

 
{code:java}
protected String getStringValue(Object element) {
   usingNullOrEmptyValue = false;
   if (element == null) {
      usingNullOrEmptyValue = true;
      return nullValue;
   }
   String string = String.valueOf(element);
   if (string.isEmpty()) {
      usingNullOrEmptyValue = true;
      return emptyValue;
   }
   return string;
} {code}
[~hyukjin.kwon]  Should we fix the change(isNullAt) in 
{color:#0747a6}UnivocityGenerator?{color}

 

 


was (Author: wayne guo):
I think I maybe have found the root cause through debug Spark source code.

In {color:#0747a6}UnivocityGenerator{color}, when the value of column is null 
values, values(i) has been changed  to {color:#00875a}options.nullValue{color}, 
default value  is ""

 
{code:java}
private def convertRow(row: InternalRow): Seq[String] = {
  var i = 0
  val values = new Array[String](row.numFields)
  while (i < row.numFields) {
    if (!row.isNullAt(i)) {
      values(i) = valueConverters(i).apply(row, i)
    } else {
      values(i) = options.nullValue
    }
    i += 1
  }
  values
} {code}
 

So,in {color:#0747a6}univocity-parsers lib{color}(depended by Spark) 
{color:#0747a6}AbstractWriter{color} class, element( is original null values) 
has been changed to  '' in UnivocityGenerator,not satisfied condition(element 
== null),finally equals emptyValue, default value  is "\"\""

 
{code:java}
protected String getStringValue(Object element) {
   usingNullOrEmptyValue = false;
   if (element == null) {
      usingNullOrEmptyValue = true;
      return nullValue;
   }
   String string = String.valueOf(element);
   if (string.isEmpty()) {
      usingNullOrEmptyValue = true;
      return emptyValue;
   }
   return string;
} {code}
[~hyukjin.kwon]  Should we fix the change(isNullAt) in 
{color:#0747a6}UnivocityGenerator{color:#172b4d}?{color}{color}

 

 

> Empty strings and null values are both saved as quoted empty Strings "" 
> rather than "" (for empty strings) and nothing(for null values)
> ---------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-37575
>                 URL: https://issues.apache.org/jira/browse/SPARK-37575
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0, 3.2.0
>            Reporter: Guo Wei
>            Priority: Major
>
> As mentioned in sql migration 
> guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]),
> {noformat}
> Since Spark 2.4, empty strings are saved as quoted empty strings "". In 
> version 2.3 and earlier, empty strings are equal to null values and do not 
> reflect to any characters in saved CSV files. For example, the row of "a", 
> null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as 
> a,,"",1. To restore the previous behavior, set the CSV option emptyValue to 
> empty (not quoted) string.{noformat}
>  
> But actually, both empty strings and null values are saved as quoted empty 
> Strings "" rather than "" (for empty strings) and nothing(for null values)。
> code:
> {code:java}
> val data = List("spark", null, "").toDF("name")
> data.coalesce(1).write.csv("spark_csv_test")
> {code}
>  actual result:
> {noformat}
> line1: spark
> line2: ""
> line3: ""{noformat}
> expected result:
> {noformat}
> line1: spark
> line2: 
> line3: ""
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to