[jira] [Comment Edited] (SPARK-37575) Empty strings and null values are both saved as quoted empty Strings "" rather than "" (for empty strings) and nothing(for null values)

Guo Wei (Jira) Wed, 08 Dec 2021 00:41:24 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455043#comment-17455043
 ]


Guo Wei edited comment on SPARK-37575 at 12/8/21, 8:40 AM:
-----------------------------------------------------------

I think I maybe have found the root cause through debug Spark source code.

In {color:#0747a6}UnivocityGenerator{color}, when the value of column is null 
values, column's value has been changed  to 
{color:#00875a}options.nullValue{color}, default value  is ""

 
{code:java}
private def convertRow(row: InternalRow): Seq[String] = {
  var i = 0
  val values = new Array[String](row.numFields)
  while (i < row.numFields) {
    if (!row.isNullAt(i)) {
      values(i) = valueConverters(i).apply(row, i)
    } else {
      values(i) = options.nullValue
    }
    i += 1
  }
  values
} {code}
 

So，in {color:#0747a6}univocity-parsers lib{color}(depended by Spark) 
{color:#0747a6}AbstractWriter{color} class, element( is original null values) 
has been changed to  '' in UnivocityGenerator，not satisfied condition(element 
== null)，finally equals emptyValue, default value  is "\"\""

 
{code:java}
protected String getStringValue(Object element) {
   usingNullOrEmptyValue = false;
   if (element == null) {
      usingNullOrEmptyValue = true;
      return nullValue;
   }
   String string = String.valueOf(element);
   if (string.isEmpty()) {
      usingNullOrEmptyValue = true;
      return emptyValue;
   }
   return string;
} {code}
[~hyukjin.kwon]  Should we fix the change(isNullAt) in 
{color:#0747a6}UnivocityGenerator?{color}

 


was (Author: wayne guo):
I think I maybe have found the root cause through debug Spark source code.

In {color:#0747a6}UnivocityGenerator{color}, when the value of column is null 
values, column's value has been changed  to 
{color:#00875a}options.nullValue{color}, default value  is ""

 
{code:java}
private def convertRow(row: InternalRow): Seq[String] = {
  var i = 0
  val values = new Array[String](row.numFields)
  while (i < row.numFields) {
    if (!row.isNullAt(i)) {
      values(i) = valueConverters(i).apply(row, i)
    } else {
      values(i) = options.nullValue
    }
    i += 1
  }
  values
} {code}
 

So，in {color:#0747a6}univocity-parsers lib{color}(depended by Spark) 
{color:#0747a6}AbstractWriter{color} class, element( is original null values) 
has been changed to  '' in UnivocityGenerator，not satisfied condition(element 
== null)，finally equals emptyValue, default value  is "\"\""

 
{code:java}
protected String getStringValue(Object element) {
   usingNullOrEmptyValue = false;
   if (element == null) {
      usingNullOrEmptyValue = true;
      return nullValue;
   }
   String string = String.valueOf(element);
   if (string.isEmpty()) {
      usingNullOrEmptyValue = true;
      return emptyValue;
   }
   return string;
} {code}
[~hyukjin.kwon]  Should we fix the change(isNullAt) in 
{color:#0747a6}UnivocityGenerator?{color}

 

 

> Empty strings and null values are both saved as quoted empty Strings "" 
> rather than "" (for empty strings) and nothing(for null values)
> ---------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-37575
>                 URL: https://issues.apache.org/jira/browse/SPARK-37575
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0, 3.2.0
>            Reporter: Guo Wei
>            Priority: Major
>
> As mentioned in sql migration 
> guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]),
> {noformat}
> Since Spark 2.4, empty strings are saved as quoted empty strings "". In 
> version 2.3 and earlier, empty strings are equal to null values and do not 
> reflect to any characters in saved CSV files. For example, the row of "a", 
> null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as 
> a,,"",1. To restore the previous behavior, set the CSV option emptyValue to 
> empty (not quoted) string.{noformat}
>  
> But actually, both empty strings and null values are saved as quoted empty 
> Strings "" rather than "" (for empty strings) and nothing(for null values)。
> code:
> {code:java}
> val data = List("spark", null, "").toDF("name")
> data.coalesce(1).write.csv("spark_csv_test")
> {code}
>  actual result:
> {noformat}
> line1: spark
> line2: ""
> line3: ""{noformat}
> expected result:
> {noformat}
> line1: spark
> line2: 
> line3: ""
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-37575) Empty strings and null values are both saved as quoted empty Strings "" rather than "" (for empty strings) and nothing(for null values)

Reply via email to