[jira] [Created] (SPARK-37604) The parameter emptyValueInRead is CSVOptions is not designed as supposed to be

Guo Wei (Jira) Thu, 09 Dec 2021 19:59:09 -0800

Guo Wei created SPARK-37604:
-------------------------------

             Summary: The parameter emptyValueInRead is CSVOptions is not 
designed as supposed to be
                 Key: SPARK-37604
                 URL: https://issues.apache.org/jira/browse/SPARK-37604
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.4.0
            Reporter: Guo Wei



For null values, the parameter nullValue can be set when reading or writing  in 
CSVOptions:
{code:scala}
// For writing, convert: null(dataframe) => nullValue(csv)
writerSettings.setNullValue(nullValue) 

// For reading, convert: nullValue or ,,(csv) => null(dataframe)
settings.setNullValue(nullValue)
{code}
For  example, a column has null values, if nullValue is set to "null" string.
{code:scala}
Seq(("Tesla", null.asInstanceOf[String])).toDF("make", 
"comment").write.option("nullValue", "NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
and if we read this csv file with nullValue set to "null" string.
{code:java}
spark.read.option("nullValue", "NULL").csv(path)
{code}
we can get the DataFrame which data is shown as:
||make||comment||
|tesla|null|

{color:#57d9a3}*We can succeed to recovery it to the original DataFrame.*{color}

 

Since Spark 2.4, for empty strings, there are  emptyValueInRead for reading and 
emptyValueInWrite for writing that can be set in CSVOptions:
{code:scala}
// For writing, convert: ""(dataframe) => emptyValueInWrite(csv)
writerSettings.setEmptyValue(emptyValueInWrite)

// For reading, convert: "" (csv) => emptyValueInRead(dataframe)
settings.setEmptyValue(emptyValueInRead) {code}
I think the write handling  is suitable, but for read handling, it supposed to 
be as flows:
{code:scala}
// in asParserSettings: "" or emptyValueInWrite (csv) =>""(dataframe)
settings.setEmptyValue(emptyValueInRead) {code}
 

For example,  a column has empty strings, if emptyValueInWrite is set to 
"EMPTY" string.
{code:scala}
Seq(("Tesla", 
{code}
{color:#910091}""{color}
{code:scala}
)).toDF("make", "comment").write.option("emptyValue", "EMPTY")csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
and if we read this csv file with emptyValueInRead set to "EMPTY" string.
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path)
{code}
we can get the DataFrame which data is shown as:
||make||comment||
|tesla|EMPTY|

but the expected DataFrame which data shoudle be shown as:
||make||comment||
|tesla|

{color:#de350b}*We can not  recovery it to the original DataFrame.*{color}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-37604) The parameter emptyValueInRead is CSVOptions is not designed as supposed to be

Reply via email to