[ 
https://issues.apache.org/jira/browse/SPARK-56429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiang Li updated SPARK-56429:
-----------------------------
    Description: 
nullValue and emptyValue are options for Spark to read CSV (they could be also 
applied to writing to CSV, but I'd like to limit the following discussion to 
"reading a CSV" only).

They have similar explanation in guide 
[https://spark.apache.org/docs/latest/sql-data-sources-csv.html,|https://spark.apache.org/docs/latest/sql-data-sources-csv.html]
{quote}Sets the string representation of a xxx value
{quote}
But the behavior is different, or the "direction" to use the specified value is 
different:
 * nullValue: if a cell in CSV matches the given value, or the given value 
quoted by double quotation marks, it is read as (or replaced by) null in the 
dataframe. The specified value is the pattern to match
 * emptyValue: if there is "" in the cell, like `col1,"",col3`, it is read as 
(or replaced place) by the specified value in the dataframe. So the specified 
value is the target to be replaced into.

It could be misleading to use the same pattern of "Sets the string 
representation of xxx" to explain both? Actually I used nullValue before and 
thought emptyValue works similarly. It was not corrected until I read the code 
and UT.

 

How about improving the doc to tell the differences on how the specified value 
is used?

 

 

 

 

  was:
nullValue and emptyValue are options for Spark to read CSV (they could be also 
applied to writing to CSV, but I'd like to limit the following discussion to 
"reading a CSV" only).

They have similar explanation in guide 
[https://spark.apache.org/docs/latest/sql-data-sources-csv.html,|https://spark.apache.org/docs/latest/sql-data-sources-csv.html]
{quote}Sets the string representation of a xxx value
{quote}
But the behavior is different, or the "direction" to use the specified value is 
different:
 * nullValue: if a cell in CSV matches the given value, or the given value 
quoted by double quotation marks, it is read as (or replaced by) null in the 
dataframe. The specified value is the pattern to match
 * emptyValue: if there is "" in the cell, like `col1,"",col3`, it is read as 
(or replaced place) by the specified value in the dataframe. So the specified 
value is the target to be replaced into.

It could be misleading to use the same pattern of "Sets the string 
representation of xxx" to explain both? Actually I used nullValue before and 
thought emptyValue works in a similar way. 

 

 


> Explain the differences between nullValue and emptyValue when reading CSV
> -------------------------------------------------------------------------
>
>                 Key: SPARK-56429
>                 URL: https://issues.apache.org/jira/browse/SPARK-56429
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 4.1.1
>            Reporter: Xiang Li
>            Priority: Minor
>
> nullValue and emptyValue are options for Spark to read CSV (they could be 
> also applied to writing to CSV, but I'd like to limit the following 
> discussion to "reading a CSV" only).
> They have similar explanation in guide 
> [https://spark.apache.org/docs/latest/sql-data-sources-csv.html,|https://spark.apache.org/docs/latest/sql-data-sources-csv.html]
> {quote}Sets the string representation of a xxx value
> {quote}
> But the behavior is different, or the "direction" to use the specified value 
> is different:
>  * nullValue: if a cell in CSV matches the given value, or the given value 
> quoted by double quotation marks, it is read as (or replaced by) null in the 
> dataframe. The specified value is the pattern to match
>  * emptyValue: if there is "" in the cell, like `col1,"",col3`, it is read as 
> (or replaced place) by the specified value in the dataframe. So the specified 
> value is the target to be replaced into.
> It could be misleading to use the same pattern of "Sets the string 
> representation of xxx" to explain both? Actually I used nullValue before and 
> thought emptyValue works similarly. It was not corrected until I read the 
> code and UT.
>  
> How about improving the doc to tell the differences on how the specified 
> value is used?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to