[ 
https://issues.apache.org/jira/browse/SPARK-56429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiang Li updated SPARK-56429:
-----------------------------
    Description: 
nullValue and emptyValue are options for Spark to read CSV (they could be also 
applied to writing to CSV, but I'd like to limit the following discussion to 
"reading a CSV" only).

They have similar explanation in guide 
[https://spark.apache.org/docs/latest/sql-data-sources-csv.html,|https://spark.apache.org/docs/latest/sql-data-sources-csv.html]
{quote}Sets the string representation of a xxx value
{quote}
But the behavior is different, or the "direction" to use the specified value is 
different:
 * nullValue: if a cell in CSV matches the given value, or the given value 
quoted by double quotation marks, it is read as (or replaced by) null in the 
dataframe. The specified value is the pattern to match
 * emptyValue: if there is "" in the cell, like `col1,"",col3`, it is read as 
(or replaced place) by the specified value in the dataframe. So the specified 
value is the target to be replaced into.

It could be misleading to use the same pattern of "Sets the string 
representation of xxx" to explain both? Actually I used nullValue before and 
thought emptyValue works in a similar way. 

 

 

> Explain the differences between nullValue and emptyValue when reading CSV
> -------------------------------------------------------------------------
>
>                 Key: SPARK-56429
>                 URL: https://issues.apache.org/jira/browse/SPARK-56429
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 4.1.1
>            Reporter: Xiang Li
>            Priority: Minor
>
> nullValue and emptyValue are options for Spark to read CSV (they could be 
> also applied to writing to CSV, but I'd like to limit the following 
> discussion to "reading a CSV" only).
> They have similar explanation in guide 
> [https://spark.apache.org/docs/latest/sql-data-sources-csv.html,|https://spark.apache.org/docs/latest/sql-data-sources-csv.html]
> {quote}Sets the string representation of a xxx value
> {quote}
> But the behavior is different, or the "direction" to use the specified value 
> is different:
>  * nullValue: if a cell in CSV matches the given value, or the given value 
> quoted by double quotation marks, it is read as (or replaced by) null in the 
> dataframe. The specified value is the pattern to match
>  * emptyValue: if there is "" in the cell, like `col1,"",col3`, it is read as 
> (or replaced place) by the specified value in the dataframe. So the specified 
> value is the target to be replaced into.
> It could be misleading to use the same pattern of "Sets the string 
> representation of xxx" to explain both? Actually I used nullValue before and 
> thought emptyValue works in a similar way. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to