[jira] [Updated] (SPARK-56429) Explain the differences between nullValue and emptyValue when reading CSV

Xiang Li (Jira) Thu, 23 Apr 2026 08:27:56 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-56429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Xiang Li updated SPARK-56429:
-----------------------------
    Description: 
This is a proposal on a doc change at least.

"nullValue" and "emptyValue" are options for Spark to read CSV (they could be 
also applied to writing to CSV, but I'd like to limit the following discussion 
to "reading a CSV" only).

They have similar explanation (so does nanValue) in guide 
[https://spark.apache.org/docs/latest/sql-data-sources-csv.html,|https://spark.apache.org/docs/latest/sql-data-sources-csv.html]
{quote}Sets the string representation of a xxx value
{quote}
But their behaviors are different, or the "direction" to use the specified 
value is different:
 * nullValue: if a cell in CSV matches the given value, or the given value 
quoted by double quotation marks, it is read as (or replaced by) null in the 
dataframe generated. The specified value is the pattern to match against.
 * emptyValue: if there is "" in the cell, like {_}col1,"",col3{_}, it is read 
as (or replaced by) the specified value in the dataframe generated. So the 
specified value is the target to be replaced into.

It could be misleading to use the same pattern of "Sets the string 
representation of xxx" to explain both? Actually I used nullValue before and 
assumed hastily that emptyValue works in the same way according to the doc. It 
was not corrected until I read the code and UT.

How about improving the doc to tell the differences on how the specified value 
is used? I could draft a PR for it.

  was:
This is a proposal on a doc change at least.

"nullValue" and "emptyValue" are options for Spark to read CSV (they could be 
also applied to writing to CSV, but I'd like to limit the following discussion 
to "reading a CSV" only).

They have similar explanation (so does nanValue) in guide 
[https://spark.apache.org/docs/latest/sql-data-sources-csv.html,|https://spark.apache.org/docs/latest/sql-data-sources-csv.html]
{quote}Sets the string representation of a xxx value
{quote}
But their behaviors are different, or the "direction" to use the specified 
value is different:
 * nullValue: if a cell in CSV matches the given value, or the given value 
quoted by double quotation marks, it is read as (or replaced by) null in the 
dataframe generated. The specified value is the pattern to match against.
 * emptyValue: if there is "" in the cell, like {_}col1,"",col3{_}, it is read 
as (or replaced by) the specified value in the dataframe generated. So the 
specified value is the target to be replaced into.

It could be misleading to use the same pattern of "Sets the string 
representation of xxx" to explain both? Actually I used nullValue before and 
assumed hastily that emptyValue works in the same way according to the doc. It 
was not corrected until I read the code and UT.

 

How about improving the doc to tell the differences on how the specified value 
is used? I could draft a PR for it.


> Explain the differences between nullValue and emptyValue when reading CSV
> -------------------------------------------------------------------------
>
>                 Key: SPARK-56429
>                 URL: https://issues.apache.org/jira/browse/SPARK-56429
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 4.1.1
>            Reporter: Xiang Li
>            Priority: Minor
>              Labels: pull-request-available
>
> This is a proposal on a doc change at least.
> "nullValue" and "emptyValue" are options for Spark to read CSV (they could be 
> also applied to writing to CSV, but I'd like to limit the following 
> discussion to "reading a CSV" only).
> They have similar explanation (so does nanValue) in guide 
> [https://spark.apache.org/docs/latest/sql-data-sources-csv.html,|https://spark.apache.org/docs/latest/sql-data-sources-csv.html]
> {quote}Sets the string representation of a xxx value
> {quote}
> But their behaviors are different, or the "direction" to use the specified 
> value is different:
>  * nullValue: if a cell in CSV matches the given value, or the given value 
> quoted by double quotation marks, it is read as (or replaced by) null in the 
> dataframe generated. The specified value is the pattern to match against.
>  * emptyValue: if there is "" in the cell, like {_}col1,"",col3{_}, it is 
> read as (or replaced by) the specified value in the dataframe generated. So 
> the specified value is the target to be replaced into.
> It could be misleading to use the same pattern of "Sets the string 
> representation of xxx" to explain both? Actually I used nullValue before and 
> assumed hastily that emptyValue works in the same way according to the doc. 
> It was not corrected until I read the code and UT.
> How about improving the doc to tell the differences on how the specified 
> value is used? I could draft a PR for it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-56429) Explain the differences between nullValue and emptyValue when reading CSV

Reply via email to