[
https://issues.apache.org/jira/browse/SPARK-56429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiang Li updated SPARK-56429:
-----------------------------
Description:
nullValue and emptyValue are options for Spark to read CSV (they could be also
applied to writing to CSV, but I'd like to limit the following discussion to
"reading a CSV" only).
They have similar explanation in guide
[https://spark.apache.org/docs/latest/sql-data-sources-csv.html,|https://spark.apache.org/docs/latest/sql-data-sources-csv.html]
{quote}Sets the string representation of a xxx value
{quote}
But the behavior is different, or the "direction" to use the specified value is
different:
* nullValue: if a cell in CSV matches the given value, or the given value
quoted by double quotation marks, it is read as (or replaced by) null in the
dataframe. The specified value is the pattern to match
* emptyValue: if there is "" in the cell, like `col1,"",col3`, it is read as
(or replaced place) by the specified value in the dataframe. So the specified
value is the target to be replaced into.
It could be misleading to use the same pattern of "Sets the string
representation of xxx" to explain both? Actually I used nullValue before and
thought emptyValue works in a similar way.
> Explain the differences between nullValue and emptyValue when reading CSV
> -------------------------------------------------------------------------
>
> Key: SPARK-56429
> URL: https://issues.apache.org/jira/browse/SPARK-56429
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 4.1.1
> Reporter: Xiang Li
> Priority: Minor
>
> nullValue and emptyValue are options for Spark to read CSV (they could be
> also applied to writing to CSV, but I'd like to limit the following
> discussion to "reading a CSV" only).
> They have similar explanation in guide
> [https://spark.apache.org/docs/latest/sql-data-sources-csv.html,|https://spark.apache.org/docs/latest/sql-data-sources-csv.html]
> {quote}Sets the string representation of a xxx value
> {quote}
> But the behavior is different, or the "direction" to use the specified value
> is different:
> * nullValue: if a cell in CSV matches the given value, or the given value
> quoted by double quotation marks, it is read as (or replaced by) null in the
> dataframe. The specified value is the pattern to match
> * emptyValue: if there is "" in the cell, like `col1,"",col3`, it is read as
> (or replaced place) by the specified value in the dataframe. So the specified
> value is the target to be replaced into.
> It could be misleading to use the same pattern of "Sets the string
> representation of xxx" to explain both? Actually I used nullValue before and
> thought emptyValue works in a similar way.
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]