[
https://issues.apache.org/jira/browse/SPARK-56429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiang Li updated SPARK-56429:
-----------------------------
Description:
This is a proposal on a doc change at least.
"nullValue" and "emptyValue" are options for Spark to read CSV (they could be
also applied to writing to CSV, but I'd like to limit the following discussion
to "reading a CSV" only).
They have similar explanation (so does nanValue) in guide
[https://spark.apache.org/docs/latest/sql-data-sources-csv.html,|https://spark.apache.org/docs/latest/sql-data-sources-csv.html]
{quote}Sets the string representation of a xxx value
{quote}
But their behaviors are different, or the "direction" to use the specified
value is different:
* nullValue: if a cell in CSV matches the given value, or the given value
quoted by double quotation marks, it is read as (or replaced by) null in the
dataframe generated. The specified value is the pattern to match against.
* emptyValue: if there is "" in the cell, like {_}col1,"",col3{_}, it is read
as (or replaced by) the specified value in the dataframe generated. So the
specified value is the target to be replaced into.
It could be misleading to use the same pattern of "Sets the string
representation of xxx" to explain both? Actually I used nullValue before and
assumed hastily that emptyValue works in the same way according to the doc. It
was not corrected until I read the code and UT.
How about improving the doc to tell the differences on how the specified value
is used? I could draft a PR for it.
was:
This is a proposal on a doc change at least.
"nullValue" and "emptyValue" are options for Spark to read CSV (they could be
also applied to writing to CSV, but I'd like to limit the following discussion
to "reading a CSV" only).
They have similar explanation (so does nanValue) in guide
[https://spark.apache.org/docs/latest/sql-data-sources-csv.html,|https://spark.apache.org/docs/latest/sql-data-sources-csv.html]
{quote}Sets the string representation of a xxx value
{quote}
But their behaviors are different, or the "direction" to use the specified
value is different:
* nullValue: if a cell in CSV matches the given value, or the given value
quoted by double quotation marks, it is read as (or replaced by) null in the
dataframe generated. The specified value is the pattern to match against.
* emptyValue: if there is "" in the cell, like {_}col1,"",col3{_}, it is read
as (or replaced by) the specified value in the dataframe generated. So the
specified value is the target to be replaced into.
It could be misleading to use the same pattern of "Sets the string
representation of xxx" to explain both? Actually I used nullValue before and
assumed hastily that emptyValue works in the same way according to the doc. It
was not corrected until I read the code and UT.
How about improving the doc to tell the differences on how the specified value
is used? I could draft a PR for it.
> Explain the differences between nullValue and emptyValue when reading CSV
> -------------------------------------------------------------------------
>
> Key: SPARK-56429
> URL: https://issues.apache.org/jira/browse/SPARK-56429
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 4.1.1
> Reporter: Xiang Li
> Priority: Minor
> Labels: pull-request-available
>
> This is a proposal on a doc change at least.
> "nullValue" and "emptyValue" are options for Spark to read CSV (they could be
> also applied to writing to CSV, but I'd like to limit the following
> discussion to "reading a CSV" only).
> They have similar explanation (so does nanValue) in guide
> [https://spark.apache.org/docs/latest/sql-data-sources-csv.html,|https://spark.apache.org/docs/latest/sql-data-sources-csv.html]
> {quote}Sets the string representation of a xxx value
> {quote}
> But their behaviors are different, or the "direction" to use the specified
> value is different:
> * nullValue: if a cell in CSV matches the given value, or the given value
> quoted by double quotation marks, it is read as (or replaced by) null in the
> dataframe generated. The specified value is the pattern to match against.
> * emptyValue: if there is "" in the cell, like {_}col1,"",col3{_}, it is
> read as (or replaced by) the specified value in the dataframe generated. So
> the specified value is the target to be replaced into.
> It could be misleading to use the same pattern of "Sets the string
> representation of xxx" to explain both? Actually I used nullValue before and
> assumed hastily that emptyValue works in the same way according to the doc.
> It was not corrected until I read the code and UT.
> How about improving the doc to tell the differences on how the specified
> value is used? I could draft a PR for it.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]