[
https://issues.apache.org/jira/browse/SPARK-56429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-56429:
-----------------------------------
Labels: pull-request-available (was: )
> Explain the differences between nullValue and emptyValue when reading CSV
> -------------------------------------------------------------------------
>
> Key: SPARK-56429
> URL: https://issues.apache.org/jira/browse/SPARK-56429
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 4.1.1
> Reporter: Xiang Li
> Priority: Minor
> Labels: pull-request-available
>
> This is a proposal on a doc change at least.
> "nullValue" and "emptyValue" are options for Spark to read CSV (they could be
> also applied to writing to CSV, but I'd like to limit the following
> discussion to "reading a CSV" only).
> They have similar explanation (so does nanValue) in guide
> [https://spark.apache.org/docs/latest/sql-data-sources-csv.html,|https://spark.apache.org/docs/latest/sql-data-sources-csv.html]
> {quote}Sets the string representation of a xxx value
> {quote}
> But their behaviors are different, or the "direction" to use the specified
> value is different:
> * nullValue: if a cell in CSV matches the given value, or the given value
> quoted by double quotation marks, it is read as (or replaced by) null in the
> dataframe generated. The specified value is the pattern to match against.
> * emptyValue: if there is "" in the cell, like {_}col1,"",col3{_}, it is
> read as (or replaced by) the specified value in the dataframe generated. So
> the specified value is the target to be replaced into.
> It could be misleading to use the same pattern of "Sets the string
> representation of xxx" to explain both? Actually I used nullValue before and
> assumed hastily that emptyValue works in the same way according to the doc.
> It was not corrected until I read the code and UT.
>
> How about improving the doc to tell the differences on how the specified
> value is used? I could draft a PR for it.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]