[ 
https://issues.apache.org/jira/browse/SPARK-56429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-56429:
-----------------------------------
    Labels: pull-request-available  (was: )

> Explain the differences between nullValue and emptyValue when reading CSV
> -------------------------------------------------------------------------
>
>                 Key: SPARK-56429
>                 URL: https://issues.apache.org/jira/browse/SPARK-56429
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 4.1.1
>            Reporter: Xiang Li
>            Priority: Minor
>              Labels: pull-request-available
>
> This is a proposal on a doc change at least.
> "nullValue" and "emptyValue" are options for Spark to read CSV (they could be 
> also applied to writing to CSV, but I'd like to limit the following 
> discussion to "reading a CSV" only).
> They have similar explanation (so does nanValue) in guide 
> [https://spark.apache.org/docs/latest/sql-data-sources-csv.html,|https://spark.apache.org/docs/latest/sql-data-sources-csv.html]
> {quote}Sets the string representation of a xxx value
> {quote}
> But their behaviors are different, or the "direction" to use the specified 
> value is different:
>  * nullValue: if a cell in CSV matches the given value, or the given value 
> quoted by double quotation marks, it is read as (or replaced by) null in the 
> dataframe generated. The specified value is the pattern to match against.
>  * emptyValue: if there is "" in the cell, like {_}col1,"",col3{_}, it is 
> read as (or replaced by) the specified value in the dataframe generated. So 
> the specified value is the target to be replaced into.
> It could be misleading to use the same pattern of "Sets the string 
> representation of xxx" to explain both? Actually I used nullValue before and 
> assumed hastily that emptyValue works in the same way according to the doc. 
> It was not corrected until I read the code and UT.
>  
> How about improving the doc to tell the differences on how the specified 
> value is used? I could draft a PR for it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to