[ 
https://issues.apache.org/jira/browse/SPARK-56429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18077168#comment-18077168
 ] 

Anupam Yadav commented on SPARK-56429:
--------------------------------------

[~mmolimar] , [~maxgekk] , [~gurwls223]  could you please take a look at

[https://github.com/apache/spark/pull/55405]

Please let me know if it needs any changes, thanks!

> Explain the differences between nullValue and emptyValue when reading CSV
> -------------------------------------------------------------------------
>
>                 Key: SPARK-56429
>                 URL: https://issues.apache.org/jira/browse/SPARK-56429
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 4.1.1
>            Reporter: Xiang Li
>            Priority: Minor
>              Labels: pull-request-available
>
> This is a proposal on a doc change at least.
> "nullValue" and "emptyValue" are options for Spark to read CSV (they could be 
> also applied to writing to CSV, but I'd like to limit the following 
> discussion to "reading a CSV" only).
> They have similar explanation (so does nanValue) in guide 
> [https://spark.apache.org/docs/latest/sql-data-sources-csv.html,|https://spark.apache.org/docs/latest/sql-data-sources-csv.html]
> {quote}Sets the string representation of a xxx value
> {quote}
> But their behaviors are different, or the "direction" to use the specified 
> value is different:
>  * nullValue: if a cell in CSV matches the given value, or the given value 
> quoted by double quotation marks, it is read as (or replaced by) null in the 
> dataframe generated. The specified value is the pattern to match against.
>  * emptyValue: if there is "" in the cell, like {_}col1,"",col3{_}, it is 
> read as (or replaced by) the specified value in the dataframe generated. So 
> the specified value is the target to be replaced into.
> It could be misleading to use the same pattern of "Sets the string 
> representation of xxx" to explain both? Actually I used nullValue before and 
> assumed hastily that emptyValue works in the same way according to the doc. 
> It was not corrected until I read the code and UT.
> How about improving the doc to tell the differences on how the specified 
> value is used? I could draft a PR for it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to