[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594046#comment-15594046
 ] 

Felix Cheung commented on SPARK-17916:
--------------------------------------

So here's what happen.

First, R read.csv has clearly documented that it treats empty/blank string the 
same as NA in the following condition: "Blank fields are also considered to be 
missing values in logical, integer, numeric and complex fields."

Second, in this example in R, the 2nd column is turned into "logical", instead 
of "character" (ie. string) as expected:
{code}
> d <- "col1,col2
+ 1,\"-\"
+ 2,\"\""
> df <- read.csv(text=d, quote="\"", na.strings=c("-"))
> df
  col1 col2
1    1   NA
2    2   NA
> str(df)
'data.frame':   2 obs. of  2 variables:
 $ col1: int  1 2
 $ col2: logi  NA NA
{code}

And that is why the blank string is turned into NA.

Whereas if the data.frame has character/factor column instead, the blank field 
is retained as blank:
{code}
> d <- "col1,col2
+ 1,\"###\"
+ 2,\"\"
+ 3,\"this is a string\""
> df <- read.csv(text=d, quote="\"", na.strings=c("###"))
> df
  col1             col2
1    1             <NA>
2    2
3    3 this is a string
> str(df)
'data.frame':   3 obs. of  2 variables:
 $ col1: int  1 2 3
 $ col2: Factor w/ 2 levels "","this is a string": NA 1 2
{code}

IMO this behavior makes sense.

> CSV data source treats empty string as null no matter what nullValue option is
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-17916
>                 URL: https://issues.apache.org/jira/browse/SPARK-17916
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.1
>            Reporter: Hossein Falaki
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to