[ https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594046#comment-15594046 ]
Felix Cheung commented on SPARK-17916: -------------------------------------- So here's what happen. First, R read.csv has clearly documented that it treats empty/blank string the same as NA in the following condition: "Blank fields are also considered to be missing values in logical, integer, numeric and complex fields." Second, in this example in R, the 2nd column is turned into "logical", instead of "character" (ie. string) as expected: {code} > d <- "col1,col2 + 1,\"-\" + 2,\"\"" > df <- read.csv(text=d, quote="\"", na.strings=c("-")) > df col1 col2 1 1 NA 2 2 NA > str(df) 'data.frame': 2 obs. of 2 variables: $ col1: int 1 2 $ col2: logi NA NA {code} And that is why the blank string is turned into NA. Whereas if the data.frame has character/factor column instead, the blank field is retained as blank: {code} > d <- "col1,col2 + 1,\"###\" + 2,\"\" + 3,\"this is a string\"" > df <- read.csv(text=d, quote="\"", na.strings=c("###")) > df col1 col2 1 1 <NA> 2 2 3 3 this is a string > str(df) 'data.frame': 3 obs. of 2 variables: $ col1: int 1 2 3 $ col2: Factor w/ 2 levels "","this is a string": NA 1 2 {code} IMO this behavior makes sense. > CSV data source treats empty string as null no matter what nullValue option is > ------------------------------------------------------------------------------ > > Key: SPARK-17916 > URL: https://issues.apache.org/jira/browse/SPARK-17916 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.1 > Reporter: Hossein Falaki > > When user configures {{nullValue}} in CSV data source, in addition to those > values, all empty string values are also converted to null. > {code} > data: > col1,col2 > 1,"-" > 2,"" > {code} > {code} > spark.read.format("csv").option("nullValue", "-") > {code} > We will find a null in both rows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org