[jira] [Commented] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Max Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460111#comment-17460111
 ] 

Max Gekk commented on SPARK-37604:
--

> but "EMPTY" strings in csv files can not be parsed as empty columns

The use case when an empty string is interpreted as non-empty like "EMPTY" is 
not clear to me. [~Wayne Guo] Could you describe the case when it is needed.

> The option emptyValueInRead(in CSVOptions) is suggested to be designed as 
> that any fields matching this string will be set as empty values "" when 
> reading
> --
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the input, and 
> the input is within quotes, the empty is used instead of an empty string.
> When writing, if the writer has an empty String to write to the output, the 
> emptyValue is used instead of an empty string.{noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> When reading:
> {code:scala}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|EMPTY|
> We can find that empty columns in dataframe can be saved as "EMPTY" strings 
> in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be 
> parsed as empty columns{color}* in dataframe. That is:
> {noformat}
> When writing, convert "" empty(in dataframe) to emptyValue(in csv)
> When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
> {noformat}
>  
> There is an obvious difference between nullValue and emptyValue in read 
> handling. For nullValue, we will convert nothing or 

[jira] [Commented] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460091#comment-17460091
 ] 

Wei Guo commented on SPARK-37604:
-

[~hyukjin.kwon][~maxgekk] Shall we have a simple discussion about it in your 
free time, I'd like to hear your thoughts on this.

> The option emptyValueInRead(in CSVOptions) is suggested to be designed as 
> that any fields matching this string will be set as empty values "" when 
> reading
> --
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the input, and 
> the input is within quotes, the empty is used instead of an empty string.
> When writing, if the writer has an empty String to write to the output, the 
> emptyValue is used instead of an empty string.{noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> When reading:
> {code:scala}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|EMPTY|
> We can find that empty columns in dataframe can be saved as "EMPTY" strings 
> in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be 
> parsed as empty columns{color}* in dataframe. That is:
> {noformat}
> When writing, convert "" empty(in dataframe) to emptyValue(in csv)
> When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
> {noformat}
>  
> There is an obvious difference between nullValue and emptyValue in read 
> handling. For nullValue, we will convert nothing or nullValue strings to null 
> in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 

[jira] [Commented] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460084#comment-17460084
 ] 

Wei Guo commented on SPARK-37604:
-

Maybe this issue is not a notable bug or promotion, but, for users' common 
usage, they prefer to be able to convert these emptyValue strings in csv files 
into ""(empty strings) again after writing out empty strings as emptyValue 
strings rather than current behaviors.

> The option emptyValueInRead(in CSVOptions) is suggested to be designed as 
> that any fields matching this string will be set as empty values "" when 
> reading
> --
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the input, and 
> the input is within quotes, the empty is used instead of an empty string.
> When writing, if the writer has an empty String to write to the output, the 
> emptyValue is used instead of an empty string.{noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> When reading:
> {code:scala}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|EMPTY|
> We can find that empty columns in dataframe can be saved as "EMPTY" strings 
> in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be 
> parsed as empty columns{color}* in dataframe. That is:
> {noformat}
> When writing, convert "" empty(in dataframe) to emptyValue(in csv)
> When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
> {noformat}
>  
> There is an obvious difference between nullValue and emptyValue in read 
> handling. For