[jira] [Comment Edited] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460144#comment-17460144 ] Wei Guo edited comment on SPARK-37604 at 12/15/21, 6:05 PM: For codes: {code:scala} val data = Seq(("Tesla", "")).toDF("make", "comment") data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty") {code} The csv file's content is as: {noformat} Tesla,EMPTY {noformat} (cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv) When I read it back to dataframe: {code:scala} spark.read.option("emptyValue", "EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show() {code} I want the column *comment* is "" rather a "EMPTY" string. !empty_test.png|width=701,height=286! For null values, we can write and read back with the same nullValue option, but for empty strings, even with same emptyValue option, it's irreversible. FYI. [~maxgekk] was (Author: wayne guo): For codes: {code:scala} val data = Seq(("Tesla", "")).toDF("make", "comment") data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty") {code} The csv file's content is as: {noformat} Tesla,EMPTY {noformat} (cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv) When I read it back to dataframe: {code:scala} spark.read.option("emptyValue", "EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show() {code} I want the column *comment* is "" rather a "EMPTY" string. !empty_test.png! For null values, we can write and read back with the same nullValue option, but for empty strings, even with same emptyValue option, it's irreversible. FYI. [~maxgekk] > Change emptyValueInRead's effect to that any fields matching this string will > be set as "" when reading csv files > - > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > Attachments: empty_test.png > > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for writing. I found that Spark keeps > the same behaviors for emptyValue with
[jira] [Comment Edited] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460144#comment-17460144 ] Wei Guo edited comment on SPARK-37604 at 12/15/21, 6:05 PM: For codes: {code:scala} val data = Seq(("Tesla", "")).toDF("make", "comment") data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty") {code} The csv file's content is as: {noformat} Tesla,EMPTY {noformat} (cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv) When I read it back to dataframe: {code:scala} spark.read.option("emptyValue", "EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show() {code} I want the column *comment* is "" rather a "EMPTY" string. !empty_test.png! For null values, we can write and read back with the same nullValue option, but for empty strings, even with same emptyValue option, it's irreversible. FYI. [~maxgekk] was (Author: wayne guo): For codes: {code:scala} val data = Seq(("Tesla", "")).toDF("make", "comment") data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty") {code} The csv file's content is as: {noformat} Tesla,EMPTY {noformat} (cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv) When I read it back to dataframe: {code:scala} spark.read.option("emptyValue", "EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show() {code} I want the column *comment* is "" rather a "EMPTY" string. !image-2021-12-16-01-57-55-864.png|width=424,height=173! For null values, we can write and read back with the same nullValue option, but for empty strings, even with same emptyValue option, it's irreversible. FYI. [~maxgekk] > Change emptyValueInRead's effect to that any fields matching this string will > be set as "" when reading csv files > - > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > Attachments: empty_test.png > > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for writing. I found that Spark keeps > the same behaviors for
[jira] [Comment Edited] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460144#comment-17460144 ] Wei Guo edited comment on SPARK-37604 at 12/15/21, 6:04 PM: For codes: {code:scala} val data = Seq(("Tesla", "")).toDF("make", "comment") data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty") {code} The csv file's content is as: {noformat} Tesla,EMPTY {noformat} (cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv) When I read it back to dataframe: {code:scala} spark.read.option("emptyValue", "EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show() {code} I want the column *comment* is "" rather a "EMPTY" string. !image-2021-12-16-01-57-55-864.png|width=424,height=173! For null values, we can write and read back with the same nullValue option, but for empty strings, even with same emptyValue option, it's irreversible. FYI. [~maxgekk] was (Author: wayne guo): For codes: {code:scala} val data = Seq(("Tesla", "")).toDF("make", "comment") data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty") {code} The csv file's content is as: {noformat} Tesla,EMPTY {noformat} (cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv) When I read it back to dataframe: {code:scala} spark.read.option("emptyValue", "EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show() {code} I want the column *comment* is "" rather a "EMPTY" string. !image-2021-12-16-01-57-55-864.png|width=424,height=173! For null values, we can write and read back with the same nullValue option, but for empty strings, even with same emptyValue option, it's irreversible. FYI. [~maxgekk] > Change emptyValueInRead's effect to that any fields matching this string will > be set as "" when reading csv files > - > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > Attachments: empty_test.png, image-2021-12-16-01-57-55-864.png > > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a
[jira] [Comment Edited] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460144#comment-17460144 ] Wei Guo edited comment on SPARK-37604 at 12/15/21, 6:03 PM: For codes: {code:scala} val data = Seq(("Tesla", "")).toDF("make", "comment") data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty") {code} The csv file's content is as: {noformat} Tesla,EMPTY {noformat} (cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv) When I read it back to dataframe: {code:scala} spark.read.option("emptyValue", "EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show() {code} I want the column *comment* is "" rather a "EMPTY" string. !image-2021-12-16-01-57-55-864.png|width=424,height=173! For null values, we can write and read back with the same nullValue option, but for empty strings, even with same emptyValue option, it's irreversible. FYI. [~maxgekk] was (Author: wayne guo): For codes: {code:scala} val data = Seq(("Tesla", "")).toDF("make", "comment") data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty") {code} The csv file's content is as: {noformat} Tesla,EMPTY {noformat} (cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv) When I read it back to dataframe: {code:scala} spark.read.option("emptyValue", "EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show() {code} I want the column *comment* is "" rather a "EMPTY" string. !image-2021-12-16-01-57-55-864.png|width=424,height=173! For null values, we can write and read back with the same nullValue option, but for empty strings, even with same emptyValue option, it's irreversible > Change emptyValueInRead's effect to that any fields matching this string will > be set as "" when reading csv files > - > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > Attachments: empty_test.png, image-2021-12-16-01-57-55-864.png > > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for
[jira] [Comment Edited] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460144#comment-17460144 ] Wei Guo edited comment on SPARK-37604 at 12/15/21, 6:03 PM: For codes: {code:scala} val data = Seq(("Tesla", "")).toDF("make", "comment") data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty") {code} The csv file's content is as: {noformat} Tesla,EMPTY {noformat} (cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv) When I read it back to dataframe: {code:scala} spark.read.option("emptyValue", "EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show() {code} I want the column *comment* is "" rather a "EMPTY" string. !image-2021-12-16-01-57-55-864.png|width=424,height=173! For null values, we can write and read back with the same nullValue option, but for empty strings, even with same emptyValue option, it's irreversible was (Author: wayne guo): For codes: {code:scala} val data = Seq(("Tesla", "")).toDF("make", "comment") data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty") {code} The csv file's content is as: {noformat} Tesla,EMPTY {noformat} (cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv) When I read it back to dataframe: {code:scala} spark.read.option("emptyValue", "EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show() {code} I want the column *comment* is "" rather a "EMPTY" string. !image-2021-12-16-01-57-55-864.png|width=424,height=173! > Change emptyValueInRead's effect to that any fields matching this string will > be set as "" when reading csv files > - > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > Attachments: empty_test.png, image-2021-12-16-01-57-55-864.png > > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for writing. I found that Spark keeps > the same behaviors for emptyValue with univocity, that is: > {noformat} > When reading, if the parser does not read any character from the