[jira] [Commented] (SPARK-49016) Spark DataSet.isEmpty behaviour is different on CSV than JSON

2024-07-26 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-49016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17868981#comment-17868981
 ] 

Wei Guo commented on SPARK-49016:
-

I made a PR for this issue https://github.com/apache/spark/pull/47506

> Spark DataSet.isEmpty behaviour is different on CSV than JSON
> -
>
> Key: SPARK-49016
> URL: https://issues.apache.org/jira/browse/SPARK-49016
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1, 3.4.3
>Reporter: Marius Butan
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2024-07-26-15-50-10-280.png, 
> image-2024-07-26-15-50-24-308.png
>
>
> Spark DataSet.isEmpty behaviour is different on CSV than JSON:
>  * CSV → dataSet.isEmpty return the values for any query
>  * JSON → dataSet.isEmpty throws error when filter is only 
> {_}corrupt{_}_record is null:
> !image-2024-07-26-15-50-10-280.png!
> Tested version: Spark 3.4.3, Spark 3.5.1
> Expected behaviour: throw error on both file types or return the correct value
>  
> In order to demonstrate the behaviour I added an unit test
>  
> test.csv
> {code:java}
> first,second,third{code}
> test.json
> {code:java}
> {"first": "first", "second": "second", "third": "third"}{code}
> Code:
> {noformat}
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import org.junit.jupiter.api.AfterEach;
> import org.junit.jupiter.api.BeforeEach;
> import org.junit.jupiter.api.Test;
> public class SparkIsEmptyTest {
> private SparkSession sparkSession;
> @BeforeEach
> void setUp() {
> sparkSession = getSpark();
> }
> @AfterEach
> void after() {
> sparkSession.close();
> }
> @Test
> void testDatasetIsEmptyForCsv() {
> var dataSet = runCsvQuery("select first, second, third, 
> _corrupt_record from tempView where _corrupt_record is null");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForJson() {
> var dataSet = runJsonQuery("select first, second, third, 
> _corrupt_record from tempView where _corrupt_record is null");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForJsonAnd1Eq1() {
> var dataSet = runJsonQuery(
> "select first, second, third, _corrupt_record from tempView 
> where _corrupt_record is null and 1=1");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForCsvAnd1Eq1() {
> var dataSet = runCsvQuery(
> "select first, second, third, _corrupt_record from tempView 
> where _corrupt_record is null and 1=1");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForJsonAndOtherCondition() {
>var dataSet = runJsonQuery("select first, second, third, 
> _corrupt_record from tempView where _corrupt_record is null and 
> first='first'");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForCsvAndOtherCondition() {
> var dataSet = runCsvQuery("select first, second, third, 
> _corrupt_record from tempView where _corrupt_record is null and 
> first='first'");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForJsonAggregation() {
> var dataSet = runJsonQuery("select count(1) from tempView where 
> _corrupt_record is null");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForCsvAggregation() {
> var dataSet = runCsvQuery("select count(1) from tempView where 
> _corrupt_record is null");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForJsonAggregationGroupBy() {
> var dataSet = runJsonQuery("select count(1) , first from tempView 
> where _corrupt_record is null group by first");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForCsvAggregationGroupBy() {
> var dataSet = runJsonQuery("select count(1) , first from tempView 
> where _corrupt_record is null group by first");
> assert !dataSet.isEmpty();
> }
> private SparkSession getSpark() {
> return SparkSession.builder()
> .master("local")
> .appName("spark-dataset-isEmpty-issue")
> .config("spark.ui.enabled", "false")
> .getOrCreate();
> }
> private Dataset runJsonQuery(String query) {
> Dataset dataset = sparkSession.read()
> .schema("first STRING,second String, third STRING, 
> _corrupt_record STRING")
> .option("columnNameOfCorruptRecord", "_corrupt_record")
> .json("test.json");
> 

[jira] [Updated] (SPARK-48973) Unexpected behavior using spark mask function handle string contains invalid UTF-8 or wide character

2024-07-23 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48973:

Description: 
In the spark the mask function when apply with a string contains invalid 
character or wide character would cause unexpected behavior.

Example to use `*` mask a string contains wide-character {{}}
{code:sql}
select mask("", "Y", "y", "n", "*");
{code}
could cause result is ** instead of *. Looks spark mask treat {{}} as 2 
characters.

Example to use wide-character {{}} do mask would cause wrong garbled code 
problem
{code:sql}
select mask("ABC", "");
{code}
result is `???`.

Example to mask a string contains a invalid UTF-8 character
{code:java}
select mask("\xED");
{code}
result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, 
`x`, `E`, `D`.

Looks spark mask can only handle BMP character (that is 16 bits) and can't 
guarantee result for invalid UTC-8 character and wide-character when doing mask.

My question here is *does that the limitation / issue of spark mask function or 
spark mask by design only handle for BMP character ?*

If it is a limitation of mask function, could spark address this part in mask 
function document or comments ?

 

  was:
In the spark the mask function when apply with a string contains invalid 
character or wide character would cause unexpected behavior.

Example to use `*` mask a stirng contains wide-character {{}}
{code:sql}
select mask("", "Y", "y", "n", "*");
{code}
could cause result is ** instead of *. Looks spark mask treat {{}} as 2 
characters.

Example to use wide-character {{}} do mask would cause wrong garbled code 
problem
{code:sql}
select mask("ABC", "");
{code}
result is `???`.

Example to mask a string contains a invalid UTF-8 character
{code:java}
select mask("\xED");
{code}
result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, 
`x`, `E`, `D`.

Looks spark mask can only handle BMP character (that is 16 bits) and can't 
guarantee result for invalid UTC-8 character and wide-character when doing mask.

My question here is *does that the limitation / issue of spark mask function or 
spark mask by design only handle for BMP character ?*

If it is a limitation of mask function, could spark address this part in mask 
function document or comments ?

 


> Unexpected behavior using spark mask function handle string contains invalid 
> UTF-8 or wide character
> 
>
> Key: SPARK-48973
> URL: https://issues.apache.org/jira/browse/SPARK-48973
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 3.3.1, 3.2.4, 3.3.2, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.3.4
> Environment: Ubuntu 22.04
>Reporter: Yangyang Gao
>Priority: Major
>
> In the spark the mask function when apply with a string contains invalid 
> character or wide character would cause unexpected behavior.
> Example to use `*` mask a string contains wide-character {{}}
> {code:sql}
> select mask("", "Y", "y", "n", "*");
> {code}
> could cause result is ** instead of *. Looks spark mask treat {{}} as 2 
> characters.
> Example to use wide-character {{}} do mask would cause wrong garbled code 
> problem
> {code:sql}
> select mask("ABC", "");
> {code}
> result is `???`.
> Example to mask a string contains a invalid UTF-8 character
> {code:java}
> select mask("\xED");
> {code}
> result is `xXX` instead of `\xED`, looks spark treat it as four character 
> `\`, `x`, `E`, `D`.
> Looks spark mask can only handle BMP character (that is 16 bits) and can't 
> guarantee result for invalid UTC-8 character and wide-character when doing 
> mask.
> My question here is *does that the limitation / issue of spark mask function 
> or spark mask by design only handle for BMP character ?*
> If it is a limitation of mask function, could spark address this part in mask 
> function document or comments ?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48973) Unexpected behavior using spark mask function handle string contains invalid UTF-8 or wide character

2024-07-23 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48973:

Description: 
In the spark the mask function when apply with a string contains invalid 
character or wide character would cause unexpected behavior.

Example to use `*` mask a stirng contains wide-character {{}}
{code:sql}
select mask("", "Y", "y", "n", "*");
{code}
could cause result is ** instead of *. Looks spark mask treat {{}} as 2 
characters.

Example to use wide-character {{}} do mask would cause wrong garbled code 
problem
{code:sql}
select mask("ABC", "");
{code}
result is `???`.

Example to mask a string contains a invalid UTF-8 character
{code:java}
select mask("\xED");
{code}
result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, 
`x`, `E`, `D`.

Looks spark mask can only handle BMP character (that is 16 bits) and can't 
guarantee result for invalid UTC-8 character and wide-character when doing mask.

My question here is *does that the limitation / issue of spark mask function or 
spark mask by design only handle for BMP character ?*

If it is a limitation of mask function, could spark address this part in mask 
function document or comments ?

 

  was:
In the spark the mask function when apply with a stirng contains invalid 
character or wide character would cause unexpected behavior.


Example to use `*` mask a stirng contains wide-character {{}}


{code:sql}
select mask("", "Y", "y", "n", "*");
{code}


could cause result is ** instead of *. Looks spark mask treat {{}} as 2 
characters.

Example to use wide-character {{}} do mask would cause wrong garbled code 
problem


{code:sql}
select mask("ABC", "");
{code}

result is `???`.

Example to mask a string contains a invalid UTF-8 character

{code:java}
select mask("\xED");
{code}

result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, 
`x`, `E`, `D`.

Looks spark mask can only handle BMP character (that is 16 bits) and can't 
guarantee result for invalid UTC-8 character and wide-character when doing mask.


My question here is *does that the limitation / issue of spark mask function or 
spark mask by design only handle for BMP character ?*

If it is a limitation of mask function, could spark address this part in mask 
function document or comments ?

 


> Unexpected behavior using spark mask function handle string contains invalid 
> UTF-8 or wide character
> 
>
> Key: SPARK-48973
> URL: https://issues.apache.org/jira/browse/SPARK-48973
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 3.3.1, 3.2.4, 3.3.2, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.3.4
> Environment: Ubuntu 22.04
>Reporter: Yangyang Gao
>Priority: Major
>
> In the spark the mask function when apply with a string contains invalid 
> character or wide character would cause unexpected behavior.
> Example to use `*` mask a stirng contains wide-character {{}}
> {code:sql}
> select mask("", "Y", "y", "n", "*");
> {code}
> could cause result is ** instead of *. Looks spark mask treat {{}} as 2 
> characters.
> Example to use wide-character {{}} do mask would cause wrong garbled code 
> problem
> {code:sql}
> select mask("ABC", "");
> {code}
> result is `???`.
> Example to mask a string contains a invalid UTF-8 character
> {code:java}
> select mask("\xED");
> {code}
> result is `xXX` instead of `\xED`, looks spark treat it as four character 
> `\`, `x`, `E`, `D`.
> Looks spark mask can only handle BMP character (that is 16 bits) and can't 
> guarantee result for invalid UTC-8 character and wide-character when doing 
> mask.
> My question here is *does that the limitation / issue of spark mask function 
> or spark mask by design only handle for BMP character ?*
> If it is a limitation of mask function, could spark address this part in mask 
> function document or comments ?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48926) Use the `checkError` method to optimize the exception check logic related to `UNRESOLVED_COLUMN` error classes

2024-07-17 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48926:

Issue Type: Improvement  (was: Bug)

> Use the `checkError` method to optimize the exception check logic related to 
> `UNRESOLVED_COLUMN` error classes
> --
>
> Key: SPARK-48926
> URL: https://issues.apache.org/jira/browse/SPARK-48926
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48926) Use the `checkError` method to optimize the exception check logic related to `UNRESOLVED_COLUMN` error classes

2024-07-17 Thread Wei Guo (Jira)
Wei Guo created SPARK-48926:
---

 Summary: Use the `checkError` method to optimize the exception 
check logic related to `UNRESOLVED_COLUMN` error classes
 Key: SPARK-48926
 URL: https://issues.apache.org/jira/browse/SPARK-48926
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48915) Add inequality (!=, <, <=, >, >=) predicates for correlation in GeneratedSubquerySuite

2024-07-17 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17866743#comment-17866743
 ] 

Wei Guo commented on SPARK-48915:
-

I have made a PR for this issue.

> Add inequality (!=, <, <=, >, >=) predicates for correlation in 
> GeneratedSubquerySuite
> --
>
> Key: SPARK-48915
> URL: https://issues.apache.org/jira/browse/SPARK-48915
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 4.0.0
>Reporter: Nick Young
>Priority: Major
>  Labels: pull-request-available
>
> {{GeneratedSubquerySuite}} is a test suite that generates SQL with variations 
> of subqueries. Currently, the operators supported are Joins, Set Operations, 
> Aggregate (with/without group by) and Limit. Implementing inequality (!=, <, 
> <=, >, >=) predicates will increase coverage by 1 additional axis, and should 
> be simple.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48893) Add some examples for linearRegression built-in functions

2024-07-14 Thread Wei Guo (Jira)
Wei Guo created SPARK-48893:
---

 Summary: Add some examples for linearRegression built-in functions
 Key: SPARK-48893
 URL: https://issues.apache.org/jira/browse/SPARK-48893
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48882) Assign names to streaming output mode related error classes

2024-07-12 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48882:

Summary: Assign names to streaming output mode related error classes  (was: 
Assign streaming output mode related error classes)

> Assign names to streaming output mode related error classes
> ---
>
> Key: SPARK-48882
> URL: https://issues.apache.org/jira/browse/SPARK-48882
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48882) Assign streaming output mode related error classes

2024-07-12 Thread Wei Guo (Jira)
Wei Guo created SPARK-48882:
---

 Summary: Assign streaming output mode related error classes
 Key: SPARK-48882
 URL: https://issues.apache.org/jira/browse/SPARK-48882
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48858) Remove deprecated `setDaemon` method call of `Thread` in `log_communication.py`

2024-07-10 Thread Wei Guo (Jira)
Wei Guo created SPARK-48858:
---

 Summary: Remove deprecated `setDaemon` method call of `Thread` in 
`log_communication.py`
 Key: SPARK-48858
 URL: https://issues.apache.org/jira/browse/SPARK-48858
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48848) Set the upper bound version of sphinxcontrib-* in dev/requirements.txt with sphinx==4.5.0

2024-07-09 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48848:

Summary: Set the upper bound version of sphinxcontrib-* in 
dev/requirements.txt with sphinx==4.5.0  (was: Pin 'sphinxcontrib-*' in 
`dev/requirements.txt` with `sphinx==4.5.0`)

> Set the upper bound version of sphinxcontrib-* in dev/requirements.txt with 
> sphinx==4.5.0
> -
>
> Key: SPARK-48848
> URL: https://issues.apache.org/jira/browse/SPARK-48848
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48848) Pin 'sphinxcontrib-*' in `dev/requirements.txt` with `sphinx==4.5.0`

2024-07-09 Thread Wei Guo (Jira)
Wei Guo created SPARK-48848:
---

 Summary: Pin 'sphinxcontrib-*' in `dev/requirements.txt` with 
`sphinx==4.5.0`
 Key: SPARK-48848
 URL: https://issues.apache.org/jira/browse/SPARK-48848
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48846) Fix the incorrect namings and missing params in func docs in `builtin.py`

2024-07-09 Thread Wei Guo (Jira)
Wei Guo created SPARK-48846:
---

 Summary: Fix the incorrect namings and missing params in func docs 
in `builtin.py`
 Key: SPARK-48846
 URL: https://issues.apache.org/jira/browse/SPARK-48846
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48826) Upgrade `fasterxml.jackson` to 2.17.2

2024-07-06 Thread Wei Guo (Jira)
Wei Guo created SPARK-48826:
---

 Summary: Upgrade `fasterxml.jackson` to 2.17.2
 Key: SPARK-48826
 URL: https://issues.apache.org/jira/browse/SPARK-48826
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48814) Upgrade tink to 1.14.0

2024-07-04 Thread Wei Guo (Jira)
Wei Guo created SPARK-48814:
---

 Summary: Upgrade tink to 1.14.0
 Key: SPARK-48814
 URL: https://issues.apache.org/jira/browse/SPARK-48814
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48812) Add some test suites for mariadb jdbc connector

2024-07-04 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48812:

Component/s: Connect

> Add some test suites for mariadb jdbc connector
> ---
>
> Key: SPARK-48812
> URL: https://issues.apache.org/jira/browse/SPARK-48812
> Project: Spark
>  Issue Type: Test
>  Components: Connect, Tests
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48812) Add some test suites for mariadb jdbc connector

2024-07-04 Thread Wei Guo (Jira)
Wei Guo created SPARK-48812:
---

 Summary: Add some test suites for mariadb jdbc connector
 Key: SPARK-48812
 URL: https://issues.apache.org/jira/browse/SPARK-48812
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48795) Upgrade mysql-connector-j to 9.0.0

2024-07-03 Thread Wei Guo (Jira)
Wei Guo created SPARK-48795:
---

 Summary: Upgrade mysql-connector-j to 9.0.0
 Key: SPARK-48795
 URL: https://issues.apache.org/jira/browse/SPARK-48795
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Tests
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48738) Correct since version for built-in func alias `random`, `position`, `mod`, `cardinality`, `current_schema`, `user`, `session_user`, `char_length`, `character_length`

2024-06-27 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48738:

Summary: Correct since version for built-in func alias `random`, 
`position`, `mod`, `cardinality`, `current_schema`, `user`, `session_user`, 
`char_length`, `character_length`  (was: Correct since version for built-in 
func alias `random`, `position`, `mod`, `cardinality`, `current_schema`, `user` 
and `session_user`)

> Correct since version for built-in func alias `random`, `position`, `mod`, 
> `cardinality`, `current_schema`, `user`, `session_user`, `char_length`, 
> `character_length`
> -
>
> Key: SPARK-48738
> URL: https://issues.apache.org/jira/browse/SPARK-48738
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48738) Correct since version for built-in func alias `random`, `position`, `mod`, `cardinality`, `current_schema`, `user` and `session_user`

2024-06-27 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48738:

Summary: Correct since version for built-in func alias `random`, 
`position`, `mod`, `cardinality`, `current_schema`, `user` and `session_user`  
(was: Correct since for method alias `random`, `position`, `mod`, 
`cardinality`, `current_schema`, `user` and `session_user`)

> Correct since version for built-in func alias `random`, `position`, `mod`, 
> `cardinality`, `current_schema`, `user` and `session_user`
> -
>
> Key: SPARK-48738
> URL: https://issues.apache.org/jira/browse/SPARK-48738
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48738) Correct since for method alias `random`, `position`, `mod`, `cardinality`, `current_schema`, `user` and `session_user`

2024-06-27 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48738:

Summary: Correct since for method alias `random`, `position`, `mod`, 
`cardinality`, `current_schema`, `user` and `session_user`  (was: Update since 
for method alias `random`, `position`, `mod`, `cardinality`, `current_schema`, 
`user` and `session_user`)

> Correct since for method alias `random`, `position`, `mod`, `cardinality`, 
> `current_schema`, `user` and `session_user`
> --
>
> Key: SPARK-48738
> URL: https://issues.apache.org/jira/browse/SPARK-48738
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48738) Update since for method alias `random`, `position`, `mod`, `cardinality`, `current_schema`, `user` and `session_user`

2024-06-27 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48738:

Summary: Update since for method alias `random`, `position`, `mod`, 
`cardinality`, `current_schema`, `user` and `session_user`  (was: Update since 
for `random`, `position`, `mod`, `cardinality`, `current_schema`, `user` and 
`session_user`)

> Update since for method alias `random`, `position`, `mod`, `cardinality`, 
> `current_schema`, `user` and `session_user`
> -
>
> Key: SPARK-48738
> URL: https://issues.apache.org/jira/browse/SPARK-48738
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48738) Update since for `random`, `position`, `mod`, `cardinality`, `current_schema`, `user` and `session_user`

2024-06-27 Thread Wei Guo (Jira)
Wei Guo created SPARK-48738:
---

 Summary: Update since for `random`, `position`, `mod`, 
`cardinality`, `current_schema`, `user` and `session_user`
 Key: SPARK-48738
 URL: https://issues.apache.org/jira/browse/SPARK-48738
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-48719) Wrong Result in regr_slope_intercept Aggregate with Tuples has NULL

2024-06-27 Thread Wei Guo (Jira)


[ https://issues.apache.org/jira/browse/SPARK-48719 ]


Wei Guo deleted comment on SPARK-48719:
-

was (Author: wayne guo):
I made a [PR|https://github.com/apache/spark/pull/47105] for this.

> Wrong Result in regr_slope_intercept Aggregate with Tuples has NULL
> 
>
> Key: SPARK-48719
> URL: https://issues.apache.org/jira/browse/SPARK-48719
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Jonathon Lee
>Priority: Major
>
> When calculate slope and intercept using regr_slope & regr_intercept 
> aggregate:
> (using Java api)
> {code:java}
> spark.sql("drop table if exists tab");
> spark.sql("CREATE TABLE tab(y int, x int) using parquet");
> spark.sql("INSERT INTO tab VALUES (1, 1)");
> spark.sql("INSERT INTO tab VALUES (2, 3)");
> spark.sql("INSERT INTO tab VALUES (3, 5)");
> spark.sql("INSERT INTO tab VALUES (NULL, 3)");
> spark.sql("INSERT INTO tab VALUES (3, NULL)");
> spark.sql("SELECT " +
> "regr_slope(x, y), " +
> "regr_intercept(x, y)" +
> "FROM tab").show(); {code}
> Spark result:
> {code:java}
> +--++
> |  regr_slope(x, y)|regr_intercept(x, y)|
> +--++
> |1.4545454545454546| 0.09090909090909083|
> +--++ {code}
> The correct answer should be 2.0 and -1.0 obviously.
>  
> Reason:
> In sql/catalyst/expressions/aggregate/linearRegression.scala,
>  
> {code:java}
> case class RegrSlope(left: Expression, right: Expression) extends 
> DeclarativeAggregate
>   with ImplicitCastInputTypes with BinaryLike[Expression] {
>   private val covarPop = new CovPopulation(right, left)
>   private val varPop = new VariancePop(right)
> .. {code}
> CovPopulation will filter tuples which right *OR* left is NULL
> But VariancePop will only filter null right expression.
> This will cause wrong result when some of the tuples' left is null (and right 
> is not null).
> {*}Same reason with RegrIntercept{*}.
>  
> A possible fix:
> {code:java}
> case class RegrSlope(left: Expression, right: Expression) extends 
> DeclarativeAggregate
>   with ImplicitCastInputTypes with BinaryLike[Expression] {
>   private val covarPop = new CovPopulation(right, left)
>   private val varPop = new VariancePop(If(And(IsNotNull(left), 
> IsNotNull(right)),
> right, Literal.create(null, right.dataType))) 
> .{code}
> *same fix to RegrIntercept*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48732) Cleanup deprecated api usage related to JdbcDialect.compileAggregate

2024-06-26 Thread Wei Guo (Jira)
Wei Guo created SPARK-48732:
---

 Summary: Cleanup deprecated api usage related to 
JdbcDialect.compileAggregate
 Key: SPARK-48732
 URL: https://issues.apache.org/jira/browse/SPARK-48732
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48719) Wrong Result in regr_slope_intercept Aggregate with Tuples has NULL

2024-06-26 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17860200#comment-17860200
 ] 

Wei Guo commented on SPARK-48719:
-

I made a [PR|https://github.com/apache/spark/pull/47105] for this.

> Wrong Result in regr_slope_intercept Aggregate with Tuples has NULL
> 
>
> Key: SPARK-48719
> URL: https://issues.apache.org/jira/browse/SPARK-48719
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Jonathon Lee
>Priority: Major
>
> When calculate slope and intercept using regr_slope & regr_intercept 
> aggregate:
> (using Java api)
> {code:java}
> spark.sql("drop table if exists tab");
> spark.sql("CREATE TABLE tab(y int, x int) using parquet");
> spark.sql("INSERT INTO tab VALUES (1, 1)");
> spark.sql("INSERT INTO tab VALUES (2, 3)");
> spark.sql("INSERT INTO tab VALUES (3, 5)");
> spark.sql("INSERT INTO tab VALUES (NULL, 3)");
> spark.sql("INSERT INTO tab VALUES (3, NULL)");
> spark.sql("SELECT " +
> "regr_slope(x, y), " +
> "regr_intercept(x, y)" +
> "FROM tab").show(); {code}
> Spark result:
> {code:java}
> +--++
> |  regr_slope(x, y)|regr_intercept(x, y)|
> +--++
> |1.4545454545454546| 0.09090909090909083|
> +--++ {code}
> The correct answer should be 2.0 and -1.0 obviously.
>  
> Reason:
> In sql/catalyst/expressions/aggregate/linearRegression.scala,
>  
> {code:java}
> case class RegrSlope(left: Expression, right: Expression) extends 
> DeclarativeAggregate
>   with ImplicitCastInputTypes with BinaryLike[Expression] {
>   private val covarPop = new CovPopulation(right, left)
>   private val varPop = new VariancePop(right)
> .. {code}
> CovPopulation will filter tuples which right *OR* left is NULL
> But VariancePop will only filter null right expression.
> This will cause wrong result when some of the tuples' left is null (and right 
> is not null).
> {*}Same reason with RegrIntercept{*}.
>  
> A possible fix:
> {code:java}
> case class RegrSlope(left: Expression, right: Expression) extends 
> DeclarativeAggregate
>   with ImplicitCastInputTypes with BinaryLike[Expression] {
>   private val covarPop = new CovPopulation(right, left)
>   private val varPop = new VariancePop(If(And(IsNotNull(left), 
> IsNotNull(right)),
> right, Literal.create(null, right.dataType))) 
> .{code}
> *same fix to RegrIntercept*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48724) Fix incorrect conf settings of ignoreCorruptFiles related tests case in ParquetQuerySuite

2024-06-26 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48724:

Description: 
The code as follows:
{code:java}
withSQLConf(SQLConf.IGNORE_CORRUPT_FILES.key -> sqlConf) {
  withSQLConf(SQLConf.IGNORE_CORRUPT_FILES.key -> "false") { 
}{code}
he inner withSQLConf (SQLConf.IGNORE_CORRUPT_FILES.key -> "false") will 
overwrite the outer configuration, making it impossible to test the situation 
where sqlConf is true.

  was:
The code as belows:

 


> Fix incorrect conf settings of ignoreCorruptFiles related tests case in 
> ParquetQuerySuite
> -
>
> Key: SPARK-48724
> URL: https://issues.apache.org/jira/browse/SPARK-48724
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Minor
>
> The code as follows:
> {code:java}
> withSQLConf(SQLConf.IGNORE_CORRUPT_FILES.key -> sqlConf) {
>   withSQLConf(SQLConf.IGNORE_CORRUPT_FILES.key -> "false") { 
> }{code}
> he inner withSQLConf (SQLConf.IGNORE_CORRUPT_FILES.key -> "false") will 
> overwrite the outer configuration, making it impossible to test the situation 
> where sqlConf is true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48724) Fix incorrect conf settings of ignoreCorruptFiles related tests case in ParquetQuerySuite

2024-06-26 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48724:

Description: 
The code as belows:

 

> Fix incorrect conf settings of ignoreCorruptFiles related tests case in 
> ParquetQuerySuite
> -
>
> Key: SPARK-48724
> URL: https://issues.apache.org/jira/browse/SPARK-48724
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Minor
>
> The code as belows:
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48724) Fix incorrect conf settings of ignoreCorruptFiles related tests case in ParquetQuerySuite

2024-06-26 Thread Wei Guo (Jira)
Wei Guo created SPARK-48724:
---

 Summary: Fix incorrect conf settings of ignoreCorruptFiles related 
tests case in ParquetQuerySuite
 Key: SPARK-48724
 URL: https://issues.apache.org/jira/browse/SPARK-48724
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-39901) Reconsider design of ignoreCorruptFiles feature

2024-06-25 Thread Wei Guo (Jira)


[ https://issues.apache.org/jira/browse/SPARK-39901 ]


Wei Guo deleted comment on SPARK-39901:
-

was (Author: wayne guo):
The `ignoreCorruptFiles` features in SQL(spark.sql.files.ignoreCorruptFiles) 
and RDD(spark.files.ignoreCorruptFiles) scenarios need to be included both. 

> Reconsider design of ignoreCorruptFiles feature
> ---
>
> Key: SPARK-39901
> URL: https://issues.apache.org/jira/browse/SPARK-39901
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Priority: Major
>
> I'm filing this ticket as a followup to the discussion at 
> [https://github.com/apache/spark/pull/36775#issuecomment-1148136217] 
> regarding the `ignoreCorruptFiles` feature: the current implementation is 
> based towards considering a broad range of IOExceptions to be corruption, but 
> this is likely overly-broad and might mis-identify transient errors as 
> corruption (causing non-corrupt data to be erroneously discarded).
> SPARK-39389 fixes one instance of that problem, but we are still vulnerable 
> to similar issues because of the overall design of this feature.
> I think we should reconsider the design of this feature: maybe we should 
> switch the default behavior so that only an explicit allowlist of known 
> corruption exceptions can cause files to be skipped. This could be done 
> through involvement of other parts of the code, e.g. rewrapping exceptions 
> into a `CorruptFileException` so higher layers can positively identify 
> corruption.
> Any changes to behavior here could potentially impact users jobs, so we'd 
> need to think carefully about when we want to change (in a 3.x release? 4.x?) 
> and how we want to provide escape hatches (e.g. configs to revert back to old 
> behavior). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48691) Upgrade `scalatest` related dependencies to the 3.2.18 series

2024-06-23 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48691:

Summary: Upgrade `scalatest` related dependencies to the 3.2.18 series  
(was: Upgrade `mockito` to 5.12.0)

> Upgrade `scalatest` related dependencies to the 3.2.18 series
> -
>
> Key: SPARK-48691
> URL: https://issues.apache.org/jira/browse/SPARK-48691
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-48689) Reading lengthy JSON results in a corrupted record.

2024-06-22 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856910#comment-17856910
 ] 

Wei Guo edited comment on SPARK-48689 at 6/22/24 2:33 PM:
--

This is controlled by option `maxStringLen`, the default value of it is 
20,000,000. If you set the option big enough when reading, you can get the 
right result.
{code:java}
spark.read.option("maxStringLen", 1).json(path){code}
 

I did a test with a long string of length 20,000,010 and proved that:

!image-2024-06-22-15-33-38-833.png!

 


was (Author: wayne guo):
This is controlled by option `maxStringLen`, the default value of it is 
20,000,000. If you set the option when reading, you can get the right result.
{code:java}
spark.read.option("maxStringLen", 1).json(path){code}
 

I did a test with a long string of length 20,000,010 and proved that:

!image-2024-06-22-15-33-38-833.png!

 

> Reading lengthy JSON results in a corrupted record.
> ---
>
> Key: SPARK-48689
> URL: https://issues.apache.org/jira/browse/SPARK-48689
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
> Environment: Ubuntu 22.04, Python 3.11, and OpenJDK 22
>Reporter: Yuxiang Wei
>Priority: Major
>  Labels: Reader
> Attachments: image-2024-06-22-15-33-38-833.png
>
>
> When reading a data frame from a JSON file including a very long string, 
> spark will incorrectly make it a corrupted record even though the format is 
> correct. Here is a minimal example with PySpark:
> {{import json}}
> {{import tempfile}}
> {{{}from pyspark.sql import SparkSession{}}}{{{}# Create a Spark session{}}}
> {{spark = (SparkSession.builde}}
> {{    .appName("PySpark JSON Example")}}
> {{    .getOrCreate()}}
> {{{}){}}}{{{}# Define the JSON content{}}}
> {{data = {}}
> {{    "text": "a" * 1}}
> {{{}}{}}}{{{}# Create a temporary file{}}}
> {{with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as 
> tmp_file:}}
> {{    # Write the JSON content to the temporary file}}
> {{    tmp_file.write(json.dumps(data) + "\n")}}
> {{    tmp_file_path = tmp_file.name}}{{    # Load the JSON file into a 
> PySpark DataFrame}}
> {{    df = spark.read.json(tmp_file_path)}}{{    # Print the schema}}
> {{    print(df)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-48689) Reading lengthy JSON results in a corrupted record.

2024-06-22 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856910#comment-17856910
 ] 

Wei Guo edited comment on SPARK-48689 at 6/22/24 7:36 AM:
--

This is controlled by option `maxStringLen`, the default value of it is 
20,000,000. If you set the option when reading, you can get the right result.
{code:java}
spark.read.option("maxStringLen", 1).json(path){code}
 

I did a test with a long string of length 20,000,010 and proved that:

!image-2024-06-22-15-33-38-833.png!

 


was (Author: wayne guo):
This is controlled by option `maxStringLen`, the default value of it is 
20,000,000. If you set the option when reading, you can get the right result.
{code:java}
spark.read.option("maxStringLen", 1).json(path){code}
 

I made a test with a 20,000,010 length string:

!image-2024-06-22-15-33-38-833.png!

 

> Reading lengthy JSON results in a corrupted record.
> ---
>
> Key: SPARK-48689
> URL: https://issues.apache.org/jira/browse/SPARK-48689
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
> Environment: Ubuntu 22.04, Python 3.11, and OpenJDK 22
>Reporter: Yuxiang Wei
>Priority: Major
>  Labels: Reader
> Attachments: image-2024-06-22-15-33-38-833.png
>
>
> When reading a data frame from a JSON file including a very long string, 
> spark will incorrectly make it a corrupted record even though the format is 
> correct. Here is a minimal example with PySpark:
> {{import json}}
> {{import tempfile}}
> {{{}from pyspark.sql import SparkSession{}}}{{{}# Create a Spark session{}}}
> {{spark = (SparkSession.builde}}
> {{    .appName("PySpark JSON Example")}}
> {{    .getOrCreate()}}
> {{{}){}}}{{{}# Define the JSON content{}}}
> {{data = {}}
> {{    "text": "a" * 1}}
> {{{}}{}}}{{{}# Create a temporary file{}}}
> {{with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as 
> tmp_file:}}
> {{    # Write the JSON content to the temporary file}}
> {{    tmp_file.write(json.dumps(data) + "\n")}}
> {{    tmp_file_path = tmp_file.name}}{{    # Load the JSON file into a 
> PySpark DataFrame}}
> {{    df = spark.read.json(tmp_file_path)}}{{    # Print the schema}}
> {{    print(df)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48689) Reading lengthy JSON results in a corrupted record.

2024-06-22 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48689:

Attachment: image-2024-06-22-15-33-38-833.png

> Reading lengthy JSON results in a corrupted record.
> ---
>
> Key: SPARK-48689
> URL: https://issues.apache.org/jira/browse/SPARK-48689
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
> Environment: Ubuntu 22.04, Python 3.11, and OpenJDK 22
>Reporter: Yuxiang Wei
>Priority: Major
>  Labels: Reader
> Attachments: image-2024-06-22-15-33-38-833.png
>
>
> When reading a data frame from a JSON file including a very long string, 
> spark will incorrectly make it a corrupted record even though the format is 
> correct. Here is a minimal example with PySpark:
> {{import json}}
> {{import tempfile}}
> {{{}from pyspark.sql import SparkSession{}}}{{{}# Create a Spark session{}}}
> {{spark = (SparkSession.builde}}
> {{    .appName("PySpark JSON Example")}}
> {{    .getOrCreate()}}
> {{{}){}}}{{{}# Define the JSON content{}}}
> {{data = {}}
> {{    "text": "a" * 1}}
> {{{}}{}}}{{{}# Create a temporary file{}}}
> {{with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as 
> tmp_file:}}
> {{    # Write the JSON content to the temporary file}}
> {{    tmp_file.write(json.dumps(data) + "\n")}}
> {{    tmp_file_path = tmp_file.name}}{{    # Load the JSON file into a 
> PySpark DataFrame}}
> {{    df = spark.read.json(tmp_file_path)}}{{    # Print the schema}}
> {{    print(df)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48689) Reading lengthy JSON results in a corrupted record.

2024-06-22 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856910#comment-17856910
 ] 

Wei Guo commented on SPARK-48689:
-

This is controlled by option `maxStringLen`, the default value of it is 
20,000,000. If you set the option when reading, you can get the right result.
{code:java}
spark.read.option("maxStringLen", 1).json(path){code}
 

I made a test with a 20,000,010 length string:

!image-2024-06-22-15-33-38-833.png!

 

> Reading lengthy JSON results in a corrupted record.
> ---
>
> Key: SPARK-48689
> URL: https://issues.apache.org/jira/browse/SPARK-48689
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
> Environment: Ubuntu 22.04, Python 3.11, and OpenJDK 22
>Reporter: Yuxiang Wei
>Priority: Major
>  Labels: Reader
> Attachments: image-2024-06-22-15-33-38-833.png
>
>
> When reading a data frame from a JSON file including a very long string, 
> spark will incorrectly make it a corrupted record even though the format is 
> correct. Here is a minimal example with PySpark:
> {{import json}}
> {{import tempfile}}
> {{{}from pyspark.sql import SparkSession{}}}{{{}# Create a Spark session{}}}
> {{spark = (SparkSession.builde}}
> {{    .appName("PySpark JSON Example")}}
> {{    .getOrCreate()}}
> {{{}){}}}{{{}# Define the JSON content{}}}
> {{data = {}}
> {{    "text": "a" * 1}}
> {{{}}{}}}{{{}# Create a temporary file{}}}
> {{with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as 
> tmp_file:}}
> {{    # Write the JSON content to the temporary file}}
> {{    tmp_file.write(json.dumps(data) + "\n")}}
> {{    tmp_file_path = tmp_file.name}}{{    # Load the JSON file into a 
> PySpark DataFrame}}
> {{    df = spark.read.json(tmp_file_path)}}{{    # Print the schema}}
> {{    print(df)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48671) Add test cases for Hex.hex

2024-06-20 Thread Wei Guo (Jira)
Wei Guo created SPARK-48671:
---

 Summary: Add test cases for Hex.hex
 Key: SPARK-48671
 URL: https://issues.apache.org/jira/browse/SPARK-48671
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48660) The result of explain is incorrect for CreateTableAsSelect

2024-06-18 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856122#comment-17856122
 ] 

Wei Guo commented on SPARK-48660:
-

I am working on this and thank your for recommendation [~yangjie01] .

> The result of explain is incorrect for CreateTableAsSelect
> --
>
> Key: SPARK-48660
> URL: https://issues.apache.org/jira/browse/SPARK-48660
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:sql}
> CREATE TABLE order_history_version_audit_rno (
>   eventid STRING,
>   id STRING,
>   referenceid STRING,
>   type STRING,
>   referencetype STRING,
>   sellerid BIGINT,
>   buyerid BIGINT,
>   producerid STRING,
>   versionid INT,
>   changedocuments ARRAY BIGINT, changeDetails: STRING>>,
>   dt STRING,
>   hr STRING)
> USING parquet
> PARTITIONED BY (dt, hr);
> explain cost
> CREATE TABLE order_history_version_audit_rno
> USING parquet
> PARTITIONED BY (dt)
> CLUSTERED BY (id) INTO 1000 buckets
> AS SELECT * FROM order_history_version_audit_rno
> WHERE dt >= '2023-11-29';
> {code}
> {noformat}
> spark-sql (default)> 
>> explain cost
>> CREATE TABLE order_history_version_audit_rno
>> USING parquet
>> PARTITIONED BY (dt)
>> CLUSTERED BY (id) INTO 1000 buckets
>> AS SELECT * FROM order_history_version_audit_rno
>> WHERE dt >= '2023-11-29';
> == Optimized Logical Plan ==
> CreateDataSourceTableAsSelectCommand 
> `spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, 
> [eventid, id, referenceid, type, referencetype, sellerid, buyerid, 
> producerid, versionid, changedocuments, hr, dt]
>+- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
> sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
> hr#16, dt#15]
>   +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
> sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
> dt#15, hr#16]
>  +- Filter (dt#15 >= 2023-11-29)
> +- SubqueryAlias 
> spark_catalog.default.order_history_version_audit_rno
>+- Relation 
> spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16]
>  parquet
> == Physical Plan ==
> Execute CreateDataSourceTableAsSelectCommand
>+- CreateDataSourceTableAsSelectCommand 
> `spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, 
> [eventid, id, referenceid, type, referencetype, sellerid, buyerid, 
> producerid, versionid, changedocuments, hr, dt]
>  +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
> sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
> hr#16, dt#15]
> +- Project [eventid#5, id#6, referenceid#7, type#8, 
> referencetype#9, sellerid#10L, buyerid#11L, producerid#12, versionid#13, 
> changedocuments#14, dt#15, hr#16]
>+- Filter (dt#15 >= 2023-11-29)
>   +- SubqueryAlias 
> spark_catalog.default.order_history_version_audit_rno
>  +- Relation 
> spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16]
>  parquet
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-48660) The result of explain is incorrect for CreateTableAsSelect

2024-06-18 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856122#comment-17856122
 ] 

Wei Guo edited comment on SPARK-48660 at 6/19/24 4:18 AM:
--

I am working on this and thank your for recommendation [~LuciferYang] 


was (Author: wayne guo):
I am working on this and thank your for recommendation [~yangjie01] .

> The result of explain is incorrect for CreateTableAsSelect
> --
>
> Key: SPARK-48660
> URL: https://issues.apache.org/jira/browse/SPARK-48660
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:sql}
> CREATE TABLE order_history_version_audit_rno (
>   eventid STRING,
>   id STRING,
>   referenceid STRING,
>   type STRING,
>   referencetype STRING,
>   sellerid BIGINT,
>   buyerid BIGINT,
>   producerid STRING,
>   versionid INT,
>   changedocuments ARRAY BIGINT, changeDetails: STRING>>,
>   dt STRING,
>   hr STRING)
> USING parquet
> PARTITIONED BY (dt, hr);
> explain cost
> CREATE TABLE order_history_version_audit_rno
> USING parquet
> PARTITIONED BY (dt)
> CLUSTERED BY (id) INTO 1000 buckets
> AS SELECT * FROM order_history_version_audit_rno
> WHERE dt >= '2023-11-29';
> {code}
> {noformat}
> spark-sql (default)> 
>> explain cost
>> CREATE TABLE order_history_version_audit_rno
>> USING parquet
>> PARTITIONED BY (dt)
>> CLUSTERED BY (id) INTO 1000 buckets
>> AS SELECT * FROM order_history_version_audit_rno
>> WHERE dt >= '2023-11-29';
> == Optimized Logical Plan ==
> CreateDataSourceTableAsSelectCommand 
> `spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, 
> [eventid, id, referenceid, type, referencetype, sellerid, buyerid, 
> producerid, versionid, changedocuments, hr, dt]
>+- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
> sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
> hr#16, dt#15]
>   +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
> sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
> dt#15, hr#16]
>  +- Filter (dt#15 >= 2023-11-29)
> +- SubqueryAlias 
> spark_catalog.default.order_history_version_audit_rno
>+- Relation 
> spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16]
>  parquet
> == Physical Plan ==
> Execute CreateDataSourceTableAsSelectCommand
>+- CreateDataSourceTableAsSelectCommand 
> `spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, 
> [eventid, id, referenceid, type, referencetype, sellerid, buyerid, 
> producerid, versionid, changedocuments, hr, dt]
>  +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
> sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
> hr#16, dt#15]
> +- Project [eventid#5, id#6, referenceid#7, type#8, 
> referencetype#9, sellerid#10L, buyerid#11L, producerid#12, versionid#13, 
> changedocuments#14, dt#15, hr#16]
>+- Filter (dt#15 >= 2023-11-29)
>   +- SubqueryAlias 
> spark_catalog.default.order_history_version_audit_rno
>  +- Relation 
> spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16]
>  parquet
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48661) Upgrade RoaringBitmap to 1.1.0

2024-06-18 Thread Wei Guo (Jira)
Wei Guo created SPARK-48661:
---

 Summary: Upgrade RoaringBitmap to 1.1.0
 Key: SPARK-48661
 URL: https://issues.apache.org/jira/browse/SPARK-48661
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48635) Assign classes to join type errors and as-of join error

2024-06-14 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48635:

Summary:  Assign classes to join type errors  and as-of join error  (was:  
Assign classes to join type errors  and as-of join error 
_LEGACY_ERROR_TEMP_3217 )

>  Assign classes to join type errors  and as-of join error
> -
>
> Key: SPARK-48635
> URL: https://issues.apache.org/jira/browse/SPARK-48635
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Minor
>
> job type errors: 
> LEGACY_ERROR_TEMP[1319, 3216]
> as-of join error:
> _LEGACY_ERROR_TEMP_3217



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48635) Assign classes to join type errors and as-of join error _LEGACY_ERROR_TEMP_3217

2024-06-14 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48635:

Summary:  Assign classes to join type errors  and as-of join error 
_LEGACY_ERROR_TEMP_3217   (was:  Assign classes to join type errors 
_LEGACY_ERROR_TEMP_[1319, 3216] and as-of join error _LEGACY_ERROR_TEMP_3217 )

>  Assign classes to join type errors  and as-of join error 
> _LEGACY_ERROR_TEMP_3217 
> --
>
> Key: SPARK-48635
> URL: https://issues.apache.org/jira/browse/SPARK-48635
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48635) Assign classes to join type errors and as-of join error _LEGACY_ERROR_TEMP_3217

2024-06-14 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48635:

Description: 
job type errors: 
LEGACY_ERROR_TEMP[1319, 3216]
as-of join error:
_LEGACY_ERROR_TEMP_3217

  was:
job type errors: 
_LEGACY_ERROR_TEMP_[1319, 3216]
as-of join error:
_LEGACY_ERROR_TEMP_3217


>  Assign classes to join type errors  and as-of join error 
> _LEGACY_ERROR_TEMP_3217 
> --
>
> Key: SPARK-48635
> URL: https://issues.apache.org/jira/browse/SPARK-48635
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Minor
>
> job type errors: 
> LEGACY_ERROR_TEMP[1319, 3216]
> as-of join error:
> _LEGACY_ERROR_TEMP_3217



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48635) Assign classes to join type errors and as-of join error _LEGACY_ERROR_TEMP_3217

2024-06-14 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48635:

Description: 
job type errors: 
_LEGACY_ERROR_TEMP_[1319, 3216]
as-of join error:


>  Assign classes to join type errors  and as-of join error 
> _LEGACY_ERROR_TEMP_3217 
> --
>
> Key: SPARK-48635
> URL: https://issues.apache.org/jira/browse/SPARK-48635
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Minor
>
> job type errors: 
> _LEGACY_ERROR_TEMP_[1319, 3216]
> as-of join error:



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48635) Assign classes to join type errors and as-of join error _LEGACY_ERROR_TEMP_3217

2024-06-14 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48635:

Description: 
job type errors: 
_LEGACY_ERROR_TEMP_[1319, 3216]
as-of join error:
_LEGACY_ERROR_TEMP_3217

  was:
job type errors: 
_LEGACY_ERROR_TEMP_[1319, 3216]
as-of join error:



>  Assign classes to join type errors  and as-of join error 
> _LEGACY_ERROR_TEMP_3217 
> --
>
> Key: SPARK-48635
> URL: https://issues.apache.org/jira/browse/SPARK-48635
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Minor
>
> job type errors: 
> _LEGACY_ERROR_TEMP_[1319, 3216]
> as-of join error:
> _LEGACY_ERROR_TEMP_3217



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48635) Assign classes to join type errors _LEGACY_ERROR_TEMP_[1319, 3216] and as-of join error _LEGACY_ERROR_TEMP_3217

2024-06-14 Thread Wei Guo (Jira)
Wei Guo created SPARK-48635:
---

 Summary:  Assign classes to join type errors 
_LEGACY_ERROR_TEMP_[1319, 3216] and as-of join error _LEGACY_ERROR_TEMP_3217 
 Key: SPARK-48635
 URL: https://issues.apache.org/jira/browse/SPARK-48635
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48614) Cleanup deprecated api usage related to kafka-clients

2024-06-13 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48614:

Description: (was: There are some deprecated classes and methods in 
commons-io called in Spark, we need to replace them:
 * writeStringToFile(final File file, final String data)
 * CountingInputStream)

> Cleanup deprecated api usage related to kafka-clients
> -
>
> Key: SPARK-48614
> URL: https://issues.apache.org/jira/browse/SPARK-48614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Assignee: Wei Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48614) Cleanup deprecated api usage related to kafka-clients

2024-06-13 Thread Wei Guo (Jira)
Wei Guo created SPARK-48614:
---

 Summary: Cleanup deprecated api usage related to kafka-clients
 Key: SPARK-48614
 URL: https://issues.apache.org/jira/browse/SPARK-48614
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wei Guo
Assignee: Wei Guo
 Fix For: 4.0.0


There are some deprecated classes and methods in commons-io called in Spark, we 
need to replace them:
 * writeStringToFile(final File file, final String data)
 * CountingInputStream



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48604) Replace deprecated classes and methods of arrow-vector called in Spark

2024-06-12 Thread Wei Guo (Jira)
Wei Guo created SPARK-48604:
---

 Summary: Replace deprecated classes and methods of arrow-vector 
called in Spark
 Key: SPARK-48604
 URL: https://issues.apache.org/jira/browse/SPARK-48604
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wei Guo


There are some deprecated classes and methods in commons-io called in Spark, we 
need to replace them:
 * writeStringToFile(final File file, final String data)
 * CountingInputStream



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48604) Replace deprecated classes and methods of arrow-vector called in Spark

2024-06-12 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48604:

Description: 
There are some deprecated classes and methods in arrow-vector called in Spark, 
we need to replace them:
 * ArrowType.Decimal(precision, scale)

  was:
There are some deprecated classes and methods in commons-io called in Spark, we 
need to replace them:
 * writeStringToFile(final File file, final String data)
 * CountingInputStream


> Replace deprecated classes and methods of arrow-vector called in Spark
> --
>
> Key: SPARK-48604
> URL: https://issues.apache.org/jira/browse/SPARK-48604
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>  Labels: pull-request-available
>
> There are some deprecated classes and methods in arrow-vector called in 
> Spark, we need to replace them:
>  * ArrowType.Decimal(precision, scale)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48583) Replace deprecated classes and methods of commons-io called in Spark

2024-06-12 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48583:

Summary: Replace deprecated classes and methods of commons-io called in 
Spark  (was: Replace deprecated classes and methods of `commons-io` called in 
Spark)

> Replace deprecated classes and methods of commons-io called in Spark
> 
>
> Key: SPARK-48583
> URL: https://issues.apache.org/jira/browse/SPARK-48583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>  Labels: pull-request-available
>
> There are some deprecated classes and methods in commons-io called in Spark, 
> we need to replace them:
>  * writeStringToFile(final File file, final String data)
>  * CountingInputStream



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48583) Replace deprecated classes and methods of `commons-io` called in Spark

2024-06-12 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48583:

Description: 
There are some deprecated classes and methods in commons-io called in Spark, we 
need to replace them:
 * writeStringToFile(final File file, final String data)
 * CountingInputStream

  was:
There are some deprecated classes and methods in `commons-io` called in Spark, 
we need to replace them:
 * `writeStringToFile(final File file, final String data);
 * `CountingInputStream`


> Replace deprecated classes and methods of `commons-io` called in Spark
> --
>
> Key: SPARK-48583
> URL: https://issues.apache.org/jira/browse/SPARK-48583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>  Labels: pull-request-available
>
> There are some deprecated classes and methods in commons-io called in Spark, 
> we need to replace them:
>  * writeStringToFile(final File file, final String data)
>  * CountingInputStream



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48583) Replace deprecated classes and methods of `commons-io` called in Spark

2024-06-12 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48583:

Description: 
There are some deprecated classes and methods in `commons-io` called in Spark, 
we need to replace them:
 *   `writeStringToFile(final File file, final String data);
 * `CountingInputStream`

  was:Method `writeStringToFile(final File file, final String data)` in class 
`FileUtils` is deprecated, use `writeStringToFile(final File file, final String 
data, final Charset charset)` instead in UDFXPathUtilSuite.


> Replace deprecated classes and methods of `commons-io` called in Spark
> --
>
> Key: SPARK-48583
> URL: https://issues.apache.org/jira/browse/SPARK-48583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>  Labels: pull-request-available
>
> There are some deprecated classes and methods in `commons-io` called in 
> Spark, we need to replace them:
>  *   `writeStringToFile(final File file, final String data);
>  * `CountingInputStream`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48583) Replace deprecated classes and methods of `commons-io` called in Spark

2024-06-12 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48583:

Description: 
There are some deprecated classes and methods in `commons-io` called in Spark, 
we need to replace them:
 * `writeStringToFile(final File file, final String data);
 * `CountingInputStream`

  was:
There are some deprecated classes and methods in `commons-io` called in Spark, 
we need to replace them:
 *   `writeStringToFile(final File file, final String data);
 * `CountingInputStream`


> Replace deprecated classes and methods of `commons-io` called in Spark
> --
>
> Key: SPARK-48583
> URL: https://issues.apache.org/jira/browse/SPARK-48583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>  Labels: pull-request-available
>
> There are some deprecated classes and methods in `commons-io` called in 
> Spark, we need to replace them:
>  * `writeStringToFile(final File file, final String data);
>  * `CountingInputStream`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48583) Replace deprecated classes and methods of `commons-io` called in Spark

2024-06-12 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48583:

Summary: Replace deprecated classes and methods of `commons-io` called in 
Spark  (was: Replace deprecated `FileUtils#writeStringToFile` )

> Replace deprecated classes and methods of `commons-io` called in Spark
> --
>
> Key: SPARK-48583
> URL: https://issues.apache.org/jira/browse/SPARK-48583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>  Labels: pull-request-available
>
> Method `writeStringToFile(final File file, final String data)` in class 
> `FileUtils` is deprecated, use `writeStringToFile(final File file, final 
> String data, final Charset charset)` instead in UDFXPathUtilSuite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48583) Replace deprecated `FileUtils#writeStringToFile`

2024-06-11 Thread Wei Guo (Jira)
Wei Guo created SPARK-48583:
---

 Summary: Replace deprecated `FileUtils#writeStringToFile` 
 Key: SPARK-48583
 URL: https://issues.apache.org/jira/browse/SPARK-48583
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wei Guo


Method `writeStringToFile(final File file, final String data)` in class 
`FileUtils` is deprecated, use `writeStringToFile(final File file, final String 
data, final Charset charset)` instead in UDFXPathUtilSuite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48581) Upgrade dropwizard metrics to 4.2.26

2024-06-10 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48581:

Summary: Upgrade dropwizard metrics to 4.2.26  (was: Upgrade dropwizard 
metrics 4.2.26)

> Upgrade dropwizard metrics to 4.2.26
> 
>
> Key: SPARK-48581
> URL: https://issues.apache.org/jira/browse/SPARK-48581
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48581) Upgrade dropwizard metrics 4.2.26

2024-06-10 Thread Wei Guo (Jira)
Wei Guo created SPARK-48581:
---

 Summary: Upgrade dropwizard metrics 4.2.26
 Key: SPARK-48581
 URL: https://issues.apache.org/jira/browse/SPARK-48581
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48539) Upgrade docker-java to 3.3.6

2024-06-05 Thread Wei Guo (Jira)
Wei Guo created SPARK-48539:
---

 Summary: Upgrade docker-java to 3.3.6
 Key: SPARK-48539
 URL: https://issues.apache.org/jira/browse/SPARK-48539
 Project: Spark
  Issue Type: Improvement
  Components: Spark Docker
Affects Versions: 4.0.0
Reporter: Wei Guo
 Fix For: 4.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47259) Assign classes to interval errors

2024-05-28 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850238#comment-17850238
 ] 

Wei Guo commented on SPARK-47259:
-

Update `_LEGACY_ERROR_TEMP_32[08-14]` to `_LEGACY_ERROR_TEMP_32[09-14]`, 
because `
_LEGACY_ERROR_TEMP_3208` is not related to interval errors.

> Assign classes to interval errors
> -
>
> Key: SPARK-47259
> URL: https://issues.apache.org/jira/browse/SPARK-47259
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_32[09-14]* 
> defined in {*}core/src/main/resources/error/error-classes.json{*}. The name 
> should be short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47259) Assign classes to interval errors

2024-05-28 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-47259:

Description: 
Choose a proper name for the error class *_LEGACY_ERROR_TEMP_32[09-14]* defined 
in {*}core/src/main/resources/error/error-classes.json{*}. The name should be 
short but complete (look at the example in error-classes.json).

Add a test which triggers the error from user code if such test still doesn't 
exist. Check exception fields by using {*}checkError(){*}. The last function 
checks valuable error fields only, and avoids dependencies from error text 
message. In this way, tech editors can modify error format in 
error-classes.json, and don't worry of Spark's internal tests. Migrate other 
tests that might trigger the error onto checkError().

If you cannot reproduce the error from user space (using SQL query), replace 
the error by an internal error, see {*}SparkException.internalError(){*}.

Improve the error message format in error-classes.json if the current is not 
clear. Propose a solution to users how to avoid and fix such kind of errors.

Please, look at the PR below as examples:
 * [https://github.com/apache/spark/pull/38685]
 * [https://github.com/apache/spark/pull/38656]
 * [https://github.com/apache/spark/pull/38490]

  was:
Choose a proper name for the error class *_LEGACY_ERROR_TEMP_32[08-14]* defined 
in {*}core/src/main/resources/error/error-classes.json{*}. The name should be 
short but complete (look at the example in error-classes.json).

Add a test which triggers the error from user code if such test still doesn't 
exist. Check exception fields by using {*}checkError(){*}. The last function 
checks valuable error fields only, and avoids dependencies from error text 
message. In this way, tech editors can modify error format in 
error-classes.json, and don't worry of Spark's internal tests. Migrate other 
tests that might trigger the error onto checkError().

If you cannot reproduce the error from user space (using SQL query), replace 
the error by an internal error, see {*}SparkException.internalError(){*}.

Improve the error message format in error-classes.json if the current is not 
clear. Propose a solution to users how to avoid and fix such kind of errors.

Please, look at the PR below as examples:
 * [https://github.com/apache/spark/pull/38685]
 * [https://github.com/apache/spark/pull/38656]
 * [https://github.com/apache/spark/pull/38490]


> Assign classes to interval errors
> -
>
> Key: SPARK-47259
> URL: https://issues.apache.org/jira/browse/SPARK-47259
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_32[09-14]* 
> defined in {*}core/src/main/resources/error/error-classes.json{*}. The name 
> should be short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40678) JSON conversion of ArrayType is not properly supported in Spark 3.2/2.13

2023-02-12 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687573#comment-17687573
 ] 

Wei Guo commented on SPARK-40678:
-

Fixed by PR 38154 https://github.com/apache/spark/pull/38154

> JSON conversion of ArrayType is not properly supported in Spark 3.2/2.13
> 
>
> Key: SPARK-40678
> URL: https://issues.apache.org/jira/browse/SPARK-40678
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.2.0
>Reporter: Cédric Chantepie
>Priority: Major
>
> In Spark 3.2 (Scala 2.13), values with {{ArrayType}} are no longer properly 
> support with JSON; e.g.
> {noformat}
> import org.apache.spark.sql.SparkSession
> case class KeyValue(key: String, value: Array[Byte])
> val spark = 
> SparkSession.builder().master("local[1]").appName("test").getOrCreate()
> import spark.implicits._
> val df = Seq(Array(KeyValue("foo", "bar".getBytes))).toDF()
> df.foreach(r => println(r.json))
> {noformat}
> Expected:
> {noformat}
> [{foo, bar}]
> {noformat}
> Encountered:
> {noformat}
> java.lang.IllegalArgumentException: Failed to convert value 
> ArraySeq([foo,[B@dcdb68f]) (class of class 
> scala.collection.mutable.ArraySeq$ofRef}) with the type of 
> ArrayType(Seq(StructField(key,StringType,false), 
> StructField(value,BinaryType,false)),true) to JSON.
>   at org.apache.spark.sql.Row.toJson$1(Row.scala:604)
>   at org.apache.spark.sql.Row.jsonValue(Row.scala:613)
>   at org.apache.spark.sql.Row.jsonValue$(Row.scala:552)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.jsonValue(rows.scala:166)
>   at org.apache.spark.sql.Row.json(Row.scala:535)
>   at org.apache.spark.sql.Row.json$(Row.scala:535)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.json(rows.scala:166)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39348) Create table in overwrite mode fails when interrupted

2023-02-09 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17686390#comment-17686390
 ] 

Wei Guo commented on SPARK-39348:
-

After PR [https://github.com/apache/spark/pull/26559,] it has been removed.
 * Since Spark 2.4, creating a managed table with nonempty location is not 
allowed. An exception is thrown when attempting to create a managed table with 
nonempty location. To set {{true}} to 
{{spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation}} restores 
the previous behavior. This option will be removed in Spark 3.0.

> Create table in overwrite mode fails when interrupted
> -
>
> Key: SPARK-39348
> URL: https://issues.apache.org/jira/browse/SPARK-39348
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.1.1
>Reporter: Max
>Priority: Major
>
> When you attempt to rerun an Apache Spark write operation by cancelling the 
> currently running job, the following error occurs:
> {code:java}
> Error: org.apache.spark.sql.AnalysisException: Cannot create the managed 
> table('`testdb`.` testtable`').
> The associated location 
> ('dbfs:/user/hive/warehouse/testdb.db/metastore_cache_ testtable) already 
> exists.;{code}
> This problem can occur if:
>  * The cluster is terminated while a write operation is in progress.
>  * A temporary network issue occurs.
>  * The job is interrupted.
> You can reproduce the problem by following these steps:
> 1. Create a DataFrame:
> {code:java}
> val df = spark.range(1000){code}
> 2. Write the DataFrame to a location in overwrite mode:
> {code:java}
> df.write.mode(SaveMode.Overwrite).saveAsTable("testdb.testtable"){code}
> 3. Cancel the command while it is executing.
> 4. Re-run the {{write}} command.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicitly in CSV dataSource

2023-02-06 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Description: 
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.

 

After this change, the behavior as flows:
|id|code|2.4 and before|3.0 and after|this update|remark|
|1|Seq("#abc", "\udef", "xyz").toDF()
.write.{color:#57d9a3}option("comment", "\u"){color}.csv(path)|#abc
*def*
xyz|{color:#4c9aff}"#abc"{color}
{color:#4c9aff}*def*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}*"def"*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|2|Seq("#abc", "\udef", "xyz").toDF()
.write{color:#57d9a3}.option("comment", "#"){color}.csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|3|Seq("#abc", "\udef", "xyz").toDF()
.write.csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|default behavior: the same|
|4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.{color:#57d9a3}option("comment", "\u"){color}.csv(path)|#abc
xyz|{color:#4c9aff}#abc{color}
{color:#4c9aff}\udef{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.{color:#57d9a3}option("comment", "#"){color}.csv(path)|\udef
xyz|\udef
xyz|\udef
xyz|the same|
|6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.csv(path)|#abc
xyz|#abc
\udef
xyz|#abc
\udef
xyz|default behavior: the same|

 

  was:
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.

 

After this change, the behavior as flows:
|id|code|2.4 and before|3.0 and after|this update|remark|
|1|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "\u").csv(path)|#abc
*def*
xyz|{color:#4c9aff}"#abc"{color}
{color:#4c9aff}*def*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}*"def"*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|2|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "#").csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|3|Seq("#abc", "\udef", "xyz").toDF()
.write.csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "\u").csv(path)|#abc
xyz|{color:#4c9aff}#abc{color}
{color:#4c9aff}\udef{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", 

[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicitly in CSV dataSource

2023-02-06 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Description: 
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.

 
|id|code|2.4 and before|3.0 and after|this update|remark|
|1|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "\u").csv(path)|#abc
*def*
xyz|{color:#4c9aff}"#abc"{color}
{color:#4c9aff}*def*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}*"def"*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|2|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "#").csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|3|Seq("#abc", "\udef", "xyz").toDF()
.write.csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "\u").csv(path)|#abc
xyz|{color:#4c9aff}#abc{color}
{color:#4c9aff}\udef{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "#").csv(path)|\udef
xyz|\udef
xyz|\udef
xyz|the same|
|6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.csv(path)|#abc
xyz|#abc
\udef
xyz|#abc
\udef
xyz|the same|

 

  was:
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.

 
|id|code| |2.4 and before|3.0 and after|current update|remark|
|1|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "\u").csv(path)| |#abc
*def*
xyz|{color:#4c9aff}"#abc"{color}
{color:#4c9aff}*def*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}*"def"*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}a little bit difference{color}|
|2|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "#").csv(path)| |#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|3|Seq("#abc", "\udef", "xyz").toDF()
.write.csv(path)| |#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "\u").csv(path)| |#abc
xyz|{color:#4c9aff}#abc{color}
{color:#4c9aff}\udef{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}a liitle bit difference{color}|
|5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "#").csv(path)| |\udef
xyz|\udef
xyz|\udef
xyz|the same|
|6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.csv(path)| |#abc
xyz|#abc
\udef
xyz|#abc
\udef
xyz|the same|
 


> Pass the comment option 

[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicitly in CSV dataSource

2023-02-06 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Description: 
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.

 

After this change, the behavior as flows:
|id|code|2.4 and before|3.0 and after|this update|remark|
|1|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "\u").csv(path)|#abc
*def*
xyz|{color:#4c9aff}"#abc"{color}
{color:#4c9aff}*def*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}*"def"*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|2|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "#").csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|3|Seq("#abc", "\udef", "xyz").toDF()
.write.csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "\u").csv(path)|#abc
xyz|{color:#4c9aff}#abc{color}
{color:#4c9aff}\udef{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "#").csv(path)|\udef
xyz|\udef
xyz|\udef
xyz|the same|
|6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.csv(path)|#abc
xyz|#abc
\udef
xyz|#abc
\udef
xyz|the same|

 

  was:
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.

 
|id|code|2.4 and before|3.0 and after|this update|remark|
|1|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "\u").csv(path)|#abc
*def*
xyz|{color:#4c9aff}"#abc"{color}
{color:#4c9aff}*def*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}*"def"*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|2|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "#").csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|3|Seq("#abc", "\udef", "xyz").toDF()
.write.csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "\u").csv(path)|#abc
xyz|{color:#4c9aff}#abc{color}
{color:#4c9aff}\udef{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "#").csv(path)|\udef
xyz|\udef
xyz|\udef
xyz|the same|
|6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.csv(path)|#abc
xyz|#abc

[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicitly in CSV dataSource

2023-02-06 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Description: 
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.

 
|id|code| |2.4 and before|3.0 and after|current update|remark|
|1|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "\u").csv(path)| |#abc
*def*
xyz|{color:#4c9aff}"#abc"{color}
{color:#4c9aff}*def*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}*"def"*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}a little bit difference{color}|
|2|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "#").csv(path)| |#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|3|Seq("#abc", "\udef", "xyz").toDF()
.write.csv(path)| |#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "\u").csv(path)| |#abc
xyz|{color:#4c9aff}#abc{color}
{color:#4c9aff}\udef{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}a liitle bit difference{color}|
|5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "#").csv(path)| |\udef
xyz|\udef
xyz|\udef
xyz|the same|
|6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.csv(path)| |#abc
xyz|#abc
\udef
xyz|#abc
\udef
xyz|the same|
 

  was:
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.


> Pass the comment option through to univocity if users set it explicitly in 
> CSV dataSource
> -
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-56-01-596.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-56-10-083.png!
> For users, they can't set comment option 

[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicitly in CSV dataSource

2023-02-05 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Summary: Pass the comment option through to univocity if users set it 
explicitly in CSV dataSource  (was: Pass the comment option through to 
univocity if users set it explicity in CSV dataSource)

> Pass the comment option through to univocity if users set it explicitly in 
> CSV dataSource
> -
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-56-01-596.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-56-10-083.png!
> For users, they can't set comment option to '\u'  to keep the behavior as 
> before because the new added `isCommentSet` check logic as follows:
> {code:java}
> val isCommentSet = this.comment != '\u'
> def asWriterSettings: CsvWriterSettings = {
>   // other code
>   if (isCommentSet) {
> format.setComment(comment)
>   }
>   // other code
> }
>  {code}
> It's better to pass the comment option through to univocity if users set it 
> explicitly in CSV dataSource.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicity in CSV dataSource

2023-02-05 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Description: 
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.

  was:
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to Univocity if users set it 
explicitly in CSV dataSource.


> Pass the comment option through to univocity if users set it explicity in CSV 
> dataSource
> 
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-56-01-596.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-56-10-083.png!
> For users, they can't set comment option to '\u'  to keep the behavior as 
> before because the new added `isCommentSet` check logic as follows:
> {code:java}
> val isCommentSet = this.comment != '\u'
> def asWriterSettings: CsvWriterSettings = {
>   // other code
>   if (isCommentSet) {
> format.setComment(comment)
>   }
>   // other code
> }
>  {code}
> It's better to pass the comment option through to univocity if users set it 
> explicitly in CSV dataSource.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicity in CSV dataSource

2023-02-05 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Description: 
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to Univocity if users set it 
explicitly in CSV dataSource.

  was:
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior 
before because the `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
  until univocity-parses releases a new version.


> Pass the comment option through to univocity if users set it explicity in CSV 
> dataSource
> 
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-56-01-596.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-56-10-083.png!
> For users, they can't set comment option to '\u'  to keep the behavior as 
> before because the new added `isCommentSet` check logic as follows:
> {code:java}
> val isCommentSet = this.comment != '\u'
> def asWriterSettings: CsvWriterSettings = {
>   // other code
>   if (isCommentSet) {
> format.setComment(comment)
>   }
>   // other code
> }
>  {code}
> It's better to pass the comment option through to Univocity if users set it 
> explicitly in CSV dataSource.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicity in CSV dataSource

2023-02-05 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Description: 
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior 
before because the `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
  until univocity-parses releases a new version.

  was:
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior 
before because the `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  xx
  if (isCommentSet) {
format.setComment(comment)
  }
}
 {code}
  until univocity-parses releases a new version.


> Pass the comment option through to univocity if users set it explicity in CSV 
> dataSource
> 
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-56-01-596.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-56-10-083.png!
> For users, they can't set comment option to '\u'  to keep the behavior 
> before because the `isCommentSet` check logic as follows:
> {code:java}
> val isCommentSet = this.comment != '\u'
> def asWriterSettings: CsvWriterSettings = {
>   // other code
>   if (isCommentSet) {
> format.setComment(comment)
>   }
>   // other code
> }
>  {code}
>   until univocity-parses releases a new version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicity in CSV dataSource

2023-02-05 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Description: 
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior 
before because the `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  xx
  if (isCommentSet) {
format.setComment(comment)
  }
}
 {code}
  until univocity-parses releases a new version.

  was:
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For Spark, it's better to add a legacy config to restores the legacy behavior 
until univocity-parses releases a new version.


> Pass the comment option through to univocity if users set it explicity in CSV 
> dataSource
> 
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-56-01-596.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-56-10-083.png!
> For users, they can't set comment option to '\u'  to keep the behavior 
> before because the `isCommentSet` check logic as follows:
> {code:java}
> val isCommentSet = this.comment != '\u'
> def asWriterSettings: CsvWriterSettings = {
>   xx
>   if (isCommentSet) {
> format.setComment(comment)
>   }
> }
>  {code}
>   until univocity-parses releases a new version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicity in CSV dataSource

2023-02-05 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Description: 
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For Spark, it's better to add a legacy config to restores the legacy behavior 
until univocity-parses releases a new version.

  was:
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!
I have made a PR [https://github.com/uniVocity/univocity-parsers/pull/518] for 
issue 
[univocity-parsers#505|https://github.com/uniVocity/univocity-parsers/issues/505]
 to univocity-parses, but it seems to be a long time for waiting it to be 
merged.
 
For Spark, it's better to add a legacy config to restores the legacy behavior 
until univocity-parses releases a new version.


> Pass the comment option through to univocity if users set it explicity in CSV 
> dataSource
> 
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-56-01-596.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-56-10-083.png!
> For Spark, it's better to add a legacy config to restores the legacy behavior 
> until univocity-parses releases a new version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicity in CSV dataSource

2023-02-05 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Summary: Pass the comment option through to univocity if users set it 
explicity in CSV dataSource  (was: Add a legacy config for restoring writer's 
comment option behavior in CSV dataSource)

> Pass the comment option through to univocity if users set it explicity in CSV 
> dataSource
> 
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-56-01-596.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-56-10-083.png!
> I have made a PR [https://github.com/uniVocity/univocity-parsers/pull/518] 
> for issue 
> [univocity-parsers#505|https://github.com/uniVocity/univocity-parsers/issues/505]
>  to univocity-parses, but it seems to be a long time for waiting it to be 
> merged.
>  
> For Spark, it's better to add a legacy config to restores the legacy behavior 
> until univocity-parses releases a new version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42252) Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config

2023-02-05 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42252:

Fix Version/s: 3.5.0
   (was: 3.4.0)

> Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config
> --
>
> Key: SPARK-42252
> URL: https://issues.apache.org/jira/browse/SPARK-42252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.5.0
>
>
> After Jira SPARK-28209 and PR 
> [25007|[https://github.com/apache/spark/pull/25007]], the new shuffle writer 
> api is proposed. All shuffle writers(BypassMergeSortShuffleWriter, 
> SortShuffleWriter, UnsafeShuffleWriter) are based on 
> LocalDiskShuffleMapOutputWriter to write local disk shuffle files. The config 
> spark.shuffle.unsafe.file.output.buffer used in 
> LocalDiskShuffleMapOutputWriter was only used in UnsafeShuffleWriter before. 
>  
> It's better to rename it and make it more suitable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42252) Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config

2023-02-05 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42252:

Affects Version/s: 3.3.0
   3.2.0
   3.1.0
   3.4.0

> Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config
> --
>
> Key: SPARK-42252
> URL: https://issues.apache.org/jira/browse/SPARK-42252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0, 3.4.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.5.0
>
>
> After Jira SPARK-28209 and PR 
> [25007|[https://github.com/apache/spark/pull/25007]], the new shuffle writer 
> api is proposed. All shuffle writers(BypassMergeSortShuffleWriter, 
> SortShuffleWriter, UnsafeShuffleWriter) are based on 
> LocalDiskShuffleMapOutputWriter to write local disk shuffle files. The config 
> spark.shuffle.unsafe.file.output.buffer used in 
> LocalDiskShuffleMapOutputWriter was only used in UnsafeShuffleWriter before. 
>  
> It's better to rename it and make it more suitable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42252) Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config

2023-02-05 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42252:

Target Version/s: 3.5.0  (was: 3.4.0)

> Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config
> --
>
> Key: SPARK-42252
> URL: https://issues.apache.org/jira/browse/SPARK-42252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
>
> After Jira SPARK-28209 and PR 
> [25007|[https://github.com/apache/spark/pull/25007]], the new shuffle writer 
> api is proposed. All shuffle writers(BypassMergeSortShuffleWriter, 
> SortShuffleWriter, UnsafeShuffleWriter) are based on 
> LocalDiskShuffleMapOutputWriter to write local disk shuffle files. The config 
> spark.shuffle.unsafe.file.output.buffer used in 
> LocalDiskShuffleMapOutputWriter was only used in UnsafeShuffleWriter before. 
>  
> It's better to rename it and make it more suitable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42335) Add a legacy config for restoring writer's comment option behavior in CSV dataSource

2023-02-03 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Attachment: image-2023-02-03-18-56-10-083.png

> Add a legacy config for restoring writer's comment option behavior in CSV 
> dataSource
> 
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-44-30-296.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-15-12-661.png!
> I have made a PR [https://github.com/uniVocity/univocity-parsers/pull/518] 
> for issue 
> [univocity-parsers#505|https://github.com/uniVocity/univocity-parsers/issues/505]
>  to univocity-parses, but it seems to be a long time for waiting it to be 
> merged.
>  
> For Spark, it's better to add a legacy config to restores the legacy behavior 
> until univocity-parses releases a new version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42335) Add a legacy config for restoring writer's comment option behavior in CSV dataSource

2023-02-03 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Description: 
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!
I have made a PR [https://github.com/uniVocity/univocity-parsers/pull/518] for 
issue 
[univocity-parsers#505|https://github.com/uniVocity/univocity-parsers/issues/505]
 to univocity-parses, but it seems to be a long time for waiting it to be 
merged.
 
For Spark, it's better to add a legacy config to restores the legacy behavior 
until univocity-parses releases a new version.

  was:
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-44-30-296.png!
After this change, the content is shown as:
!image-2023-02-03-18-15-12-661.png!
I have made a PR [https://github.com/uniVocity/univocity-parsers/pull/518] for 
issue 
[univocity-parsers#505|https://github.com/uniVocity/univocity-parsers/issues/505]
 to univocity-parses, but it seems to be a long time for waiting it to be 
merged.
 
For Spark, it's better to add a legacy config to restores the legacy behavior 
until univocity-parses releases a new version.


> Add a legacy config for restoring writer's comment option behavior in CSV 
> dataSource
> 
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-56-01-596.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-56-10-083.png!
> I have made a PR [https://github.com/uniVocity/univocity-parsers/pull/518] 
> for issue 
> [univocity-parsers#505|https://github.com/uniVocity/univocity-parsers/issues/505]
>  to univocity-parses, but it seems to be a long time for waiting it to be 
> merged.
>  
> For Spark, it's better to add a legacy config to restores the legacy behavior 
> until univocity-parses releases a new version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42335) Add a legacy config for restoring writer's comment option behavior in CSV dataSource

2023-02-03 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Attachment: image-2023-02-03-18-56-01-596.png

> Add a legacy config for restoring writer's comment option behavior in CSV 
> dataSource
> 
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-44-30-296.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-15-12-661.png!
> I have made a PR [https://github.com/uniVocity/univocity-parsers/pull/518] 
> for issue 
> [univocity-parsers#505|https://github.com/uniVocity/univocity-parsers/issues/505]
>  to univocity-parses, but it seems to be a long time for waiting it to be 
> merged.
>  
> For Spark, it's better to add a legacy config to restores the legacy behavior 
> until univocity-parses releases a new version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42335) Add a legacy config for restoring writer's comment option behavior in CSV dataSource

2023-02-03 Thread Wei Guo (Jira)
Wei Guo created SPARK-42335:
---

 Summary: Add a legacy config for restoring writer's comment option 
behavior in CSV dataSource
 Key: SPARK-42335
 URL: https://issues.apache.org/jira/browse/SPARK-42335
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0, 3.2.0, 3.1.0, 3.0.0
Reporter: Wei Guo
 Fix For: 3.4.0


In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-44-30-296.png!
After this change, the content is shown as:
!image-2023-02-03-18-15-12-661.png!
I have made a PR [https://github.com/uniVocity/univocity-parsers/pull/518] for 
issue 
[univocity-parsers#505|https://github.com/uniVocity/univocity-parsers/issues/505]
 to univocity-parses, but it seems to be a long time for waiting it to be 
merged.
 
For Spark, it's better to add a legacy config to restores the legacy behavior 
until univocity-parses releases a new version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42237) change binary to unsupported dataType in csv format

2023-02-02 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42237:

Description: 
When a binary colunm is written into csv files, actual content of this colunm 
is {*}object.toString(){*}, which is meaningless.
{code:java}
val df = Seq(Array[Byte](1,2)).toDF
df.write.csv("/Users/guowei/Desktop/binary_csv")
{code}
The csv file's content is as follows:

!image-2023-01-30-17-21-09-212.png|width=141,height=29!

Meanwhile, if a binary colunm saved as table with csv fileformat, the table 
can't be read back successfully.
{code:java}
val df = Seq((1, Array[Byte](1,2))).toDF
df.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select * from 
binaryDataTable").show()
{code}
!https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963=Eiscz4oMI45Sfp=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A!

So I think it' better to change binary to unsupported dataType in csv format, 
both for datasource v1(CSVFileFormat) and v2(CSVTable).

  was:
When a binary colunm is written into csv files, actual content of this colunm 
is {*}object.toString(){*}, which is meaningless.
{code:java}
val df = Seq(Array[Byte](1,2)).toDF
df.write.csv("/Users/guowei19/Desktop/binary_csv")
{code}

The csv file's content is as follows:

!image-2023-01-30-17-21-09-212.png|width=141,height=29!

Meanwhile, if a binary colunm saved as table with csv fileformat, the table 
can't be read back successfully.

{code:java}
val df = Seq((1, Array[Byte](1,2))).toDF
df.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select * from 
binaryDataTable").show()
{code}

!https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963=Eiscz4oMI45Sfp=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A!

So I think it' better to change binary to unsupported dataType in csv format, 
both for datasource v1(CSVFileFormat) and v2(CSVTable).


> change binary to unsupported dataType in csv format
> ---
>
> Key: SPARK-42237
> URL: https://issues.apache.org/jira/browse/SPARK-42237
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.8, 3.3.1
>Reporter: Wei Guo
>Priority: Minor
> Attachments: image-2023-01-30-17-21-09-212.png
>
>
> When a binary colunm is written into csv files, actual content of this colunm 
> is {*}object.toString(){*}, which is meaningless.
> {code:java}
> val df = Seq(Array[Byte](1,2)).toDF
> df.write.csv("/Users/guowei/Desktop/binary_csv")
> {code}
> The csv file's content is as follows:
> !image-2023-01-30-17-21-09-212.png|width=141,height=29!
> Meanwhile, if a binary colunm saved as table with csv fileformat, the table 
> can't be read back successfully.
> {code:java}
> val df = Seq((1, Array[Byte](1,2))).toDF
> df.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select * from 
> binaryDataTable").show()
> {code}
> !https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963=Eiscz4oMI45Sfp=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A!
> So I think it' better to change binary to unsupported dataType in csv format, 
> both for datasource v1(CSVFileFormat) and v2(CSVTable).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42252) Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config

2023-01-30 Thread Wei Guo (Jira)
Wei Guo created SPARK-42252:
---

 Summary: Deprecate spark.shuffle.unsafe.file.output.buffer and add 
a new config
 Key: SPARK-42252
 URL: https://issues.apache.org/jira/browse/SPARK-42252
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Wei Guo
 Fix For: 3.4.0


After Jira SPARK-28209 and PR 
[25007|[https://github.com/apache/spark/pull/25007]], the new shuffle writer 
api is proposed. All shuffle writers(BypassMergeSortShuffleWriter, 
SortShuffleWriter, UnsafeShuffleWriter) are based on 
LocalDiskShuffleMapOutputWriter to write local disk shuffle files. The config 
spark.shuffle.unsafe.file.output.buffer used in LocalDiskShuffleMapOutputWriter 
was only used in UnsafeShuffleWriter before. 
 
It's better to rename it and make it more suitable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42237) change binary to unsupported dataType in csv format

2023-01-30 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17681978#comment-17681978
 ] 

Wei Guo commented on SPARK-42237:
-

a pr is ready~

> change binary to unsupported dataType in csv format
> ---
>
> Key: SPARK-42237
> URL: https://issues.apache.org/jira/browse/SPARK-42237
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.8, 3.3.1
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-01-30-17-21-09-212.png
>
>
> When a binary colunm is written into csv files, actual content of this colunm 
> is {*}object.toString(){*}, which is meaningless.
> {code:java}
> val df = 
> Seq(Array[Byte](1,2)).toDFdf.write.csv("/Users/guowei19/Desktop/binary_csv") 
> {code}
> The csv file's content is as follows:
> !image-2023-01-30-17-21-09-212.png|width=141,height=29!
> Meanwhile, if a binary colunm saved as table with csv fileformat, the table 
> can't be read back successfully.
> {code:java}
> val df = Seq((1, 
> Array[Byte](1,2))).toDFdf.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select
>  * from binaryDataTable").show() {code}
> !https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963=Eiscz4oMI45Sfp=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A!
> So I think it' better to change binary to unsupported dataType in csv format, 
> both for datasource v1(CSVFileFormat) and v2(CSVTable).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-42237) change binary to unsupported dataType in csv format

2023-01-30 Thread Wei Guo (Jira)


[ https://issues.apache.org/jira/browse/SPARK-42237 ]


Wei Guo deleted comment on SPARK-42237:
-

was (Author: wayne guo):
a pr is ready~

> change binary to unsupported dataType in csv format
> ---
>
> Key: SPARK-42237
> URL: https://issues.apache.org/jira/browse/SPARK-42237
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.8, 3.3.1
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-01-30-17-21-09-212.png
>
>
> When a binary colunm is written into csv files, actual content of this colunm 
> is {*}object.toString(){*}, which is meaningless.
> {code:java}
> val df = 
> Seq(Array[Byte](1,2)).toDFdf.write.csv("/Users/guowei19/Desktop/binary_csv") 
> {code}
> The csv file's content is as follows:
> !image-2023-01-30-17-21-09-212.png|width=141,height=29!
> Meanwhile, if a binary colunm saved as table with csv fileformat, the table 
> can't be read back successfully.
> {code:java}
> val df = Seq((1, 
> Array[Byte](1,2))).toDFdf.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select
>  * from binaryDataTable").show() {code}
> !https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963=Eiscz4oMI45Sfp=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A!
> So I think it' better to change binary to unsupported dataType in csv format, 
> both for datasource v1(CSVFileFormat) and v2(CSVTable).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42237) change binary to unsupported dataType in csv format

2023-01-30 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42237:

Description: 
When a binary colunm is written into csv files, actual content of this colunm 
is {*}object.toString(){*}, which is meaningless.
{code:java}
val df = 
Seq(Array[Byte](1,2)).toDFdf.write.csv("/Users/guowei19/Desktop/binary_csv") 
{code}
The csv file's content is as follows:
!image-2023-01-30-17-21-09-212.png|width=141,height=29!
Meanwhile, if a binary colunm saved as table with csv fileformat, the table 
can't be read back successfully.
{code:java}
val df = Seq((1, 
Array[Byte](1,2))).toDFdf.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select
 * from binaryDataTable").show() {code}
!https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963=Eiscz4oMI45Sfp=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A!
So I think it' better to change binary to unsupported dataType in csv format, 
both for datasource v1(CSVFileFormat) and v2(CSVTable).

  was:
When a binary colunm is written into csv files, actual content of this colunm 
is {*}object.toString(){*}, which is meaningless. 
{code:java}
val df = 
Seq(Array[Byte](1,2)).toDFdf.write.csv("/Users/guowei19/Desktop/binary_csv") 
{code}
The csv file's content is as follows:
!image-2023-01-30-17-18-16-372.png|width=104,height=21!
Meanwhile, if a binary colunm saved as table with csv fileformat, the table 
can't be read back successfully.
{code:java}
val df = Seq((1, 
Array[Byte](1,2))).toDFdf.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select
 * from binaryDataTable").show() {code}
!https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963=Eiscz4oMI45Sfp=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A!
So I think it' better to change binary to unsupported dataType in csv format, 
both for datasource v1(CSVFileFormat) and v2(CSVTable).


> change binary to unsupported dataType in csv format
> ---
>
> Key: SPARK-42237
> URL: https://issues.apache.org/jira/browse/SPARK-42237
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.8, 3.3.1
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-01-30-17-21-09-212.png
>
>
> When a binary colunm is written into csv files, actual content of this colunm 
> is {*}object.toString(){*}, which is meaningless.
> {code:java}
> val df = 
> Seq(Array[Byte](1,2)).toDFdf.write.csv("/Users/guowei19/Desktop/binary_csv") 
> {code}
> The csv file's content is as follows:
> !image-2023-01-30-17-21-09-212.png|width=141,height=29!
> Meanwhile, if a binary colunm saved as table with csv fileformat, the table 
> can't be read back successfully.
> {code:java}
> val df = Seq((1, 
> Array[Byte](1,2))).toDFdf.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select
>  * from binaryDataTable").show() {code}
> !https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963=Eiscz4oMI45Sfp=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A!
> So I think it' better to change binary to unsupported dataType in csv format, 
> both for datasource v1(CSVFileFormat) and v2(CSVTable).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42237) change binary to unsupported dataType in csv format

2023-01-30 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42237:

Attachment: image-2023-01-30-17-21-09-212.png

> change binary to unsupported dataType in csv format
> ---
>
> Key: SPARK-42237
> URL: https://issues.apache.org/jira/browse/SPARK-42237
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.8, 3.3.1
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-01-30-17-21-09-212.png
>
>
> When a binary colunm is written into csv files, actual content of this colunm 
> is {*}object.toString(){*}, which is meaningless. 
> {code:java}
> val df = 
> Seq(Array[Byte](1,2)).toDFdf.write.csv("/Users/guowei19/Desktop/binary_csv") 
> {code}
> The csv file's content is as follows:
> !image-2023-01-30-17-18-16-372.png|width=104,height=21!
> Meanwhile, if a binary colunm saved as table with csv fileformat, the table 
> can't be read back successfully.
> {code:java}
> val df = Seq((1, 
> Array[Byte](1,2))).toDFdf.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select
>  * from binaryDataTable").show() {code}
> !https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963=Eiscz4oMI45Sfp=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A!
> So I think it' better to change binary to unsupported dataType in csv format, 
> both for datasource v1(CSVFileFormat) and v2(CSVTable).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42237) change binary to unsupported dataType in csv format

2023-01-30 Thread Wei Guo (Jira)
Wei Guo created SPARK-42237:
---

 Summary: change binary to unsupported dataType in csv format
 Key: SPARK-42237
 URL: https://issues.apache.org/jira/browse/SPARK-42237
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.1, 2.4.8
Reporter: Wei Guo
 Fix For: 3.4.0


When a binary colunm is written into csv files, actual content of this colunm 
is {*}object.toString(){*}, which is meaningless. 
{code:java}
val df = 
Seq(Array[Byte](1,2)).toDFdf.write.csv("/Users/guowei19/Desktop/binary_csv") 
{code}
The csv file's content is as follows:
!image-2023-01-30-17-18-16-372.png|width=104,height=21!
Meanwhile, if a binary colunm saved as table with csv fileformat, the table 
can't be read back successfully.
{code:java}
val df = Seq((1, 
Array[Byte](1,2))).toDFdf.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select
 * from binaryDataTable").show() {code}
!https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963=Eiscz4oMI45Sfp=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A!
So I think it' better to change binary to unsupported dataType in csv format, 
both for datasource v1(CSVFileFormat) and v2(CSVTable).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39901) Reconsider design of ignoreCorruptFiles feature

2022-07-28 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572502#comment-17572502
 ] 

Wei Guo commented on SPARK-39901:
-

The `ignoreCorruptFiles` features in SQL(spark.sql.files.ignoreCorruptFiles) 
and RDD(spark.files.ignoreCorruptFiles) scenarios need to be included both. 

> Reconsider design of ignoreCorruptFiles feature
> ---
>
> Key: SPARK-39901
> URL: https://issues.apache.org/jira/browse/SPARK-39901
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Priority: Major
>
> I'm filing this ticket as a followup to the discussion at 
> [https://github.com/apache/spark/pull/36775#issuecomment-1148136217] 
> regarding the `ignoreCorruptFiles` feature: the current implementation is 
> based towards considering a broad range of IOExceptions to be corruption, but 
> this is likely overly-broad and might mis-identify transient errors as 
> corruption (causing non-corrupt data to be erroneously discarded).
> SPARK-39389 fixes one instance of that problem, but we are still vulnerable 
> to similar issues because of the overall design of this feature.
> I think we should reconsider the design of this feature: maybe we should 
> switch the default behavior so that only an explicit allowlist of known 
> corruption exceptions can cause files to be skipped. This could be done 
> through involvement of other parts of the code, e.g. rewrapping exceptions 
> into a `CorruptFileException` so higher layers can positively identify 
> corruption.
> Any changes to behavior here could potentially impact users jobs, so we'd 
> need to think carefully about when we want to change (in a 3.x release? 4.x?) 
> and how we want to provide escape hatches (e.g. configs to revert back to old 
> behavior). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37575) null values should be saved as nothing rather than quoted empty Strings "" with default settings

2022-01-06 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37575:

Description: 
As mentioned in sql migration 
guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]),
{noformat}
Since Spark 2.4, empty strings are saved as quoted empty strings "". In version 
2.3 and earlier, empty strings are equal to null values and do not reflect to 
any characters in saved CSV files. For example, the row of "a", null, "", 1 was 
written as a,,,1. Since Spark 2.4, the same row is saved as a,,"",1. To restore 
the previous behavior, set the CSV option emptyValue to empty (not quoted) 
string.{noformat}
But actually, both empty strings and null values are saved as quoted empty 
Strings "" rather than "" (for empty strings) and nothing(for null values)。

code:
{code:java}
val data = List("spark", null, "").toDF("name")
data.coalesce(1).write.csv("spark_csv_test")
{code}
 actual result:
{noformat}
line1: spark
line2: ""
line3: ""{noformat}
expected result:
{noformat}
line1: spark
line2: 
line3: ""
{noformat}

  was:
As mentioned in sql migration 
guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]),
{noformat}
Since Spark 2.4, empty strings are saved as quoted empty strings "". In version 
2.3 and earlier, empty strings are equal to null values and do not reflect to 
any characters in saved CSV files. For example, the row of "a", null, "", 1 was 
written as a,,,1. Since Spark 2.4, the same row is saved as a,,"",1. To restore 
the previous behavior, set the CSV option emptyValue to empty (not quoted) 
string.{noformat}
 

But actually, both empty strings and null values are saved as quoted empty 
Strings "" rather than "" (for empty strings) and nothing(for null values)。

code:
{code:java}
val data = List("spark", null, "").toDF("name")
data.coalesce(1).write.csv("spark_csv_test")
{code}
 actual result:
{noformat}
line1: spark
line2: ""
line3: ""{noformat}
expected result:
{noformat}
line1: spark
line2: 
line3: ""
{noformat}


> null values should be saved as nothing rather than quoted empty Strings "" 
> with default settings
> 
>
> Key: SPARK-37575
> URL: https://issues.apache.org/jira/browse/SPARK-37575
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Assignee: Wei Guo
>Priority: Major
> Fix For: 3.3.0
>
>
> As mentioned in sql migration 
> guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]),
> {noformat}
> Since Spark 2.4, empty strings are saved as quoted empty strings "". In 
> version 2.3 and earlier, empty strings are equal to null values and do not 
> reflect to any characters in saved CSV files. For example, the row of "a", 
> null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as 
> a,,"",1. To restore the previous behavior, set the CSV option emptyValue to 
> empty (not quoted) string.{noformat}
> But actually, both empty strings and null values are saved as quoted empty 
> Strings "" rather than "" (for empty strings) and nothing(for null values)。
> code:
> {code:java}
> val data = List("spark", null, "").toDF("name")
> data.coalesce(1).write.csv("spark_csv_test")
> {code}
>  actual result:
> {noformat}
> line1: spark
> line2: ""
> line3: ""{noformat}
> expected result:
> {noformat}
> line1: spark
> line2: 
> line3: ""
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-17 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo resolved SPARK-37604.
-
Resolution: Not A Problem

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the input, and 
> the input is within quotes, the empty is used instead of an empty string.
> When writing, if the writer has an empty String to write to the output, the 
> emptyValue is used instead of an empty string.{noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> When reading:
> {code:scala}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|EMPTY|
> We can find that empty columns in dataframe can be saved as "EMPTY" strings 
> in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be 
> parsed as empty columns{color}* in dataframe. That is:
> {noformat}
> When writing, convert "" empty(in dataframe) to emptyValue(in csv)
> When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
> {noformat}
>  
> There is an obvious difference between nullValue and emptyValue in read 
> handling. For nullValue, we will convert nothing or nullValue strings to null 
> in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
> strings) to emptyValue strings rather than to convert both "\"\""(quoted 
> empty strings) and emptyValue strings to ""(empty) in dataframe.
> I think it's better that if we 

[jira] [Commented] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-17 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17461525#comment-17461525
 ] 

Wei Guo commented on SPARK-37604:
-

Well, I think your explanation is clearly and reasonable and it convinced me. 
So I'll close this issue and the PR related. Thank you! 

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the input, and 
> the input is within quotes, the empty is used instead of an empty string.
> When writing, if the writer has an empty String to write to the output, the 
> emptyValue is used instead of an empty string.{noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> When reading:
> {code:scala}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|EMPTY|
> We can find that empty columns in dataframe can be saved as "EMPTY" strings 
> in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be 
> parsed as empty columns{color}* in dataframe. That is:
> {noformat}
> When writing, convert "" empty(in dataframe) to emptyValue(in csv)
> When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
> {noformat}
>  
> There is an obvious difference between nullValue and emptyValue in read 
> handling. For nullValue, we will convert nothing or nullValue strings to null 
> in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
> strings) to emptyValue 

[jira] [Commented] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-17 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17461393#comment-17461393
 ] 

Wei Guo commented on SPARK-37604:
-

As the consideration of Hyukjin Kwon in the PR related, if we worry about 
making a breaking change, we can add a new option to support it.

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the input, and 
> the input is within quotes, the empty is used instead of an empty string.
> When writing, if the writer has an empty String to write to the output, the 
> emptyValue is used instead of an empty string.{noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> When reading:
> {code:scala}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|EMPTY|
> We can find that empty columns in dataframe can be saved as "EMPTY" strings 
> in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be 
> parsed as empty columns{color}* in dataframe. That is:
> {noformat}
> When writing, convert "" empty(in dataframe) to emptyValue(in csv)
> When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
> {noformat}
>  
> There is an obvious difference between nullValue and emptyValue in read 
> handling. For nullValue, we will convert nothing or nullValue strings to null 
> in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
> strings) to emptyValue 

[jira] [Commented] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-17 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17461390#comment-17461390
 ] 

Wei Guo commented on SPARK-37604:
-

In short, for null values, we can save null values in dataframe as "NULL" 
strings in csv files, and read back "NULL" strings as null values with the same 
nullValue option("NULL"). But for empty values, if we save empty values in 
dataframe as "EMPTY" strings in csv files, we can not read back "EMPTY" strings 
as empty values with the same emptyValue("EMPTY"), we finally get "EMPTY" 
strings. [~maxgekk] 

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the input, and 
> the input is within quotes, the empty is used instead of an empty string.
> When writing, if the writer has an empty String to write to the output, the 
> emptyValue is used instead of an empty string.{noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> When reading:
> {code:scala}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|EMPTY|
> We can find that empty columns in dataframe can be saved as "EMPTY" strings 
> in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be 
> parsed as empty columns{color}* in dataframe. That is:
> {noformat}
> When writing, convert "" empty(in dataframe) to emptyValue(in csv)
> When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
> 

[jira] [Updated] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Issue Type: Improvement  (was: Bug)

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the input, and 
> the input is within quotes, the empty is used instead of an empty string.
> When writing, if the writer has an empty String to write to the output, the 
> emptyValue is used instead of an empty string.{noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> When reading:
> {code:scala}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|EMPTY|
> We can find that empty columns in dataframe can be saved as "EMPTY" strings 
> in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be 
> parsed as empty columns{color}* in dataframe. That is:
> {noformat}
> When writing, convert "" empty(in dataframe) to emptyValue(in csv)
> When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
> {noformat}
>  
> There is an obvious difference between nullValue and emptyValue in read 
> handling. For nullValue, we will convert nothing or nullValue strings to null 
> in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
> strings) to emptyValue strings rather than to convert both "\"\""(quoted 
> empty strings) and emptyValue strings to ""(empty) in dataframe.
> I think it's better 

[jira] [Comment Edited] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-15 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460144#comment-17460144
 ] 

Wei Guo edited comment on SPARK-37604 at 12/15/21, 6:05 PM:


For codes:
{code:scala}
val data = Seq(("Tesla", "")).toDF("make", "comment")
data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty")
{code}
The csv file's content is as:
{noformat}
Tesla,EMPTY
{noformat}
(cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv)

When I read it back to dataframe:
{code:scala}
spark.read.option("emptyValue", 
"EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show()
{code}
I want the column *comment* is "" rather a "EMPTY" string.

!empty_test.png|width=701,height=286!

For null values, we can write and read back with the same nullValue option, but 
for empty strings, even with same emptyValue option, it's irreversible. 

FYI. [~maxgekk] 


was (Author: wayne guo):
For codes:
{code:scala}
val data = Seq(("Tesla", "")).toDF("make", "comment")
data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty")
{code}
The csv file's content is as:
{noformat}
Tesla,EMPTY
{noformat}
(cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv)

When I read it back to dataframe:
{code:scala}
spark.read.option("emptyValue", 
"EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show()
{code}
I want the column *comment* is "" rather a "EMPTY" string.

!empty_test.png!

For null values, we can write and read back with the same nullValue option, but 
for empty strings, even with same emptyValue option, it's irreversible. 

FYI. [~maxgekk] 

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with 

[jira] [Comment Edited] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-15 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460144#comment-17460144
 ] 

Wei Guo edited comment on SPARK-37604 at 12/15/21, 6:05 PM:


For codes:
{code:scala}
val data = Seq(("Tesla", "")).toDF("make", "comment")
data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty")
{code}
The csv file's content is as:
{noformat}
Tesla,EMPTY
{noformat}
(cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv)

When I read it back to dataframe:
{code:scala}
spark.read.option("emptyValue", 
"EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show()
{code}
I want the column *comment* is "" rather a "EMPTY" string.

!empty_test.png!

For null values, we can write and read back with the same nullValue option, but 
for empty strings, even with same emptyValue option, it's irreversible. 

FYI. [~maxgekk] 


was (Author: wayne guo):
For codes:
{code:scala}
val data = Seq(("Tesla", "")).toDF("make", "comment")
data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty")
{code}
The csv file's content is as:
{noformat}
Tesla,EMPTY
{noformat}
(cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv)

When I read it back to dataframe:
{code:scala}
spark.read.option("emptyValue", 
"EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show()
{code}
I want the column *comment* is "" rather a "EMPTY" string.

!image-2021-12-16-01-57-55-864.png|width=424,height=173!

For null values, we can write and read back with the same nullValue option, but 
for empty strings, even with same emptyValue option, it's irreversible. 

FYI. [~maxgekk] 

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for 

[jira] [Updated] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Attachment: (was: image-2021-12-16-01-57-55-864.png)

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the input, and 
> the input is within quotes, the empty is used instead of an empty string.
> When writing, if the writer has an empty String to write to the output, the 
> emptyValue is used instead of an empty string.{noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> When reading:
> {code:scala}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|EMPTY|
> We can find that empty columns in dataframe can be saved as "EMPTY" strings 
> in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be 
> parsed as empty columns{color}* in dataframe. That is:
> {noformat}
> When writing, convert "" empty(in dataframe) to emptyValue(in csv)
> When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
> {noformat}
>  
> There is an obvious difference between nullValue and emptyValue in read 
> handling. For nullValue, we will convert nothing or nullValue strings to null 
> in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
> strings) to emptyValue strings rather than to convert both "\"\""(quoted 
> empty strings) and emptyValue strings to ""(empty) in dataframe.
> I think 

[jira] [Comment Edited] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-15 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460144#comment-17460144
 ] 

Wei Guo edited comment on SPARK-37604 at 12/15/21, 6:04 PM:


For codes:
{code:scala}
val data = Seq(("Tesla", "")).toDF("make", "comment")
data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty")
{code}
The csv file's content is as:
{noformat}
Tesla,EMPTY
{noformat}
(cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv)

When I read it back to dataframe:
{code:scala}
spark.read.option("emptyValue", 
"EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show()
{code}
I want the column *comment* is "" rather a "EMPTY" string.

!image-2021-12-16-01-57-55-864.png|width=424,height=173!

For null values, we can write and read back with the same nullValue option, but 
for empty strings, even with same emptyValue option, it's irreversible. 

FYI. [~maxgekk] 


was (Author: wayne guo):
For codes:
{code:scala}
val data = Seq(("Tesla", "")).toDF("make", "comment")
data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty")
{code}
The csv file's content is as:
{noformat}
Tesla,EMPTY
{noformat}
(cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv)

When I read it back to dataframe:
{code:scala}
spark.read.option("emptyValue", 
"EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show()
{code}
I want the column *comment* is "" rather a "EMPTY" string.

!image-2021-12-16-01-57-55-864.png|width=424,height=173!

 

For null values, we can write and read back with the same nullValue option, but 
for empty strings, even with same emptyValue option, it's irreversible. 

FYI. [~maxgekk] 

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png, image-2021-12-16-01-57-55-864.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a 

[jira] [Comment Edited] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-15 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460144#comment-17460144
 ] 

Wei Guo edited comment on SPARK-37604 at 12/15/21, 6:03 PM:


For codes:
{code:scala}
val data = Seq(("Tesla", "")).toDF("make", "comment")
data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty")
{code}
The csv file's content is as:
{noformat}
Tesla,EMPTY
{noformat}
(cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv)

When I read it back to dataframe:
{code:scala}
spark.read.option("emptyValue", 
"EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show()
{code}
I want the column *comment* is "" rather a "EMPTY" string.

!image-2021-12-16-01-57-55-864.png|width=424,height=173!

 

For null values, we can write and read back with the same nullValue option, but 
for empty strings, even with same emptyValue option, it's irreversible. 

FYI. [~maxgekk] 


was (Author: wayne guo):
For codes:
{code:scala}
val data = Seq(("Tesla", "")).toDF("make", "comment")
data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty")
{code}
The csv file's content is as:
{noformat}
Tesla,EMPTY
{noformat}
(cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv)

When I read it back to dataframe:
{code:scala}
spark.read.option("emptyValue", 
"EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show()
{code}
I want the column *comment* is "" rather a "EMPTY" string.

!image-2021-12-16-01-57-55-864.png|width=424,height=173!

 

For null values, we can write and read back with the same nullValue option, but 
for empty strings, even with same emptyValue option, it's irreversible

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png, image-2021-12-16-01-57-55-864.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for 

[jira] [Comment Edited] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-15 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460144#comment-17460144
 ] 

Wei Guo edited comment on SPARK-37604 at 12/15/21, 6:03 PM:


For codes:
{code:scala}
val data = Seq(("Tesla", "")).toDF("make", "comment")
data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty")
{code}
The csv file's content is as:
{noformat}
Tesla,EMPTY
{noformat}
(cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv)

When I read it back to dataframe:
{code:scala}
spark.read.option("emptyValue", 
"EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show()
{code}
I want the column *comment* is "" rather a "EMPTY" string.

!image-2021-12-16-01-57-55-864.png|width=424,height=173!

 

For null values, we can write and read back with the same nullValue option, but 
for empty strings, even with same emptyValue option, it's irreversible


was (Author: wayne guo):
For codes:
{code:scala}
val data = Seq(("Tesla", "")).toDF("make", "comment")
data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty")
{code}
The csv file's content is as:
{noformat}
Tesla,EMPTY
{noformat}
(cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv)

When I read it back to dataframe:
{code:scala}
spark.read.option("emptyValue", 
"EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show()
{code}
I want the column *comment* is "" rather a "EMPTY" string.

!image-2021-12-16-01-57-55-864.png|width=424,height=173!

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png, image-2021-12-16-01-57-55-864.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the 

  1   2   >