[jira] [Commented] (SPARK-49520) ArrayRemove() Function Need Remove NULL Value

2024-09-05 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-49520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879735#comment-17879735
 ] 

Wei Guo commented on SPARK-49520:
-

[~Shadowell] Oh, yes, I made a mistake. 

arrays_overlap(a1, a2) - Returns true if a1 contains at least a non-null 
element present also in a2. If the arrays have no common element and they are 
both non-empty and {color:red}either of them contains a null element null is 
returned{color}, false otherwise.


So, it's better to use array_compact to trim null firstly.

> ArrayRemove() Function Need Remove NULL Value
> -
>
> Key: SPARK-49520
> URL: https://issues.apache.org/jira/browse/SPARK-49520
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
> Environment: *Spark Version: 3.2.1*
>Reporter: Feng Jie
>Priority: Major
>  Labels: Function, SQL
> Fix For: 3.2.1, 3.3.0, 3.2.2, 3.2.3, 3.2.4, 3.3.3, 3.4.2, 3.3.2, 
> 3.4.0, 3.4.1, 3.5.0, 3.5.1, 3.3.4, 3.5.2, 3.4.3, 3.4.4, 3.5.3
>
> Attachments: image-2024-09-06-09-52-45-339.png
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> I want to calculate the intersection of two arrays like this: 
> {noformat}
> select   
> case when intersect_size > 0 then 1 else 0 end as is_include
> from ( 
> select 
> size(array_intersect(array_a, array_b)) as intersect_size 
> from table_a
> )
> {noformat}
>  
> But, the NULL will affect the output:
> {code:java}
> SELECT size(array_intersect(array(1, 2, 3, null), array(null)))
> Output: 1 {code}
> So I want remove the NULL in first array by using {*}array_remove{*}:
> {code:java}
> SELECT array_remove(array(1, 2, 3, null, 3), null) 
> Output: null{code}
> I want to add extra logic for function *array_remove* to remove NULL. Shall I 
> overwrite the function (May be named: array_remove(array_a, array_b, 
> isIgnoreNull)) or just fix the original function?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-49350) FoldablePropagation rule and ConstantFolding rule leads to wrong aggregated result

2024-08-26 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-49350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876886#comment-17876886
 ] 

Wei Guo commented on SPARK-49350:
-

[~bersprockets]  [~NathanKan] Yes, with the latest code, this problem has been 
solved. I checked  with Spark 4.0.0 with Master branch.

> FoldablePropagation rule and ConstantFolding rule leads to wrong aggregated 
> result
> --
>
> Key: SPARK-49350
> URL: https://issues.apache.org/jira/browse/SPARK-49350
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: GANHONGNAN
>Priority: Blocker
>
> {code:java}
> SELECT  cast(-1 AS BIGINT) AS ele1
> FROM(
>SELECT  array(1, 5, 3, 123, 255, 546, 64, 23) AS t
>) LATERAL VIEW explode(t) tmp AS ele
> WHERE   ele=-1 {code}
> This query returns an empty result. However, the following query returns 1.  
> This result seems wrong.
> {code:java}
> SELECT  count(DISTINCT ele1)
> FROM(
> SELECT  cast(-1 as bigint) as ele1
> FROM(
> SELECT  array(1, 5, 3, 123, 255, 546, 64, 23) AS t
>
> ) LATERAL VIEW explode(t) tmp AS ele
> WHERE   ele = -1
> ) {code}
> By plan change log, I find that it is FoldablePropagation rule and 
> ConstantFolding rule that optimize Aggregate expression to `Aggregat 
> [[cast(count(distinct -1) as string) AS count(DISTINCT ele)#7|#7]] ]`.
>  
> Is this result right?  Does it need to be fixed? 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-49314) Upgrade `h2` to 2.3.232, `postgresql` to 42.7.4 and `mssql` to 12.8.1.jre11

2024-08-23 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-49314:

Summary: Upgrade `h2` to 2.3.232, `postgresql` to 42.7.4 and `mssql` to 
12.8.1.jre11  (was: Upgrade `h2` to 2.3.232)

> Upgrade `h2` to 2.3.232, `postgresql` to 42.7.4 and `mssql` to 12.8.1.jre11
> ---
>
> Key: SPARK-49314
> URL: https://issues.apache.org/jira/browse/SPARK-49314
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-49124) Upgrade tink to 1.14.1

2024-08-06 Thread Wei Guo (Jira)
Wei Guo created SPARK-49124:
---

 Summary: Upgrade tink to 1.14.1
 Key: SPARK-49124
 URL: https://issues.apache.org/jira/browse/SPARK-49124
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-49097) Add Python3 environment detection for the `build_error_docs` method in `build_api_decs.rb`

2024-08-05 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-49097:

Summary: Add Python3 environment detection for the `build_error_docs` 
method in `build_api_decs.rb`  (was: Add Python3 environment detection for the 
`build_orror_docs` method in `build_api_decs.rb`)

> Add Python3 environment detection for the `build_error_docs` method in 
> `build_api_decs.rb`
> --
>
> Key: SPARK-49097
> URL: https://issues.apache.org/jira/browse/SPARK-49097
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-49097) Add Python3 environment detection for the `build_orror_docs` method in `build_api_decs.rb`

2024-08-02 Thread Wei Guo (Jira)
Wei Guo created SPARK-49097:
---

 Summary: Add Python3 environment detection for the 
`build_orror_docs` method in `build_api_decs.rb`
 Key: SPARK-49097
 URL: https://issues.apache.org/jira/browse/SPARK-49097
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-49095) Update DecimalType compatible logic of Avro datasource to avoid loss of decimal precision

2024-08-02 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-49095:

Summary: Update DecimalType compatible logic of Avro datasource to avoid 
loss of decimal precision  (was: Update DecimalType compatible logic to avoid 
loss of decimal precision)

> Update DecimalType compatible logic of Avro datasource to avoid loss of 
> decimal precision
> -
>
> Key: SPARK-49095
> URL: https://issues.apache.org/jira/browse/SPARK-49095
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: Wei Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-49095) Update DecimalType compatible logic of Avro data source to avoid loss of decimal precision

2024-08-02 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-49095:

Summary: Update DecimalType compatible logic of Avro data source to avoid 
loss of decimal precision  (was: Update DecimalType compatible logic of Avro 
datas ource to avoid loss of decimal precision)

> Update DecimalType compatible logic of Avro data source to avoid loss of 
> decimal precision
> --
>
> Key: SPARK-49095
> URL: https://issues.apache.org/jira/browse/SPARK-49095
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: Wei Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-49095) Update DecimalType compatible logic of Avro datas ource to avoid loss of decimal precision

2024-08-02 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-49095:

Summary: Update DecimalType compatible logic of Avro datas ource to avoid 
loss of decimal precision  (was: Update DecimalType compatible logic of Avro 
datasource to avoid loss of decimal precision)

> Update DecimalType compatible logic of Avro datas ource to avoid loss of 
> decimal precision
> --
>
> Key: SPARK-49095
> URL: https://issues.apache.org/jira/browse/SPARK-49095
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: Wei Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-49095) Update DecimalType compatible logic to avoid loss of decimal precision

2024-08-02 Thread Wei Guo (Jira)
Wei Guo created SPARK-49095:
---

 Summary: Update DecimalType compatible logic to avoid loss of 
decimal precision
 Key: SPARK-49095
 URL: https://issues.apache.org/jira/browse/SPARK-49095
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.1, 3.5.0, 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-49081) Add data source options docs of Protobuf

2024-08-01 Thread Wei Guo (Jira)
Wei Guo created SPARK-49081:
---

 Summary: Add data source options docs of Protobuf 
 Key: SPARK-49081
 URL: https://issues.apache.org/jira/browse/SPARK-49081
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-49072) Fix abnormal display of text content which contains two $ in one line but non-formula in docs

2024-07-31 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-49072:

Summary: Fix abnormal display of text content which contains two $ in one 
line but non-formula in docs  (was: Fix abnormal display of text content which 
contains two $ in one line but not non-formula in docs)

> Fix abnormal display of text content which contains two $ in one line but 
> non-formula in docs
> -
>
> Key: SPARK-49072
> URL: https://issues.apache.org/jira/browse/SPARK-49072
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-49072) Fix abnormal display of text content which contains two $ in one line but not non-formula in docs

2024-07-31 Thread Wei Guo (Jira)
Wei Guo created SPARK-49072:
---

 Summary: Fix abnormal display of text content which contains two $ 
in one line but not non-formula in docs
 Key: SPARK-49072
 URL: https://issues.apache.org/jira/browse/SPARK-49072
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-49062) Migrate xml to File Data Source V2

2024-07-30 Thread Wei Guo (Jira)
Wei Guo created SPARK-49062:
---

 Summary: Migrate xml to File Data Source V2
 Key: SPARK-49062
 URL: https://issues.apache.org/jira/browse/SPARK-49062
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-49062) Migrate XML to File Data Source V2

2024-07-30 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-49062:

Summary: Migrate XML to File Data Source V2  (was: Migrate xml to File Data 
Source V2)

> Migrate XML to File Data Source V2
> --
>
> Key: SPARK-49062
> URL: https://issues.apache.org/jira/browse/SPARK-49062
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-49042) CodeGenerator: Error calculating stats of compiled class. java.lang.UnsupportedOperationException: empty.max

2024-07-30 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-49042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17869671#comment-17869671
 ] 

Wei Guo commented on SPARK-49042:
-

[~arnaud.nauwynck]  Can you provide some code to construct a dataset to 
reproduce this warning log?

> CodeGenerator: Error calculating stats of compiled class. 
> java.lang.UnsupportedOperationException: empty.max
> 
>
> Key: SPARK-49042
> URL: https://issues.apache.org/jira/browse/SPARK-49042
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1, 3.5.1
>Reporter: Arnaud Nauwynck
>Priority: Major
>
> CodeGenerator   (here using "dataset.count()")   generate WARN logs  for some 
> datasets 
> The thrown exception is catched, an error log is printed, and code statistics 
> are WRONG because it increment "(-1, -1)" instead of real values.
> Here is log error 
> {noformat}
> WARN CodeGenerator: Error calculating stats of compiled class.
> java.lang.UnsupportedOperationException: empty.max
>   at scala.collection.TraversableOnce.max(TraversableOnce.scala:234)
>   at scala.collection.TraversableOnce.max$(TraversableOnce.scala:232)
>   at scala.collection.AbstractTraversable.max(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.$anonfun$updateAndGetCompilationStats$1(CodeGenerator.scala:1470)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1451)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1405)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1501)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1498)
>   at 
> org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>   at 
> org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>   at 
> org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>   at 
> org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
>   at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000)
>   at 
> org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
>   at 
> org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1352)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:721)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:720)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:321)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:387)
>   at org.apache.spark.sql.Dataset.$anonfun$count$1(Dataset.scala:3006)
>   at 
> org.apache.spark.sql.Dataset.$anonfun$count$1$adapted(Dataset.scala:3005)
>   at 
> org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagat

[jira] [Commented] (SPARK-49016) Spark DataSet.isEmpty behaviour is different on CSV than JSON

2024-07-26 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-49016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17868981#comment-17868981
 ] 

Wei Guo commented on SPARK-49016:
-

I made a PR for this issue https://github.com/apache/spark/pull/47506

> Spark DataSet.isEmpty behaviour is different on CSV than JSON
> -
>
> Key: SPARK-49016
> URL: https://issues.apache.org/jira/browse/SPARK-49016
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1, 3.4.3
>Reporter: Marius Butan
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2024-07-26-15-50-10-280.png, 
> image-2024-07-26-15-50-24-308.png
>
>
> Spark DataSet.isEmpty behaviour is different on CSV than JSON:
>  * CSV → dataSet.isEmpty return the values for any query
>  * JSON → dataSet.isEmpty throws error when filter is only 
> {_}corrupt{_}_record is null:
> !image-2024-07-26-15-50-10-280.png!
> Tested version: Spark 3.4.3, Spark 3.5.1
> Expected behaviour: throw error on both file types or return the correct value
>  
> In order to demonstrate the behaviour I added an unit test
>  
> test.csv
> {code:java}
> first,second,third{code}
> test.json
> {code:java}
> {"first": "first", "second": "second", "third": "third"}{code}
> Code:
> {noformat}
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import org.junit.jupiter.api.AfterEach;
> import org.junit.jupiter.api.BeforeEach;
> import org.junit.jupiter.api.Test;
> public class SparkIsEmptyTest {
> private SparkSession sparkSession;
> @BeforeEach
> void setUp() {
> sparkSession = getSpark();
> }
> @AfterEach
> void after() {
> sparkSession.close();
> }
> @Test
> void testDatasetIsEmptyForCsv() {
> var dataSet = runCsvQuery("select first, second, third, 
> _corrupt_record from tempView where _corrupt_record is null");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForJson() {
> var dataSet = runJsonQuery("select first, second, third, 
> _corrupt_record from tempView where _corrupt_record is null");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForJsonAnd1Eq1() {
> var dataSet = runJsonQuery(
> "select first, second, third, _corrupt_record from tempView 
> where _corrupt_record is null and 1=1");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForCsvAnd1Eq1() {
> var dataSet = runCsvQuery(
> "select first, second, third, _corrupt_record from tempView 
> where _corrupt_record is null and 1=1");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForJsonAndOtherCondition() {
>var dataSet = runJsonQuery("select first, second, third, 
> _corrupt_record from tempView where _corrupt_record is null and 
> first='first'");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForCsvAndOtherCondition() {
> var dataSet = runCsvQuery("select first, second, third, 
> _corrupt_record from tempView where _corrupt_record is null and 
> first='first'");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForJsonAggregation() {
> var dataSet = runJsonQuery("select count(1) from tempView where 
> _corrupt_record is null");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForCsvAggregation() {
> var dataSet = runCsvQuery("select count(1) from tempView where 
> _corrupt_record is null");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForJsonAggregationGroupBy() {
> var dataSet = runJsonQuery("select count(1) , first from tempView 
> where _corrupt_record is null group by first");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForCsvAggregationGroupBy() {
> var dataSet = runJsonQuery("select count(1) , first from tempView 
> where _corrupt_record is null group by first");
> assert !dataSet.isEmpty();
> }
> private SparkSession getSpark() {
> return SparkSession.builder()
> .master("local")
> .appName("spark-dataset-isEmpty-issue")
> .config("spark.ui.enabled", "false")
> .getOrCreate();
> }
> private Dataset runJsonQuery(String query) {
> Dataset dataset = sparkSession.read()
> .schema("first STRING,second String, third STRING, 
> _corrupt_record STRING")
> .option("columnNameOfCorruptRecord", "_corrupt_record")
> .json("test.json");
>   

[jira] [Updated] (SPARK-48973) Unexpected behavior using spark mask function handle string contains invalid UTF-8 or wide character

2024-07-22 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48973:

Description: 
In the spark the mask function when apply with a string contains invalid 
character or wide character would cause unexpected behavior.

Example to use `*` mask a string contains wide-character {{🙂}}
{code:sql}
select mask("🙂", "Y", "y", "n", "*");
{code}
could cause result is ** instead of *. Looks spark mask treat {{🙂}} as 2 
characters.

Example to use wide-character {{🙂}} do mask would cause wrong garbled code 
problem
{code:sql}
select mask("ABC", "🙂");
{code}
result is `???`.

Example to mask a string contains a invalid UTF-8 character
{code:java}
select mask("\xED");
{code}
result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, 
`x`, `E`, `D`.

Looks spark mask can only handle BMP character (that is 16 bits) and can't 
guarantee result for invalid UTC-8 character and wide-character when doing mask.

My question here is *does that the limitation / issue of spark mask function or 
spark mask by design only handle for BMP character ?*

If it is a limitation of mask function, could spark address this part in mask 
function document or comments ?

 

  was:
In the spark the mask function when apply with a string contains invalid 
character or wide character would cause unexpected behavior.

Example to use `*` mask a stirng contains wide-character {{🙂}}
{code:sql}
select mask("🙂", "Y", "y", "n", "*");
{code}
could cause result is ** instead of *. Looks spark mask treat {{🙂}} as 2 
characters.

Example to use wide-character {{🙂}} do mask would cause wrong garbled code 
problem
{code:sql}
select mask("ABC", "🙂");
{code}
result is `???`.

Example to mask a string contains a invalid UTF-8 character
{code:java}
select mask("\xED");
{code}
result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, 
`x`, `E`, `D`.

Looks spark mask can only handle BMP character (that is 16 bits) and can't 
guarantee result for invalid UTC-8 character and wide-character when doing mask.

My question here is *does that the limitation / issue of spark mask function or 
spark mask by design only handle for BMP character ?*

If it is a limitation of mask function, could spark address this part in mask 
function document or comments ?

 


> Unexpected behavior using spark mask function handle string contains invalid 
> UTF-8 or wide character
> 
>
> Key: SPARK-48973
> URL: https://issues.apache.org/jira/browse/SPARK-48973
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 3.3.1, 3.2.4, 3.3.2, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.3.4
> Environment: Ubuntu 22.04
>Reporter: Yangyang Gao
>Priority: Major
>
> In the spark the mask function when apply with a string contains invalid 
> character or wide character would cause unexpected behavior.
> Example to use `*` mask a string contains wide-character {{🙂}}
> {code:sql}
> select mask("🙂", "Y", "y", "n", "*");
> {code}
> could cause result is ** instead of *. Looks spark mask treat {{🙂}} as 2 
> characters.
> Example to use wide-character {{🙂}} do mask would cause wrong garbled code 
> problem
> {code:sql}
> select mask("ABC", "🙂");
> {code}
> result is `???`.
> Example to mask a string contains a invalid UTF-8 character
> {code:java}
> select mask("\xED");
> {code}
> result is `xXX` instead of `\xED`, looks spark treat it as four character 
> `\`, `x`, `E`, `D`.
> Looks spark mask can only handle BMP character (that is 16 bits) and can't 
> guarantee result for invalid UTC-8 character and wide-character when doing 
> mask.
> My question here is *does that the limitation / issue of spark mask function 
> or spark mask by design only handle for BMP character ?*
> If it is a limitation of mask function, could spark address this part in mask 
> function document or comments ?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48973) Unexpected behavior using spark mask function handle string contains invalid UTF-8 or wide character

2024-07-22 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48973:

Description: 
In the spark the mask function when apply with a string contains invalid 
character or wide character would cause unexpected behavior.

Example to use `*` mask a stirng contains wide-character {{🙂}}
{code:sql}
select mask("🙂", "Y", "y", "n", "*");
{code}
could cause result is ** instead of *. Looks spark mask treat {{🙂}} as 2 
characters.

Example to use wide-character {{🙂}} do mask would cause wrong garbled code 
problem
{code:sql}
select mask("ABC", "🙂");
{code}
result is `???`.

Example to mask a string contains a invalid UTF-8 character
{code:java}
select mask("\xED");
{code}
result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, 
`x`, `E`, `D`.

Looks spark mask can only handle BMP character (that is 16 bits) and can't 
guarantee result for invalid UTC-8 character and wide-character when doing mask.

My question here is *does that the limitation / issue of spark mask function or 
spark mask by design only handle for BMP character ?*

If it is a limitation of mask function, could spark address this part in mask 
function document or comments ?

 

  was:
In the spark the mask function when apply with a stirng contains invalid 
character or wide character would cause unexpected behavior.


Example to use `*` mask a stirng contains wide-character {{🙂}}


{code:sql}
select mask("🙂", "Y", "y", "n", "*");
{code}


could cause result is ** instead of *. Looks spark mask treat {{🙂}} as 2 
characters.

Example to use wide-character {{🙂}} do mask would cause wrong garbled code 
problem


{code:sql}
select mask("ABC", "🙂");
{code}

result is `???`.

Example to mask a string contains a invalid UTF-8 character

{code:java}
select mask("\xED");
{code}

result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, 
`x`, `E`, `D`.

Looks spark mask can only handle BMP character (that is 16 bits) and can't 
guarantee result for invalid UTC-8 character and wide-character when doing mask.


My question here is *does that the limitation / issue of spark mask function or 
spark mask by design only handle for BMP character ?*

If it is a limitation of mask function, could spark address this part in mask 
function document or comments ?

 


> Unexpected behavior using spark mask function handle string contains invalid 
> UTF-8 or wide character
> 
>
> Key: SPARK-48973
> URL: https://issues.apache.org/jira/browse/SPARK-48973
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 3.3.1, 3.2.4, 3.3.2, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.3.4
> Environment: Ubuntu 22.04
>Reporter: Yangyang Gao
>Priority: Major
>
> In the spark the mask function when apply with a string contains invalid 
> character or wide character would cause unexpected behavior.
> Example to use `*` mask a stirng contains wide-character {{🙂}}
> {code:sql}
> select mask("🙂", "Y", "y", "n", "*");
> {code}
> could cause result is ** instead of *. Looks spark mask treat {{🙂}} as 2 
> characters.
> Example to use wide-character {{🙂}} do mask would cause wrong garbled code 
> problem
> {code:sql}
> select mask("ABC", "🙂");
> {code}
> result is `???`.
> Example to mask a string contains a invalid UTF-8 character
> {code:java}
> select mask("\xED");
> {code}
> result is `xXX` instead of `\xED`, looks spark treat it as four character 
> `\`, `x`, `E`, `D`.
> Looks spark mask can only handle BMP character (that is 16 bits) and can't 
> guarantee result for invalid UTC-8 character and wide-character when doing 
> mask.
> My question here is *does that the limitation / issue of spark mask function 
> or spark mask by design only handle for BMP character ?*
> If it is a limitation of mask function, could spark address this part in mask 
> function document or comments ?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48926) Use the `checkError` method to optimize the exception check logic related to `UNRESOLVED_COLUMN` error classes

2024-07-17 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48926:

Issue Type: Improvement  (was: Bug)

> Use the `checkError` method to optimize the exception check logic related to 
> `UNRESOLVED_COLUMN` error classes
> --
>
> Key: SPARK-48926
> URL: https://issues.apache.org/jira/browse/SPARK-48926
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48926) Use the `checkError` method to optimize the exception check logic related to `UNRESOLVED_COLUMN` error classes

2024-07-17 Thread Wei Guo (Jira)
Wei Guo created SPARK-48926:
---

 Summary: Use the `checkError` method to optimize the exception 
check logic related to `UNRESOLVED_COLUMN` error classes
 Key: SPARK-48926
 URL: https://issues.apache.org/jira/browse/SPARK-48926
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48915) Add inequality (!=, <, <=, >, >=) predicates for correlation in GeneratedSubquerySuite

2024-07-17 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866743#comment-17866743
 ] 

Wei Guo commented on SPARK-48915:
-

I have made a PR for this issue.

> Add inequality (!=, <, <=, >, >=) predicates for correlation in 
> GeneratedSubquerySuite
> --
>
> Key: SPARK-48915
> URL: https://issues.apache.org/jira/browse/SPARK-48915
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 4.0.0
>Reporter: Nick Young
>Priority: Major
>  Labels: pull-request-available
>
> {{GeneratedSubquerySuite}} is a test suite that generates SQL with variations 
> of subqueries. Currently, the operators supported are Joins, Set Operations, 
> Aggregate (with/without group by) and Limit. Implementing inequality (!=, <, 
> <=, >, >=) predicates will increase coverage by 1 additional axis, and should 
> be simple.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48893) Add some examples for linearRegression built-in functions

2024-07-14 Thread Wei Guo (Jira)
Wei Guo created SPARK-48893:
---

 Summary: Add some examples for linearRegression built-in functions
 Key: SPARK-48893
 URL: https://issues.apache.org/jira/browse/SPARK-48893
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48882) Assign names to streaming output mode related error classes

2024-07-12 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48882:

Summary: Assign names to streaming output mode related error classes  (was: 
Assign streaming output mode related error classes)

> Assign names to streaming output mode related error classes
> ---
>
> Key: SPARK-48882
> URL: https://issues.apache.org/jira/browse/SPARK-48882
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48882) Assign streaming output mode related error classes

2024-07-12 Thread Wei Guo (Jira)
Wei Guo created SPARK-48882:
---

 Summary: Assign streaming output mode related error classes
 Key: SPARK-48882
 URL: https://issues.apache.org/jira/browse/SPARK-48882
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48858) Remove deprecated `setDaemon` method call of `Thread` in `log_communication.py`

2024-07-10 Thread Wei Guo (Jira)
Wei Guo created SPARK-48858:
---

 Summary: Remove deprecated `setDaemon` method call of `Thread` in 
`log_communication.py`
 Key: SPARK-48858
 URL: https://issues.apache.org/jira/browse/SPARK-48858
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48848) Set the upper bound version of sphinxcontrib-* in dev/requirements.txt with sphinx==4.5.0

2024-07-09 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48848:

Summary: Set the upper bound version of sphinxcontrib-* in 
dev/requirements.txt with sphinx==4.5.0  (was: Pin 'sphinxcontrib-*' in 
`dev/requirements.txt` with `sphinx==4.5.0`)

> Set the upper bound version of sphinxcontrib-* in dev/requirements.txt with 
> sphinx==4.5.0
> -
>
> Key: SPARK-48848
> URL: https://issues.apache.org/jira/browse/SPARK-48848
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48848) Pin 'sphinxcontrib-*' in `dev/requirements.txt` with `sphinx==4.5.0`

2024-07-09 Thread Wei Guo (Jira)
Wei Guo created SPARK-48848:
---

 Summary: Pin 'sphinxcontrib-*' in `dev/requirements.txt` with 
`sphinx==4.5.0`
 Key: SPARK-48848
 URL: https://issues.apache.org/jira/browse/SPARK-48848
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48846) Fix the incorrect namings and missing params in func docs in `builtin.py`

2024-07-09 Thread Wei Guo (Jira)
Wei Guo created SPARK-48846:
---

 Summary: Fix the incorrect namings and missing params in func docs 
in `builtin.py`
 Key: SPARK-48846
 URL: https://issues.apache.org/jira/browse/SPARK-48846
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48826) Upgrade `fasterxml.jackson` to 2.17.2

2024-07-06 Thread Wei Guo (Jira)
Wei Guo created SPARK-48826:
---

 Summary: Upgrade `fasterxml.jackson` to 2.17.2
 Key: SPARK-48826
 URL: https://issues.apache.org/jira/browse/SPARK-48826
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48814) Upgrade tink to 1.14.0

2024-07-04 Thread Wei Guo (Jira)
Wei Guo created SPARK-48814:
---

 Summary: Upgrade tink to 1.14.0
 Key: SPARK-48814
 URL: https://issues.apache.org/jira/browse/SPARK-48814
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48812) Add some test suites for mariadb jdbc connector

2024-07-04 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48812:

Component/s: Connect

> Add some test suites for mariadb jdbc connector
> ---
>
> Key: SPARK-48812
> URL: https://issues.apache.org/jira/browse/SPARK-48812
> Project: Spark
>  Issue Type: Test
>  Components: Connect, Tests
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48812) Add some test suites for mariadb jdbc connector

2024-07-04 Thread Wei Guo (Jira)
Wei Guo created SPARK-48812:
---

 Summary: Add some test suites for mariadb jdbc connector
 Key: SPARK-48812
 URL: https://issues.apache.org/jira/browse/SPARK-48812
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48795) Upgrade mysql-connector-j to 9.0.0

2024-07-03 Thread Wei Guo (Jira)
Wei Guo created SPARK-48795:
---

 Summary: Upgrade mysql-connector-j to 9.0.0
 Key: SPARK-48795
 URL: https://issues.apache.org/jira/browse/SPARK-48795
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Tests
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48738) Correct since version for built-in func alias `random`, `position`, `mod`, `cardinality`, `current_schema`, `user`, `session_user`, `char_length`, `character_length`

2024-06-27 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48738:

Summary: Correct since version for built-in func alias `random`, 
`position`, `mod`, `cardinality`, `current_schema`, `user`, `session_user`, 
`char_length`, `character_length`  (was: Correct since version for built-in 
func alias `random`, `position`, `mod`, `cardinality`, `current_schema`, `user` 
and `session_user`)

> Correct since version for built-in func alias `random`, `position`, `mod`, 
> `cardinality`, `current_schema`, `user`, `session_user`, `char_length`, 
> `character_length`
> -
>
> Key: SPARK-48738
> URL: https://issues.apache.org/jira/browse/SPARK-48738
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48738) Correct since version for built-in func alias `random`, `position`, `mod`, `cardinality`, `current_schema`, `user` and `session_user`

2024-06-27 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48738:

Summary: Correct since version for built-in func alias `random`, 
`position`, `mod`, `cardinality`, `current_schema`, `user` and `session_user`  
(was: Correct since for method alias `random`, `position`, `mod`, 
`cardinality`, `current_schema`, `user` and `session_user`)

> Correct since version for built-in func alias `random`, `position`, `mod`, 
> `cardinality`, `current_schema`, `user` and `session_user`
> -
>
> Key: SPARK-48738
> URL: https://issues.apache.org/jira/browse/SPARK-48738
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48738) Correct since for method alias `random`, `position`, `mod`, `cardinality`, `current_schema`, `user` and `session_user`

2024-06-27 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48738:

Summary: Correct since for method alias `random`, `position`, `mod`, 
`cardinality`, `current_schema`, `user` and `session_user`  (was: Update since 
for method alias `random`, `position`, `mod`, `cardinality`, `current_schema`, 
`user` and `session_user`)

> Correct since for method alias `random`, `position`, `mod`, `cardinality`, 
> `current_schema`, `user` and `session_user`
> --
>
> Key: SPARK-48738
> URL: https://issues.apache.org/jira/browse/SPARK-48738
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48738) Update since for method alias `random`, `position`, `mod`, `cardinality`, `current_schema`, `user` and `session_user`

2024-06-27 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48738:

Summary: Update since for method alias `random`, `position`, `mod`, 
`cardinality`, `current_schema`, `user` and `session_user`  (was: Update since 
for `random`, `position`, `mod`, `cardinality`, `current_schema`, `user` and 
`session_user`)

> Update since for method alias `random`, `position`, `mod`, `cardinality`, 
> `current_schema`, `user` and `session_user`
> -
>
> Key: SPARK-48738
> URL: https://issues.apache.org/jira/browse/SPARK-48738
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48738) Update since for `random`, `position`, `mod`, `cardinality`, `current_schema`, `user` and `session_user`

2024-06-27 Thread Wei Guo (Jira)
Wei Guo created SPARK-48738:
---

 Summary: Update since for `random`, `position`, `mod`, 
`cardinality`, `current_schema`, `user` and `session_user`
 Key: SPARK-48738
 URL: https://issues.apache.org/jira/browse/SPARK-48738
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-48719) Wrong Result in regr_slope®r_intercept Aggregate with Tuples has NULL

2024-06-27 Thread Wei Guo (Jira)


[ https://issues.apache.org/jira/browse/SPARK-48719 ]


Wei Guo deleted comment on SPARK-48719:
-

was (Author: wayne guo):
I made a [PR|https://github.com/apache/spark/pull/47105] for this.

> Wrong Result in regr_slope®r_intercept Aggregate with Tuples has NULL
> 
>
> Key: SPARK-48719
> URL: https://issues.apache.org/jira/browse/SPARK-48719
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Jonathon Lee
>Priority: Major
>
> When calculate slope and intercept using regr_slope & regr_intercept 
> aggregate:
> (using Java api)
> {code:java}
> spark.sql("drop table if exists tab");
> spark.sql("CREATE TABLE tab(y int, x int) using parquet");
> spark.sql("INSERT INTO tab VALUES (1, 1)");
> spark.sql("INSERT INTO tab VALUES (2, 3)");
> spark.sql("INSERT INTO tab VALUES (3, 5)");
> spark.sql("INSERT INTO tab VALUES (NULL, 3)");
> spark.sql("INSERT INTO tab VALUES (3, NULL)");
> spark.sql("SELECT " +
> "regr_slope(x, y), " +
> "regr_intercept(x, y)" +
> "FROM tab").show(); {code}
> Spark result:
> {code:java}
> +--++
> |  regr_slope(x, y)|regr_intercept(x, y)|
> +--++
> |1.4545454545454546| 0.09090909090909083|
> +--++ {code}
> The correct answer should be 2.0 and -1.0 obviously.
>  
> Reason:
> In sql/catalyst/expressions/aggregate/linearRegression.scala,
>  
> {code:java}
> case class RegrSlope(left: Expression, right: Expression) extends 
> DeclarativeAggregate
>   with ImplicitCastInputTypes with BinaryLike[Expression] {
>   private val covarPop = new CovPopulation(right, left)
>   private val varPop = new VariancePop(right)
> .. {code}
> CovPopulation will filter tuples which right *OR* left is NULL
> But VariancePop will only filter null right expression.
> This will cause wrong result when some of the tuples' left is null (and right 
> is not null).
> {*}Same reason with RegrIntercept{*}.
>  
> A possible fix:
> {code:java}
> case class RegrSlope(left: Expression, right: Expression) extends 
> DeclarativeAggregate
>   with ImplicitCastInputTypes with BinaryLike[Expression] {
>   private val covarPop = new CovPopulation(right, left)
>   private val varPop = new VariancePop(If(And(IsNotNull(left), 
> IsNotNull(right)),
> right, Literal.create(null, right.dataType))) 
> .{code}
> *same fix to RegrIntercept*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48732) Cleanup deprecated api usage related to JdbcDialect.compileAggregate

2024-06-26 Thread Wei Guo (Jira)
Wei Guo created SPARK-48732:
---

 Summary: Cleanup deprecated api usage related to 
JdbcDialect.compileAggregate
 Key: SPARK-48732
 URL: https://issues.apache.org/jira/browse/SPARK-48732
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48719) Wrong Result in regr_slope®r_intercept Aggregate with Tuples has NULL

2024-06-26 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17860200#comment-17860200
 ] 

Wei Guo commented on SPARK-48719:
-

I made a [PR|https://github.com/apache/spark/pull/47105] for this.

> Wrong Result in regr_slope®r_intercept Aggregate with Tuples has NULL
> 
>
> Key: SPARK-48719
> URL: https://issues.apache.org/jira/browse/SPARK-48719
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Jonathon Lee
>Priority: Major
>
> When calculate slope and intercept using regr_slope & regr_intercept 
> aggregate:
> (using Java api)
> {code:java}
> spark.sql("drop table if exists tab");
> spark.sql("CREATE TABLE tab(y int, x int) using parquet");
> spark.sql("INSERT INTO tab VALUES (1, 1)");
> spark.sql("INSERT INTO tab VALUES (2, 3)");
> spark.sql("INSERT INTO tab VALUES (3, 5)");
> spark.sql("INSERT INTO tab VALUES (NULL, 3)");
> spark.sql("INSERT INTO tab VALUES (3, NULL)");
> spark.sql("SELECT " +
> "regr_slope(x, y), " +
> "regr_intercept(x, y)" +
> "FROM tab").show(); {code}
> Spark result:
> {code:java}
> +--++
> |  regr_slope(x, y)|regr_intercept(x, y)|
> +--++
> |1.4545454545454546| 0.09090909090909083|
> +--++ {code}
> The correct answer should be 2.0 and -1.0 obviously.
>  
> Reason:
> In sql/catalyst/expressions/aggregate/linearRegression.scala,
>  
> {code:java}
> case class RegrSlope(left: Expression, right: Expression) extends 
> DeclarativeAggregate
>   with ImplicitCastInputTypes with BinaryLike[Expression] {
>   private val covarPop = new CovPopulation(right, left)
>   private val varPop = new VariancePop(right)
> .. {code}
> CovPopulation will filter tuples which right *OR* left is NULL
> But VariancePop will only filter null right expression.
> This will cause wrong result when some of the tuples' left is null (and right 
> is not null).
> {*}Same reason with RegrIntercept{*}.
>  
> A possible fix:
> {code:java}
> case class RegrSlope(left: Expression, right: Expression) extends 
> DeclarativeAggregate
>   with ImplicitCastInputTypes with BinaryLike[Expression] {
>   private val covarPop = new CovPopulation(right, left)
>   private val varPop = new VariancePop(If(And(IsNotNull(left), 
> IsNotNull(right)),
> right, Literal.create(null, right.dataType))) 
> .{code}
> *same fix to RegrIntercept*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48724) Fix incorrect conf settings of ignoreCorruptFiles related tests case in ParquetQuerySuite

2024-06-26 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48724:

Description: 
The code as follows:
{code:java}
withSQLConf(SQLConf.IGNORE_CORRUPT_FILES.key -> sqlConf) {
  withSQLConf(SQLConf.IGNORE_CORRUPT_FILES.key -> "false") { 
}{code}
he inner withSQLConf (SQLConf.IGNORE_CORRUPT_FILES.key -> "false") will 
overwrite the outer configuration, making it impossible to test the situation 
where sqlConf is true.

  was:
The code as belows:

 


> Fix incorrect conf settings of ignoreCorruptFiles related tests case in 
> ParquetQuerySuite
> -
>
> Key: SPARK-48724
> URL: https://issues.apache.org/jira/browse/SPARK-48724
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Minor
>
> The code as follows:
> {code:java}
> withSQLConf(SQLConf.IGNORE_CORRUPT_FILES.key -> sqlConf) {
>   withSQLConf(SQLConf.IGNORE_CORRUPT_FILES.key -> "false") { 
> }{code}
> he inner withSQLConf (SQLConf.IGNORE_CORRUPT_FILES.key -> "false") will 
> overwrite the outer configuration, making it impossible to test the situation 
> where sqlConf is true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48724) Fix incorrect conf settings of ignoreCorruptFiles related tests case in ParquetQuerySuite

2024-06-26 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48724:

Description: 
The code as belows:

 

> Fix incorrect conf settings of ignoreCorruptFiles related tests case in 
> ParquetQuerySuite
> -
>
> Key: SPARK-48724
> URL: https://issues.apache.org/jira/browse/SPARK-48724
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Minor
>
> The code as belows:
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48724) Fix incorrect conf settings of ignoreCorruptFiles related tests case in ParquetQuerySuite

2024-06-26 Thread Wei Guo (Jira)
Wei Guo created SPARK-48724:
---

 Summary: Fix incorrect conf settings of ignoreCorruptFiles related 
tests case in ParquetQuerySuite
 Key: SPARK-48724
 URL: https://issues.apache.org/jira/browse/SPARK-48724
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-39901) Reconsider design of ignoreCorruptFiles feature

2024-06-25 Thread Wei Guo (Jira)


[ https://issues.apache.org/jira/browse/SPARK-39901 ]


Wei Guo deleted comment on SPARK-39901:
-

was (Author: wayne guo):
The `ignoreCorruptFiles` features in SQL(spark.sql.files.ignoreCorruptFiles) 
and RDD(spark.files.ignoreCorruptFiles) scenarios need to be included both. 

> Reconsider design of ignoreCorruptFiles feature
> ---
>
> Key: SPARK-39901
> URL: https://issues.apache.org/jira/browse/SPARK-39901
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Priority: Major
>
> I'm filing this ticket as a followup to the discussion at 
> [https://github.com/apache/spark/pull/36775#issuecomment-1148136217] 
> regarding the `ignoreCorruptFiles` feature: the current implementation is 
> based towards considering a broad range of IOExceptions to be corruption, but 
> this is likely overly-broad and might mis-identify transient errors as 
> corruption (causing non-corrupt data to be erroneously discarded).
> SPARK-39389 fixes one instance of that problem, but we are still vulnerable 
> to similar issues because of the overall design of this feature.
> I think we should reconsider the design of this feature: maybe we should 
> switch the default behavior so that only an explicit allowlist of known 
> corruption exceptions can cause files to be skipped. This could be done 
> through involvement of other parts of the code, e.g. rewrapping exceptions 
> into a `CorruptFileException` so higher layers can positively identify 
> corruption.
> Any changes to behavior here could potentially impact users jobs, so we'd 
> need to think carefully about when we want to change (in a 3.x release? 4.x?) 
> and how we want to provide escape hatches (e.g. configs to revert back to old 
> behavior). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48691) Upgrade `scalatest` related dependencies to the 3.2.18 series

2024-06-23 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48691:

Summary: Upgrade `scalatest` related dependencies to the 3.2.18 series  
(was: Upgrade `mockito` to 5.12.0)

> Upgrade `scalatest` related dependencies to the 3.2.18 series
> -
>
> Key: SPARK-48691
> URL: https://issues.apache.org/jira/browse/SPARK-48691
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-48689) Reading lengthy JSON results in a corrupted record.

2024-06-22 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17856910#comment-17856910
 ] 

Wei Guo edited comment on SPARK-48689 at 6/22/24 2:33 PM:
--

This is controlled by option `maxStringLen`, the default value of it is 
20,000,000. If you set the option big enough when reading, you can get the 
right result.
{code:java}
spark.read.option("maxStringLen", 1).json(path){code}
 

I did a test with a long string of length 20,000,010 and proved that:

!image-2024-06-22-15-33-38-833.png!

 


was (Author: wayne guo):
This is controlled by option `maxStringLen`, the default value of it is 
20,000,000. If you set the option when reading, you can get the right result.
{code:java}
spark.read.option("maxStringLen", 1).json(path){code}
 

I did a test with a long string of length 20,000,010 and proved that:

!image-2024-06-22-15-33-38-833.png!

 

> Reading lengthy JSON results in a corrupted record.
> ---
>
> Key: SPARK-48689
> URL: https://issues.apache.org/jira/browse/SPARK-48689
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
> Environment: Ubuntu 22.04, Python 3.11, and OpenJDK 22
>Reporter: Yuxiang Wei
>Priority: Major
>  Labels: Reader
> Attachments: image-2024-06-22-15-33-38-833.png
>
>
> When reading a data frame from a JSON file including a very long string, 
> spark will incorrectly make it a corrupted record even though the format is 
> correct. Here is a minimal example with PySpark:
> {{import json}}
> {{import tempfile}}
> {{{}from pyspark.sql import SparkSession{}}}{{{}# Create a Spark session{}}}
> {{spark = (SparkSession.builde}}
> {{    .appName("PySpark JSON Example")}}
> {{    .getOrCreate()}}
> {{{}){}}}{{{}# Define the JSON content{}}}
> {{data = {}}
> {{    "text": "a" * 1}}
> {{{}}{}}}{{{}# Create a temporary file{}}}
> {{with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as 
> tmp_file:}}
> {{    # Write the JSON content to the temporary file}}
> {{    tmp_file.write(json.dumps(data) + "\n")}}
> {{    tmp_file_path = tmp_file.name}}{{    # Load the JSON file into a 
> PySpark DataFrame}}
> {{    df = spark.read.json(tmp_file_path)}}{{    # Print the schema}}
> {{    print(df)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-48689) Reading lengthy JSON results in a corrupted record.

2024-06-22 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17856910#comment-17856910
 ] 

Wei Guo edited comment on SPARK-48689 at 6/22/24 7:36 AM:
--

This is controlled by option `maxStringLen`, the default value of it is 
20,000,000. If you set the option when reading, you can get the right result.
{code:java}
spark.read.option("maxStringLen", 1).json(path){code}
 

I did a test with a long string of length 20,000,010 and proved that:

!image-2024-06-22-15-33-38-833.png!

 


was (Author: wayne guo):
This is controlled by option `maxStringLen`, the default value of it is 
20,000,000. If you set the option when reading, you can get the right result.
{code:java}
spark.read.option("maxStringLen", 1).json(path){code}
 

I made a test with a 20,000,010 length string:

!image-2024-06-22-15-33-38-833.png!

 

> Reading lengthy JSON results in a corrupted record.
> ---
>
> Key: SPARK-48689
> URL: https://issues.apache.org/jira/browse/SPARK-48689
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
> Environment: Ubuntu 22.04, Python 3.11, and OpenJDK 22
>Reporter: Yuxiang Wei
>Priority: Major
>  Labels: Reader
> Attachments: image-2024-06-22-15-33-38-833.png
>
>
> When reading a data frame from a JSON file including a very long string, 
> spark will incorrectly make it a corrupted record even though the format is 
> correct. Here is a minimal example with PySpark:
> {{import json}}
> {{import tempfile}}
> {{{}from pyspark.sql import SparkSession{}}}{{{}# Create a Spark session{}}}
> {{spark = (SparkSession.builde}}
> {{    .appName("PySpark JSON Example")}}
> {{    .getOrCreate()}}
> {{{}){}}}{{{}# Define the JSON content{}}}
> {{data = {}}
> {{    "text": "a" * 1}}
> {{{}}{}}}{{{}# Create a temporary file{}}}
> {{with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as 
> tmp_file:}}
> {{    # Write the JSON content to the temporary file}}
> {{    tmp_file.write(json.dumps(data) + "\n")}}
> {{    tmp_file_path = tmp_file.name}}{{    # Load the JSON file into a 
> PySpark DataFrame}}
> {{    df = spark.read.json(tmp_file_path)}}{{    # Print the schema}}
> {{    print(df)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48689) Reading lengthy JSON results in a corrupted record.

2024-06-22 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48689:

Attachment: image-2024-06-22-15-33-38-833.png

> Reading lengthy JSON results in a corrupted record.
> ---
>
> Key: SPARK-48689
> URL: https://issues.apache.org/jira/browse/SPARK-48689
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
> Environment: Ubuntu 22.04, Python 3.11, and OpenJDK 22
>Reporter: Yuxiang Wei
>Priority: Major
>  Labels: Reader
> Attachments: image-2024-06-22-15-33-38-833.png
>
>
> When reading a data frame from a JSON file including a very long string, 
> spark will incorrectly make it a corrupted record even though the format is 
> correct. Here is a minimal example with PySpark:
> {{import json}}
> {{import tempfile}}
> {{{}from pyspark.sql import SparkSession{}}}{{{}# Create a Spark session{}}}
> {{spark = (SparkSession.builde}}
> {{    .appName("PySpark JSON Example")}}
> {{    .getOrCreate()}}
> {{{}){}}}{{{}# Define the JSON content{}}}
> {{data = {}}
> {{    "text": "a" * 1}}
> {{{}}{}}}{{{}# Create a temporary file{}}}
> {{with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as 
> tmp_file:}}
> {{    # Write the JSON content to the temporary file}}
> {{    tmp_file.write(json.dumps(data) + "\n")}}
> {{    tmp_file_path = tmp_file.name}}{{    # Load the JSON file into a 
> PySpark DataFrame}}
> {{    df = spark.read.json(tmp_file_path)}}{{    # Print the schema}}
> {{    print(df)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48689) Reading lengthy JSON results in a corrupted record.

2024-06-22 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17856910#comment-17856910
 ] 

Wei Guo commented on SPARK-48689:
-

This is controlled by option `maxStringLen`, the default value of it is 
20,000,000. If you set the option when reading, you can get the right result.
{code:java}
spark.read.option("maxStringLen", 1).json(path){code}
 

I made a test with a 20,000,010 length string:

!image-2024-06-22-15-33-38-833.png!

 

> Reading lengthy JSON results in a corrupted record.
> ---
>
> Key: SPARK-48689
> URL: https://issues.apache.org/jira/browse/SPARK-48689
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
> Environment: Ubuntu 22.04, Python 3.11, and OpenJDK 22
>Reporter: Yuxiang Wei
>Priority: Major
>  Labels: Reader
> Attachments: image-2024-06-22-15-33-38-833.png
>
>
> When reading a data frame from a JSON file including a very long string, 
> spark will incorrectly make it a corrupted record even though the format is 
> correct. Here is a minimal example with PySpark:
> {{import json}}
> {{import tempfile}}
> {{{}from pyspark.sql import SparkSession{}}}{{{}# Create a Spark session{}}}
> {{spark = (SparkSession.builde}}
> {{    .appName("PySpark JSON Example")}}
> {{    .getOrCreate()}}
> {{{}){}}}{{{}# Define the JSON content{}}}
> {{data = {}}
> {{    "text": "a" * 1}}
> {{{}}{}}}{{{}# Create a temporary file{}}}
> {{with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as 
> tmp_file:}}
> {{    # Write the JSON content to the temporary file}}
> {{    tmp_file.write(json.dumps(data) + "\n")}}
> {{    tmp_file_path = tmp_file.name}}{{    # Load the JSON file into a 
> PySpark DataFrame}}
> {{    df = spark.read.json(tmp_file_path)}}{{    # Print the schema}}
> {{    print(df)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48671) Add test cases for Hex.hex

2024-06-20 Thread Wei Guo (Jira)
Wei Guo created SPARK-48671:
---

 Summary: Add test cases for Hex.hex
 Key: SPARK-48671
 URL: https://issues.apache.org/jira/browse/SPARK-48671
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48660) The result of explain is incorrect for CreateTableAsSelect

2024-06-18 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17856122#comment-17856122
 ] 

Wei Guo commented on SPARK-48660:
-

I am working on this and thank your for recommendation [~yangjie01] .

> The result of explain is incorrect for CreateTableAsSelect
> --
>
> Key: SPARK-48660
> URL: https://issues.apache.org/jira/browse/SPARK-48660
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:sql}
> CREATE TABLE order_history_version_audit_rno (
>   eventid STRING,
>   id STRING,
>   referenceid STRING,
>   type STRING,
>   referencetype STRING,
>   sellerid BIGINT,
>   buyerid BIGINT,
>   producerid STRING,
>   versionid INT,
>   changedocuments ARRAY BIGINT, changeDetails: STRING>>,
>   dt STRING,
>   hr STRING)
> USING parquet
> PARTITIONED BY (dt, hr);
> explain cost
> CREATE TABLE order_history_version_audit_rno
> USING parquet
> PARTITIONED BY (dt)
> CLUSTERED BY (id) INTO 1000 buckets
> AS SELECT * FROM order_history_version_audit_rno
> WHERE dt >= '2023-11-29';
> {code}
> {noformat}
> spark-sql (default)> 
>> explain cost
>> CREATE TABLE order_history_version_audit_rno
>> USING parquet
>> PARTITIONED BY (dt)
>> CLUSTERED BY (id) INTO 1000 buckets
>> AS SELECT * FROM order_history_version_audit_rno
>> WHERE dt >= '2023-11-29';
> == Optimized Logical Plan ==
> CreateDataSourceTableAsSelectCommand 
> `spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, 
> [eventid, id, referenceid, type, referencetype, sellerid, buyerid, 
> producerid, versionid, changedocuments, hr, dt]
>+- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
> sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
> hr#16, dt#15]
>   +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
> sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
> dt#15, hr#16]
>  +- Filter (dt#15 >= 2023-11-29)
> +- SubqueryAlias 
> spark_catalog.default.order_history_version_audit_rno
>+- Relation 
> spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16]
>  parquet
> == Physical Plan ==
> Execute CreateDataSourceTableAsSelectCommand
>+- CreateDataSourceTableAsSelectCommand 
> `spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, 
> [eventid, id, referenceid, type, referencetype, sellerid, buyerid, 
> producerid, versionid, changedocuments, hr, dt]
>  +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
> sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
> hr#16, dt#15]
> +- Project [eventid#5, id#6, referenceid#7, type#8, 
> referencetype#9, sellerid#10L, buyerid#11L, producerid#12, versionid#13, 
> changedocuments#14, dt#15, hr#16]
>+- Filter (dt#15 >= 2023-11-29)
>   +- SubqueryAlias 
> spark_catalog.default.order_history_version_audit_rno
>  +- Relation 
> spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16]
>  parquet
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-48660) The result of explain is incorrect for CreateTableAsSelect

2024-06-18 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17856122#comment-17856122
 ] 

Wei Guo edited comment on SPARK-48660 at 6/19/24 4:18 AM:
--

I am working on this and thank your for recommendation [~LuciferYang] 


was (Author: wayne guo):
I am working on this and thank your for recommendation [~yangjie01] .

> The result of explain is incorrect for CreateTableAsSelect
> --
>
> Key: SPARK-48660
> URL: https://issues.apache.org/jira/browse/SPARK-48660
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:sql}
> CREATE TABLE order_history_version_audit_rno (
>   eventid STRING,
>   id STRING,
>   referenceid STRING,
>   type STRING,
>   referencetype STRING,
>   sellerid BIGINT,
>   buyerid BIGINT,
>   producerid STRING,
>   versionid INT,
>   changedocuments ARRAY BIGINT, changeDetails: STRING>>,
>   dt STRING,
>   hr STRING)
> USING parquet
> PARTITIONED BY (dt, hr);
> explain cost
> CREATE TABLE order_history_version_audit_rno
> USING parquet
> PARTITIONED BY (dt)
> CLUSTERED BY (id) INTO 1000 buckets
> AS SELECT * FROM order_history_version_audit_rno
> WHERE dt >= '2023-11-29';
> {code}
> {noformat}
> spark-sql (default)> 
>> explain cost
>> CREATE TABLE order_history_version_audit_rno
>> USING parquet
>> PARTITIONED BY (dt)
>> CLUSTERED BY (id) INTO 1000 buckets
>> AS SELECT * FROM order_history_version_audit_rno
>> WHERE dt >= '2023-11-29';
> == Optimized Logical Plan ==
> CreateDataSourceTableAsSelectCommand 
> `spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, 
> [eventid, id, referenceid, type, referencetype, sellerid, buyerid, 
> producerid, versionid, changedocuments, hr, dt]
>+- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
> sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
> hr#16, dt#15]
>   +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
> sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
> dt#15, hr#16]
>  +- Filter (dt#15 >= 2023-11-29)
> +- SubqueryAlias 
> spark_catalog.default.order_history_version_audit_rno
>+- Relation 
> spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16]
>  parquet
> == Physical Plan ==
> Execute CreateDataSourceTableAsSelectCommand
>+- CreateDataSourceTableAsSelectCommand 
> `spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, 
> [eventid, id, referenceid, type, referencetype, sellerid, buyerid, 
> producerid, versionid, changedocuments, hr, dt]
>  +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
> sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
> hr#16, dt#15]
> +- Project [eventid#5, id#6, referenceid#7, type#8, 
> referencetype#9, sellerid#10L, buyerid#11L, producerid#12, versionid#13, 
> changedocuments#14, dt#15, hr#16]
>+- Filter (dt#15 >= 2023-11-29)
>   +- SubqueryAlias 
> spark_catalog.default.order_history_version_audit_rno
>  +- Relation 
> spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16]
>  parquet
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48661) Upgrade RoaringBitmap to 1.1.0

2024-06-18 Thread Wei Guo (Jira)
Wei Guo created SPARK-48661:
---

 Summary: Upgrade RoaringBitmap to 1.1.0
 Key: SPARK-48661
 URL: https://issues.apache.org/jira/browse/SPARK-48661
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48635) Assign classes to join type errors and as-of join error

2024-06-14 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48635:

Summary:  Assign classes to join type errors  and as-of join error  (was:  
Assign classes to join type errors  and as-of join error 
_LEGACY_ERROR_TEMP_3217 )

>  Assign classes to join type errors  and as-of join error
> -
>
> Key: SPARK-48635
> URL: https://issues.apache.org/jira/browse/SPARK-48635
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Minor
>
> job type errors: 
> LEGACY_ERROR_TEMP[1319, 3216]
> as-of join error:
> _LEGACY_ERROR_TEMP_3217



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48635) Assign classes to join type errors and as-of join error _LEGACY_ERROR_TEMP_3217

2024-06-14 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48635:

Summary:  Assign classes to join type errors  and as-of join error 
_LEGACY_ERROR_TEMP_3217   (was:  Assign classes to join type errors 
_LEGACY_ERROR_TEMP_[1319, 3216] and as-of join error _LEGACY_ERROR_TEMP_3217 )

>  Assign classes to join type errors  and as-of join error 
> _LEGACY_ERROR_TEMP_3217 
> --
>
> Key: SPARK-48635
> URL: https://issues.apache.org/jira/browse/SPARK-48635
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48635) Assign classes to join type errors and as-of join error _LEGACY_ERROR_TEMP_3217

2024-06-14 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48635:

Description: 
job type errors: 
LEGACY_ERROR_TEMP[1319, 3216]
as-of join error:
_LEGACY_ERROR_TEMP_3217

  was:
job type errors: 
_LEGACY_ERROR_TEMP_[1319, 3216]
as-of join error:
_LEGACY_ERROR_TEMP_3217


>  Assign classes to join type errors  and as-of join error 
> _LEGACY_ERROR_TEMP_3217 
> --
>
> Key: SPARK-48635
> URL: https://issues.apache.org/jira/browse/SPARK-48635
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Minor
>
> job type errors: 
> LEGACY_ERROR_TEMP[1319, 3216]
> as-of join error:
> _LEGACY_ERROR_TEMP_3217



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48635) Assign classes to join type errors and as-of join error _LEGACY_ERROR_TEMP_3217

2024-06-14 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48635:

Description: 
job type errors: 
_LEGACY_ERROR_TEMP_[1319, 3216]
as-of join error:


>  Assign classes to join type errors  and as-of join error 
> _LEGACY_ERROR_TEMP_3217 
> --
>
> Key: SPARK-48635
> URL: https://issues.apache.org/jira/browse/SPARK-48635
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Minor
>
> job type errors: 
> _LEGACY_ERROR_TEMP_[1319, 3216]
> as-of join error:



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48635) Assign classes to join type errors and as-of join error _LEGACY_ERROR_TEMP_3217

2024-06-14 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48635:

Description: 
job type errors: 
_LEGACY_ERROR_TEMP_[1319, 3216]
as-of join error:
_LEGACY_ERROR_TEMP_3217

  was:
job type errors: 
_LEGACY_ERROR_TEMP_[1319, 3216]
as-of join error:



>  Assign classes to join type errors  and as-of join error 
> _LEGACY_ERROR_TEMP_3217 
> --
>
> Key: SPARK-48635
> URL: https://issues.apache.org/jira/browse/SPARK-48635
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Minor
>
> job type errors: 
> _LEGACY_ERROR_TEMP_[1319, 3216]
> as-of join error:
> _LEGACY_ERROR_TEMP_3217



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48635) Assign classes to join type errors _LEGACY_ERROR_TEMP_[1319, 3216] and as-of join error _LEGACY_ERROR_TEMP_3217

2024-06-14 Thread Wei Guo (Jira)
Wei Guo created SPARK-48635:
---

 Summary:  Assign classes to join type errors 
_LEGACY_ERROR_TEMP_[1319, 3216] and as-of join error _LEGACY_ERROR_TEMP_3217 
 Key: SPARK-48635
 URL: https://issues.apache.org/jira/browse/SPARK-48635
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48614) Cleanup deprecated api usage related to kafka-clients

2024-06-13 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48614:

Description: (was: There are some deprecated classes and methods in 
commons-io called in Spark, we need to replace them:
 * writeStringToFile(final File file, final String data)
 * CountingInputStream)

> Cleanup deprecated api usage related to kafka-clients
> -
>
> Key: SPARK-48614
> URL: https://issues.apache.org/jira/browse/SPARK-48614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Assignee: Wei Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48614) Cleanup deprecated api usage related to kafka-clients

2024-06-13 Thread Wei Guo (Jira)
Wei Guo created SPARK-48614:
---

 Summary: Cleanup deprecated api usage related to kafka-clients
 Key: SPARK-48614
 URL: https://issues.apache.org/jira/browse/SPARK-48614
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wei Guo
Assignee: Wei Guo
 Fix For: 4.0.0


There are some deprecated classes and methods in commons-io called in Spark, we 
need to replace them:
 * writeStringToFile(final File file, final String data)
 * CountingInputStream



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48604) Replace deprecated classes and methods of arrow-vector called in Spark

2024-06-12 Thread Wei Guo (Jira)
Wei Guo created SPARK-48604:
---

 Summary: Replace deprecated classes and methods of arrow-vector 
called in Spark
 Key: SPARK-48604
 URL: https://issues.apache.org/jira/browse/SPARK-48604
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wei Guo


There are some deprecated classes and methods in commons-io called in Spark, we 
need to replace them:
 * writeStringToFile(final File file, final String data)
 * CountingInputStream



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48604) Replace deprecated classes and methods of arrow-vector called in Spark

2024-06-12 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48604:

Description: 
There are some deprecated classes and methods in arrow-vector called in Spark, 
we need to replace them:
 * ArrowType.Decimal(precision, scale)

  was:
There are some deprecated classes and methods in commons-io called in Spark, we 
need to replace them:
 * writeStringToFile(final File file, final String data)
 * CountingInputStream


> Replace deprecated classes and methods of arrow-vector called in Spark
> --
>
> Key: SPARK-48604
> URL: https://issues.apache.org/jira/browse/SPARK-48604
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>  Labels: pull-request-available
>
> There are some deprecated classes and methods in arrow-vector called in 
> Spark, we need to replace them:
>  * ArrowType.Decimal(precision, scale)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48583) Replace deprecated classes and methods of commons-io called in Spark

2024-06-12 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48583:

Summary: Replace deprecated classes and methods of commons-io called in 
Spark  (was: Replace deprecated classes and methods of `commons-io` called in 
Spark)

> Replace deprecated classes and methods of commons-io called in Spark
> 
>
> Key: SPARK-48583
> URL: https://issues.apache.org/jira/browse/SPARK-48583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>  Labels: pull-request-available
>
> There are some deprecated classes and methods in commons-io called in Spark, 
> we need to replace them:
>  * writeStringToFile(final File file, final String data)
>  * CountingInputStream



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48583) Replace deprecated classes and methods of `commons-io` called in Spark

2024-06-12 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48583:

Description: 
There are some deprecated classes and methods in commons-io called in Spark, we 
need to replace them:
 * writeStringToFile(final File file, final String data)
 * CountingInputStream

  was:
There are some deprecated classes and methods in `commons-io` called in Spark, 
we need to replace them:
 * `writeStringToFile(final File file, final String data);
 * `CountingInputStream`


> Replace deprecated classes and methods of `commons-io` called in Spark
> --
>
> Key: SPARK-48583
> URL: https://issues.apache.org/jira/browse/SPARK-48583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>  Labels: pull-request-available
>
> There are some deprecated classes and methods in commons-io called in Spark, 
> we need to replace them:
>  * writeStringToFile(final File file, final String data)
>  * CountingInputStream



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48583) Replace deprecated classes and methods of `commons-io` called in Spark

2024-06-12 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48583:

Description: 
There are some deprecated classes and methods in `commons-io` called in Spark, 
we need to replace them:
 *   `writeStringToFile(final File file, final String data);
 * `CountingInputStream`

  was:Method `writeStringToFile(final File file, final String data)` in class 
`FileUtils` is deprecated, use `writeStringToFile(final File file, final String 
data, final Charset charset)` instead in UDFXPathUtilSuite.


> Replace deprecated classes and methods of `commons-io` called in Spark
> --
>
> Key: SPARK-48583
> URL: https://issues.apache.org/jira/browse/SPARK-48583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>  Labels: pull-request-available
>
> There are some deprecated classes and methods in `commons-io` called in 
> Spark, we need to replace them:
>  *   `writeStringToFile(final File file, final String data);
>  * `CountingInputStream`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48583) Replace deprecated classes and methods of `commons-io` called in Spark

2024-06-12 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48583:

Description: 
There are some deprecated classes and methods in `commons-io` called in Spark, 
we need to replace them:
 * `writeStringToFile(final File file, final String data);
 * `CountingInputStream`

  was:
There are some deprecated classes and methods in `commons-io` called in Spark, 
we need to replace them:
 *   `writeStringToFile(final File file, final String data);
 * `CountingInputStream`


> Replace deprecated classes and methods of `commons-io` called in Spark
> --
>
> Key: SPARK-48583
> URL: https://issues.apache.org/jira/browse/SPARK-48583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>  Labels: pull-request-available
>
> There are some deprecated classes and methods in `commons-io` called in 
> Spark, we need to replace them:
>  * `writeStringToFile(final File file, final String data);
>  * `CountingInputStream`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48583) Replace deprecated classes and methods of `commons-io` called in Spark

2024-06-12 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48583:

Summary: Replace deprecated classes and methods of `commons-io` called in 
Spark  (was: Replace deprecated `FileUtils#writeStringToFile` )

> Replace deprecated classes and methods of `commons-io` called in Spark
> --
>
> Key: SPARK-48583
> URL: https://issues.apache.org/jira/browse/SPARK-48583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>  Labels: pull-request-available
>
> Method `writeStringToFile(final File file, final String data)` in class 
> `FileUtils` is deprecated, use `writeStringToFile(final File file, final 
> String data, final Charset charset)` instead in UDFXPathUtilSuite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48583) Replace deprecated `FileUtils#writeStringToFile`

2024-06-11 Thread Wei Guo (Jira)
Wei Guo created SPARK-48583:
---

 Summary: Replace deprecated `FileUtils#writeStringToFile` 
 Key: SPARK-48583
 URL: https://issues.apache.org/jira/browse/SPARK-48583
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wei Guo


Method `writeStringToFile(final File file, final String data)` in class 
`FileUtils` is deprecated, use `writeStringToFile(final File file, final String 
data, final Charset charset)` instead in UDFXPathUtilSuite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48581) Upgrade dropwizard metrics to 4.2.26

2024-06-10 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48581:

Summary: Upgrade dropwizard metrics to 4.2.26  (was: Upgrade dropwizard 
metrics 4.2.26)

> Upgrade dropwizard metrics to 4.2.26
> 
>
> Key: SPARK-48581
> URL: https://issues.apache.org/jira/browse/SPARK-48581
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48581) Upgrade dropwizard metrics 4.2.26

2024-06-10 Thread Wei Guo (Jira)
Wei Guo created SPARK-48581:
---

 Summary: Upgrade dropwizard metrics 4.2.26
 Key: SPARK-48581
 URL: https://issues.apache.org/jira/browse/SPARK-48581
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48539) Upgrade docker-java to 3.3.6

2024-06-05 Thread Wei Guo (Jira)
Wei Guo created SPARK-48539:
---

 Summary: Upgrade docker-java to 3.3.6
 Key: SPARK-48539
 URL: https://issues.apache.org/jira/browse/SPARK-48539
 Project: Spark
  Issue Type: Improvement
  Components: Spark Docker
Affects Versions: 4.0.0
Reporter: Wei Guo
 Fix For: 4.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47259) Assign classes to interval errors

2024-05-28 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850238#comment-17850238
 ] 

Wei Guo commented on SPARK-47259:
-

Update `_LEGACY_ERROR_TEMP_32[08-14]` to `_LEGACY_ERROR_TEMP_32[09-14]`, 
because `
_LEGACY_ERROR_TEMP_3208` is not related to interval errors.

> Assign classes to interval errors
> -
>
> Key: SPARK-47259
> URL: https://issues.apache.org/jira/browse/SPARK-47259
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_32[09-14]* 
> defined in {*}core/src/main/resources/error/error-classes.json{*}. The name 
> should be short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47259) Assign classes to interval errors

2024-05-28 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-47259:

Description: 
Choose a proper name for the error class *_LEGACY_ERROR_TEMP_32[09-14]* defined 
in {*}core/src/main/resources/error/error-classes.json{*}. The name should be 
short but complete (look at the example in error-classes.json).

Add a test which triggers the error from user code if such test still doesn't 
exist. Check exception fields by using {*}checkError(){*}. The last function 
checks valuable error fields only, and avoids dependencies from error text 
message. In this way, tech editors can modify error format in 
error-classes.json, and don't worry of Spark's internal tests. Migrate other 
tests that might trigger the error onto checkError().

If you cannot reproduce the error from user space (using SQL query), replace 
the error by an internal error, see {*}SparkException.internalError(){*}.

Improve the error message format in error-classes.json if the current is not 
clear. Propose a solution to users how to avoid and fix such kind of errors.

Please, look at the PR below as examples:
 * [https://github.com/apache/spark/pull/38685]
 * [https://github.com/apache/spark/pull/38656]
 * [https://github.com/apache/spark/pull/38490]

  was:
Choose a proper name for the error class *_LEGACY_ERROR_TEMP_32[08-14]* defined 
in {*}core/src/main/resources/error/error-classes.json{*}. The name should be 
short but complete (look at the example in error-classes.json).

Add a test which triggers the error from user code if such test still doesn't 
exist. Check exception fields by using {*}checkError(){*}. The last function 
checks valuable error fields only, and avoids dependencies from error text 
message. In this way, tech editors can modify error format in 
error-classes.json, and don't worry of Spark's internal tests. Migrate other 
tests that might trigger the error onto checkError().

If you cannot reproduce the error from user space (using SQL query), replace 
the error by an internal error, see {*}SparkException.internalError(){*}.

Improve the error message format in error-classes.json if the current is not 
clear. Propose a solution to users how to avoid and fix such kind of errors.

Please, look at the PR below as examples:
 * [https://github.com/apache/spark/pull/38685]
 * [https://github.com/apache/spark/pull/38656]
 * [https://github.com/apache/spark/pull/38490]


> Assign classes to interval errors
> -
>
> Key: SPARK-47259
> URL: https://issues.apache.org/jira/browse/SPARK-47259
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_32[09-14]* 
> defined in {*}core/src/main/resources/error/error-classes.json{*}. The name 
> should be short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40678) JSON conversion of ArrayType is not properly supported in Spark 3.2/2.13

2023-02-12 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17687573#comment-17687573
 ] 

Wei Guo commented on SPARK-40678:
-

Fixed by PR 38154 https://github.com/apache/spark/pull/38154

> JSON conversion of ArrayType is not properly supported in Spark 3.2/2.13
> 
>
> Key: SPARK-40678
> URL: https://issues.apache.org/jira/browse/SPARK-40678
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.2.0
>Reporter: Cédric Chantepie
>Priority: Major
>
> In Spark 3.2 (Scala 2.13), values with {{ArrayType}} are no longer properly 
> support with JSON; e.g.
> {noformat}
> import org.apache.spark.sql.SparkSession
> case class KeyValue(key: String, value: Array[Byte])
> val spark = 
> SparkSession.builder().master("local[1]").appName("test").getOrCreate()
> import spark.implicits._
> val df = Seq(Array(KeyValue("foo", "bar".getBytes))).toDF()
> df.foreach(r => println(r.json))
> {noformat}
> Expected:
> {noformat}
> [{foo, bar}]
> {noformat}
> Encountered:
> {noformat}
> java.lang.IllegalArgumentException: Failed to convert value 
> ArraySeq([foo,[B@dcdb68f]) (class of class 
> scala.collection.mutable.ArraySeq$ofRef}) with the type of 
> ArrayType(Seq(StructField(key,StringType,false), 
> StructField(value,BinaryType,false)),true) to JSON.
>   at org.apache.spark.sql.Row.toJson$1(Row.scala:604)
>   at org.apache.spark.sql.Row.jsonValue(Row.scala:613)
>   at org.apache.spark.sql.Row.jsonValue$(Row.scala:552)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.jsonValue(rows.scala:166)
>   at org.apache.spark.sql.Row.json(Row.scala:535)
>   at org.apache.spark.sql.Row.json$(Row.scala:535)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.json(rows.scala:166)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39348) Create table in overwrite mode fails when interrupted

2023-02-09 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17686390#comment-17686390
 ] 

Wei Guo commented on SPARK-39348:
-

After PR [https://github.com/apache/spark/pull/26559,] it has been removed.
 * Since Spark 2.4, creating a managed table with nonempty location is not 
allowed. An exception is thrown when attempting to create a managed table with 
nonempty location. To set {{true}} to 
{{spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation}} restores 
the previous behavior. This option will be removed in Spark 3.0.

> Create table in overwrite mode fails when interrupted
> -
>
> Key: SPARK-39348
> URL: https://issues.apache.org/jira/browse/SPARK-39348
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.1.1
>Reporter: Max
>Priority: Major
>
> When you attempt to rerun an Apache Spark write operation by cancelling the 
> currently running job, the following error occurs:
> {code:java}
> Error: org.apache.spark.sql.AnalysisException: Cannot create the managed 
> table('`testdb`.` testtable`').
> The associated location 
> ('dbfs:/user/hive/warehouse/testdb.db/metastore_cache_ testtable) already 
> exists.;{code}
> This problem can occur if:
>  * The cluster is terminated while a write operation is in progress.
>  * A temporary network issue occurs.
>  * The job is interrupted.
> You can reproduce the problem by following these steps:
> 1. Create a DataFrame:
> {code:java}
> val df = spark.range(1000){code}
> 2. Write the DataFrame to a location in overwrite mode:
> {code:java}
> df.write.mode(SaveMode.Overwrite).saveAsTable("testdb.testtable"){code}
> 3. Cancel the command while it is executing.
> 4. Re-run the {{write}} command.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicitly in CSV dataSource

2023-02-06 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Description: 
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.

 

After this change, the behavior as flows:
|id|code|2.4 and before|3.0 and after|this update|remark|
|1|Seq("#abc", "\udef", "xyz").toDF()
.write.{color:#57d9a3}option("comment", "\u"){color}.csv(path)|#abc
*def*
xyz|{color:#4c9aff}"#abc"{color}
{color:#4c9aff}*def*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}*"def"*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|2|Seq("#abc", "\udef", "xyz").toDF()
.write{color:#57d9a3}.option("comment", "#"){color}.csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|3|Seq("#abc", "\udef", "xyz").toDF()
.write.csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|default behavior: the same|
|4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.{color:#57d9a3}option("comment", "\u"){color}.csv(path)|#abc
xyz|{color:#4c9aff}#abc{color}
{color:#4c9aff}\udef{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.{color:#57d9a3}option("comment", "#"){color}.csv(path)|\udef
xyz|\udef
xyz|\udef
xyz|the same|
|6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.csv(path)|#abc
xyz|#abc
\udef
xyz|#abc
\udef
xyz|default behavior: the same|

 

  was:
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.

 

After this change, the behavior as flows:
|id|code|2.4 and before|3.0 and after|this update|remark|
|1|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "\u").csv(path)|#abc
*def*
xyz|{color:#4c9aff}"#abc"{color}
{color:#4c9aff}*def*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}*"def"*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|2|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "#").csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|3|Seq("#abc", "\udef", "xyz").toDF()
.write.csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "\u").csv(path)|#abc
xyz|{color:#4c9aff}#abc{color}
{color:#4c9aff}\udef{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", 

[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicitly in CSV dataSource

2023-02-06 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Description: 
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.

 
|id|code|2.4 and before|3.0 and after|this update|remark|
|1|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "\u").csv(path)|#abc
*def*
xyz|{color:#4c9aff}"#abc"{color}
{color:#4c9aff}*def*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}*"def"*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|2|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "#").csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|3|Seq("#abc", "\udef", "xyz").toDF()
.write.csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "\u").csv(path)|#abc
xyz|{color:#4c9aff}#abc{color}
{color:#4c9aff}\udef{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "#").csv(path)|\udef
xyz|\udef
xyz|\udef
xyz|the same|
|6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.csv(path)|#abc
xyz|#abc
\udef
xyz|#abc
\udef
xyz|the same|

 

  was:
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.

 
|id|code| |2.4 and before|3.0 and after|current update|remark|
|1|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "\u").csv(path)| |#abc
*def*
xyz|{color:#4c9aff}"#abc"{color}
{color:#4c9aff}*def*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}*"def"*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}a little bit difference{color}|
|2|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "#").csv(path)| |#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|3|Seq("#abc", "\udef", "xyz").toDF()
.write.csv(path)| |#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "\u").csv(path)| |#abc
xyz|{color:#4c9aff}#abc{color}
{color:#4c9aff}\udef{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}a liitle bit difference{color}|
|5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "#").csv(path)| |\udef
xyz|\udef
xyz|\udef
xyz|the same|
|6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.csv(path)| |#abc
xyz|#abc
\udef
xyz|#abc
\udef
xyz|the same|
 


> Pass the comment option throu

[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicitly in CSV dataSource

2023-02-06 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Description: 
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.

 

After this change, the behavior as flows:
|id|code|2.4 and before|3.0 and after|this update|remark|
|1|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "\u").csv(path)|#abc
*def*
xyz|{color:#4c9aff}"#abc"{color}
{color:#4c9aff}*def*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}*"def"*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|2|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "#").csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|3|Seq("#abc", "\udef", "xyz").toDF()
.write.csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "\u").csv(path)|#abc
xyz|{color:#4c9aff}#abc{color}
{color:#4c9aff}\udef{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "#").csv(path)|\udef
xyz|\udef
xyz|\udef
xyz|the same|
|6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.csv(path)|#abc
xyz|#abc
\udef
xyz|#abc
\udef
xyz|the same|

 

  was:
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.

 
|id|code|2.4 and before|3.0 and after|this update|remark|
|1|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "\u").csv(path)|#abc
*def*
xyz|{color:#4c9aff}"#abc"{color}
{color:#4c9aff}*def*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}*"def"*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|2|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "#").csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|3|Seq("#abc", "\udef", "xyz").toDF()
.write.csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "\u").csv(path)|#abc
xyz|{color:#4c9aff}#abc{color}
{color:#4c9aff}\udef{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "#").csv(path)|\udef
xyz|\udef
xyz|\udef
xyz|the same|
|6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.csv(path)|#abc
xyz|#abc

[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicitly in CSV dataSource

2023-02-06 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Description: 
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.

 
|id|code| |2.4 and before|3.0 and after|current update|remark|
|1|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "\u").csv(path)| |#abc
*def*
xyz|{color:#4c9aff}"#abc"{color}
{color:#4c9aff}*def*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}*"def"*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}a little bit difference{color}|
|2|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "#").csv(path)| |#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|3|Seq("#abc", "\udef", "xyz").toDF()
.write.csv(path)| |#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "\u").csv(path)| |#abc
xyz|{color:#4c9aff}#abc{color}
{color:#4c9aff}\udef{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}a liitle bit difference{color}|
|5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "#").csv(path)| |\udef
xyz|\udef
xyz|\udef
xyz|the same|
|6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.csv(path)| |#abc
xyz|#abc
\udef
xyz|#abc
\udef
xyz|the same|
 

  was:
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.


> Pass the comment option through to univocity if users set it explicitly in 
> CSV dataSource
> -
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-56-01-596.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-56-10-083.png!
> For users, they can't set comment option 

[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicitly in CSV dataSource

2023-02-05 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Summary: Pass the comment option through to univocity if users set it 
explicitly in CSV dataSource  (was: Pass the comment option through to 
univocity if users set it explicity in CSV dataSource)

> Pass the comment option through to univocity if users set it explicitly in 
> CSV dataSource
> -
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-56-01-596.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-56-10-083.png!
> For users, they can't set comment option to '\u'  to keep the behavior as 
> before because the new added `isCommentSet` check logic as follows:
> {code:java}
> val isCommentSet = this.comment != '\u'
> def asWriterSettings: CsvWriterSettings = {
>   // other code
>   if (isCommentSet) {
> format.setComment(comment)
>   }
>   // other code
> }
>  {code}
> It's better to pass the comment option through to univocity if users set it 
> explicitly in CSV dataSource.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicity in CSV dataSource

2023-02-05 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Description: 
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.

  was:
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to Univocity if users set it 
explicitly in CSV dataSource.


> Pass the comment option through to univocity if users set it explicity in CSV 
> dataSource
> 
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-56-01-596.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-56-10-083.png!
> For users, they can't set comment option to '\u'  to keep the behavior as 
> before because the new added `isCommentSet` check logic as follows:
> {code:java}
> val isCommentSet = this.comment != '\u'
> def asWriterSettings: CsvWriterSettings = {
>   // other code
>   if (isCommentSet) {
> format.setComment(comment)
>   }
>   // other code
> }
>  {code}
> It's better to pass the comment option through to univocity if users set it 
> explicitly in CSV dataSource.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicity in CSV dataSource

2023-02-05 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Description: 
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to Univocity if users set it 
explicitly in CSV dataSource.

  was:
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior 
before because the `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
  until univocity-parses releases a new version.


> Pass the comment option through to univocity if users set it explicity in CSV 
> dataSource
> 
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-56-01-596.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-56-10-083.png!
> For users, they can't set comment option to '\u'  to keep the behavior as 
> before because the new added `isCommentSet` check logic as follows:
> {code:java}
> val isCommentSet = this.comment != '\u'
> def asWriterSettings: CsvWriterSettings = {
>   // other code
>   if (isCommentSet) {
> format.setComment(comment)
>   }
>   // other code
> }
>  {code}
> It's better to pass the comment option through to Univocity if users set it 
> explicitly in CSV dataSource.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicity in CSV dataSource

2023-02-05 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Description: 
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior 
before because the `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
  until univocity-parses releases a new version.

  was:
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior 
before because the `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  xx
  if (isCommentSet) {
format.setComment(comment)
  }
}
 {code}
  until univocity-parses releases a new version.


> Pass the comment option through to univocity if users set it explicity in CSV 
> dataSource
> 
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-56-01-596.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-56-10-083.png!
> For users, they can't set comment option to '\u'  to keep the behavior 
> before because the `isCommentSet` check logic as follows:
> {code:java}
> val isCommentSet = this.comment != '\u'
> def asWriterSettings: CsvWriterSettings = {
>   // other code
>   if (isCommentSet) {
> format.setComment(comment)
>   }
>   // other code
> }
>  {code}
>   until univocity-parses releases a new version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicity in CSV dataSource

2023-02-05 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Description: 
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior 
before because the `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  xx
  if (isCommentSet) {
format.setComment(comment)
  }
}
 {code}
  until univocity-parses releases a new version.

  was:
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For Spark, it's better to add a legacy config to restores the legacy behavior 
until univocity-parses releases a new version.


> Pass the comment option through to univocity if users set it explicity in CSV 
> dataSource
> 
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-56-01-596.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-56-10-083.png!
> For users, they can't set comment option to '\u'  to keep the behavior 
> before because the `isCommentSet` check logic as follows:
> {code:java}
> val isCommentSet = this.comment != '\u'
> def asWriterSettings: CsvWriterSettings = {
>   xx
>   if (isCommentSet) {
> format.setComment(comment)
>   }
> }
>  {code}
>   until univocity-parses releases a new version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicity in CSV dataSource

2023-02-05 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Description: 
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For Spark, it's better to add a legacy config to restores the legacy behavior 
until univocity-parses releases a new version.

  was:
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!
I have made a PR [https://github.com/uniVocity/univocity-parsers/pull/518] for 
issue 
[univocity-parsers#505|https://github.com/uniVocity/univocity-parsers/issues/505]
 to univocity-parses, but it seems to be a long time for waiting it to be 
merged.
 
For Spark, it's better to add a legacy config to restores the legacy behavior 
until univocity-parses releases a new version.


> Pass the comment option through to univocity if users set it explicity in CSV 
> dataSource
> 
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-56-01-596.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-56-10-083.png!
> For Spark, it's better to add a legacy config to restores the legacy behavior 
> until univocity-parses releases a new version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicity in CSV dataSource

2023-02-05 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Summary: Pass the comment option through to univocity if users set it 
explicity in CSV dataSource  (was: Add a legacy config for restoring writer's 
comment option behavior in CSV dataSource)

> Pass the comment option through to univocity if users set it explicity in CSV 
> dataSource
> 
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-56-01-596.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-56-10-083.png!
> I have made a PR [https://github.com/uniVocity/univocity-parsers/pull/518] 
> for issue 
> [univocity-parsers#505|https://github.com/uniVocity/univocity-parsers/issues/505]
>  to univocity-parses, but it seems to be a long time for waiting it to be 
> merged.
>  
> For Spark, it's better to add a legacy config to restores the legacy behavior 
> until univocity-parses releases a new version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42252) Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config

2023-02-05 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42252:

Fix Version/s: 3.5.0
   (was: 3.4.0)

> Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config
> --
>
> Key: SPARK-42252
> URL: https://issues.apache.org/jira/browse/SPARK-42252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.5.0
>
>
> After Jira SPARK-28209 and PR 
> [25007|[https://github.com/apache/spark/pull/25007]], the new shuffle writer 
> api is proposed. All shuffle writers(BypassMergeSortShuffleWriter, 
> SortShuffleWriter, UnsafeShuffleWriter) are based on 
> LocalDiskShuffleMapOutputWriter to write local disk shuffle files. The config 
> spark.shuffle.unsafe.file.output.buffer used in 
> LocalDiskShuffleMapOutputWriter was only used in UnsafeShuffleWriter before. 
>  
> It's better to rename it and make it more suitable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42252) Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config

2023-02-05 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42252:

Affects Version/s: 3.3.0
   3.2.0
   3.1.0
   3.4.0

> Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config
> --
>
> Key: SPARK-42252
> URL: https://issues.apache.org/jira/browse/SPARK-42252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0, 3.4.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.5.0
>
>
> After Jira SPARK-28209 and PR 
> [25007|[https://github.com/apache/spark/pull/25007]], the new shuffle writer 
> api is proposed. All shuffle writers(BypassMergeSortShuffleWriter, 
> SortShuffleWriter, UnsafeShuffleWriter) are based on 
> LocalDiskShuffleMapOutputWriter to write local disk shuffle files. The config 
> spark.shuffle.unsafe.file.output.buffer used in 
> LocalDiskShuffleMapOutputWriter was only used in UnsafeShuffleWriter before. 
>  
> It's better to rename it and make it more suitable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42252) Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config

2023-02-05 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42252:

Target Version/s: 3.5.0  (was: 3.4.0)

> Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config
> --
>
> Key: SPARK-42252
> URL: https://issues.apache.org/jira/browse/SPARK-42252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
>
> After Jira SPARK-28209 and PR 
> [25007|[https://github.com/apache/spark/pull/25007]], the new shuffle writer 
> api is proposed. All shuffle writers(BypassMergeSortShuffleWriter, 
> SortShuffleWriter, UnsafeShuffleWriter) are based on 
> LocalDiskShuffleMapOutputWriter to write local disk shuffle files. The config 
> spark.shuffle.unsafe.file.output.buffer used in 
> LocalDiskShuffleMapOutputWriter was only used in UnsafeShuffleWriter before. 
>  
> It's better to rename it and make it more suitable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42335) Add a legacy config for restoring writer's comment option behavior in CSV dataSource

2023-02-03 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Attachment: image-2023-02-03-18-56-10-083.png

> Add a legacy config for restoring writer's comment option behavior in CSV 
> dataSource
> 
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-44-30-296.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-15-12-661.png!
> I have made a PR [https://github.com/uniVocity/univocity-parsers/pull/518] 
> for issue 
> [univocity-parsers#505|https://github.com/uniVocity/univocity-parsers/issues/505]
>  to univocity-parses, but it seems to be a long time for waiting it to be 
> merged.
>  
> For Spark, it's better to add a legacy config to restores the legacy behavior 
> until univocity-parses releases a new version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42335) Add a legacy config for restoring writer's comment option behavior in CSV dataSource

2023-02-03 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Description: 
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!
I have made a PR [https://github.com/uniVocity/univocity-parsers/pull/518] for 
issue 
[univocity-parsers#505|https://github.com/uniVocity/univocity-parsers/issues/505]
 to univocity-parses, but it seems to be a long time for waiting it to be 
merged.
 
For Spark, it's better to add a legacy config to restores the legacy behavior 
until univocity-parses releases a new version.

  was:
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-44-30-296.png!
After this change, the content is shown as:
!image-2023-02-03-18-15-12-661.png!
I have made a PR [https://github.com/uniVocity/univocity-parsers/pull/518] for 
issue 
[univocity-parsers#505|https://github.com/uniVocity/univocity-parsers/issues/505]
 to univocity-parses, but it seems to be a long time for waiting it to be 
merged.
 
For Spark, it's better to add a legacy config to restores the legacy behavior 
until univocity-parses releases a new version.


> Add a legacy config for restoring writer's comment option behavior in CSV 
> dataSource
> 
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-56-01-596.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-56-10-083.png!
> I have made a PR [https://github.com/uniVocity/univocity-parsers/pull/518] 
> for issue 
> [univocity-parsers#505|https://github.com/uniVocity/univocity-parsers/issues/505]
>  to univocity-parses, but it seems to be a long time for waiting it to be 
> merged.
>  
> For Spark, it's better to add a legacy config to restores the legacy behavior 
> until univocity-parses releases a new version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42335) Add a legacy config for restoring writer's comment option behavior in CSV dataSource

2023-02-03 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Attachment: image-2023-02-03-18-56-01-596.png

> Add a legacy config for restoring writer's comment option behavior in CSV 
> dataSource
> 
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-44-30-296.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-15-12-661.png!
> I have made a PR [https://github.com/uniVocity/univocity-parsers/pull/518] 
> for issue 
> [univocity-parsers#505|https://github.com/uniVocity/univocity-parsers/issues/505]
>  to univocity-parses, but it seems to be a long time for waiting it to be 
> merged.
>  
> For Spark, it's better to add a legacy config to restores the legacy behavior 
> until univocity-parses releases a new version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42335) Add a legacy config for restoring writer's comment option behavior in CSV dataSource

2023-02-03 Thread Wei Guo (Jira)
Wei Guo created SPARK-42335:
---

 Summary: Add a legacy config for restoring writer's comment option 
behavior in CSV dataSource
 Key: SPARK-42335
 URL: https://issues.apache.org/jira/browse/SPARK-42335
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0, 3.2.0, 3.1.0, 3.0.0
Reporter: Wei Guo
 Fix For: 3.4.0


In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-44-30-296.png!
After this change, the content is shown as:
!image-2023-02-03-18-15-12-661.png!
I have made a PR [https://github.com/uniVocity/univocity-parsers/pull/518] for 
issue 
[univocity-parsers#505|https://github.com/uniVocity/univocity-parsers/issues/505]
 to univocity-parses, but it seems to be a long time for waiting it to be 
merged.
 
For Spark, it's better to add a legacy config to restores the legacy behavior 
until univocity-parses releases a new version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42237) change binary to unsupported dataType in csv format

2023-02-02 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42237:

Description: 
When a binary colunm is written into csv files, actual content of this colunm 
is {*}object.toString(){*}, which is meaningless.
{code:java}
val df = Seq(Array[Byte](1,2)).toDF
df.write.csv("/Users/guowei/Desktop/binary_csv")
{code}
The csv file's content is as follows:

!image-2023-01-30-17-21-09-212.png|width=141,height=29!

Meanwhile, if a binary colunm saved as table with csv fileformat, the table 
can't be read back successfully.
{code:java}
val df = Seq((1, Array[Byte](1,2))).toDF
df.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select * from 
binaryDataTable").show()
{code}
!https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963&docGuid=Eiscz4oMI45Sfp&sign=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A!

So I think it' better to change binary to unsupported dataType in csv format, 
both for datasource v1(CSVFileFormat) and v2(CSVTable).

  was:
When a binary colunm is written into csv files, actual content of this colunm 
is {*}object.toString(){*}, which is meaningless.
{code:java}
val df = Seq(Array[Byte](1,2)).toDF
df.write.csv("/Users/guowei19/Desktop/binary_csv")
{code}

The csv file's content is as follows:

!image-2023-01-30-17-21-09-212.png|width=141,height=29!

Meanwhile, if a binary colunm saved as table with csv fileformat, the table 
can't be read back successfully.

{code:java}
val df = Seq((1, Array[Byte](1,2))).toDF
df.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select * from 
binaryDataTable").show()
{code}

!https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963&docGuid=Eiscz4oMI45Sfp&sign=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A!

So I think it' better to change binary to unsupported dataType in csv format, 
both for datasource v1(CSVFileFormat) and v2(CSVTable).


> change binary to unsupported dataType in csv format
> ---
>
> Key: SPARK-42237
> URL: https://issues.apache.org/jira/browse/SPARK-42237
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.8, 3.3.1
>Reporter: Wei Guo
>Priority: Minor
> Attachments: image-2023-01-30-17-21-09-212.png
>
>
> When a binary colunm is written into csv files, actual content of this colunm 
> is {*}object.toString(){*}, which is meaningless.
> {code:java}
> val df = Seq(Array[Byte](1,2)).toDF
> df.write.csv("/Users/guowei/Desktop/binary_csv")
> {code}
> The csv file's content is as follows:
> !image-2023-01-30-17-21-09-212.png|width=141,height=29!
> Meanwhile, if a binary colunm saved as table with csv fileformat, the table 
> can't be read back successfully.
> {code:java}
> val df = Seq((1, Array[Byte](1,2))).toDF
> df.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select * from 
> binaryDataTable").show()
> {code}
> !https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963&docGuid=Eiscz4oMI45Sfp&sign=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A!
> So I think it' better to change binary to unsupported dataType in csv format, 
> both for datasource v1(CSVFileFormat) and v2(CSVTable).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42252) Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config

2023-01-30 Thread Wei Guo (Jira)
Wei Guo created SPARK-42252:
---

 Summary: Deprecate spark.shuffle.unsafe.file.output.buffer and add 
a new config
 Key: SPARK-42252
 URL: https://issues.apache.org/jira/browse/SPARK-42252
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Wei Guo
 Fix For: 3.4.0


After Jira SPARK-28209 and PR 
[25007|[https://github.com/apache/spark/pull/25007]], the new shuffle writer 
api is proposed. All shuffle writers(BypassMergeSortShuffleWriter, 
SortShuffleWriter, UnsafeShuffleWriter) are based on 
LocalDiskShuffleMapOutputWriter to write local disk shuffle files. The config 
spark.shuffle.unsafe.file.output.buffer used in LocalDiskShuffleMapOutputWriter 
was only used in UnsafeShuffleWriter before. 
 
It's better to rename it and make it more suitable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42237) change binary to unsupported dataType in csv format

2023-01-30 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17681978#comment-17681978
 ] 

Wei Guo commented on SPARK-42237:
-

a pr is ready~

> change binary to unsupported dataType in csv format
> ---
>
> Key: SPARK-42237
> URL: https://issues.apache.org/jira/browse/SPARK-42237
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.8, 3.3.1
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-01-30-17-21-09-212.png
>
>
> When a binary colunm is written into csv files, actual content of this colunm 
> is {*}object.toString(){*}, which is meaningless.
> {code:java}
> val df = 
> Seq(Array[Byte](1,2)).toDFdf.write.csv("/Users/guowei19/Desktop/binary_csv") 
> {code}
> The csv file's content is as follows:
> !image-2023-01-30-17-21-09-212.png|width=141,height=29!
> Meanwhile, if a binary colunm saved as table with csv fileformat, the table 
> can't be read back successfully.
> {code:java}
> val df = Seq((1, 
> Array[Byte](1,2))).toDFdf.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select
>  * from binaryDataTable").show() {code}
> !https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963&docGuid=Eiscz4oMI45Sfp&sign=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A!
> So I think it' better to change binary to unsupported dataType in csv format, 
> both for datasource v1(CSVFileFormat) and v2(CSVTable).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-42237) change binary to unsupported dataType in csv format

2023-01-30 Thread Wei Guo (Jira)


[ https://issues.apache.org/jira/browse/SPARK-42237 ]


Wei Guo deleted comment on SPARK-42237:
-

was (Author: wayne guo):
a pr is ready~

> change binary to unsupported dataType in csv format
> ---
>
> Key: SPARK-42237
> URL: https://issues.apache.org/jira/browse/SPARK-42237
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.8, 3.3.1
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-01-30-17-21-09-212.png
>
>
> When a binary colunm is written into csv files, actual content of this colunm 
> is {*}object.toString(){*}, which is meaningless.
> {code:java}
> val df = 
> Seq(Array[Byte](1,2)).toDFdf.write.csv("/Users/guowei19/Desktop/binary_csv") 
> {code}
> The csv file's content is as follows:
> !image-2023-01-30-17-21-09-212.png|width=141,height=29!
> Meanwhile, if a binary colunm saved as table with csv fileformat, the table 
> can't be read back successfully.
> {code:java}
> val df = Seq((1, 
> Array[Byte](1,2))).toDFdf.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select
>  * from binaryDataTable").show() {code}
> !https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963&docGuid=Eiscz4oMI45Sfp&sign=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A!
> So I think it' better to change binary to unsupported dataType in csv format, 
> both for datasource v1(CSVFileFormat) and v2(CSVTable).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >