[jira] [Resolved] (SPARK-48175) Store collation information in metadata and not in type for SER/DE

2024-05-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48175.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46280
[https://github.com/apache/spark/pull/46280]

> Store collation information in metadata and not in type for SER/DE
> --
>
> Key: SPARK-48175
> URL: https://issues.apache.org/jira/browse/SPARK-48175
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 4.0.0
>Reporter: Stefan Kandic
>Assignee: Stefan Kandic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Changing serialization and deserialization of collated strings so that the 
> collation information is put in the metadata of the enclosing struct field - 
> and then read back from there during parsing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48175) Store collation information in metadata and not in type for SER/DE

2024-05-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48175:
---

Assignee: Stefan Kandic

> Store collation information in metadata and not in type for SER/DE
> --
>
> Key: SPARK-48175
> URL: https://issues.apache.org/jira/browse/SPARK-48175
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 4.0.0
>Reporter: Stefan Kandic
>Assignee: Stefan Kandic
>Priority: Major
>  Labels: pull-request-available
>
> Changing serialization and deserialization of collated strings so that the 
> collation information is put in the metadata of the enclosing struct field - 
> and then read back from there during parsing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48329) Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true

2024-05-18 Thread chesterxu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847502#comment-17847502
 ] 

chesterxu commented on SPARK-48329:
---

Hey there, may I ask can I have a try for this?

> Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true
> -
>
> Key: SPARK-48329
> URL: https://issues.apache.org/jira/browse/SPARK-48329
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Szehon Ho
>Priority: Minor
>
> The SPJ feature flag 'spark.sql.sources.v2.bucketing.pushPartValues.enabled' 
> has proven valuable for most use cases.  We should take advantage of 4.0 
> release and change the value to true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48329) Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true

2024-05-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48329:
---
Labels: pull-request-available  (was: )

> Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true
> -
>
> Key: SPARK-48329
> URL: https://issues.apache.org/jira/browse/SPARK-48329
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Szehon Ho
>Priority: Minor
>  Labels: pull-request-available
>
> The SPJ feature flag 'spark.sql.sources.v2.bucketing.pushPartValues.enabled' 
> has proven valuable for most use cases.  We should take advantage of 4.0 
> release and change the value to true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-48329) Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true

2024-05-18 Thread chesterxu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847502#comment-17847502
 ] 

chesterxu edited comment on SPARK-48329 at 5/18/24 11:26 AM:
-

Hey there~ 

If no one was assigned, please check this PR:
https://github.com/apache/spark/pull/46650


was (Author: JIRAUSER302535):
Hey there, may I ask can I have a try for this?

> Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true
> -
>
> Key: SPARK-48329
> URL: https://issues.apache.org/jira/browse/SPARK-48329
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Szehon Ho
>Priority: Minor
>  Labels: pull-request-available
>
> The SPJ feature flag 'spark.sql.sources.v2.bucketing.pushPartValues.enabled' 
> has proven valuable for most use cases.  We should take advantage of 4.0 
> release and change the value to true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-48329) Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true

2024-05-18 Thread chesterxu (Jira)


[ https://issues.apache.org/jira/browse/SPARK-48329 ]


chesterxu deleted comment on SPARK-48329:
---

was (Author: JIRAUSER302535):
Hey there~ 

If no one was assigned, please check this PR:
https://github.com/apache/spark/pull/46650

> Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true
> -
>
> Key: SPARK-48329
> URL: https://issues.apache.org/jira/browse/SPARK-48329
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Szehon Ho
>Priority: Minor
>  Labels: pull-request-available
>
> The SPJ feature flag 'spark.sql.sources.v2.bucketing.pushPartValues.enabled' 
> has proven valuable for most use cases.  We should take advantage of 4.0 
> release and change the value to true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48329) Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true

2024-05-18 Thread chesterxu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847569#comment-17847569
 ] 

chesterxu commented on SPARK-48329:
---

Hey there~ 

If no one was assigned, please check this PR:
[https://github.com/apache/spark/pull/46650]

> Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true
> -
>
> Key: SPARK-48329
> URL: https://issues.apache.org/jira/browse/SPARK-48329
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Szehon Ho
>Priority: Minor
>  Labels: pull-request-available
>
> The SPJ feature flag 'spark.sql.sources.v2.bucketing.pushPartValues.enabled' 
> has proven valuable for most use cases.  We should take advantage of 4.0 
> release and change the value to true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48330) Fix the python data source timeout issue for large trigger interval

2024-05-18 Thread Chaoqin Li (Jira)
Chaoqin Li created SPARK-48330:
--

 Summary: Fix the python data source timeout issue for large 
trigger interval
 Key: SPARK-48330
 URL: https://issues.apache.org/jira/browse/SPARK-48330
 Project: Spark
  Issue Type: Task
  Components: PySpark, SS
Affects Versions: 4.0.0
Reporter: Chaoqin Li


Currently we run long running python worker process for python streaming source 
and sink to perform planning, commit and abort in driver side. Testing indicate 
that current implementation cause connection timeout error when streaming query 
has large trigger interval

For python streaming source, keep the long running worker archaetecture but set 
the socket timeout to be infinity to avoid timeout error.

For python streaming sink, since StreamingWrite is also created per microbatch 
in scala side, long running worker cannot be attached to s StreamingWrite 
instance. Therefore we abandon the long running worker architecture, simply 
call commit() or abort() and exit the worker and allow spark to reuse worker 
for us.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48330) Fix the python streaming data source timeout issue for large trigger interval

2024-05-18 Thread Chaoqin Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chaoqin Li updated SPARK-48330:
---
Summary: Fix the python streaming data source timeout issue for large 
trigger interval  (was: Fix the python data source timeout issue for large 
trigger interval)

> Fix the python streaming data source timeout issue for large trigger interval
> -
>
> Key: SPARK-48330
> URL: https://issues.apache.org/jira/browse/SPARK-48330
> Project: Spark
>  Issue Type: Task
>  Components: PySpark, SS
>Affects Versions: 4.0.0
>Reporter: Chaoqin Li
>Priority: Major
>
> Currently we run long running python worker process for python streaming 
> source and sink to perform planning, commit and abort in driver side. Testing 
> indicate that current implementation cause connection timeout error when 
> streaming query has large trigger interval
> For python streaming source, keep the long running worker archaetecture but 
> set the socket timeout to be infinity to avoid timeout error.
> For python streaming sink, since StreamingWrite is also created per 
> microbatch in scala side, long running worker cannot be attached to s 
> StreamingWrite instance. Therefore we abandon the long running worker 
> architecture, simply call commit() or abort() and exit the worker and allow 
> spark to reuse worker for us.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48330) Fix the python streaming data source timeout issue for large trigger interval

2024-05-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48330:
---
Labels: pull-request-available  (was: )

> Fix the python streaming data source timeout issue for large trigger interval
> -
>
> Key: SPARK-48330
> URL: https://issues.apache.org/jira/browse/SPARK-48330
> Project: Spark
>  Issue Type: Task
>  Components: PySpark, SS
>Affects Versions: 4.0.0
>Reporter: Chaoqin Li
>Priority: Major
>  Labels: pull-request-available
>
> Currently we run long running python worker process for python streaming 
> source and sink to perform planning, commit and abort in driver side. Testing 
> indicate that current implementation cause connection timeout error when 
> streaming query has large trigger interval
> For python streaming source, keep the long running worker archaetecture but 
> set the socket timeout to be infinity to avoid timeout error.
> For python streaming sink, since StreamingWrite is also created per 
> microbatch in scala side, long running worker cannot be attached to s 
> StreamingWrite instance. Therefore we abandon the long running worker 
> architecture, simply call commit() or abort() and exit the worker and allow 
> spark to reuse worker for us.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48332) Upgrade `jdbc` related test dependencies

2024-05-18 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-48332:
---

 Summary: Upgrade `jdbc` related test dependencies
 Key: SPARK-48332
 URL: https://issues.apache.org/jira/browse/SPARK-48332
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48332) Upgrade `jdbc` related test dependencies

2024-05-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48332:
---
Labels: pull-request-available  (was: )

> Upgrade `jdbc` related test dependencies
> 
>
> Key: SPARK-48332
> URL: https://issues.apache.org/jira/browse/SPARK-48332
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org