[jira] [Resolved] (SPARK-48175) Store collation information in metadata and not in type for SER/DE
[ https://issues.apache.org/jira/browse/SPARK-48175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-48175. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46280 [https://github.com/apache/spark/pull/46280] > Store collation information in metadata and not in type for SER/DE > -- > > Key: SPARK-48175 > URL: https://issues.apache.org/jira/browse/SPARK-48175 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 4.0.0 >Reporter: Stefan Kandic >Assignee: Stefan Kandic >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Changing serialization and deserialization of collated strings so that the > collation information is put in the metadata of the enclosing struct field - > and then read back from there during parsing. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48175) Store collation information in metadata and not in type for SER/DE
[ https://issues.apache.org/jira/browse/SPARK-48175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-48175: --- Assignee: Stefan Kandic > Store collation information in metadata and not in type for SER/DE > -- > > Key: SPARK-48175 > URL: https://issues.apache.org/jira/browse/SPARK-48175 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 4.0.0 >Reporter: Stefan Kandic >Assignee: Stefan Kandic >Priority: Major > Labels: pull-request-available > > Changing serialization and deserialization of collated strings so that the > collation information is put in the metadata of the enclosing struct field - > and then read back from there during parsing. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48329) Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true
[ https://issues.apache.org/jira/browse/SPARK-48329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847502#comment-17847502 ] chesterxu commented on SPARK-48329: --- Hey there, may I ask can I have a try for this? > Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true > - > > Key: SPARK-48329 > URL: https://issues.apache.org/jira/browse/SPARK-48329 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Szehon Ho >Priority: Minor > > The SPJ feature flag 'spark.sql.sources.v2.bucketing.pushPartValues.enabled' > has proven valuable for most use cases. We should take advantage of 4.0 > release and change the value to true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48329) Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true
[ https://issues.apache.org/jira/browse/SPARK-48329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48329: --- Labels: pull-request-available (was: ) > Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true > - > > Key: SPARK-48329 > URL: https://issues.apache.org/jira/browse/SPARK-48329 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Szehon Ho >Priority: Minor > Labels: pull-request-available > > The SPJ feature flag 'spark.sql.sources.v2.bucketing.pushPartValues.enabled' > has proven valuable for most use cases. We should take advantage of 4.0 > release and change the value to true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-48329) Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true
[ https://issues.apache.org/jira/browse/SPARK-48329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847502#comment-17847502 ] chesterxu edited comment on SPARK-48329 at 5/18/24 11:26 AM: - Hey there~ If no one was assigned, please check this PR: https://github.com/apache/spark/pull/46650 was (Author: JIRAUSER302535): Hey there, may I ask can I have a try for this? > Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true > - > > Key: SPARK-48329 > URL: https://issues.apache.org/jira/browse/SPARK-48329 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Szehon Ho >Priority: Minor > Labels: pull-request-available > > The SPJ feature flag 'spark.sql.sources.v2.bucketing.pushPartValues.enabled' > has proven valuable for most use cases. We should take advantage of 4.0 > release and change the value to true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-48329) Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true
[ https://issues.apache.org/jira/browse/SPARK-48329 ] chesterxu deleted comment on SPARK-48329: --- was (Author: JIRAUSER302535): Hey there~ If no one was assigned, please check this PR: https://github.com/apache/spark/pull/46650 > Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true > - > > Key: SPARK-48329 > URL: https://issues.apache.org/jira/browse/SPARK-48329 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Szehon Ho >Priority: Minor > Labels: pull-request-available > > The SPJ feature flag 'spark.sql.sources.v2.bucketing.pushPartValues.enabled' > has proven valuable for most use cases. We should take advantage of 4.0 > release and change the value to true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48329) Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true
[ https://issues.apache.org/jira/browse/SPARK-48329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847569#comment-17847569 ] chesterxu commented on SPARK-48329: --- Hey there~ If no one was assigned, please check this PR: [https://github.com/apache/spark/pull/46650] > Default spark.sql.sources.v2.bucketing.pushPartValues.enabled to true > - > > Key: SPARK-48329 > URL: https://issues.apache.org/jira/browse/SPARK-48329 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Szehon Ho >Priority: Minor > Labels: pull-request-available > > The SPJ feature flag 'spark.sql.sources.v2.bucketing.pushPartValues.enabled' > has proven valuable for most use cases. We should take advantage of 4.0 > release and change the value to true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48330) Fix the python data source timeout issue for large trigger interval
Chaoqin Li created SPARK-48330: -- Summary: Fix the python data source timeout issue for large trigger interval Key: SPARK-48330 URL: https://issues.apache.org/jira/browse/SPARK-48330 Project: Spark Issue Type: Task Components: PySpark, SS Affects Versions: 4.0.0 Reporter: Chaoqin Li Currently we run long running python worker process for python streaming source and sink to perform planning, commit and abort in driver side. Testing indicate that current implementation cause connection timeout error when streaming query has large trigger interval For python streaming source, keep the long running worker archaetecture but set the socket timeout to be infinity to avoid timeout error. For python streaming sink, since StreamingWrite is also created per microbatch in scala side, long running worker cannot be attached to s StreamingWrite instance. Therefore we abandon the long running worker architecture, simply call commit() or abort() and exit the worker and allow spark to reuse worker for us. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48330) Fix the python streaming data source timeout issue for large trigger interval
[ https://issues.apache.org/jira/browse/SPARK-48330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chaoqin Li updated SPARK-48330: --- Summary: Fix the python streaming data source timeout issue for large trigger interval (was: Fix the python data source timeout issue for large trigger interval) > Fix the python streaming data source timeout issue for large trigger interval > - > > Key: SPARK-48330 > URL: https://issues.apache.org/jira/browse/SPARK-48330 > Project: Spark > Issue Type: Task > Components: PySpark, SS >Affects Versions: 4.0.0 >Reporter: Chaoqin Li >Priority: Major > > Currently we run long running python worker process for python streaming > source and sink to perform planning, commit and abort in driver side. Testing > indicate that current implementation cause connection timeout error when > streaming query has large trigger interval > For python streaming source, keep the long running worker archaetecture but > set the socket timeout to be infinity to avoid timeout error. > For python streaming sink, since StreamingWrite is also created per > microbatch in scala side, long running worker cannot be attached to s > StreamingWrite instance. Therefore we abandon the long running worker > architecture, simply call commit() or abort() and exit the worker and allow > spark to reuse worker for us. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48330) Fix the python streaming data source timeout issue for large trigger interval
[ https://issues.apache.org/jira/browse/SPARK-48330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48330: --- Labels: pull-request-available (was: ) > Fix the python streaming data source timeout issue for large trigger interval > - > > Key: SPARK-48330 > URL: https://issues.apache.org/jira/browse/SPARK-48330 > Project: Spark > Issue Type: Task > Components: PySpark, SS >Affects Versions: 4.0.0 >Reporter: Chaoqin Li >Priority: Major > Labels: pull-request-available > > Currently we run long running python worker process for python streaming > source and sink to perform planning, commit and abort in driver side. Testing > indicate that current implementation cause connection timeout error when > streaming query has large trigger interval > For python streaming source, keep the long running worker archaetecture but > set the socket timeout to be infinity to avoid timeout error. > For python streaming sink, since StreamingWrite is also created per > microbatch in scala side, long running worker cannot be attached to s > StreamingWrite instance. Therefore we abandon the long running worker > architecture, simply call commit() or abort() and exit the worker and allow > spark to reuse worker for us. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48332) Upgrade `jdbc` related test dependencies
BingKun Pan created SPARK-48332: --- Summary: Upgrade `jdbc` related test dependencies Key: SPARK-48332 URL: https://issues.apache.org/jira/browse/SPARK-48332 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48332) Upgrade `jdbc` related test dependencies
[ https://issues.apache.org/jira/browse/SPARK-48332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48332: --- Labels: pull-request-available (was: ) > Upgrade `jdbc` related test dependencies > > > Key: SPARK-48332 > URL: https://issues.apache.org/jira/browse/SPARK-48332 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org