[jira] [Commented] (SPARK-42105) Document work (Release note & Guide doc) for SPARK-40925
[ https://issues.apache.org/jira/browse/SPARK-42105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678114#comment-17678114 ] Jungtaek Lim commented on SPARK-42105: -- Let me set this to blocker so that I won't miss this on release phase. > Document work (Release note & Guide doc) for SPARK-40925 > > > Key: SPARK-42105 > URL: https://issues.apache.org/jira/browse/SPARK-42105 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Priority: Blocker > > SPARK-40925 fixed the bug which introduced the major limitation we described > in the guide doc. > [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#limitation-of-global-watermark] > Many limitations described in the guide doc are resolved by SPARK-40925. We > even unblocked the functionality by SPARK-40940, so the doc is out of sync > with the codebase. > We probably should update the guide doc to describe a new limitation. We also > have to write down release note for SPARK-40925 as well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42105) Document work (Release note & Guide doc) for SPARK-40925
[ https://issues.apache.org/jira/browse/SPARK-42105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-42105: - Priority: Blocker (was: Major) > Document work (Release note & Guide doc) for SPARK-40925 > > > Key: SPARK-42105 > URL: https://issues.apache.org/jira/browse/SPARK-42105 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Priority: Blocker > > SPARK-40925 fixed the bug which introduced the major limitation we described > in the guide doc. > [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#limitation-of-global-watermark] > Many limitations described in the guide doc are resolved by SPARK-40925. We > even unblocked the functionality by SPARK-40940, so the doc is out of sync > with the codebase. > We probably should update the guide doc to describe a new limitation. We also > have to write down release note for SPARK-40925 as well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42105) Document work (Release note & Guide doc) for SPARK-40925
Jungtaek Lim created SPARK-42105: Summary: Document work (Release note & Guide doc) for SPARK-40925 Key: SPARK-42105 URL: https://issues.apache.org/jira/browse/SPARK-42105 Project: Spark Issue Type: Task Components: Structured Streaming Affects Versions: 3.4.0 Reporter: Jungtaek Lim SPARK-40925 fixed the bug which introduced the major limitation we described in the guide doc. [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#limitation-of-global-watermark] Many limitations described in the guide doc are resolved by SPARK-40925. We even unblocked the functionality by SPARK-40940, so the doc is out of sync with the codebase. We probably should update the guide doc to describe a new limitation. We also have to write down release note for SPARK-40925 as well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42100) Protect null `SQLExecutionUIData#description` in `SQLExecutionUIDataSerializer`
[ https://issues.apache.org/jira/browse/SPARK-42100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-42100: -- Assignee: Yang Jie > Protect null `SQLExecutionUIData#description` in > `SQLExecutionUIDataSerializer` > --- > > Key: SPARK-42100 > URL: https://issues.apache.org/jira/browse/SPARK-42100 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > > export LIVE_UI_LOCAL_STORE_DIR = /tmp/spark-ui > mvn clean install -pl sql/core > -Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest -Dtest=none > -DwildcardSuites=org.apache.spark.sql.DynamicPartitionPruningV1SuiteAEOff -am > > no test failed, but some error message: > > {code:java} > 14:46:44.514 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener > SQLAppStatusListener threw an exception > java.lang.NullPointerException > at > org.apache.spark.status.protobuf.StoreTypes$SQLExecutionUIData$Builder.setDescription(StoreTypes.java:46500) > at > org.apache.spark.status.protobuf.sql.SQLExecutionUIDataSerializer.serialize(SQLExecutionUIDataSerializer.scala:34) > at > org.apache.spark.status.protobuf.sql.SQLExecutionUIDataSerializer.serialize(SQLExecutionUIDataSerializer.scala:28) > at > org.apache.spark.status.protobuf.KVStoreProtobufSerializer.serialize(KVStoreProtobufSerializer.scala:30) > at org.apache.spark.util.kvstore.RocksDB.write(RocksDB.java:188) > at > org.apache.spark.status.ElementTrackingStore.write(ElementTrackingStore.scala:123) > at > org.apache.spark.status.ElementTrackingStore.write(ElementTrackingStore.scala:127) > at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:50) > at > org.apache.spark.sql.execution.ui.SQLAppStatusListener.update(SQLAppStatusListener.scala:456) > at > org.apache.spark.sql.execution.ui.SQLAppStatusListener.onJobStart(SQLAppStatusListener.scala:124) > at > org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37) > at > org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) > at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) > at > org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) > at > org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) > at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1444) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) > 14:46:44.936 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener > SQLAppStatusListener threw an exception {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42100) Protect null `SQLExecutionUIData#description` in `SQLExecutionUIDataSerializer`
[ https://issues.apache.org/jira/browse/SPARK-42100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-42100. Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39623 [https://github.com/apache/spark/pull/39623] > Protect null `SQLExecutionUIData#description` in > `SQLExecutionUIDataSerializer` > --- > > Key: SPARK-42100 > URL: https://issues.apache.org/jira/browse/SPARK-42100 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.4.0 > > > export LIVE_UI_LOCAL_STORE_DIR = /tmp/spark-ui > mvn clean install -pl sql/core > -Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest -Dtest=none > -DwildcardSuites=org.apache.spark.sql.DynamicPartitionPruningV1SuiteAEOff -am > > no test failed, but some error message: > > {code:java} > 14:46:44.514 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener > SQLAppStatusListener threw an exception > java.lang.NullPointerException > at > org.apache.spark.status.protobuf.StoreTypes$SQLExecutionUIData$Builder.setDescription(StoreTypes.java:46500) > at > org.apache.spark.status.protobuf.sql.SQLExecutionUIDataSerializer.serialize(SQLExecutionUIDataSerializer.scala:34) > at > org.apache.spark.status.protobuf.sql.SQLExecutionUIDataSerializer.serialize(SQLExecutionUIDataSerializer.scala:28) > at > org.apache.spark.status.protobuf.KVStoreProtobufSerializer.serialize(KVStoreProtobufSerializer.scala:30) > at org.apache.spark.util.kvstore.RocksDB.write(RocksDB.java:188) > at > org.apache.spark.status.ElementTrackingStore.write(ElementTrackingStore.scala:123) > at > org.apache.spark.status.ElementTrackingStore.write(ElementTrackingStore.scala:127) > at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:50) > at > org.apache.spark.sql.execution.ui.SQLAppStatusListener.update(SQLAppStatusListener.scala:456) > at > org.apache.spark.sql.execution.ui.SQLAppStatusListener.onJobStart(SQLAppStatusListener.scala:124) > at > org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37) > at > org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) > at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) > at > org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) > at > org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) > at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1444) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) > 14:46:44.936 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener > SQLAppStatusListener threw an exception {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42080) Add guideline for PySpark errors.
[ https://issues.apache.org/jira/browse/SPARK-42080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42080: Assignee: (was: Apache Spark) > Add guideline for PySpark errors. > - > > Key: SPARK-42080 > URL: https://issues.apache.org/jira/browse/SPARK-42080 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Add guideline for PySpark errores -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42080) Add guideline for PySpark errors.
[ https://issues.apache.org/jira/browse/SPARK-42080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678098#comment-17678098 ] Apache Spark commented on SPARK-42080: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/39639 > Add guideline for PySpark errors. > - > > Key: SPARK-42080 > URL: https://issues.apache.org/jira/browse/SPARK-42080 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Add guideline for PySpark errores -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42080) Add guideline for PySpark errors.
[ https://issues.apache.org/jira/browse/SPARK-42080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42080: Assignee: Apache Spark > Add guideline for PySpark errors. > - > > Key: SPARK-42080 > URL: https://issues.apache.org/jira/browse/SPARK-42080 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > > Add guideline for PySpark errores -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42078) Migrate errors thrown by JVM into PySpark Exception.
[ https://issues.apache.org/jira/browse/SPARK-42078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42078: Assignee: Haejoon Lee > Migrate errors thrown by JVM into PySpark Exception. > > > Key: SPARK-42078 > URL: https://issues.apache.org/jira/browse/SPARK-42078 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > We should migrate all exceptions generated on PySpark into PySparkException. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42078) Migrate errors thrown by JVM into PySpark Exception.
[ https://issues.apache.org/jira/browse/SPARK-42078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42078. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39591 [https://github.com/apache/spark/pull/39591] > Migrate errors thrown by JVM into PySpark Exception. > > > Key: SPARK-42078 > URL: https://issues.apache.org/jira/browse/SPARK-42078 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > > We should migrate all exceptions generated on PySpark into PySparkException. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40885) Spark will filter out data field sorting when dynamic partitions and data fields are sorted at the same time
[ https://issues.apache.org/jira/browse/SPARK-40885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enrico Minack updated SPARK-40885: -- Affects Version/s: 3.4.0 > Spark will filter out data field sorting when dynamic partitions and data > fields are sorted at the same time > > > Key: SPARK-40885 > URL: https://issues.apache.org/jira/browse/SPARK-40885 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.3.0, 3.2.2, 3.4.0 >Reporter: ming95 >Priority: Major > Attachments: 1666494504884.jpg > > > When using dynamic partitions to write data and sort partitions and data > fields, Spark will filter the sorting of data fields. > > reproduce sql: > {code:java} > CREATE TABLE `sort_table`( > `id` int, > `name` string > ) > PARTITIONED BY ( > `dt` string) > stored as textfile > LOCATION 'sort_table';CREATE TABLE `test_table`( > `id` int, > `name` string) > PARTITIONED BY ( > `dt` string) > stored as textfile > LOCATION > 'test_table';//gen test data > insert into test_table partition(dt=20221011) select 10,"15" union all select > 1,"10" union all select 5,"50" union all select 20,"2" union all select > 30,"14" ; > set spark.hadoop.hive.exec.dynamici.partition=true; > set spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict; > // this sql sort with partition filed (`dt`) and data filed (`name`), but > sort with `name` can not work > insert overwrite table sort_table partition(dt) select id,name,dt from > test_table order by name,dt; > {code} > > The Sort operator of DAG has only one sort field, but there are actually two > in SQL.(See the attached drawing) > > It relate this issue : https://issues.apache.org/jira/browse/SPARK-40588 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41485) Unify the environment variable of *_PROTOC_EXEC_PATH
[ https://issues.apache.org/jira/browse/SPARK-41485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678088#comment-17678088 ] Apache Spark commented on SPARK-41485: -- User 'WolverineJiang' has created a pull request for this issue: https://github.com/apache/spark/pull/39036 > Unify the environment variable of *_PROTOC_EXEC_PATH > > > Key: SPARK-41485 > URL: https://issues.apache.org/jira/browse/SPARK-41485 > Project: Spark > Issue Type: Sub-task > Components: Connect, Protobuf, Spark Core >Affects Versions: 3.4.0 >Reporter: Haonan Jiang >Priority: Minor > Fix For: 3.4.0 > > > At present, there are 3 similar environment variable of *_PROTOC_EXEC_PATH, > but they use the same pb version. Because they are consistent in compilation, > so we can unify the environment variable names to simplify. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41485) Unify the environment variable of *_PROTOC_EXEC_PATH
[ https://issues.apache.org/jira/browse/SPARK-41485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678089#comment-17678089 ] Apache Spark commented on SPARK-41485: -- User 'WolverineJiang' has created a pull request for this issue: https://github.com/apache/spark/pull/39036 > Unify the environment variable of *_PROTOC_EXEC_PATH > > > Key: SPARK-41485 > URL: https://issues.apache.org/jira/browse/SPARK-41485 > Project: Spark > Issue Type: Sub-task > Components: Connect, Protobuf, Spark Core >Affects Versions: 3.4.0 >Reporter: Haonan Jiang >Priority: Minor > Fix For: 3.4.0 > > > At present, there are 3 similar environment variable of *_PROTOC_EXEC_PATH, > but they use the same pb version. Because they are consistent in compilation, > so we can unify the environment variable names to simplify. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41777) Add Integration Tests
[ https://issues.apache.org/jira/browse/SPARK-41777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41777: Assignee: Apache Spark > Add Integration Tests > - > > Key: SPARK-41777 > URL: https://issues.apache.org/jira/browse/SPARK-41777 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.4.0 >Reporter: Rithwik Ediga Lakhamsani >Assignee: Apache Spark >Priority: Major > > This requires us to add PyTorch as a testing dependency. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41598) Migrate the errors from `pyspark/sql/functions.py` into error class.
[ https://issues.apache.org/jira/browse/SPARK-41598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678082#comment-17678082 ] Apache Spark commented on SPARK-41598: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/39638 > Migrate the errors from `pyspark/sql/functions.py` into error class. > > > Key: SPARK-41598 > URL: https://issues.apache.org/jira/browse/SPARK-41598 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Migrate the existing errors into new PySpark error framework. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41598) Migrate the errors from `pyspark/sql/functions.py` into error class.
[ https://issues.apache.org/jira/browse/SPARK-41598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678084#comment-17678084 ] Apache Spark commented on SPARK-41598: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/39638 > Migrate the errors from `pyspark/sql/functions.py` into error class. > > > Key: SPARK-41598 > URL: https://issues.apache.org/jira/browse/SPARK-41598 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Migrate the existing errors into new PySpark error framework. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41777) Add Integration Tests
[ https://issues.apache.org/jira/browse/SPARK-41777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678083#comment-17678083 ] Apache Spark commented on SPARK-41777: -- User 'rithwik-db' has created a pull request for this issue: https://github.com/apache/spark/pull/39637 > Add Integration Tests > - > > Key: SPARK-41777 > URL: https://issues.apache.org/jira/browse/SPARK-41777 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.4.0 >Reporter: Rithwik Ediga Lakhamsani >Priority: Major > > This requires us to add PyTorch as a testing dependency. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41777) Add Integration Tests
[ https://issues.apache.org/jira/browse/SPARK-41777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41777: Assignee: (was: Apache Spark) > Add Integration Tests > - > > Key: SPARK-41777 > URL: https://issues.apache.org/jira/browse/SPARK-41777 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.4.0 >Reporter: Rithwik Ediga Lakhamsani >Priority: Major > > This requires us to add PyTorch as a testing dependency. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41777) Add Integration Tests
[ https://issues.apache.org/jira/browse/SPARK-41777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678085#comment-17678085 ] Apache Spark commented on SPARK-41777: -- User 'rithwik-db' has created a pull request for this issue: https://github.com/apache/spark/pull/39637 > Add Integration Tests > - > > Key: SPARK-41777 > URL: https://issues.apache.org/jira/browse/SPARK-41777 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.4.0 >Reporter: Rithwik Ediga Lakhamsani >Priority: Major > > This requires us to add PyTorch as a testing dependency. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42082) Introduce `PySparkValueError` and `PySparkTypeError`
[ https://issues.apache.org/jira/browse/SPARK-42082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678080#comment-17678080 ] Apache Spark commented on SPARK-42082: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/39638 > Introduce `PySparkValueError` and `PySparkTypeError` > > > Key: SPARK-42082 > URL: https://issues.apache.org/jira/browse/SPARK-42082 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > We should migrate all Python built-in Exception into PySparkException. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41598) Migrate the errors from `pyspark/sql/functions.py` into error class.
[ https://issues.apache.org/jira/browse/SPARK-41598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678081#comment-17678081 ] Apache Spark commented on SPARK-41598: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/39638 > Migrate the errors from `pyspark/sql/functions.py` into error class. > > > Key: SPARK-41598 > URL: https://issues.apache.org/jira/browse/SPARK-41598 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Migrate the existing errors into new PySpark error framework. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42082) Introduce `PySparkValueError` and `PySparkTypeError`
[ https://issues.apache.org/jira/browse/SPARK-42082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42082: Assignee: (was: Apache Spark) > Introduce `PySparkValueError` and `PySparkTypeError` > > > Key: SPARK-42082 > URL: https://issues.apache.org/jira/browse/SPARK-42082 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > We should migrate all Python built-in Exception into PySparkException. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42082) Introduce `PySparkValueError` and `PySparkTypeError`
[ https://issues.apache.org/jira/browse/SPARK-42082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42082: Assignee: Apache Spark > Introduce `PySparkValueError` and `PySparkTypeError` > > > Key: SPARK-42082 > URL: https://issues.apache.org/jira/browse/SPARK-42082 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > > We should migrate all Python built-in Exception into PySparkException. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42082) Introduce `PySparkValueError` and `PySparkTypeError`
[ https://issues.apache.org/jira/browse/SPARK-42082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678079#comment-17678079 ] Apache Spark commented on SPARK-42082: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/39638 > Introduce `PySparkValueError` and `PySparkTypeError` > > > Key: SPARK-42082 > URL: https://issues.apache.org/jira/browse/SPARK-42082 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > We should migrate all Python built-in Exception into PySparkException. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42082) Introduce `PySparkValueError` and `PySparkTypeError`
[ https://issues.apache.org/jira/browse/SPARK-42082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-42082: Summary: Introduce `PySparkValueError` and `PySparkTypeError` (was: Add PySparkValueError and PySparkTypeError) > Introduce `PySparkValueError` and `PySparkTypeError` > > > Key: SPARK-42082 > URL: https://issues.apache.org/jira/browse/SPARK-42082 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > We should migrate all Python built-in Exception into PySparkException. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42082) Add PySparkValueError and PySparkTypeError
[ https://issues.apache.org/jira/browse/SPARK-42082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-42082: Summary: Add PySparkValueError and PySparkTypeError (was: Migrate ValueError into PySparkValueError and manage the functions.py) > Add PySparkValueError and PySparkTypeError > -- > > Key: SPARK-42082 > URL: https://issues.apache.org/jira/browse/SPARK-42082 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > We should migrate all Python built-in Exception into PySparkException. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41485) Unify the environment variable of *_PROTOC_EXEC_PATH
[ https://issues.apache.org/jira/browse/SPARK-41485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-41485. -- Fix Version/s: 3.4.0 Target Version/s: 3.4.0 Resolution: Fixed [https://github.com/apache/spark/pull/39036] has solved this issuse, but uses a wrong jira id, so I change this to fixed > Unify the environment variable of *_PROTOC_EXEC_PATH > > > Key: SPARK-41485 > URL: https://issues.apache.org/jira/browse/SPARK-41485 > Project: Spark > Issue Type: Sub-task > Components: Connect, Protobuf, Spark Core >Affects Versions: 3.4.0 >Reporter: Haonan Jiang >Priority: Minor > Fix For: 3.4.0 > > > At present, there are 3 similar environment variable of *_PROTOC_EXEC_PATH, > but they use the same pb version. Because they are consistent in compilation, > so we can unify the environment variable names to simplify. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42029) Distribution build for Spark Connect does not work with Spark Shell
[ https://issues.apache.org/jira/browse/SPARK-42029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-42029: - Parent: (was: SPARK-39375) Issue Type: Bug (was: Sub-task) > Distribution build for Spark Connect does not work with Spark Shell > --- > > Key: SPARK-42029 > URL: https://issues.apache.org/jira/browse/SPARK-42029 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42029) Distribution build for Spark Connect does not work with Spark Shell
[ https://issues.apache.org/jira/browse/SPARK-42029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-42029: - Parent: SPARK-41286 Issue Type: Sub-task (was: Bug) > Distribution build for Spark Connect does not work with Spark Shell > --- > > Key: SPARK-42029 > URL: https://issues.apache.org/jira/browse/SPARK-42029 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41727) ClassCastException when config spark.sql.hive.metastore* properties under jdk17
[ https://issues.apache.org/jira/browse/SPARK-41727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-41727: - Priority: Major (was: Critical) > ClassCastException when config spark.sql.hive.metastore* properties under > jdk17 > --- > > Key: SPARK-41727 > URL: https://issues.apache.org/jira/browse/SPARK-41727 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.1 > Environment: Apache spark3.3.1 \ HDP3.1.5 with hive 3.1.0 >Reporter: kevinshin >Priority: Major > Attachments: hms-init-error.txt > > > Apache spark3.3.1 \ HDP3.1.5 with hive 3.1.0 > when config properties about spark.sql.hive.metastore* to use > hive.metastore.version 3.1.2: > *spark.sql.hive.metastore.jars /data/soft/spark3/standalone-metastore/** > *spark.sql.hive.metastore.version 3.1.2* > then start spark-shell with master = local[*] under jdk17 > try to select a hive table, will got error: > 13:44:52.428 [main] ERROR > org.apache.hadoop.hive.metastore.utils.MetaStoreUtils - Got exception: > java.lang.ClassCastException class [Ljava.lang.Object; cannot be cast to > class [Ljava.net.URI; ([Ljava.lang.Object; and [Ljava.net.URI; are in module > java.base of loader 'bootstrap') > java.lang.ClassCastException: class [Ljava.lang.Object; cannot be cast to > class [Ljava.net.URI; ([Ljava.lang.Object; and [Ljava.net.URI; are in module > java.base of loader 'bootstrap') > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.resolveUris(HiveMetaStoreClient.java:262) > ~[hive-standalone-metastore-3.1.2.jar:3.1.2] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-41589) PyTorch Distributor
[ https://issues.apache.org/jira/browse/SPARK-41589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-41589: -- > PyTorch Distributor > --- > > Key: SPARK-41589 > URL: https://issues.apache.org/jira/browse/SPARK-41589 > Project: Spark > Issue Type: Umbrella > Components: ML, PySpark >Affects Versions: 3.4.0 >Reporter: Rithwik Ediga Lakhamsani >Priority: Major > Fix For: 3.4.0 > > > This is a project to make it easier for PySpark users to distribute PyTorch > code using PySpark. The corresponding [Design > Document|https://docs.google.com/document/d/1QPO1Ly8WteL6aIPvVcR7Xne9qVtJiB3fdrRn7NwBcpA/edit?usp=sharing] > can give more context. This was a project determined by the Databricks ML > Training Team; please reach out to [~gurwls223] (Spark-side) or [~erithwik] > for more context. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41589) PyTorch Distributor
[ https://issues.apache.org/jira/browse/SPARK-41589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-41589: - Fix Version/s: (was: 3.4.0) > PyTorch Distributor > --- > > Key: SPARK-41589 > URL: https://issues.apache.org/jira/browse/SPARK-41589 > Project: Spark > Issue Type: Umbrella > Components: ML, PySpark >Affects Versions: 3.4.0 >Reporter: Rithwik Ediga Lakhamsani >Priority: Major > > This is a project to make it easier for PySpark users to distribute PyTorch > code using PySpark. The corresponding [Design > Document|https://docs.google.com/document/d/1QPO1Ly8WteL6aIPvVcR7Xne9qVtJiB3fdrRn7NwBcpA/edit?usp=sharing] > can give more context. This was a project determined by the Databricks ML > Training Team; please reach out to [~gurwls223] (Spark-side) or [~erithwik] > for more context. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41589) PyTorch Distributor
[ https://issues.apache.org/jira/browse/SPARK-41589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-41589: - Priority: Critical (was: Major) > PyTorch Distributor > --- > > Key: SPARK-41589 > URL: https://issues.apache.org/jira/browse/SPARK-41589 > Project: Spark > Issue Type: Umbrella > Components: ML, PySpark >Affects Versions: 3.4.0 >Reporter: Rithwik Ediga Lakhamsani >Priority: Critical > > This is a project to make it easier for PySpark users to distribute PyTorch > code using PySpark. The corresponding [Design > Document|https://docs.google.com/document/d/1QPO1Ly8WteL6aIPvVcR7Xne9qVtJiB3fdrRn7NwBcpA/edit?usp=sharing] > can give more context. This was a project determined by the Databricks ML > Training Team; please reach out to [~gurwls223] (Spark-side) or [~erithwik] > for more context. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42061) Mark Expressions that have state has stateful
[ https://issues.apache.org/jira/browse/SPARK-42061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-42061. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39630 [https://github.com/apache/spark/pull/39630] > Mark Expressions that have state has stateful > - > > Key: SPARK-42061 > URL: https://issues.apache.org/jira/browse/SPARK-42061 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42061) Mark Expressions that have state has stateful
[ https://issues.apache.org/jira/browse/SPARK-42061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-42061: --- Assignee: (was: Wenchen Fan) > Mark Expressions that have state has stateful > - > > Key: SPARK-42061 > URL: https://issues.apache.org/jira/browse/SPARK-42061 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42061) Mark Expressions that have state has stateful
[ https://issues.apache.org/jira/browse/SPARK-42061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-42061: --- Assignee: Wenchen Fan > Mark Expressions that have state has stateful > - > > Key: SPARK-42061 > URL: https://issues.apache.org/jira/browse/SPARK-42061 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24941) Add RDDBarrier.coalesce() function
[ https://issues.apache.org/jira/browse/SPARK-24941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678044#comment-17678044 ] Erik Ordentlich edited comment on SPARK-24941 at 1/18/23 1:33 AM: -- Anyone still planning to work on this? cc [~mengxr] [~leewyang] was (Author: JIRAUSER287642): Anyone still planning to work on this? cc [~mengxr] > Add RDDBarrier.coalesce() function > -- > > Key: SPARK-24941 > URL: https://issues.apache.org/jira/browse/SPARK-24941 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Xingbo Jiang >Priority: Major > > https://github.com/apache/spark/pull/21758#discussion_r204917245 > The number of partitions from the input data can be unexpectedly large, eg. > if you do > {code} > sc.textFile(...).barrier().mapPartitions() > {code} > The number of input partitions is based on the hdfs input splits. We shall > provide a way in RDDBarrier to enable users to specify the number of tasks in > a barrier stage. Maybe something like RDDBarrier.coalesce(numPartitions: Int) > . -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24941) Add RDDBarrier.coalesce() function
[ https://issues.apache.org/jira/browse/SPARK-24941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678044#comment-17678044 ] Erik Ordentlich commented on SPARK-24941: - Anyone still planning to work on this? cc [~mengxr] > Add RDDBarrier.coalesce() function > -- > > Key: SPARK-24941 > URL: https://issues.apache.org/jira/browse/SPARK-24941 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Xingbo Jiang >Priority: Major > > https://github.com/apache/spark/pull/21758#discussion_r204917245 > The number of partitions from the input data can be unexpectedly large, eg. > if you do > {code} > sc.textFile(...).barrier().mapPartitions() > {code} > The number of input partitions is based on the hdfs input splits. We shall > provide a way in RDDBarrier to enable users to specify the number of tasks in > a barrier stage. Maybe something like RDDBarrier.coalesce(numPartitions: Int) > . -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41596) Document the new feature "Async Progress Tracking" to Structured Streaming guide doc
[ https://issues.apache.org/jira/browse/SPARK-41596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-41596. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39538 [https://github.com/apache/spark/pull/39538] > Document the new feature "Async Progress Tracking" to Structured Streaming > guide doc > > > Key: SPARK-41596 > URL: https://issues.apache.org/jira/browse/SPARK-41596 > Project: Spark > Issue Type: Documentation > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Assignee: Boyang Jerry Peng >Priority: Blocker > Fix For: 3.4.0 > > > Given that we merged the new SPIP feature SPARK-39591, we have to document > the new feature to the Structured Streaming guide doc so that end users can > refer to the doc and start experimenting the feature. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41596) Document the new feature "Async Progress Tracking" to Structured Streaming guide doc
[ https://issues.apache.org/jira/browse/SPARK-41596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-41596: Assignee: Boyang Jerry Peng > Document the new feature "Async Progress Tracking" to Structured Streaming > guide doc > > > Key: SPARK-41596 > URL: https://issues.apache.org/jira/browse/SPARK-41596 > Project: Spark > Issue Type: Documentation > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Assignee: Boyang Jerry Peng >Priority: Blocker > > Given that we merged the new SPIP feature SPARK-39591, we have to document > the new feature to the Structured Streaming guide doc so that end users can > refer to the doc and start experimenting the feature. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41822) Setup Scala/JVM Client Connection
[ https://issues.apache.org/jira/browse/SPARK-41822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678038#comment-17678038 ] Apache Spark commented on SPARK-41822: -- User 'zhenlineo' has created a pull request for this issue: https://github.com/apache/spark/pull/39635 > Setup Scala/JVM Client Connection > - > > Key: SPARK-41822 > URL: https://issues.apache.org/jira/browse/SPARK-41822 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Venkata Sai Akhil Gudesa >Assignee: Venkata Sai Akhil Gudesa >Priority: Major > Fix For: 3.4.0 > > > Set up the gRPC connection for the Scala/JVM client to enable communication > with the Spark Connect server. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42090) Introduce sasl retry count in RetryingBlockTransferor
[ https://issues.apache.org/jira/browse/SPARK-42090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678032#comment-17678032 ] Apache Spark commented on SPARK-42090: -- User 'akpatnam25' has created a pull request for this issue: https://github.com/apache/spark/pull/39634 > Introduce sasl retry count in RetryingBlockTransferor > - > > Key: SPARK-42090 > URL: https://issues.apache.org/jira/browse/SPARK-42090 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Fix For: 3.4.0 > > > Previously a boolean variable, saslTimeoutSeen, was used in > RetryingBlockTransferor. However, the boolean variable wouldn't cover the > following scenario: > 1. SaslTimeoutException > 2. IOException > 3. SaslTimeoutException > 4. IOException > Even though IOException at #2 is retried (resulting in increment of > retryCount), the retryCount would be cleared at step #4. > Since the intention of saslTimeoutSeen is to undo the increment due to > retrying SaslTimeoutException, we should keep a counter for > SaslTimeoutException retries and subtract the value of this counter from > retryCount. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42090) Introduce sasl retry count in RetryingBlockTransferor
[ https://issues.apache.org/jira/browse/SPARK-42090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678030#comment-17678030 ] Apache Spark commented on SPARK-42090: -- User 'akpatnam25' has created a pull request for this issue: https://github.com/apache/spark/pull/39632 > Introduce sasl retry count in RetryingBlockTransferor > - > > Key: SPARK-42090 > URL: https://issues.apache.org/jira/browse/SPARK-42090 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Fix For: 3.4.0 > > > Previously a boolean variable, saslTimeoutSeen, was used in > RetryingBlockTransferor. However, the boolean variable wouldn't cover the > following scenario: > 1. SaslTimeoutException > 2. IOException > 3. SaslTimeoutException > 4. IOException > Even though IOException at #2 is retried (resulting in increment of > retryCount), the retryCount would be cleared at step #4. > Since the intention of saslTimeoutSeen is to undo the increment due to > retrying SaslTimeoutException, we should keep a counter for > SaslTimeoutException retries and subtract the value of this counter from > retryCount. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41415) SASL Request Retries
[ https://issues.apache.org/jira/browse/SPARK-41415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678028#comment-17678028 ] Apache Spark commented on SPARK-41415: -- User 'akpatnam25' has created a pull request for this issue: https://github.com/apache/spark/pull/39634 > SASL Request Retries > > > Key: SPARK-41415 > URL: https://issues.apache.org/jira/browse/SPARK-41415 > Project: Spark > Issue Type: Task > Components: Shuffle >Affects Versions: 3.2.4 >Reporter: Aravind Patnam >Assignee: Aravind Patnam >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42090) Introduce sasl retry count in RetryingBlockTransferor
[ https://issues.apache.org/jira/browse/SPARK-42090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678029#comment-17678029 ] Apache Spark commented on SPARK-42090: -- User 'akpatnam25' has created a pull request for this issue: https://github.com/apache/spark/pull/39634 > Introduce sasl retry count in RetryingBlockTransferor > - > > Key: SPARK-42090 > URL: https://issues.apache.org/jira/browse/SPARK-42090 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Fix For: 3.4.0 > > > Previously a boolean variable, saslTimeoutSeen, was used in > RetryingBlockTransferor. However, the boolean variable wouldn't cover the > following scenario: > 1. SaslTimeoutException > 2. IOException > 3. SaslTimeoutException > 4. IOException > Even though IOException at #2 is retried (resulting in increment of > retryCount), the retryCount would be cleared at step #4. > Since the intention of saslTimeoutSeen is to undo the increment due to > retrying SaslTimeoutException, we should keep a counter for > SaslTimeoutException retries and subtract the value of this counter from > retryCount. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41415) SASL Request Retries
[ https://issues.apache.org/jira/browse/SPARK-41415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678027#comment-17678027 ] Apache Spark commented on SPARK-41415: -- User 'akpatnam25' has created a pull request for this issue: https://github.com/apache/spark/pull/39634 > SASL Request Retries > > > Key: SPARK-41415 > URL: https://issues.apache.org/jira/browse/SPARK-41415 > Project: Spark > Issue Type: Task > Components: Shuffle >Affects Versions: 3.2.4 >Reporter: Aravind Patnam >Assignee: Aravind Patnam >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41415) SASL Request Retries
[ https://issues.apache.org/jira/browse/SPARK-41415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678026#comment-17678026 ] Apache Spark commented on SPARK-41415: -- User 'akpatnam25' has created a pull request for this issue: https://github.com/apache/spark/pull/39632 > SASL Request Retries > > > Key: SPARK-41415 > URL: https://issues.apache.org/jira/browse/SPARK-41415 > Project: Spark > Issue Type: Task > Components: Shuffle >Affects Versions: 3.2.4 >Reporter: Aravind Patnam >Assignee: Aravind Patnam >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42038) SPJ: Support partially clustered distribution
[ https://issues.apache.org/jira/browse/SPARK-42038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42038: Assignee: (was: Apache Spark) > SPJ: Support partially clustered distribution > - > > Key: SPARK-42038 > URL: https://issues.apache.org/jira/browse/SPARK-42038 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Chao Sun >Priority: Major > > Currently the storage-partitioned join requires both sides to be fully > clustered on the partition values, that is, all input partitions reported by > a V2 data source shall be grouped by partition values before the join > happens. This could lead to data skew issues if a particular partition value > is associated with a large amount of rows. > > To combat this, we can introduce the idea of partially clustered > distribution, which means that only one side of the join is required to be > fully clustered, while the other side is not. This allows Spark to increase > the parallelism of the join and avoid the data skewness. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42038) SPJ: Support partially clustered distribution
[ https://issues.apache.org/jira/browse/SPARK-42038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42038: Assignee: Apache Spark > SPJ: Support partially clustered distribution > - > > Key: SPARK-42038 > URL: https://issues.apache.org/jira/browse/SPARK-42038 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Chao Sun >Assignee: Apache Spark >Priority: Major > > Currently the storage-partitioned join requires both sides to be fully > clustered on the partition values, that is, all input partitions reported by > a V2 data source shall be grouped by partition values before the join > happens. This could lead to data skew issues if a particular partition value > is associated with a large amount of rows. > > To combat this, we can introduce the idea of partially clustered > distribution, which means that only one side of the join is required to be > fully clustered, while the other side is not. This allows Spark to increase > the parallelism of the join and avoid the data skewness. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42038) SPJ: Support partially clustered distribution
[ https://issues.apache.org/jira/browse/SPARK-42038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678025#comment-17678025 ] Apache Spark commented on SPARK-42038: -- User 'sunchao' has created a pull request for this issue: https://github.com/apache/spark/pull/39633 > SPJ: Support partially clustered distribution > - > > Key: SPARK-42038 > URL: https://issues.apache.org/jira/browse/SPARK-42038 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Chao Sun >Priority: Major > > Currently the storage-partitioned join requires both sides to be fully > clustered on the partition values, that is, all input partitions reported by > a V2 data source shall be grouped by partition values before the join > happens. This could lead to data skew issues if a particular partition value > is associated with a large amount of rows. > > To combat this, we can introduce the idea of partially clustered > distribution, which means that only one side of the join is required to be > fully clustered, while the other side is not. This allows Spark to increase > the parallelism of the join and avoid the data skewness. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42103) Add Instrumentation
[ https://issues.apache.org/jira/browse/SPARK-42103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42103: Assignee: (was: Apache Spark) > Add Instrumentation > --- > > Key: SPARK-42103 > URL: https://issues.apache.org/jira/browse/SPARK-42103 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.4.0 >Reporter: Rithwik Ediga Lakhamsani >Priority: Major > > Adding instrumentation -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42103) Add Instrumentation
[ https://issues.apache.org/jira/browse/SPARK-42103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42103: Assignee: Apache Spark > Add Instrumentation > --- > > Key: SPARK-42103 > URL: https://issues.apache.org/jira/browse/SPARK-42103 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.4.0 >Reporter: Rithwik Ediga Lakhamsani >Assignee: Apache Spark >Priority: Major > > Adding instrumentation -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42103) Add Instrumentation
[ https://issues.apache.org/jira/browse/SPARK-42103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677995#comment-17677995 ] Apache Spark commented on SPARK-42103: -- User 'rithwik-db' has created a pull request for this issue: https://github.com/apache/spark/pull/39629 > Add Instrumentation > --- > > Key: SPARK-42103 > URL: https://issues.apache.org/jira/browse/SPARK-42103 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.4.0 >Reporter: Rithwik Ediga Lakhamsani >Priority: Major > > Adding instrumentation -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42061) Mark Expressions that have state has stateful
[ https://issues.apache.org/jira/browse/SPARK-42061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677994#comment-17677994 ] Apache Spark commented on SPARK-42061: -- User 'lzlfred' has created a pull request for this issue: https://github.com/apache/spark/pull/39630 > Mark Expressions that have state has stateful > - > > Key: SPARK-42061 > URL: https://issues.apache.org/jira/browse/SPARK-42061 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42061) Mark Expressions that have state has stateful
[ https://issues.apache.org/jira/browse/SPARK-42061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42061: Assignee: (was: Apache Spark) > Mark Expressions that have state has stateful > - > > Key: SPARK-42061 > URL: https://issues.apache.org/jira/browse/SPARK-42061 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42061) Mark Expressions that have state has stateful
[ https://issues.apache.org/jira/browse/SPARK-42061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42061: Assignee: Apache Spark > Mark Expressions that have state has stateful > - > > Key: SPARK-42061 > URL: https://issues.apache.org/jira/browse/SPARK-42061 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41776) Implement support for PyTorch Lightning
[ https://issues.apache.org/jira/browse/SPARK-41776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rithwik Ediga Lakhamsani updated SPARK-41776: - Description: This requires us to just call train() on each spark task separately without much preprocessing or postprocessing because PyTorch Lightning handles that by itself. Update: This was resolved by using `torch.distributed.run` was:This requires us to just call train() on each spark task separately without much preprocessing or postprocessing because PyTorch Lightning handles that by itself. > Implement support for PyTorch Lightning > --- > > Key: SPARK-41776 > URL: https://issues.apache.org/jira/browse/SPARK-41776 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.4.0 >Reporter: Rithwik Ediga Lakhamsani >Priority: Major > > This requires us to just call train() on each spark task separately without > much preprocessing or postprocessing because PyTorch Lightning handles that > by itself. > > Update: This was resolved by using `torch.distributed.run` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41776) Implement support for PyTorch Lightning
[ https://issues.apache.org/jira/browse/SPARK-41776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rithwik Ediga Lakhamsani updated SPARK-41776: - Description: This requires us to just call train() on each spark task separately without much preprocessing or postprocessing because PyTorch Lightning handles that by itself. (was: This requires us to just call train() on each spark task separately without much preprocessing or postprocessing because PyTorch Lightning handles that by itself. Update: This was resolved by using `torch.distributed.run`) > Implement support for PyTorch Lightning > --- > > Key: SPARK-41776 > URL: https://issues.apache.org/jira/browse/SPARK-41776 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.4.0 >Reporter: Rithwik Ediga Lakhamsani >Priority: Major > > This requires us to just call train() on each spark task separately without > much preprocessing or postprocessing because PyTorch Lightning handles that > by itself. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41915) Change API so that the user doesn't have to explicitly set pytorch-lightning
[ https://issues.apache.org/jira/browse/SPARK-41915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rithwik Ediga Lakhamsani resolved SPARK-41915. -- Resolution: Fixed This is already resolved within https://issues.apache.org/jira/browse/SPARK-41590. > Change API so that the user doesn't have to explicitly set pytorch-lightning > > > Key: SPARK-41915 > URL: https://issues.apache.org/jira/browse/SPARK-41915 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.4.0 >Reporter: Rithwik Ediga Lakhamsani >Priority: Major > > Removing the `framework` parameter in the API and have cloudpickle > automatically find out whether the user code has a dependency on PyTorch > Lightning. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-40687) Support data masking built-in Function 'mask'
[ https://issues.apache.org/jira/browse/SPARK-40687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677988#comment-17677988 ] Vinod KC edited comment on SPARK-40687 at 1/17/23 9:29 PM: --- Note: In the udf 'mask', using -1 as ignore parameter in String type argument is not a standard way. Please refer SPARK-42070 , it changes the default value of the argument of m{*}ask{*} udf from -1 to NULL was (Author: vinodkc): Note: Please refer [SPARK-42070|https://issues.apache.org/jira/browse/SPARK-42070] , it changes the default value of argument of Mask udf from -1 to NULL > Support data masking built-in Function 'mask' > -- > > Key: SPARK-40687 > URL: https://issues.apache.org/jira/browse/SPARK-40687 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Vinod KC >Assignee: Vinod KC >Priority: Minor > Fix For: 3.4.0 > > > Support data masking built-in Function *mask* > Return a masked version of str. By default, upper case letters should be > converted to "X", lower case letters should be converted to "x" and numbers > should be converted to "n". For example mask("abcd-EFGH-8765-4321") results > in ---. Should be able override the characters used in the > mask by supplying additional arguments: the second argument controls the mask > character for upper case letters, the third argument for lower case letters > and the fourth argument for numbers. For example, mask("abcd-EFGH-8765-4321", > "U", "l", "#") should result in --- > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40687) Support data masking built-in Function 'mask'
[ https://issues.apache.org/jira/browse/SPARK-40687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677988#comment-17677988 ] Vinod KC commented on SPARK-40687: -- Note: Please refer [SPARK-42070|https://issues.apache.org/jira/browse/SPARK-42070] , it changes the default value of argument of Mask udf from -1 to NULL > Support data masking built-in Function 'mask' > -- > > Key: SPARK-40687 > URL: https://issues.apache.org/jira/browse/SPARK-40687 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Vinod KC >Assignee: Vinod KC >Priority: Minor > Fix For: 3.4.0 > > > Support data masking built-in Function *mask* > Return a masked version of str. By default, upper case letters should be > converted to "X", lower case letters should be converted to "x" and numbers > should be converted to "n". For example mask("abcd-EFGH-8765-4321") results > in ---. Should be able override the characters used in the > mask by supplying additional arguments: the second argument controls the mask > character for upper case letters, the third argument for lower case letters > and the fourth argument for numbers. For example, mask("abcd-EFGH-8765-4321", > "U", "l", "#") should result in --- > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42070) Change the default value of argument of Mask udf from -1 to NULL
[ https://issues.apache.org/jira/browse/SPARK-42070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod KC updated SPARK-42070: - Description: In the udf 'mask', using -1 as ignore parameter in String type argument is not a standard way, hence, it is better to change the value of ignore argument from -1 to NULL. Note: SPARK-40687 , has recently implemented udf *mask* , which uses -1 as the default argument to ignore the masking option, As no Spark version release has occurred since then, this new change will not cause backward compatibility issues was: In the udf 'mask', using -1 as ignore parameter in String type argument is not a standard way, hence it is better to change the value of ignore argument from -1 to NULL. > Change the default value of argument of Mask udf from -1 to NULL > > > Key: SPARK-42070 > URL: https://issues.apache.org/jira/browse/SPARK-42070 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Vinod KC >Priority: Minor > > In the udf 'mask', using -1 as ignore parameter in String type argument is > not a standard way, hence, it is better to change the value of ignore > argument from -1 to NULL. > Note: SPARK-40687 , has recently implemented udf *mask* , which uses -1 as > the default argument to ignore the masking option, As no Spark version > release has occurred since then, this new change will not cause backward > compatibility issues -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42039) SPJ: Remove Option in KeyGroupedPartitioning#partitionValues
[ https://issues.apache.org/jira/browse/SPARK-42039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42039. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39540 [https://github.com/apache/spark/pull/39540] > SPJ: Remove Option in KeyGroupedPartitioning#partitionValues > > > Key: SPARK-42039 > URL: https://issues.apache.org/jira/browse/SPARK-42039 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Minor > Fix For: 3.4.0 > > > Currently {{KeyGroupedPartitioning#partitionValuesOpt}} is an > {{{}Option[Seq[InternalRow]]{}}}. This is unnecessary since it is always set. > This propose to just replace it with {{Seq[InternalRow]}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42039) SPJ: Remove Option in KeyGroupedPartitioning#partitionValues
[ https://issues.apache.org/jira/browse/SPARK-42039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42039: - Assignee: Chao Sun > SPJ: Remove Option in KeyGroupedPartitioning#partitionValues > > > Key: SPARK-42039 > URL: https://issues.apache.org/jira/browse/SPARK-42039 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Minor > > Currently {{KeyGroupedPartitioning#partitionValuesOpt}} is an > {{{}Option[Seq[InternalRow]]{}}}. This is unnecessary since it is always set. > This propose to just replace it with {{Seq[InternalRow]}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42104) Throw ExecutorDeadException in fetchBlocks when executor dead
Zhongwei Zhu created SPARK-42104: Summary: Throw ExecutorDeadException in fetchBlocks when executor dead Key: SPARK-42104 URL: https://issues.apache.org/jira/browse/SPARK-42104 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.3.1 Reporter: Zhongwei Zhu When fetchBlocks failed due to IOException, ExecutorDeadException will be thrown when executor is dead. There're other cases that executor dead will cause TimeoutException or other Exceptions. {code:java} Caused by: java.lang.RuntimeException: java.util.concurrent.TimeoutException: Waited 3 milliseconds (plus 143334 nanoseconds delay) for SettableFuture@624de392[status=PENDING] at org.sparkproject.guava.base.Throwables.propagate(Throwables.java:243) at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:293) at org.apache.spark.network.crypto.AuthClientBootstrap.doSparkAuth(AuthClientBootstrap.java:113) at org.apache.spark.network.crypto.AuthClientBootstrap.doBootstrap(AuthClientBootstrap.java:80) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:300) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218) at org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:126) at org.apache.spark.network.shuffle.RetryingBlockTransferor.transferAllOutstanding(RetryingBlockTransferor.java:154) at org.apache.spark.network.shuffle.RetryingBlockTransferor.lambda$initiateRetry$0(RetryingBlockTransferor.java:184) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40264) Add helper function for DL model inference in pyspark.ml.functions
[ https://issues.apache.org/jira/browse/SPARK-40264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677905#comment-17677905 ] Apache Spark commented on SPARK-40264: -- User 'leewyang' has created a pull request for this issue: https://github.com/apache/spark/pull/39628 > Add helper function for DL model inference in pyspark.ml.functions > -- > > Key: SPARK-40264 > URL: https://issues.apache.org/jira/browse/SPARK-40264 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.2.2 >Reporter: Lee Yang >Priority: Minor > > Add a helper function to create a pandas_udf for inference on a given DL > model, where the user provides a predict function that is responsible for > loading the model and inferring on a batch of numpy inputs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42103) Add Instrumentation
Rithwik Ediga Lakhamsani created SPARK-42103: Summary: Add Instrumentation Key: SPARK-42103 URL: https://issues.apache.org/jira/browse/SPARK-42103 Project: Spark Issue Type: Sub-task Components: ML, PySpark Affects Versions: 3.4.0 Reporter: Rithwik Ediga Lakhamsani Adding instrumentation -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42092) Upgrade RoaringBitmap to 0.9.38
[ https://issues.apache.org/jira/browse/SPARK-42092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-42092: Assignee: Yang Jie > Upgrade RoaringBitmap to 0.9.38 > --- > > Key: SPARK-42092 > URL: https://issues.apache.org/jira/browse/SPARK-42092 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > https://github.com/RoaringBitmap/RoaringBitmap/compare/0.9.36...0.9.38 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42092) Upgrade RoaringBitmap to 0.9.38
[ https://issues.apache.org/jira/browse/SPARK-42092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-42092. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39613 [https://github.com/apache/spark/pull/39613] > Upgrade RoaringBitmap to 0.9.38 > --- > > Key: SPARK-42092 > URL: https://issues.apache.org/jira/browse/SPARK-42092 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.4.0 > > > https://github.com/RoaringBitmap/RoaringBitmap/compare/0.9.36...0.9.38 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42098) ResolveInlineTables should handle RuntimeReplaceable
[ https://issues.apache.org/jira/browse/SPARK-42098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677856#comment-17677856 ] Daniel commented on SPARK-42098: [~cloud_fan] [~srielau] I can help with this if you guys need help > ResolveInlineTables should handle RuntimeReplaceable > > > Key: SPARK-42098 > URL: https://issues.apache.org/jira/browse/SPARK-42098 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.1 >Reporter: Wenchen Fan >Priority: Major > > spark-sql> VALUES (try_divide(5, 0)); > cannot evaluate expression try_divide(5, 0) in inline table definition; line > 1 pos 8 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41993) Move RowEncoder to AgnosticEncoders
[ https://issues.apache.org/jira/browse/SPARK-41993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677823#comment-17677823 ] Apache Spark commented on SPARK-41993: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/39627 > Move RowEncoder to AgnosticEncoders > --- > > Key: SPARK-41993 > URL: https://issues.apache.org/jira/browse/SPARK-41993 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.4.0 > > > Move RowEncoder to the AgnosticEncoder framework. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42102) Using checkpoints in Spark Structured Streaming with the foreachBatch sink
Kai-Michael Roesner created SPARK-42102: --- Summary: Using checkpoints in Spark Structured Streaming with the foreachBatch sink Key: SPARK-42102 URL: https://issues.apache.org/jira/browse/SPARK-42102 Project: Spark Issue Type: Question Components: PySpark, Structured Streaming Affects Versions: 3.3.1 Reporter: Kai-Michael Roesner I want to build a fault-tolerant, recoverable Spark job (using Structured Streaming in PySpark) that reads a data stream from Kafka and uses the [{{foreachBatch}}|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreachbatch] sink to implement a stateful transformation before writing the resulting data to the actual sink. The basic structure of my Spark job is like this: {code} counter = 0 def batch_handler(df, batch_id): global counter counter += 1 df.withColumn('counter', lit(counter)).show(truncate=30) spark = (SparkSession.builder .appName('test.stateful.checkpoint') .config('spark.jars.packages', f'{KAFKA_SQL},{KAFKA_CLNT}') .getOrCreate()) source = (spark.readStream .format('kafka') .options(**KAFKA_OPTIONS) .option('subscribe', 'topic-spark-stateful') .option('startingOffsets', 'earliest') .option('includeHeaders', 'true') .load()) (source .selectExpr('CAST(value AS STRING) AS data', 'CAST(timestamp AS STRING) AS time') .writeStream .option('checkpointLocation', './checkpoints/stateful') .foreachBatch(batch_handler) .start() .awaitTermination()) {code} where the simplified {{batch_handler}} function is a stand-in for the stateful transformation + writer to the actual data sink. Also for simplicity I am using a local folder as checkpoint location. This works fine as far as checkpointing of Kafka offsets is concerned. But how can I include the state of my custom batch handler ({{counter}} in my simplified example) in the checkpoints such that the job can pick up where it left after a crash? The [Spark Structured Streaming Guide|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing] doesn't say anything on the topic. With the [{{foreach}}|(https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreach] sink I can pass a custom row handler object but this seems to support only {{open}}, {{process}}, and {{close}} methods. Would it make sense to create a "Request" or even "Feature" ticket to enhance this with methods for restoring state from a checkpoint and exporting state to support checkpointing? PS: I have posted this on [SOF|https://stackoverflow.com/questions/74864425], too. If anyone cares to answer or comment I'd be happy to upvote their post. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40599) Add multiTransform methods to TreeNode to generate alternatives
[ https://issues.apache.org/jira/browse/SPARK-40599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-40599. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38034 [https://github.com/apache/spark/pull/38034] > Add multiTransform methods to TreeNode to generate alternatives > --- > > Key: SPARK-40599 > URL: https://issues.apache.org/jira/browse/SPARK-40599 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40599) Add multiTransform methods to TreeNode to generate alternatives
[ https://issues.apache.org/jira/browse/SPARK-40599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-40599: --- Assignee: Peter Toth > Add multiTransform methods to TreeNode to generate alternatives > --- > > Key: SPARK-40599 > URL: https://issues.apache.org/jira/browse/SPARK-40599 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42066) The DATATYPE_MISMATCH error class contains inappropriate and duplicating subclasses
[ https://issues.apache.org/jira/browse/SPARK-42066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42066: Assignee: Apache Spark > The DATATYPE_MISMATCH error class contains inappropriate and duplicating > subclasses > --- > > Key: SPARK-42066 > URL: https://issues.apache.org/jira/browse/SPARK-42066 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Assignee: Apache Spark >Priority: Major > > subclass WRONG_NUM_ARGS (with suggestions) semantically does not belong into > DATATYPE_MISMATCH and there is an error class with that same name. > We should rea the subclasses for this errorclass, which seems to have become > a bit of a dumping ground... -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42066) The DATATYPE_MISMATCH error class contains inappropriate and duplicating subclasses
[ https://issues.apache.org/jira/browse/SPARK-42066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677717#comment-17677717 ] Apache Spark commented on SPARK-42066: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/39625 > The DATATYPE_MISMATCH error class contains inappropriate and duplicating > subclasses > --- > > Key: SPARK-42066 > URL: https://issues.apache.org/jira/browse/SPARK-42066 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Priority: Major > > subclass WRONG_NUM_ARGS (with suggestions) semantically does not belong into > DATATYPE_MISMATCH and there is an error class with that same name. > We should rea the subclasses for this errorclass, which seems to have become > a bit of a dumping ground... -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42066) The DATATYPE_MISMATCH error class contains inappropriate and duplicating subclasses
[ https://issues.apache.org/jira/browse/SPARK-42066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42066: Assignee: (was: Apache Spark) > The DATATYPE_MISMATCH error class contains inappropriate and duplicating > subclasses > --- > > Key: SPARK-42066 > URL: https://issues.apache.org/jira/browse/SPARK-42066 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Priority: Major > > subclass WRONG_NUM_ARGS (with suggestions) semantically does not belong into > DATATYPE_MISMATCH and there is an error class with that same name. > We should rea the subclasses for this errorclass, which seems to have become > a bit of a dumping ground... -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42101) Wrap InMemoryTableScanExec with QueryStage
[ https://issues.apache.org/jira/browse/SPARK-42101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42101: Assignee: Apache Spark > Wrap InMemoryTableScanExec with QueryStage > -- > > Key: SPARK-42101 > URL: https://issues.apache.org/jira/browse/SPARK-42101 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: Apache Spark >Priority: Major > > The first access to the cached plan which is enable AQE is tricky. Currently, > we can not preverse it's output partitioning and ordering. > The whole query plan also missed lots of optimization in AQE framework. Wrap > InMemoryTableScanExec to query stage can resolve all these issues. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42101) Wrap InMemoryTableScanExec with QueryStage
[ https://issues.apache.org/jira/browse/SPARK-42101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42101: Assignee: (was: Apache Spark) > Wrap InMemoryTableScanExec with QueryStage > -- > > Key: SPARK-42101 > URL: https://issues.apache.org/jira/browse/SPARK-42101 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Major > > The first access to the cached plan which is enable AQE is tricky. Currently, > we can not preverse it's output partitioning and ordering. > The whole query plan also missed lots of optimization in AQE framework. Wrap > InMemoryTableScanExec to query stage can resolve all these issues. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42101) Wrap InMemoryTableScanExec with QueryStage
[ https://issues.apache.org/jira/browse/SPARK-42101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677716#comment-17677716 ] Apache Spark commented on SPARK-42101: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/39624 > Wrap InMemoryTableScanExec with QueryStage > -- > > Key: SPARK-42101 > URL: https://issues.apache.org/jira/browse/SPARK-42101 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Major > > The first access to the cached plan which is enable AQE is tricky. Currently, > we can not preverse it's output partitioning and ordering. > The whole query plan also missed lots of optimization in AQE framework. Wrap > InMemoryTableScanExec to query stage can resolve all these issues. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42101) Wrap InMemoryTableScanExec with QueryStage
[ https://issues.apache.org/jira/browse/SPARK-42101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-42101: -- Summary: Wrap InMemoryTableScanExec with QueryStage (was: Wrap InMemoryTableScanExec + AQE with QueryStage) > Wrap InMemoryTableScanExec with QueryStage > -- > > Key: SPARK-42101 > URL: https://issues.apache.org/jira/browse/SPARK-42101 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Major > > The first access to the cached plan which is enable AQE is tricky. Currently, > we can not preverse it's output partitioning and ordering. > The whole query plan also missed lots of optimization in AQE framework. Wrap > InMemoryTableScanExec + AQE to query stage can resolve all these issues. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42101) Wrap InMemoryTableScanExec with QueryStage
[ https://issues.apache.org/jira/browse/SPARK-42101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-42101: -- Description: The first access to the cached plan which is enable AQE is tricky. Currently, we can not preverse it's output partitioning and ordering. The whole query plan also missed lots of optimization in AQE framework. Wrap InMemoryTableScanExec to query stage can resolve all these issues. was: The first access to the cached plan which is enable AQE is tricky. Currently, we can not preverse it's output partitioning and ordering. The whole query plan also missed lots of optimization in AQE framework. Wrap InMemoryTableScanExec + AQE to query stage can resolve all these issues. > Wrap InMemoryTableScanExec with QueryStage > -- > > Key: SPARK-42101 > URL: https://issues.apache.org/jira/browse/SPARK-42101 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Major > > The first access to the cached plan which is enable AQE is tricky. Currently, > we can not preverse it's output partitioning and ordering. > The whole query plan also missed lots of optimization in AQE framework. Wrap > InMemoryTableScanExec to query stage can resolve all these issues. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42101) Wrap InMemoryTableScanExec + AQE with QueryStage
XiDuo You created SPARK-42101: - Summary: Wrap InMemoryTableScanExec + AQE with QueryStage Key: SPARK-42101 URL: https://issues.apache.org/jira/browse/SPARK-42101 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: XiDuo You The first access to the cached plan which is enable AQE is tricky. Currently, we can not preverse it's output partitioning and ordering. The whole query plan also missed lots of optimization in AQE framework. Wrap InMemoryTableScanExec + AQE to query stage can resolve all these issues. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38230) InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases
[ https://issues.apache.org/jira/browse/SPARK-38230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677683#comment-17677683 ] Gabor Roczei commented on SPARK-38230: -- Hi [~ximz], > [~roczei] Can you please review the PR and let me know if I missed anything? >Thank you. I will try to allocate some time for this next week. > InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions > in most cases > --- > > Key: SPARK-38230 > URL: https://issues.apache.org/jira/browse/SPARK-38230 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.2 >Reporter: Coal Chan >Priority: Major > > In > `org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand`, > `sparkSession.sessionState.catalog.listPartitions` will call method > `org.apache.hadoop.hive.metastore.listPartitionsPsWithAuth` of hive metastore > client, this method will produce multiple queries per partition on hive > metastore db. So when you insert into a table which has too many > partitions(ie: 10k), it will produce too many queries on hive metastore > db(ie: n * 10k = 10nk), it puts a lot of strain on the database. > In fact, it calls method `listPartitions` in order to get locations of > partitions and get `customPartitionLocations`. But in most cases, we do not > have custom partitions, we can just get partition names, so we can call > method listPartitionNames. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42100) Protect null `SQLExecutionUIData#description` in `SQLExecutionUIDataSerializer`
[ https://issues.apache.org/jira/browse/SPARK-42100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42100: Assignee: (was: Apache Spark) > Protect null `SQLExecutionUIData#description` in > `SQLExecutionUIDataSerializer` > --- > > Key: SPARK-42100 > URL: https://issues.apache.org/jira/browse/SPARK-42100 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > > export LIVE_UI_LOCAL_STORE_DIR = /tmp/spark-ui > mvn clean install -pl sql/core > -Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest -Dtest=none > -DwildcardSuites=org.apache.spark.sql.DynamicPartitionPruningV1SuiteAEOff -am > > no test failed, but some error message: > > {code:java} > 14:46:44.514 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener > SQLAppStatusListener threw an exception > java.lang.NullPointerException > at > org.apache.spark.status.protobuf.StoreTypes$SQLExecutionUIData$Builder.setDescription(StoreTypes.java:46500) > at > org.apache.spark.status.protobuf.sql.SQLExecutionUIDataSerializer.serialize(SQLExecutionUIDataSerializer.scala:34) > at > org.apache.spark.status.protobuf.sql.SQLExecutionUIDataSerializer.serialize(SQLExecutionUIDataSerializer.scala:28) > at > org.apache.spark.status.protobuf.KVStoreProtobufSerializer.serialize(KVStoreProtobufSerializer.scala:30) > at org.apache.spark.util.kvstore.RocksDB.write(RocksDB.java:188) > at > org.apache.spark.status.ElementTrackingStore.write(ElementTrackingStore.scala:123) > at > org.apache.spark.status.ElementTrackingStore.write(ElementTrackingStore.scala:127) > at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:50) > at > org.apache.spark.sql.execution.ui.SQLAppStatusListener.update(SQLAppStatusListener.scala:456) > at > org.apache.spark.sql.execution.ui.SQLAppStatusListener.onJobStart(SQLAppStatusListener.scala:124) > at > org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37) > at > org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) > at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) > at > org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) > at > org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) > at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1444) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) > 14:46:44.936 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener > SQLAppStatusListener threw an exception {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42100) Protect null `SQLExecutionUIData#description` in `SQLExecutionUIDataSerializer`
[ https://issues.apache.org/jira/browse/SPARK-42100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677657#comment-17677657 ] Apache Spark commented on SPARK-42100: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/39623 > Protect null `SQLExecutionUIData#description` in > `SQLExecutionUIDataSerializer` > --- > > Key: SPARK-42100 > URL: https://issues.apache.org/jira/browse/SPARK-42100 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > > export LIVE_UI_LOCAL_STORE_DIR = /tmp/spark-ui > mvn clean install -pl sql/core > -Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest -Dtest=none > -DwildcardSuites=org.apache.spark.sql.DynamicPartitionPruningV1SuiteAEOff -am > > no test failed, but some error message: > > {code:java} > 14:46:44.514 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener > SQLAppStatusListener threw an exception > java.lang.NullPointerException > at > org.apache.spark.status.protobuf.StoreTypes$SQLExecutionUIData$Builder.setDescription(StoreTypes.java:46500) > at > org.apache.spark.status.protobuf.sql.SQLExecutionUIDataSerializer.serialize(SQLExecutionUIDataSerializer.scala:34) > at > org.apache.spark.status.protobuf.sql.SQLExecutionUIDataSerializer.serialize(SQLExecutionUIDataSerializer.scala:28) > at > org.apache.spark.status.protobuf.KVStoreProtobufSerializer.serialize(KVStoreProtobufSerializer.scala:30) > at org.apache.spark.util.kvstore.RocksDB.write(RocksDB.java:188) > at > org.apache.spark.status.ElementTrackingStore.write(ElementTrackingStore.scala:123) > at > org.apache.spark.status.ElementTrackingStore.write(ElementTrackingStore.scala:127) > at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:50) > at > org.apache.spark.sql.execution.ui.SQLAppStatusListener.update(SQLAppStatusListener.scala:456) > at > org.apache.spark.sql.execution.ui.SQLAppStatusListener.onJobStart(SQLAppStatusListener.scala:124) > at > org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37) > at > org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) > at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) > at > org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) > at > org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) > at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1444) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) > 14:46:44.936 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener > SQLAppStatusListener threw an exception {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42100) Protect null `SQLExecutionUIData#description` in `SQLExecutionUIDataSerializer`
[ https://issues.apache.org/jira/browse/SPARK-42100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42100: Assignee: Apache Spark > Protect null `SQLExecutionUIData#description` in > `SQLExecutionUIDataSerializer` > --- > > Key: SPARK-42100 > URL: https://issues.apache.org/jira/browse/SPARK-42100 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > > export LIVE_UI_LOCAL_STORE_DIR = /tmp/spark-ui > mvn clean install -pl sql/core > -Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest -Dtest=none > -DwildcardSuites=org.apache.spark.sql.DynamicPartitionPruningV1SuiteAEOff -am > > no test failed, but some error message: > > {code:java} > 14:46:44.514 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener > SQLAppStatusListener threw an exception > java.lang.NullPointerException > at > org.apache.spark.status.protobuf.StoreTypes$SQLExecutionUIData$Builder.setDescription(StoreTypes.java:46500) > at > org.apache.spark.status.protobuf.sql.SQLExecutionUIDataSerializer.serialize(SQLExecutionUIDataSerializer.scala:34) > at > org.apache.spark.status.protobuf.sql.SQLExecutionUIDataSerializer.serialize(SQLExecutionUIDataSerializer.scala:28) > at > org.apache.spark.status.protobuf.KVStoreProtobufSerializer.serialize(KVStoreProtobufSerializer.scala:30) > at org.apache.spark.util.kvstore.RocksDB.write(RocksDB.java:188) > at > org.apache.spark.status.ElementTrackingStore.write(ElementTrackingStore.scala:123) > at > org.apache.spark.status.ElementTrackingStore.write(ElementTrackingStore.scala:127) > at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:50) > at > org.apache.spark.sql.execution.ui.SQLAppStatusListener.update(SQLAppStatusListener.scala:456) > at > org.apache.spark.sql.execution.ui.SQLAppStatusListener.onJobStart(SQLAppStatusListener.scala:124) > at > org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37) > at > org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) > at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) > at > org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) > at > org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) > at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1444) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) > 14:46:44.936 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener > SQLAppStatusListener threw an exception {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org