[jira] [Commented] (SPARK-42105) Document work (Release note & Guide doc) for SPARK-40925

2023-01-17 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678114#comment-17678114
 ] 

Jungtaek Lim commented on SPARK-42105:
--

Let me set this to blocker so that I won't miss this on release phase.

> Document work (Release note & Guide doc) for SPARK-40925
> 
>
> Key: SPARK-42105
> URL: https://issues.apache.org/jira/browse/SPARK-42105
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Priority: Blocker
>
> SPARK-40925 fixed the bug which introduced the major limitation we described 
> in the guide doc.
> [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#limitation-of-global-watermark]
> Many limitations described in the guide doc are resolved by SPARK-40925. We 
> even unblocked the functionality by SPARK-40940, so the doc is out of sync 
> with the codebase.
> We probably should update the guide doc to describe a new limitation. We also 
> have to write down release note for SPARK-40925 as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42105) Document work (Release note & Guide doc) for SPARK-40925

2023-01-17 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-42105:
-
Priority: Blocker  (was: Major)

> Document work (Release note & Guide doc) for SPARK-40925
> 
>
> Key: SPARK-42105
> URL: https://issues.apache.org/jira/browse/SPARK-42105
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Priority: Blocker
>
> SPARK-40925 fixed the bug which introduced the major limitation we described 
> in the guide doc.
> [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#limitation-of-global-watermark]
> Many limitations described in the guide doc are resolved by SPARK-40925. We 
> even unblocked the functionality by SPARK-40940, so the doc is out of sync 
> with the codebase.
> We probably should update the guide doc to describe a new limitation. We also 
> have to write down release note for SPARK-40925 as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42105) Document work (Release note & Guide doc) for SPARK-40925

2023-01-17 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-42105:


 Summary: Document work (Release note & Guide doc) for SPARK-40925
 Key: SPARK-42105
 URL: https://issues.apache.org/jira/browse/SPARK-42105
 Project: Spark
  Issue Type: Task
  Components: Structured Streaming
Affects Versions: 3.4.0
Reporter: Jungtaek Lim


SPARK-40925 fixed the bug which introduced the major limitation we described in 
the guide doc.

[https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#limitation-of-global-watermark]

Many limitations described in the guide doc are resolved by SPARK-40925. We 
even unblocked the functionality by SPARK-40940, so the doc is out of sync with 
the codebase.

We probably should update the guide doc to describe a new limitation. We also 
have to write down release note for SPARK-40925 as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42100) Protect null `SQLExecutionUIData#description` in `SQLExecutionUIDataSerializer`

2023-01-17 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-42100:
--

Assignee: Yang Jie

> Protect null `SQLExecutionUIData#description` in 
> `SQLExecutionUIDataSerializer`
> ---
>
> Key: SPARK-42100
> URL: https://issues.apache.org/jira/browse/SPARK-42100
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>
> export LIVE_UI_LOCAL_STORE_DIR = /tmp/spark-ui
> mvn clean install -pl sql/core 
> -Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest -Dtest=none 
> -DwildcardSuites=org.apache.spark.sql.DynamicPartitionPruningV1SuiteAEOff -am 
>  
> no test failed, but some error message:
>  
> {code:java}
> 14:46:44.514 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener 
> SQLAppStatusListener threw an exception
> java.lang.NullPointerException
>     at 
> org.apache.spark.status.protobuf.StoreTypes$SQLExecutionUIData$Builder.setDescription(StoreTypes.java:46500)
>     at 
> org.apache.spark.status.protobuf.sql.SQLExecutionUIDataSerializer.serialize(SQLExecutionUIDataSerializer.scala:34)
>     at 
> org.apache.spark.status.protobuf.sql.SQLExecutionUIDataSerializer.serialize(SQLExecutionUIDataSerializer.scala:28)
>     at 
> org.apache.spark.status.protobuf.KVStoreProtobufSerializer.serialize(KVStoreProtobufSerializer.scala:30)
>     at org.apache.spark.util.kvstore.RocksDB.write(RocksDB.java:188)
>     at 
> org.apache.spark.status.ElementTrackingStore.write(ElementTrackingStore.scala:123)
>     at 
> org.apache.spark.status.ElementTrackingStore.write(ElementTrackingStore.scala:127)
>     at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:50)
>     at 
> org.apache.spark.sql.execution.ui.SQLAppStatusListener.update(SQLAppStatusListener.scala:456)
>     at 
> org.apache.spark.sql.execution.ui.SQLAppStatusListener.onJobStart(SQLAppStatusListener.scala:124)
>     at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37)
>     at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>     at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117)
>     at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
>     at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
>     at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
>     at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1444)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
> 14:46:44.936 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener 
> SQLAppStatusListener threw an exception {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42100) Protect null `SQLExecutionUIData#description` in `SQLExecutionUIDataSerializer`

2023-01-17 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-42100.

Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39623
[https://github.com/apache/spark/pull/39623]

> Protect null `SQLExecutionUIData#description` in 
> `SQLExecutionUIDataSerializer`
> ---
>
> Key: SPARK-42100
> URL: https://issues.apache.org/jira/browse/SPARK-42100
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.4.0
>
>
> export LIVE_UI_LOCAL_STORE_DIR = /tmp/spark-ui
> mvn clean install -pl sql/core 
> -Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest -Dtest=none 
> -DwildcardSuites=org.apache.spark.sql.DynamicPartitionPruningV1SuiteAEOff -am 
>  
> no test failed, but some error message:
>  
> {code:java}
> 14:46:44.514 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener 
> SQLAppStatusListener threw an exception
> java.lang.NullPointerException
>     at 
> org.apache.spark.status.protobuf.StoreTypes$SQLExecutionUIData$Builder.setDescription(StoreTypes.java:46500)
>     at 
> org.apache.spark.status.protobuf.sql.SQLExecutionUIDataSerializer.serialize(SQLExecutionUIDataSerializer.scala:34)
>     at 
> org.apache.spark.status.protobuf.sql.SQLExecutionUIDataSerializer.serialize(SQLExecutionUIDataSerializer.scala:28)
>     at 
> org.apache.spark.status.protobuf.KVStoreProtobufSerializer.serialize(KVStoreProtobufSerializer.scala:30)
>     at org.apache.spark.util.kvstore.RocksDB.write(RocksDB.java:188)
>     at 
> org.apache.spark.status.ElementTrackingStore.write(ElementTrackingStore.scala:123)
>     at 
> org.apache.spark.status.ElementTrackingStore.write(ElementTrackingStore.scala:127)
>     at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:50)
>     at 
> org.apache.spark.sql.execution.ui.SQLAppStatusListener.update(SQLAppStatusListener.scala:456)
>     at 
> org.apache.spark.sql.execution.ui.SQLAppStatusListener.onJobStart(SQLAppStatusListener.scala:124)
>     at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37)
>     at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>     at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117)
>     at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
>     at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
>     at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
>     at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1444)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
> 14:46:44.936 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener 
> SQLAppStatusListener threw an exception {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42080) Add guideline for PySpark errors.

2023-01-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42080:


Assignee: (was: Apache Spark)

> Add guideline for PySpark errors.
> -
>
> Key: SPARK-42080
> URL: https://issues.apache.org/jira/browse/SPARK-42080
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Add guideline for PySpark errores



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42080) Add guideline for PySpark errors.

2023-01-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678098#comment-17678098
 ] 

Apache Spark commented on SPARK-42080:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/39639

> Add guideline for PySpark errors.
> -
>
> Key: SPARK-42080
> URL: https://issues.apache.org/jira/browse/SPARK-42080
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Add guideline for PySpark errores



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42080) Add guideline for PySpark errors.

2023-01-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42080:


Assignee: Apache Spark

> Add guideline for PySpark errors.
> -
>
> Key: SPARK-42080
> URL: https://issues.apache.org/jira/browse/SPARK-42080
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> Add guideline for PySpark errores



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42078) Migrate errors thrown by JVM into PySpark Exception.

2023-01-17 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42078:


Assignee: Haejoon Lee

> Migrate errors thrown by JVM into PySpark Exception.
> 
>
> Key: SPARK-42078
> URL: https://issues.apache.org/jira/browse/SPARK-42078
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> We should migrate all exceptions generated on PySpark into PySparkException.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42078) Migrate errors thrown by JVM into PySpark Exception.

2023-01-17 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42078.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39591
[https://github.com/apache/spark/pull/39591]

> Migrate errors thrown by JVM into PySpark Exception.
> 
>
> Key: SPARK-42078
> URL: https://issues.apache.org/jira/browse/SPARK-42078
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> We should migrate all exceptions generated on PySpark into PySparkException.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40885) Spark will filter out data field sorting when dynamic partitions and data fields are sorted at the same time

2023-01-17 Thread Enrico Minack (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-40885:
--
Affects Version/s: 3.4.0

> Spark will filter out data field sorting when dynamic partitions and data 
> fields are sorted at the same time
> 
>
> Key: SPARK-40885
> URL: https://issues.apache.org/jira/browse/SPARK-40885
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.3.0, 3.2.2, 3.4.0
>Reporter: ming95
>Priority: Major
> Attachments: 1666494504884.jpg
>
>
> When using dynamic partitions to write data and sort partitions and data 
> fields, Spark will filter the sorting of data fields.
>  
> reproduce sql:
> {code:java}
> CREATE TABLE `sort_table`(
>   `id` int,
>   `name` string
>   )
> PARTITIONED BY (
>   `dt` string)
> stored as textfile
> LOCATION 'sort_table';CREATE TABLE `test_table`(
>   `id` int,
>   `name` string)
> PARTITIONED BY (
>   `dt` string)
> stored as textfile
> LOCATION
>   'test_table';//gen test data
> insert into test_table partition(dt=20221011) select 10,"15" union all select 
> 1,"10" union  all select 5,"50" union  all select 20,"2" union  all select 
> 30,"14"  ;
> set spark.hadoop.hive.exec.dynamici.partition=true;
> set spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict;
> // this sql sort with partition filed (`dt`) and data filed (`name`), but 
> sort with `name` can not work
> insert overwrite table sort_table partition(dt) select id,name,dt from 
> test_table order by name,dt;
>  {code}
>  
> The Sort operator of DAG has only one sort field, but there are actually two 
> in SQL.(See the attached drawing)
>  
> It relate this issue : https://issues.apache.org/jira/browse/SPARK-40588



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41485) Unify the environment variable of *_PROTOC_EXEC_PATH

2023-01-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678088#comment-17678088
 ] 

Apache Spark commented on SPARK-41485:
--

User 'WolverineJiang' has created a pull request for this issue:
https://github.com/apache/spark/pull/39036

> Unify the environment variable of *_PROTOC_EXEC_PATH
> 
>
> Key: SPARK-41485
> URL: https://issues.apache.org/jira/browse/SPARK-41485
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Protobuf, Spark Core
>Affects Versions: 3.4.0
>Reporter: Haonan Jiang
>Priority: Minor
> Fix For: 3.4.0
>
>
> At present, there are 3 similar environment variable of *_PROTOC_EXEC_PATH, 
> but they use the same pb version. Because they are consistent in compilation, 
> so we can unify the environment variable names to simplify.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41485) Unify the environment variable of *_PROTOC_EXEC_PATH

2023-01-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678089#comment-17678089
 ] 

Apache Spark commented on SPARK-41485:
--

User 'WolverineJiang' has created a pull request for this issue:
https://github.com/apache/spark/pull/39036

> Unify the environment variable of *_PROTOC_EXEC_PATH
> 
>
> Key: SPARK-41485
> URL: https://issues.apache.org/jira/browse/SPARK-41485
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Protobuf, Spark Core
>Affects Versions: 3.4.0
>Reporter: Haonan Jiang
>Priority: Minor
> Fix For: 3.4.0
>
>
> At present, there are 3 similar environment variable of *_PROTOC_EXEC_PATH, 
> but they use the same pb version. Because they are consistent in compilation, 
> so we can unify the environment variable names to simplify.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41777) Add Integration Tests

2023-01-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41777:


Assignee: Apache Spark

> Add Integration Tests
> -
>
> Key: SPARK-41777
> URL: https://issues.apache.org/jira/browse/SPARK-41777
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Assignee: Apache Spark
>Priority: Major
>
> This requires us to add PyTorch as a testing dependency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41598) Migrate the errors from `pyspark/sql/functions.py` into error class.

2023-01-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678082#comment-17678082
 ] 

Apache Spark commented on SPARK-41598:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/39638

> Migrate the errors from `pyspark/sql/functions.py` into error class.
> 
>
> Key: SPARK-41598
> URL: https://issues.apache.org/jira/browse/SPARK-41598
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Migrate the existing errors into new PySpark error framework.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41598) Migrate the errors from `pyspark/sql/functions.py` into error class.

2023-01-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678084#comment-17678084
 ] 

Apache Spark commented on SPARK-41598:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/39638

> Migrate the errors from `pyspark/sql/functions.py` into error class.
> 
>
> Key: SPARK-41598
> URL: https://issues.apache.org/jira/browse/SPARK-41598
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Migrate the existing errors into new PySpark error framework.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41777) Add Integration Tests

2023-01-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678083#comment-17678083
 ] 

Apache Spark commented on SPARK-41777:
--

User 'rithwik-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/39637

> Add Integration Tests
> -
>
> Key: SPARK-41777
> URL: https://issues.apache.org/jira/browse/SPARK-41777
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> This requires us to add PyTorch as a testing dependency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41777) Add Integration Tests

2023-01-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41777:


Assignee: (was: Apache Spark)

> Add Integration Tests
> -
>
> Key: SPARK-41777
> URL: https://issues.apache.org/jira/browse/SPARK-41777
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> This requires us to add PyTorch as a testing dependency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41777) Add Integration Tests

2023-01-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678085#comment-17678085
 ] 

Apache Spark commented on SPARK-41777:
--

User 'rithwik-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/39637

> Add Integration Tests
> -
>
> Key: SPARK-41777
> URL: https://issues.apache.org/jira/browse/SPARK-41777
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> This requires us to add PyTorch as a testing dependency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42082) Introduce `PySparkValueError` and `PySparkTypeError`

2023-01-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678080#comment-17678080
 ] 

Apache Spark commented on SPARK-42082:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/39638

> Introduce `PySparkValueError` and `PySparkTypeError`
> 
>
> Key: SPARK-42082
> URL: https://issues.apache.org/jira/browse/SPARK-42082
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should migrate all Python built-in Exception into PySparkException.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41598) Migrate the errors from `pyspark/sql/functions.py` into error class.

2023-01-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678081#comment-17678081
 ] 

Apache Spark commented on SPARK-41598:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/39638

> Migrate the errors from `pyspark/sql/functions.py` into error class.
> 
>
> Key: SPARK-41598
> URL: https://issues.apache.org/jira/browse/SPARK-41598
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Migrate the existing errors into new PySpark error framework.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42082) Introduce `PySparkValueError` and `PySparkTypeError`

2023-01-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42082:


Assignee: (was: Apache Spark)

> Introduce `PySparkValueError` and `PySparkTypeError`
> 
>
> Key: SPARK-42082
> URL: https://issues.apache.org/jira/browse/SPARK-42082
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should migrate all Python built-in Exception into PySparkException.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42082) Introduce `PySparkValueError` and `PySparkTypeError`

2023-01-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42082:


Assignee: Apache Spark

> Introduce `PySparkValueError` and `PySparkTypeError`
> 
>
> Key: SPARK-42082
> URL: https://issues.apache.org/jira/browse/SPARK-42082
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> We should migrate all Python built-in Exception into PySparkException.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42082) Introduce `PySparkValueError` and `PySparkTypeError`

2023-01-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678079#comment-17678079
 ] 

Apache Spark commented on SPARK-42082:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/39638

> Introduce `PySparkValueError` and `PySparkTypeError`
> 
>
> Key: SPARK-42082
> URL: https://issues.apache.org/jira/browse/SPARK-42082
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should migrate all Python built-in Exception into PySparkException.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42082) Introduce `PySparkValueError` and `PySparkTypeError`

2023-01-17 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-42082:

Summary: Introduce `PySparkValueError` and `PySparkTypeError`  (was: Add 
PySparkValueError and PySparkTypeError)

> Introduce `PySparkValueError` and `PySparkTypeError`
> 
>
> Key: SPARK-42082
> URL: https://issues.apache.org/jira/browse/SPARK-42082
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should migrate all Python built-in Exception into PySparkException.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42082) Add PySparkValueError and PySparkTypeError

2023-01-17 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-42082:

Summary: Add PySparkValueError and PySparkTypeError  (was: Migrate 
ValueError into PySparkValueError and manage the functions.py)

> Add PySparkValueError and PySparkTypeError
> --
>
> Key: SPARK-42082
> URL: https://issues.apache.org/jira/browse/SPARK-42082
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should migrate all Python built-in Exception into PySparkException.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41485) Unify the environment variable of *_PROTOC_EXEC_PATH

2023-01-17 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-41485.
--
   Fix Version/s: 3.4.0
Target Version/s: 3.4.0
  Resolution: Fixed

[https://github.com/apache/spark/pull/39036]  has solved  this issuse, but uses 
a wrong jira id, so I change this to fixed

 

> Unify the environment variable of *_PROTOC_EXEC_PATH
> 
>
> Key: SPARK-41485
> URL: https://issues.apache.org/jira/browse/SPARK-41485
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Protobuf, Spark Core
>Affects Versions: 3.4.0
>Reporter: Haonan Jiang
>Priority: Minor
> Fix For: 3.4.0
>
>
> At present, there are 3 similar environment variable of *_PROTOC_EXEC_PATH, 
> but they use the same pb version. Because they are consistent in compilation, 
> so we can unify the environment variable names to simplify.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42029) Distribution build for Spark Connect does not work with Spark Shell

2023-01-17 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-42029:
-
Parent: (was: SPARK-39375)
Issue Type: Bug  (was: Sub-task)

> Distribution build for Spark Connect does not work with Spark Shell
> ---
>
> Key: SPARK-42029
> URL: https://issues.apache.org/jira/browse/SPARK-42029
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42029) Distribution build for Spark Connect does not work with Spark Shell

2023-01-17 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-42029:
-
Parent: SPARK-41286
Issue Type: Sub-task  (was: Bug)

> Distribution build for Spark Connect does not work with Spark Shell
> ---
>
> Key: SPARK-42029
> URL: https://issues.apache.org/jira/browse/SPARK-42029
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41727) ClassCastException when config spark.sql.hive.metastore* properties under jdk17

2023-01-17 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-41727:
-
Priority: Major  (was: Critical)

> ClassCastException when config spark.sql.hive.metastore* properties under 
> jdk17
> ---
>
> Key: SPARK-41727
> URL: https://issues.apache.org/jira/browse/SPARK-41727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1
> Environment: Apache spark3.3.1 \ HDP3.1.5 with hive 3.1.0
>Reporter: kevinshin
>Priority: Major
> Attachments: hms-init-error.txt
>
>
> Apache spark3.3.1 \ HDP3.1.5 with hive 3.1.0
> when config properties about spark.sql.hive.metastore* to use 
> hive.metastore.version 3.1.2: 
> *spark.sql.hive.metastore.jars /data/soft/spark3/standalone-metastore/**
> *spark.sql.hive.metastore.version 3.1.2*
> then start spark-shell with master = local[*] under jdk17 
> try to select a hive table, will got error:
> 13:44:52.428 [main] ERROR 
> org.apache.hadoop.hive.metastore.utils.MetaStoreUtils - Got exception: 
> java.lang.ClassCastException class [Ljava.lang.Object; cannot be cast to 
> class [Ljava.net.URI; ([Ljava.lang.Object; and [Ljava.net.URI; are in module 
> java.base of loader 'bootstrap')
> java.lang.ClassCastException: class [Ljava.lang.Object; cannot be cast to 
> class [Ljava.net.URI; ([Ljava.lang.Object; and [Ljava.net.URI; are in module 
> java.base of loader 'bootstrap')
>         at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.resolveUris(HiveMetaStoreClient.java:262)
>  ~[hive-standalone-metastore-3.1.2.jar:3.1.2]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-41589) PyTorch Distributor

2023-01-17 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-41589:
--

> PyTorch Distributor
> ---
>
> Key: SPARK-41589
> URL: https://issues.apache.org/jira/browse/SPARK-41589
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
> Fix For: 3.4.0
>
>
> This is a project to make it easier for PySpark users to distribute PyTorch 
> code using PySpark. The corresponding [Design 
> Document|https://docs.google.com/document/d/1QPO1Ly8WteL6aIPvVcR7Xne9qVtJiB3fdrRn7NwBcpA/edit?usp=sharing]
>  can give more context. This was a project determined by the Databricks ML 
> Training Team; please reach out to [~gurwls223] (Spark-side) or [~erithwik] 
> for more context.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41589) PyTorch Distributor

2023-01-17 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-41589:
-
Fix Version/s: (was: 3.4.0)

> PyTorch Distributor
> ---
>
> Key: SPARK-41589
> URL: https://issues.apache.org/jira/browse/SPARK-41589
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> This is a project to make it easier for PySpark users to distribute PyTorch 
> code using PySpark. The corresponding [Design 
> Document|https://docs.google.com/document/d/1QPO1Ly8WteL6aIPvVcR7Xne9qVtJiB3fdrRn7NwBcpA/edit?usp=sharing]
>  can give more context. This was a project determined by the Databricks ML 
> Training Team; please reach out to [~gurwls223] (Spark-side) or [~erithwik] 
> for more context.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41589) PyTorch Distributor

2023-01-17 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-41589:
-
Priority: Critical  (was: Major)

> PyTorch Distributor
> ---
>
> Key: SPARK-41589
> URL: https://issues.apache.org/jira/browse/SPARK-41589
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Critical
>
> This is a project to make it easier for PySpark users to distribute PyTorch 
> code using PySpark. The corresponding [Design 
> Document|https://docs.google.com/document/d/1QPO1Ly8WteL6aIPvVcR7Xne9qVtJiB3fdrRn7NwBcpA/edit?usp=sharing]
>  can give more context. This was a project determined by the Databricks ML 
> Training Team; please reach out to [~gurwls223] (Spark-side) or [~erithwik] 
> for more context.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42061) Mark Expressions that have state has stateful

2023-01-17 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-42061.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39630
[https://github.com/apache/spark/pull/39630]

> Mark Expressions that have state has stateful
> -
>
> Key: SPARK-42061
> URL: https://issues.apache.org/jira/browse/SPARK-42061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42061) Mark Expressions that have state has stateful

2023-01-17 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-42061:
---

Assignee: (was: Wenchen Fan)

> Mark Expressions that have state has stateful
> -
>
> Key: SPARK-42061
> URL: https://issues.apache.org/jira/browse/SPARK-42061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42061) Mark Expressions that have state has stateful

2023-01-17 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-42061:
---

Assignee: Wenchen Fan

> Mark Expressions that have state has stateful
> -
>
> Key: SPARK-42061
> URL: https://issues.apache.org/jira/browse/SPARK-42061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24941) Add RDDBarrier.coalesce() function

2023-01-17 Thread Erik Ordentlich (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678044#comment-17678044
 ] 

Erik Ordentlich edited comment on SPARK-24941 at 1/18/23 1:33 AM:
--

Anyone still planning to work on this?   cc [~mengxr] [~leewyang] 


was (Author: JIRAUSER287642):
Anyone still planning to work on this?   cc [~mengxr] 

> Add RDDBarrier.coalesce() function
> --
>
> Key: SPARK-24941
> URL: https://issues.apache.org/jira/browse/SPARK-24941
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Xingbo Jiang
>Priority: Major
>
> https://github.com/apache/spark/pull/21758#discussion_r204917245
> The number of partitions from the input data can be unexpectedly large, eg. 
> if you do
> {code}
> sc.textFile(...).barrier().mapPartitions()
> {code}
> The number of input partitions is based on the hdfs input splits. We shall 
> provide a way in RDDBarrier to enable users to specify the number of tasks in 
> a barrier stage. Maybe something like RDDBarrier.coalesce(numPartitions: Int) 
> .



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24941) Add RDDBarrier.coalesce() function

2023-01-17 Thread Erik Ordentlich (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678044#comment-17678044
 ] 

Erik Ordentlich commented on SPARK-24941:
-

Anyone still planning to work on this?   cc [~mengxr] 

> Add RDDBarrier.coalesce() function
> --
>
> Key: SPARK-24941
> URL: https://issues.apache.org/jira/browse/SPARK-24941
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Xingbo Jiang
>Priority: Major
>
> https://github.com/apache/spark/pull/21758#discussion_r204917245
> The number of partitions from the input data can be unexpectedly large, eg. 
> if you do
> {code}
> sc.textFile(...).barrier().mapPartitions()
> {code}
> The number of input partitions is based on the hdfs input splits. We shall 
> provide a way in RDDBarrier to enable users to specify the number of tasks in 
> a barrier stage. Maybe something like RDDBarrier.coalesce(numPartitions: Int) 
> .



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41596) Document the new feature "Async Progress Tracking" to Structured Streaming guide doc

2023-01-17 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-41596.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39538
[https://github.com/apache/spark/pull/39538]

> Document the new feature "Async Progress Tracking" to Structured Streaming 
> guide doc
> 
>
> Key: SPARK-41596
> URL: https://issues.apache.org/jira/browse/SPARK-41596
> Project: Spark
>  Issue Type: Documentation
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Assignee: Boyang Jerry Peng
>Priority: Blocker
> Fix For: 3.4.0
>
>
> Given that we merged the new SPIP feature SPARK-39591, we have to document 
> the new feature to the Structured Streaming guide doc so that end users can 
> refer to the doc and start experimenting the feature.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41596) Document the new feature "Async Progress Tracking" to Structured Streaming guide doc

2023-01-17 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-41596:


Assignee: Boyang Jerry Peng

> Document the new feature "Async Progress Tracking" to Structured Streaming 
> guide doc
> 
>
> Key: SPARK-41596
> URL: https://issues.apache.org/jira/browse/SPARK-41596
> Project: Spark
>  Issue Type: Documentation
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Assignee: Boyang Jerry Peng
>Priority: Blocker
>
> Given that we merged the new SPIP feature SPARK-39591, we have to document 
> the new feature to the Structured Streaming guide doc so that end users can 
> refer to the doc and start experimenting the feature.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41822) Setup Scala/JVM Client Connection

2023-01-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678038#comment-17678038
 ] 

Apache Spark commented on SPARK-41822:
--

User 'zhenlineo' has created a pull request for this issue:
https://github.com/apache/spark/pull/39635

> Setup Scala/JVM Client Connection
> -
>
> Key: SPARK-41822
> URL: https://issues.apache.org/jira/browse/SPARK-41822
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Venkata Sai Akhil Gudesa
>Assignee: Venkata Sai Akhil Gudesa
>Priority: Major
> Fix For: 3.4.0
>
>
> Set up the gRPC connection for the Scala/JVM client to enable communication 
> with the Spark Connect server. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42090) Introduce sasl retry count in RetryingBlockTransferor

2023-01-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678032#comment-17678032
 ] 

Apache Spark commented on SPARK-42090:
--

User 'akpatnam25' has created a pull request for this issue:
https://github.com/apache/spark/pull/39634

> Introduce sasl retry count in RetryingBlockTransferor
> -
>
> Key: SPARK-42090
> URL: https://issues.apache.org/jira/browse/SPARK-42090
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: 3.4.0
>
>
> Previously a boolean variable, saslTimeoutSeen, was used in 
> RetryingBlockTransferor. However, the boolean variable wouldn't cover the 
> following scenario:
> 1. SaslTimeoutException
> 2. IOException
> 3. SaslTimeoutException
> 4. IOException
> Even though IOException at #2 is retried (resulting in increment of 
> retryCount), the retryCount would be cleared at step #4.
> Since the intention of saslTimeoutSeen is to undo the increment due to 
> retrying SaslTimeoutException, we should keep a counter for 
> SaslTimeoutException retries and subtract the value of this counter from 
> retryCount.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42090) Introduce sasl retry count in RetryingBlockTransferor

2023-01-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678030#comment-17678030
 ] 

Apache Spark commented on SPARK-42090:
--

User 'akpatnam25' has created a pull request for this issue:
https://github.com/apache/spark/pull/39632

> Introduce sasl retry count in RetryingBlockTransferor
> -
>
> Key: SPARK-42090
> URL: https://issues.apache.org/jira/browse/SPARK-42090
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: 3.4.0
>
>
> Previously a boolean variable, saslTimeoutSeen, was used in 
> RetryingBlockTransferor. However, the boolean variable wouldn't cover the 
> following scenario:
> 1. SaslTimeoutException
> 2. IOException
> 3. SaslTimeoutException
> 4. IOException
> Even though IOException at #2 is retried (resulting in increment of 
> retryCount), the retryCount would be cleared at step #4.
> Since the intention of saslTimeoutSeen is to undo the increment due to 
> retrying SaslTimeoutException, we should keep a counter for 
> SaslTimeoutException retries and subtract the value of this counter from 
> retryCount.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41415) SASL Request Retries

2023-01-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678028#comment-17678028
 ] 

Apache Spark commented on SPARK-41415:
--

User 'akpatnam25' has created a pull request for this issue:
https://github.com/apache/spark/pull/39634

> SASL Request Retries
> 
>
> Key: SPARK-41415
> URL: https://issues.apache.org/jira/browse/SPARK-41415
> Project: Spark
>  Issue Type: Task
>  Components: Shuffle
>Affects Versions: 3.2.4
>Reporter: Aravind Patnam
>Assignee: Aravind Patnam
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42090) Introduce sasl retry count in RetryingBlockTransferor

2023-01-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678029#comment-17678029
 ] 

Apache Spark commented on SPARK-42090:
--

User 'akpatnam25' has created a pull request for this issue:
https://github.com/apache/spark/pull/39634

> Introduce sasl retry count in RetryingBlockTransferor
> -
>
> Key: SPARK-42090
> URL: https://issues.apache.org/jira/browse/SPARK-42090
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: 3.4.0
>
>
> Previously a boolean variable, saslTimeoutSeen, was used in 
> RetryingBlockTransferor. However, the boolean variable wouldn't cover the 
> following scenario:
> 1. SaslTimeoutException
> 2. IOException
> 3. SaslTimeoutException
> 4. IOException
> Even though IOException at #2 is retried (resulting in increment of 
> retryCount), the retryCount would be cleared at step #4.
> Since the intention of saslTimeoutSeen is to undo the increment due to 
> retrying SaslTimeoutException, we should keep a counter for 
> SaslTimeoutException retries and subtract the value of this counter from 
> retryCount.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41415) SASL Request Retries

2023-01-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678027#comment-17678027
 ] 

Apache Spark commented on SPARK-41415:
--

User 'akpatnam25' has created a pull request for this issue:
https://github.com/apache/spark/pull/39634

> SASL Request Retries
> 
>
> Key: SPARK-41415
> URL: https://issues.apache.org/jira/browse/SPARK-41415
> Project: Spark
>  Issue Type: Task
>  Components: Shuffle
>Affects Versions: 3.2.4
>Reporter: Aravind Patnam
>Assignee: Aravind Patnam
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41415) SASL Request Retries

2023-01-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678026#comment-17678026
 ] 

Apache Spark commented on SPARK-41415:
--

User 'akpatnam25' has created a pull request for this issue:
https://github.com/apache/spark/pull/39632

> SASL Request Retries
> 
>
> Key: SPARK-41415
> URL: https://issues.apache.org/jira/browse/SPARK-41415
> Project: Spark
>  Issue Type: Task
>  Components: Shuffle
>Affects Versions: 3.2.4
>Reporter: Aravind Patnam
>Assignee: Aravind Patnam
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42038) SPJ: Support partially clustered distribution

2023-01-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42038:


Assignee: (was: Apache Spark)

> SPJ: Support partially clustered distribution
> -
>
> Key: SPARK-42038
> URL: https://issues.apache.org/jira/browse/SPARK-42038
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Chao Sun
>Priority: Major
>
> Currently the storage-partitioned join requires both sides to be fully 
> clustered on the partition values, that is, all input partitions reported by 
> a V2 data source shall be grouped by partition values before the join 
> happens. This could lead to data skew issues if a particular partition value 
> is associated with a large amount of rows.
>  
> To combat this, we can introduce the idea of partially clustered 
> distribution, which means that only one side of the join is required to be 
> fully clustered, while the other side is not. This allows Spark to increase 
> the parallelism of the join and avoid the data skewness.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42038) SPJ: Support partially clustered distribution

2023-01-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42038:


Assignee: Apache Spark

> SPJ: Support partially clustered distribution
> -
>
> Key: SPARK-42038
> URL: https://issues.apache.org/jira/browse/SPARK-42038
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Chao Sun
>Assignee: Apache Spark
>Priority: Major
>
> Currently the storage-partitioned join requires both sides to be fully 
> clustered on the partition values, that is, all input partitions reported by 
> a V2 data source shall be grouped by partition values before the join 
> happens. This could lead to data skew issues if a particular partition value 
> is associated with a large amount of rows.
>  
> To combat this, we can introduce the idea of partially clustered 
> distribution, which means that only one side of the join is required to be 
> fully clustered, while the other side is not. This allows Spark to increase 
> the parallelism of the join and avoid the data skewness.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42038) SPJ: Support partially clustered distribution

2023-01-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678025#comment-17678025
 ] 

Apache Spark commented on SPARK-42038:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/39633

> SPJ: Support partially clustered distribution
> -
>
> Key: SPARK-42038
> URL: https://issues.apache.org/jira/browse/SPARK-42038
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Chao Sun
>Priority: Major
>
> Currently the storage-partitioned join requires both sides to be fully 
> clustered on the partition values, that is, all input partitions reported by 
> a V2 data source shall be grouped by partition values before the join 
> happens. This could lead to data skew issues if a particular partition value 
> is associated with a large amount of rows.
>  
> To combat this, we can introduce the idea of partially clustered 
> distribution, which means that only one side of the join is required to be 
> fully clustered, while the other side is not. This allows Spark to increase 
> the parallelism of the join and avoid the data skewness.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42103) Add Instrumentation

2023-01-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42103:


Assignee: (was: Apache Spark)

> Add Instrumentation
> ---
>
> Key: SPARK-42103
> URL: https://issues.apache.org/jira/browse/SPARK-42103
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> Adding instrumentation



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42103) Add Instrumentation

2023-01-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42103:


Assignee: Apache Spark

> Add Instrumentation
> ---
>
> Key: SPARK-42103
> URL: https://issues.apache.org/jira/browse/SPARK-42103
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Assignee: Apache Spark
>Priority: Major
>
> Adding instrumentation



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42103) Add Instrumentation

2023-01-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677995#comment-17677995
 ] 

Apache Spark commented on SPARK-42103:
--

User 'rithwik-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/39629

> Add Instrumentation
> ---
>
> Key: SPARK-42103
> URL: https://issues.apache.org/jira/browse/SPARK-42103
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> Adding instrumentation



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42061) Mark Expressions that have state has stateful

2023-01-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677994#comment-17677994
 ] 

Apache Spark commented on SPARK-42061:
--

User 'lzlfred' has created a pull request for this issue:
https://github.com/apache/spark/pull/39630

> Mark Expressions that have state has stateful
> -
>
> Key: SPARK-42061
> URL: https://issues.apache.org/jira/browse/SPARK-42061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42061) Mark Expressions that have state has stateful

2023-01-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42061:


Assignee: (was: Apache Spark)

> Mark Expressions that have state has stateful
> -
>
> Key: SPARK-42061
> URL: https://issues.apache.org/jira/browse/SPARK-42061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42061) Mark Expressions that have state has stateful

2023-01-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42061:


Assignee: Apache Spark

> Mark Expressions that have state has stateful
> -
>
> Key: SPARK-42061
> URL: https://issues.apache.org/jira/browse/SPARK-42061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41776) Implement support for PyTorch Lightning

2023-01-17 Thread Rithwik Ediga Lakhamsani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rithwik Ediga Lakhamsani updated SPARK-41776:
-
Description: 
This requires us to just call train() on each spark task separately without 
much preprocessing or postprocessing because PyTorch Lightning handles that by 
itself.

 

Update: This was resolved by using `torch.distributed.run`

  was:This requires us to just call train() on each spark task separately 
without much preprocessing or postprocessing because PyTorch Lightning handles 
that by itself.


> Implement support for PyTorch Lightning
> ---
>
> Key: SPARK-41776
> URL: https://issues.apache.org/jira/browse/SPARK-41776
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> This requires us to just call train() on each spark task separately without 
> much preprocessing or postprocessing because PyTorch Lightning handles that 
> by itself.
>  
> Update: This was resolved by using `torch.distributed.run`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41776) Implement support for PyTorch Lightning

2023-01-17 Thread Rithwik Ediga Lakhamsani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rithwik Ediga Lakhamsani updated SPARK-41776:
-
Description: This requires us to just call train() on each spark task 
separately without much preprocessing or postprocessing because PyTorch 
Lightning handles that by itself.  (was: This requires us to just call train() 
on each spark task separately without much preprocessing or postprocessing 
because PyTorch Lightning handles that by itself.

 

Update: This was resolved by using `torch.distributed.run`)

> Implement support for PyTorch Lightning
> ---
>
> Key: SPARK-41776
> URL: https://issues.apache.org/jira/browse/SPARK-41776
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> This requires us to just call train() on each spark task separately without 
> much preprocessing or postprocessing because PyTorch Lightning handles that 
> by itself.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41915) Change API so that the user doesn't have to explicitly set pytorch-lightning

2023-01-17 Thread Rithwik Ediga Lakhamsani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rithwik Ediga Lakhamsani resolved SPARK-41915.
--
Resolution: Fixed

This is already resolved within 
https://issues.apache.org/jira/browse/SPARK-41590.

> Change API so that the user doesn't have to explicitly set pytorch-lightning
> 
>
> Key: SPARK-41915
> URL: https://issues.apache.org/jira/browse/SPARK-41915
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Priority: Major
>
> Removing the `framework` parameter in the API and have cloudpickle 
> automatically find out whether the user code has a dependency on PyTorch 
> Lightning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-40687) Support data masking built-in Function 'mask'

2023-01-17 Thread Vinod KC (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677988#comment-17677988
 ] 

Vinod KC edited comment on SPARK-40687 at 1/17/23 9:29 PM:
---

Note: In the udf 'mask', using -1 as ignore parameter in String type argument 
is not a standard way. 
Please refer SPARK-42070 , it changes the default value of the argument of 
m{*}ask{*} udf from -1 to NULL

 

 


was (Author: vinodkc):
Note:  Please refer 
[SPARK-42070|https://issues.apache.org/jira/browse/SPARK-42070] , it changes  
the default value of argument of Mask udf from -1 to NULL

 

> Support data masking built-in Function  'mask'
> --
>
> Key: SPARK-40687
> URL: https://issues.apache.org/jira/browse/SPARK-40687
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vinod KC
>Assignee: Vinod KC
>Priority: Minor
> Fix For: 3.4.0
>
>
> Support data masking built-in Function  *mask*
> Return a masked version of str. By default, upper case letters should be 
> converted to "X", lower case letters should be converted to "x" and numbers 
> should be converted to "n". For example mask("abcd-EFGH-8765-4321") results 
> in ---. Should be able override the characters used in the 
> mask by supplying additional arguments: the second argument controls the mask 
> character for upper case letters, the third argument for lower case letters 
> and the fourth argument for numbers. For example, mask("abcd-EFGH-8765-4321", 
> "U", "l", "#") should result in ---
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40687) Support data masking built-in Function 'mask'

2023-01-17 Thread Vinod KC (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677988#comment-17677988
 ] 

Vinod KC commented on SPARK-40687:
--

Note:  Please refer 
[SPARK-42070|https://issues.apache.org/jira/browse/SPARK-42070] , it changes  
the default value of argument of Mask udf from -1 to NULL

 

> Support data masking built-in Function  'mask'
> --
>
> Key: SPARK-40687
> URL: https://issues.apache.org/jira/browse/SPARK-40687
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vinod KC
>Assignee: Vinod KC
>Priority: Minor
> Fix For: 3.4.0
>
>
> Support data masking built-in Function  *mask*
> Return a masked version of str. By default, upper case letters should be 
> converted to "X", lower case letters should be converted to "x" and numbers 
> should be converted to "n". For example mask("abcd-EFGH-8765-4321") results 
> in ---. Should be able override the characters used in the 
> mask by supplying additional arguments: the second argument controls the mask 
> character for upper case letters, the third argument for lower case letters 
> and the fourth argument for numbers. For example, mask("abcd-EFGH-8765-4321", 
> "U", "l", "#") should result in ---
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42070) Change the default value of argument of Mask udf from -1 to NULL

2023-01-17 Thread Vinod KC (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod KC updated SPARK-42070:
-
Description: 
In the udf 'mask', using -1 as ignore parameter in String type argument is not 
a standard way, hence, it is better to change the value of ignore argument from 
-1 to NULL.

Note: SPARK-40687 , has recently implemented udf *mask* , which uses -1 as the 
default argument to ignore the masking option, As no Spark version release has 
occurred since then, this new change will not cause backward compatibility 
issues

  was:
In the udf 'mask', using -1 as ignore parameter in String type  argument is not 
a standard way, hence it is better to change the value of ignore argument from 
-1 to NULL.

 


> Change the default value of argument of Mask udf from -1 to NULL
> 
>
> Key: SPARK-42070
> URL: https://issues.apache.org/jira/browse/SPARK-42070
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vinod KC
>Priority: Minor
>
> In the udf 'mask', using -1 as ignore parameter in String type argument is 
> not a standard way, hence, it is better to change the value of ignore 
> argument from -1 to NULL.
> Note: SPARK-40687 , has recently implemented udf *mask* , which uses -1 as 
> the default argument to ignore the masking option, As no Spark version 
> release has occurred since then, this new change will not cause backward 
> compatibility issues



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42039) SPJ: Remove Option in KeyGroupedPartitioning#partitionValues

2023-01-17 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-42039.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39540
[https://github.com/apache/spark/pull/39540]

> SPJ: Remove Option in KeyGroupedPartitioning#partitionValues
> 
>
> Key: SPARK-42039
> URL: https://issues.apache.org/jira/browse/SPARK-42039
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Minor
> Fix For: 3.4.0
>
>
> Currently {{KeyGroupedPartitioning#partitionValuesOpt}} is an 
> {{{}Option[Seq[InternalRow]]{}}}. This is unnecessary since it is always set. 
> This propose to just replace it with {{Seq[InternalRow]}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42039) SPJ: Remove Option in KeyGroupedPartitioning#partitionValues

2023-01-17 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-42039:
-

Assignee: Chao Sun

> SPJ: Remove Option in KeyGroupedPartitioning#partitionValues
> 
>
> Key: SPARK-42039
> URL: https://issues.apache.org/jira/browse/SPARK-42039
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Minor
>
> Currently {{KeyGroupedPartitioning#partitionValuesOpt}} is an 
> {{{}Option[Seq[InternalRow]]{}}}. This is unnecessary since it is always set. 
> This propose to just replace it with {{Seq[InternalRow]}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42104) Throw ExecutorDeadException in fetchBlocks when executor dead

2023-01-17 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-42104:


 Summary: Throw ExecutorDeadException in fetchBlocks when executor 
dead
 Key: SPARK-42104
 URL: https://issues.apache.org/jira/browse/SPARK-42104
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.3.1
Reporter: Zhongwei Zhu


When fetchBlocks failed due to IOException, ExecutorDeadException will be 
thrown when executor is dead.

There're other cases that executor dead will cause TimeoutException or other 
Exceptions.
{code:java}
Caused by: java.lang.RuntimeException: java.util.concurrent.TimeoutException: 
Waited 3 milliseconds (plus 143334 nanoseconds delay) for 
SettableFuture@624de392[status=PENDING]
    at org.sparkproject.guava.base.Throwables.propagate(Throwables.java:243)
    at 
org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:293)
    at 
org.apache.spark.network.crypto.AuthClientBootstrap.doSparkAuth(AuthClientBootstrap.java:113)
    at 
org.apache.spark.network.crypto.AuthClientBootstrap.doBootstrap(AuthClientBootstrap.java:80)
    at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:300)
    at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218)
    at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:126)
    at 
org.apache.spark.network.shuffle.RetryingBlockTransferor.transferAllOutstanding(RetryingBlockTransferor.java:154)
    at 
org.apache.spark.network.shuffle.RetryingBlockTransferor.lambda$initiateRetry$0(RetryingBlockTransferor.java:184)
    at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40264) Add helper function for DL model inference in pyspark.ml.functions

2023-01-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677905#comment-17677905
 ] 

Apache Spark commented on SPARK-40264:
--

User 'leewyang' has created a pull request for this issue:
https://github.com/apache/spark/pull/39628

> Add helper function for DL model inference in pyspark.ml.functions
> --
>
> Key: SPARK-40264
> URL: https://issues.apache.org/jira/browse/SPARK-40264
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.2.2
>Reporter: Lee Yang
>Priority: Minor
>
> Add a helper function to create a pandas_udf for inference on a given DL 
> model, where the user provides a predict function that is responsible for 
> loading the model and inferring on a batch of numpy inputs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42103) Add Instrumentation

2023-01-17 Thread Rithwik Ediga Lakhamsani (Jira)
Rithwik Ediga Lakhamsani created SPARK-42103:


 Summary: Add Instrumentation
 Key: SPARK-42103
 URL: https://issues.apache.org/jira/browse/SPARK-42103
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Affects Versions: 3.4.0
Reporter: Rithwik Ediga Lakhamsani


Adding instrumentation



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42092) Upgrade RoaringBitmap to 0.9.38

2023-01-17 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-42092:


Assignee: Yang Jie

> Upgrade RoaringBitmap to 0.9.38
> ---
>
> Key: SPARK-42092
> URL: https://issues.apache.org/jira/browse/SPARK-42092
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> https://github.com/RoaringBitmap/RoaringBitmap/compare/0.9.36...0.9.38



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42092) Upgrade RoaringBitmap to 0.9.38

2023-01-17 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-42092.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39613
[https://github.com/apache/spark/pull/39613]

> Upgrade RoaringBitmap to 0.9.38
> ---
>
> Key: SPARK-42092
> URL: https://issues.apache.org/jira/browse/SPARK-42092
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> https://github.com/RoaringBitmap/RoaringBitmap/compare/0.9.36...0.9.38



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42098) ResolveInlineTables should handle RuntimeReplaceable

2023-01-17 Thread Daniel (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677856#comment-17677856
 ] 

Daniel commented on SPARK-42098:


[~cloud_fan]  [~srielau]  I can help with this if you guys need help

> ResolveInlineTables should handle RuntimeReplaceable
> 
>
> Key: SPARK-42098
> URL: https://issues.apache.org/jira/browse/SPARK-42098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Wenchen Fan
>Priority: Major
>
> spark-sql> VALUES (try_divide(5, 0));
> cannot evaluate expression try_divide(5, 0) in inline table definition; line 
> 1 pos 8



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41993) Move RowEncoder to AgnosticEncoders

2023-01-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677823#comment-17677823
 ] 

Apache Spark commented on SPARK-41993:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/39627

> Move RowEncoder to AgnosticEncoders
> ---
>
> Key: SPARK-41993
> URL: https://issues.apache.org/jira/browse/SPARK-41993
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
> Fix For: 3.4.0
>
>
> Move RowEncoder to the AgnosticEncoder framework.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42102) Using checkpoints in Spark Structured Streaming with the foreachBatch sink

2023-01-17 Thread Kai-Michael Roesner (Jira)
Kai-Michael Roesner created SPARK-42102:
---

 Summary: Using checkpoints in Spark Structured Streaming with the 
foreachBatch sink
 Key: SPARK-42102
 URL: https://issues.apache.org/jira/browse/SPARK-42102
 Project: Spark
  Issue Type: Question
  Components: PySpark, Structured Streaming
Affects Versions: 3.3.1
Reporter: Kai-Michael Roesner


I want to build a fault-tolerant, recoverable Spark job (using Structured 
Streaming in PySpark) that reads a data stream from Kafka and uses the 
[{{foreachBatch}}|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreachbatch]
 sink to implement a stateful transformation before writing the resulting data 
to the actual sink.

The basic structure of my Spark job is like this:
{code}
counter = 0

def batch_handler(df, batch_id):
  global counter
  counter += 1
  df.withColumn('counter', lit(counter)).show(truncate=30)

spark = (SparkSession.builder
  .appName('test.stateful.checkpoint')
  .config('spark.jars.packages', f'{KAFKA_SQL},{KAFKA_CLNT}')
  .getOrCreate())

source = (spark.readStream
  .format('kafka')
  .options(**KAFKA_OPTIONS)
  .option('subscribe', 'topic-spark-stateful')
  .option('startingOffsets', 'earliest')
  .option('includeHeaders', 'true')
  .load())

(source
  .selectExpr('CAST(value AS STRING) AS data', 'CAST(timestamp AS STRING) AS 
time')
  .writeStream
  .option('checkpointLocation', './checkpoints/stateful')
  .foreachBatch(batch_handler)
  .start()
  .awaitTermination())
{code}

where the simplified {{batch_handler}} function is a stand-in for the stateful 
transformation + writer to the actual data sink. Also for simplicity I am using 
a local folder as checkpoint location. 

This works fine as far as checkpointing of Kafka offsets is concerned. But how 
can I include the state of my custom batch handler ({{counter}} in my 
simplified example) in the checkpoints such that the job can pick up where it 
left after a crash?

The [Spark Structured Streaming 
Guide|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing]
 doesn't say anything on the topic. With the 
[{{foreach}}|(https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreach]
 sink I can pass a custom row handler object but this seems to support only 
{{open}}, {{process}}, and {{close}} methods.

Would it make sense to create a "Request" or even "Feature" ticket to enhance 
this with methods for restoring state from a checkpoint and exporting state to 
support checkpointing?

PS: I have posted this on [SOF|https://stackoverflow.com/questions/74864425], 
too. If anyone cares to answer or comment I'd be happy to upvote their post.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40599) Add multiTransform methods to TreeNode to generate alternatives

2023-01-17 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-40599.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38034
[https://github.com/apache/spark/pull/38034]

> Add multiTransform methods to TreeNode to generate alternatives
> ---
>
> Key: SPARK-40599
> URL: https://issues.apache.org/jira/browse/SPARK-40599
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40599) Add multiTransform methods to TreeNode to generate alternatives

2023-01-17 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-40599:
---

Assignee: Peter Toth

> Add multiTransform methods to TreeNode to generate alternatives
> ---
>
> Key: SPARK-40599
> URL: https://issues.apache.org/jira/browse/SPARK-40599
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42066) The DATATYPE_MISMATCH error class contains inappropriate and duplicating subclasses

2023-01-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42066:


Assignee: Apache Spark

> The DATATYPE_MISMATCH error class contains inappropriate and duplicating 
> subclasses
> ---
>
> Key: SPARK-42066
> URL: https://issues.apache.org/jira/browse/SPARK-42066
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Assignee: Apache Spark
>Priority: Major
>
> subclass WRONG_NUM_ARGS (with suggestions) semantically does not belong into 
> DATATYPE_MISMATCH and there is an error class with that same name.
> We should rea the subclasses for this errorclass, which seems to have become 
> a bit of a dumping ground...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42066) The DATATYPE_MISMATCH error class contains inappropriate and duplicating subclasses

2023-01-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677717#comment-17677717
 ] 

Apache Spark commented on SPARK-42066:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/39625

> The DATATYPE_MISMATCH error class contains inappropriate and duplicating 
> subclasses
> ---
>
> Key: SPARK-42066
> URL: https://issues.apache.org/jira/browse/SPARK-42066
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Major
>
> subclass WRONG_NUM_ARGS (with suggestions) semantically does not belong into 
> DATATYPE_MISMATCH and there is an error class with that same name.
> We should rea the subclasses for this errorclass, which seems to have become 
> a bit of a dumping ground...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42066) The DATATYPE_MISMATCH error class contains inappropriate and duplicating subclasses

2023-01-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42066:


Assignee: (was: Apache Spark)

> The DATATYPE_MISMATCH error class contains inappropriate and duplicating 
> subclasses
> ---
>
> Key: SPARK-42066
> URL: https://issues.apache.org/jira/browse/SPARK-42066
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Major
>
> subclass WRONG_NUM_ARGS (with suggestions) semantically does not belong into 
> DATATYPE_MISMATCH and there is an error class with that same name.
> We should rea the subclasses for this errorclass, which seems to have become 
> a bit of a dumping ground...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42101) Wrap InMemoryTableScanExec with QueryStage

2023-01-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42101:


Assignee: Apache Spark

> Wrap InMemoryTableScanExec with QueryStage
> --
>
> Key: SPARK-42101
> URL: https://issues.apache.org/jira/browse/SPARK-42101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Major
>
> The first access to the cached plan which is enable AQE is tricky. Currently, 
> we can not preverse it's output partitioning and ordering.
> The whole query plan also missed lots of optimization in AQE framework. Wrap 
> InMemoryTableScanExec  to query stage can resolve all these issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42101) Wrap InMemoryTableScanExec with QueryStage

2023-01-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42101:


Assignee: (was: Apache Spark)

> Wrap InMemoryTableScanExec with QueryStage
> --
>
> Key: SPARK-42101
> URL: https://issues.apache.org/jira/browse/SPARK-42101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> The first access to the cached plan which is enable AQE is tricky. Currently, 
> we can not preverse it's output partitioning and ordering.
> The whole query plan also missed lots of optimization in AQE framework. Wrap 
> InMemoryTableScanExec  to query stage can resolve all these issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42101) Wrap InMemoryTableScanExec with QueryStage

2023-01-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677716#comment-17677716
 ] 

Apache Spark commented on SPARK-42101:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/39624

> Wrap InMemoryTableScanExec with QueryStage
> --
>
> Key: SPARK-42101
> URL: https://issues.apache.org/jira/browse/SPARK-42101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> The first access to the cached plan which is enable AQE is tricky. Currently, 
> we can not preverse it's output partitioning and ordering.
> The whole query plan also missed lots of optimization in AQE framework. Wrap 
> InMemoryTableScanExec  to query stage can resolve all these issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42101) Wrap InMemoryTableScanExec with QueryStage

2023-01-17 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-42101:
--
Summary: Wrap InMemoryTableScanExec with QueryStage  (was: Wrap 
InMemoryTableScanExec + AQE with QueryStage)

> Wrap InMemoryTableScanExec with QueryStage
> --
>
> Key: SPARK-42101
> URL: https://issues.apache.org/jira/browse/SPARK-42101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> The first access to the cached plan which is enable AQE is tricky. Currently, 
> we can not preverse it's output partitioning and ordering.
> The whole query plan also missed lots of optimization in AQE framework. Wrap 
> InMemoryTableScanExec + AQE to query stage can resolve all these issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42101) Wrap InMemoryTableScanExec with QueryStage

2023-01-17 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-42101:
--
Description: 
The first access to the cached plan which is enable AQE is tricky. Currently, 
we can not preverse it's output partitioning and ordering.

The whole query plan also missed lots of optimization in AQE framework. Wrap 
InMemoryTableScanExec  to query stage can resolve all these issues.

  was:
The first access to the cached plan which is enable AQE is tricky. Currently, 
we can not preverse it's output partitioning and ordering.

The whole query plan also missed lots of optimization in AQE framework. Wrap 
InMemoryTableScanExec + AQE to query stage can resolve all these issues.


> Wrap InMemoryTableScanExec with QueryStage
> --
>
> Key: SPARK-42101
> URL: https://issues.apache.org/jira/browse/SPARK-42101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> The first access to the cached plan which is enable AQE is tricky. Currently, 
> we can not preverse it's output partitioning and ordering.
> The whole query plan also missed lots of optimization in AQE framework. Wrap 
> InMemoryTableScanExec  to query stage can resolve all these issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42101) Wrap InMemoryTableScanExec + AQE with QueryStage

2023-01-17 Thread XiDuo You (Jira)
XiDuo You created SPARK-42101:
-

 Summary: Wrap InMemoryTableScanExec + AQE with QueryStage
 Key: SPARK-42101
 URL: https://issues.apache.org/jira/browse/SPARK-42101
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: XiDuo You


The first access to the cached plan which is enable AQE is tricky. Currently, 
we can not preverse it's output partitioning and ordering.

The whole query plan also missed lots of optimization in AQE framework. Wrap 
InMemoryTableScanExec + AQE to query stage can resolve all these issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38230) InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases

2023-01-17 Thread Gabor Roczei (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677683#comment-17677683
 ] 

Gabor Roczei commented on SPARK-38230:
--

Hi [~ximz],

> [~roczei] Can you please review the PR and let me know if I missed anything? 
>Thank you.

I will try to allocate some time for this next week. 

> InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions 
> in most cases
> ---
>
> Key: SPARK-38230
> URL: https://issues.apache.org/jira/browse/SPARK-38230
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: Coal Chan
>Priority: Major
>
> In 
> `org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand`,
>  `sparkSession.sessionState.catalog.listPartitions` will call method 
> `org.apache.hadoop.hive.metastore.listPartitionsPsWithAuth` of hive metastore 
> client, this method will produce multiple queries per partition on hive 
> metastore db. So when you insert into a table which has too many 
> partitions(ie: 10k), it will produce too many queries on hive metastore 
> db(ie: n * 10k = 10nk), it puts a lot of strain on the database.
> In fact, it calls method `listPartitions` in order to get locations of 
> partitions and get `customPartitionLocations`. But in most cases, we do not 
> have custom partitions, we can just get partition names, so we can call 
> method listPartitionNames.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42100) Protect null `SQLExecutionUIData#description` in `SQLExecutionUIDataSerializer`

2023-01-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42100:


Assignee: (was: Apache Spark)

> Protect null `SQLExecutionUIData#description` in 
> `SQLExecutionUIDataSerializer`
> ---
>
> Key: SPARK-42100
> URL: https://issues.apache.org/jira/browse/SPARK-42100
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
> export LIVE_UI_LOCAL_STORE_DIR = /tmp/spark-ui
> mvn clean install -pl sql/core 
> -Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest -Dtest=none 
> -DwildcardSuites=org.apache.spark.sql.DynamicPartitionPruningV1SuiteAEOff -am 
>  
> no test failed, but some error message:
>  
> {code:java}
> 14:46:44.514 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener 
> SQLAppStatusListener threw an exception
> java.lang.NullPointerException
>     at 
> org.apache.spark.status.protobuf.StoreTypes$SQLExecutionUIData$Builder.setDescription(StoreTypes.java:46500)
>     at 
> org.apache.spark.status.protobuf.sql.SQLExecutionUIDataSerializer.serialize(SQLExecutionUIDataSerializer.scala:34)
>     at 
> org.apache.spark.status.protobuf.sql.SQLExecutionUIDataSerializer.serialize(SQLExecutionUIDataSerializer.scala:28)
>     at 
> org.apache.spark.status.protobuf.KVStoreProtobufSerializer.serialize(KVStoreProtobufSerializer.scala:30)
>     at org.apache.spark.util.kvstore.RocksDB.write(RocksDB.java:188)
>     at 
> org.apache.spark.status.ElementTrackingStore.write(ElementTrackingStore.scala:123)
>     at 
> org.apache.spark.status.ElementTrackingStore.write(ElementTrackingStore.scala:127)
>     at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:50)
>     at 
> org.apache.spark.sql.execution.ui.SQLAppStatusListener.update(SQLAppStatusListener.scala:456)
>     at 
> org.apache.spark.sql.execution.ui.SQLAppStatusListener.onJobStart(SQLAppStatusListener.scala:124)
>     at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37)
>     at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>     at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117)
>     at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
>     at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
>     at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
>     at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1444)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
> 14:46:44.936 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener 
> SQLAppStatusListener threw an exception {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42100) Protect null `SQLExecutionUIData#description` in `SQLExecutionUIDataSerializer`

2023-01-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677657#comment-17677657
 ] 

Apache Spark commented on SPARK-42100:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/39623

> Protect null `SQLExecutionUIData#description` in 
> `SQLExecutionUIDataSerializer`
> ---
>
> Key: SPARK-42100
> URL: https://issues.apache.org/jira/browse/SPARK-42100
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
> export LIVE_UI_LOCAL_STORE_DIR = /tmp/spark-ui
> mvn clean install -pl sql/core 
> -Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest -Dtest=none 
> -DwildcardSuites=org.apache.spark.sql.DynamicPartitionPruningV1SuiteAEOff -am 
>  
> no test failed, but some error message:
>  
> {code:java}
> 14:46:44.514 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener 
> SQLAppStatusListener threw an exception
> java.lang.NullPointerException
>     at 
> org.apache.spark.status.protobuf.StoreTypes$SQLExecutionUIData$Builder.setDescription(StoreTypes.java:46500)
>     at 
> org.apache.spark.status.protobuf.sql.SQLExecutionUIDataSerializer.serialize(SQLExecutionUIDataSerializer.scala:34)
>     at 
> org.apache.spark.status.protobuf.sql.SQLExecutionUIDataSerializer.serialize(SQLExecutionUIDataSerializer.scala:28)
>     at 
> org.apache.spark.status.protobuf.KVStoreProtobufSerializer.serialize(KVStoreProtobufSerializer.scala:30)
>     at org.apache.spark.util.kvstore.RocksDB.write(RocksDB.java:188)
>     at 
> org.apache.spark.status.ElementTrackingStore.write(ElementTrackingStore.scala:123)
>     at 
> org.apache.spark.status.ElementTrackingStore.write(ElementTrackingStore.scala:127)
>     at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:50)
>     at 
> org.apache.spark.sql.execution.ui.SQLAppStatusListener.update(SQLAppStatusListener.scala:456)
>     at 
> org.apache.spark.sql.execution.ui.SQLAppStatusListener.onJobStart(SQLAppStatusListener.scala:124)
>     at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37)
>     at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>     at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117)
>     at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
>     at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
>     at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
>     at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1444)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
> 14:46:44.936 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener 
> SQLAppStatusListener threw an exception {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42100) Protect null `SQLExecutionUIData#description` in `SQLExecutionUIDataSerializer`

2023-01-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42100:


Assignee: Apache Spark

> Protect null `SQLExecutionUIData#description` in 
> `SQLExecutionUIDataSerializer`
> ---
>
> Key: SPARK-42100
> URL: https://issues.apache.org/jira/browse/SPARK-42100
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>
> export LIVE_UI_LOCAL_STORE_DIR = /tmp/spark-ui
> mvn clean install -pl sql/core 
> -Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest -Dtest=none 
> -DwildcardSuites=org.apache.spark.sql.DynamicPartitionPruningV1SuiteAEOff -am 
>  
> no test failed, but some error message:
>  
> {code:java}
> 14:46:44.514 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener 
> SQLAppStatusListener threw an exception
> java.lang.NullPointerException
>     at 
> org.apache.spark.status.protobuf.StoreTypes$SQLExecutionUIData$Builder.setDescription(StoreTypes.java:46500)
>     at 
> org.apache.spark.status.protobuf.sql.SQLExecutionUIDataSerializer.serialize(SQLExecutionUIDataSerializer.scala:34)
>     at 
> org.apache.spark.status.protobuf.sql.SQLExecutionUIDataSerializer.serialize(SQLExecutionUIDataSerializer.scala:28)
>     at 
> org.apache.spark.status.protobuf.KVStoreProtobufSerializer.serialize(KVStoreProtobufSerializer.scala:30)
>     at org.apache.spark.util.kvstore.RocksDB.write(RocksDB.java:188)
>     at 
> org.apache.spark.status.ElementTrackingStore.write(ElementTrackingStore.scala:123)
>     at 
> org.apache.spark.status.ElementTrackingStore.write(ElementTrackingStore.scala:127)
>     at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:50)
>     at 
> org.apache.spark.sql.execution.ui.SQLAppStatusListener.update(SQLAppStatusListener.scala:456)
>     at 
> org.apache.spark.sql.execution.ui.SQLAppStatusListener.onJobStart(SQLAppStatusListener.scala:124)
>     at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37)
>     at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>     at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117)
>     at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
>     at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
>     at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
>     at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1444)
>     at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
> 14:46:44.936 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener 
> SQLAppStatusListener threw an exception {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org