[jira] [Assigned] (SPARK-44361) Use PartitionEvaluator API in MapInBatchExec

2023-07-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-44361:
---

Assignee: Vinod KC

> Use  PartitionEvaluator API in MapInBatchExec
> -
>
> Key: SPARK-44361
> URL: https://issues.apache.org/jira/browse/SPARK-44361
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Vinod KC
>Assignee: Vinod KC
>Priority: Major
>
> Use  PartitionEvaluator API in
> `MapInBatchExec`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44361) Use PartitionEvaluator API in MapInBatchExec

2023-07-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-44361.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 42024
[https://github.com/apache/spark/pull/42024]

> Use  PartitionEvaluator API in MapInBatchExec
> -
>
> Key: SPARK-44361
> URL: https://issues.apache.org/jira/browse/SPARK-44361
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Vinod KC
>Assignee: Vinod KC
>Priority: Major
> Fix For: 3.5.0
>
>
> Use  PartitionEvaluator API in
> `MapInBatchExec`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44411) Use PartitionEvaluator API in ArrowEvalPythonExec, BatchEvalPythonExec

2023-07-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-44411.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 41998
[https://github.com/apache/spark/pull/41998]

> Use PartitionEvaluator API in ArrowEvalPythonExec, BatchEvalPythonExec
> --
>
> Key: SPARK-44411
> URL: https://issues.apache.org/jira/browse/SPARK-44411
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Vinod KC
>Assignee: Vinod KC
>Priority: Major
> Fix For: 4.0.0
>
>
> Use PartitionEvaluator API in
> `ArrowEvalPythonExec`
> `BatchEvalPythonExec`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44411) Use PartitionEvaluator API in ArrowEvalPythonExec, BatchEvalPythonExec

2023-07-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-44411:
---

Assignee: Vinod KC

> Use PartitionEvaluator API in ArrowEvalPythonExec, BatchEvalPythonExec
> --
>
> Key: SPARK-44411
> URL: https://issues.apache.org/jira/browse/SPARK-44411
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Vinod KC
>Assignee: Vinod KC
>Priority: Major
>
> Use PartitionEvaluator API in
> `ArrowEvalPythonExec`
> `BatchEvalPythonExec`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44375) Use PartitionEvaluator API in DebugExec

2023-07-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-44375:
---

Assignee: Jia Fan

> Use PartitionEvaluator API in DebugExec
> ---
>
> Key: SPARK-44375
> URL: https://issues.apache.org/jira/browse/SPARK-44375
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Jia Fan
>Assignee: Jia Fan
>Priority: Major
>
> Use PartitionEvaluator API in DebugExec



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44375) Use PartitionEvaluator API in DebugExec

2023-07-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-44375.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 41949
[https://github.com/apache/spark/pull/41949]

> Use PartitionEvaluator API in DebugExec
> ---
>
> Key: SPARK-44375
> URL: https://issues.apache.org/jira/browse/SPARK-44375
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Jia Fan
>Assignee: Jia Fan
>Priority: Major
> Fix For: 4.0.0
>
>
> Use PartitionEvaluator API in DebugExec



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44474) Reenable "Test observe response" at SparkConnectServiceSuite

2023-07-18 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44474.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42063
[https://github.com/apache/spark/pull/42063]

> Reenable "Test observe response" at SparkConnectServiceSuite
> 
>
> Key: SPARK-44474
> URL: https://issues.apache.org/jira/browse/SPARK-44474
> Project: Spark
>  Issue Type: Task
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Blocker
> Fix For: 3.5.0, 4.0.0
>
>
> [https://github.com/apache/spark/pull/41443] apparently made the test flaky 
> (or failed). We should reenable it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44474) Reenable "Test observe response" at SparkConnectServiceSuite

2023-07-18 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44474:


Assignee: Hyukjin Kwon

> Reenable "Test observe response" at SparkConnectServiceSuite
> 
>
> Key: SPARK-44474
> URL: https://issues.apache.org/jira/browse/SPARK-44474
> Project: Spark
>  Issue Type: Task
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Blocker
>
> [https://github.com/apache/spark/pull/41443] apparently made the test flaky 
> (or failed). We should reenable it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44264) DeepSpeed Distrobutor

2023-07-18 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744410#comment-17744410
 ] 

Hudson commented on SPARK-44264:


User 'mathewjacob1002' has created a pull request for this issue:
https://github.com/apache/spark/pull/42067

> DeepSpeed Distrobutor
> -
>
> Key: SPARK-44264
> URL: https://issues.apache.org/jira/browse/SPARK-44264
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.4.1
>Reporter: Lu Wang
>Priority: Critical
> Fix For: 3.5.0
>
> Attachments: Trying to Run Deepspeed Funcs.html
>
>
> To make it easier for Pyspark users to run distributed training and inference 
> with DeepSpeed on spark clusters using PySpark. This was a project determined 
> by the Databricks ML Training Team.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44264) DeepSpeed Distrobutor

2023-07-18 Thread Rithwik Ediga Lakhamsani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rithwik Ediga Lakhamsani updated SPARK-44264:
-
Attachment: Trying to Run Deepspeed Funcs.html

> DeepSpeed Distrobutor
> -
>
> Key: SPARK-44264
> URL: https://issues.apache.org/jira/browse/SPARK-44264
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.4.1
>Reporter: Lu Wang
>Priority: Critical
> Fix For: 3.5.0
>
> Attachments: Trying to Run Deepspeed Funcs.html
>
>
> To make it easier for Pyspark users to run distributed training and inference 
> with DeepSpeed on spark clusters using PySpark. This was a project determined 
> by the Databricks ML Training Team.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44401) Arrow Python UDF Use Guide

2023-07-18 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44401:


Assignee: Xinrong Meng

> Arrow Python UDF Use Guide
> --
>
> Key: SPARK-44401
> URL: https://issues.apache.org/jira/browse/SPARK-44401
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44401) Arrow Python UDF Use Guide

2023-07-18 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44401.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 41974
[https://github.com/apache/spark/pull/41974]

> Arrow Python UDF Use Guide
> --
>
> Key: SPARK-44401
> URL: https://issues.apache.org/jira/browse/SPARK-44401
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.5.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44464) Fix applyInPandasWithStatePythonRunner to output rows that have Null as first column value

2023-07-18 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-44464.
--
Fix Version/s: 3.5.0
 Assignee: Siying Dong
   Resolution: Fixed

Issue resolved via [https://github.com/apache/spark/pull/42046]

 

> Fix applyInPandasWithStatePythonRunner to output rows that have Null as first 
> column value
> --
>
> Key: SPARK-44464
> URL: https://issues.apache.org/jira/browse/SPARK-44464
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.3.3
>Reporter: Siying Dong
>Assignee: Siying Dong
>Priority: Major
> Fix For: 3.5.0
>
>
> The current implementation of {{ApplyInPandasWithStatePythonRunner}} cannot 
> deal with outputs where the first column of the row is {{{}null{}}}, as it 
> cannot distinguish the case where the column is null, or the field is filled 
> as the number of data records are smaller than state records. It causes 
> incorrect results for the former case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44448) Wrong results for dense_rank() <= k from InferWindowGroupLimit and DenseRankLimitIterator

2023-07-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-8?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-8:

Fix Version/s: (was: 4.0.0)

> Wrong results for dense_rank() <= k from InferWindowGroupLimit and 
> DenseRankLimitIterator
> -
>
> Key: SPARK-8
> URL: https://issues.apache.org/jira/browse/SPARK-8
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Jack Chen
>Assignee: Jack Chen
>Priority: Major
> Fix For: 3.5.0
>
>
> Top-k filters on a dense_rank() window function return wrong results, due to 
> a bug in optimization InferWindowGroupLimit, specifically in the code for 
> DenseRankLimitIterator, introduced in 
> https://issues.apache.org/jira/browse/SPARK-37099.
> Repro:
> {code:java}
> create or replace temp view t1 (p, o) as values (1, 1), (1, 1), (1, 2), (2, 
> 1), (2, 1), (2, 2);
> select * from (select *, dense_rank() over (partition by p order by o) as rnk 
> from t1) where rnk = 1;{code}
> Spark result:
> {code:java}
> [1,1,1]
> [1,1,1]
> [2,1,1]{code}
> Correct result:
> {code:java}
> [1,1,1]
> [1,1,1]
> [2,1,1]
> [2,1,1]{code}
>  
> The bug is in {{{}DenseRankLimitIterator{}}}, it fails to reset state 
> properly when transitioning from one window partition to the next. {{reset}} 
> only resets {{{}rank = 0{}}}, what it is missing is to reset 
> {{{}currentRankRow = null{}}}. This means that when processing the second and 
> later window partitions, the rank incorrectly gets incremented based on 
> comparing the ordering of the last row of the previous partition to the first 
> row of the new partition.
> This means that a dense_rank window func that has more than one window 
> partition and more than one row with dense_rank = 1 in the second or later 
> partitions can give wrong results when optimized.
> ({{{}RankLimitIterator{}}} narrowly avoids this bug by happenstance, the 
> first row in the new partition will try to increment rank, but increment it 
> by the value of count which is 0, so it happens to work by accident).
> Unfortunately, tests for the optimization only had a single row per rank, so 
> did not catch the bug as the bug requires multiple rows per rank.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44448) Wrong results for dense_rank() <= k from InferWindowGroupLimit and DenseRankLimitIterator

2023-07-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-8?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-8.
-
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42026
[https://github.com/apache/spark/pull/42026]

> Wrong results for dense_rank() <= k from InferWindowGroupLimit and 
> DenseRankLimitIterator
> -
>
> Key: SPARK-8
> URL: https://issues.apache.org/jira/browse/SPARK-8
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Jack Chen
>Assignee: Jack Chen
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> Top-k filters on a dense_rank() window function return wrong results, due to 
> a bug in optimization InferWindowGroupLimit, specifically in the code for 
> DenseRankLimitIterator, introduced in 
> https://issues.apache.org/jira/browse/SPARK-37099.
> Repro:
> {code:java}
> create or replace temp view t1 (p, o) as values (1, 1), (1, 1), (1, 2), (2, 
> 1), (2, 1), (2, 2);
> select * from (select *, dense_rank() over (partition by p order by o) as rnk 
> from t1) where rnk = 1;{code}
> Spark result:
> {code:java}
> [1,1,1]
> [1,1,1]
> [2,1,1]{code}
> Correct result:
> {code:java}
> [1,1,1]
> [1,1,1]
> [2,1,1]
> [2,1,1]{code}
>  
> The bug is in {{{}DenseRankLimitIterator{}}}, it fails to reset state 
> properly when transitioning from one window partition to the next. {{reset}} 
> only resets {{{}rank = 0{}}}, what it is missing is to reset 
> {{{}currentRankRow = null{}}}. This means that when processing the second and 
> later window partitions, the rank incorrectly gets incremented based on 
> comparing the ordering of the last row of the previous partition to the first 
> row of the new partition.
> This means that a dense_rank window func that has more than one window 
> partition and more than one row with dense_rank = 1 in the second or later 
> partitions can give wrong results when optimized.
> ({{{}RankLimitIterator{}}} narrowly avoids this bug by happenstance, the 
> first row in the new partition will try to increment rank, but increment it 
> by the value of count which is 0, so it happens to work by accident).
> Unfortunately, tests for the optimization only had a single row per rank, so 
> did not catch the bug as the bug requires multiple rows per rank.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44448) Wrong results for dense_rank() <= k from InferWindowGroupLimit and DenseRankLimitIterator

2023-07-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-8?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-8:
---

Assignee: Jack Chen

> Wrong results for dense_rank() <= k from InferWindowGroupLimit and 
> DenseRankLimitIterator
> -
>
> Key: SPARK-8
> URL: https://issues.apache.org/jira/browse/SPARK-8
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Jack Chen
>Assignee: Jack Chen
>Priority: Major
>
> Top-k filters on a dense_rank() window function return wrong results, due to 
> a bug in optimization InferWindowGroupLimit, specifically in the code for 
> DenseRankLimitIterator, introduced in 
> https://issues.apache.org/jira/browse/SPARK-37099.
> Repro:
> {code:java}
> create or replace temp view t1 (p, o) as values (1, 1), (1, 1), (1, 2), (2, 
> 1), (2, 1), (2, 2);
> select * from (select *, dense_rank() over (partition by p order by o) as rnk 
> from t1) where rnk = 1;{code}
> Spark result:
> {code:java}
> [1,1,1]
> [1,1,1]
> [2,1,1]{code}
> Correct result:
> {code:java}
> [1,1,1]
> [1,1,1]
> [2,1,1]
> [2,1,1]{code}
>  
> The bug is in {{{}DenseRankLimitIterator{}}}, it fails to reset state 
> properly when transitioning from one window partition to the next. {{reset}} 
> only resets {{{}rank = 0{}}}, what it is missing is to reset 
> {{{}currentRankRow = null{}}}. This means that when processing the second and 
> later window partitions, the rank incorrectly gets incremented based on 
> comparing the ordering of the last row of the previous partition to the first 
> row of the new partition.
> This means that a dense_rank window func that has more than one window 
> partition and more than one row with dense_rank = 1 in the second or later 
> partitions can give wrong results when optimized.
> ({{{}RankLimitIterator{}}} narrowly avoids this bug by happenstance, the 
> first row in the new partition will try to increment rank, but increment it 
> by the value of count which is 0, so it happens to work by accident).
> Unfortunately, tests for the optimization only had a single row per rank, so 
> did not catch the bug as the bug requires multiple rows per rank.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44324) Move CaseInsensitiveMap to sql/api

2023-07-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-44324.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41882
[https://github.com/apache/spark/pull/41882]

> Move CaseInsensitiveMap to sql/api
> --
>
> Key: SPARK-44324
> URL: https://issues.apache.org/jira/browse/SPARK-44324
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, SQL
>Affects Versions: 3.5.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44480) Add option for thread pool to perform maintenance for RocksDB/HDFS State Store Providers

2023-07-18 Thread Eric Marnadi (Jira)
Eric Marnadi created SPARK-44480:


 Summary: Add option for thread pool to perform maintenance for 
RocksDB/HDFS State Store Providers
 Key: SPARK-44480
 URL: https://issues.apache.org/jira/browse/SPARK-44480
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.5.0
Reporter: Eric Marnadi


Maintenance tasks on StateStore was being done by a single background thread, 
which is prone to straggling. In this change, the single background thread 
would instead schedule maintenance tasks to a thread pool.
Introduce 
{{spark.sql.streaming.stateStore.enableStateStoreMaintenanceThreadPool}} config 
so that the user can enable a thread pool for maintenance manually.
Introduce {{spark.sql.streaming.stateStore.numStateStoreMaintenanceThreads}} 
config so the thread pool size is configurable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43755) Spark Connect - decouple query execution from RPC handler

2023-07-18 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43755.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42060
[https://github.com/apache/spark/pull/42060]

> Spark Connect - decouple query execution from RPC handler
> -
>
> Key: SPARK-43755
> URL: https://issues.apache.org/jira/browse/SPARK-43755
> Project: Spark
>  Issue Type: Story
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> Move actual query execution out of the RPC handler callback. This allows:
>  * (immediately) better control over query cancellation, by interrupting the 
> execution thread.
>  * design changes to the RPC interface to allow different execution models 
> than stream-push from server.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43755) Spark Connect - decouple query execution from RPC handler

2023-07-18 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43755:


Assignee: Juliusz Sompolski

> Spark Connect - decouple query execution from RPC handler
> -
>
> Key: SPARK-43755
> URL: https://issues.apache.org/jira/browse/SPARK-43755
> Project: Spark
>  Issue Type: Story
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
>Priority: Major
>
> Move actual query execution out of the RPC handler callback. This allows:
>  * (immediately) better control over query cancellation, by interrupting the 
> execution thread.
>  * design changes to the RPC interface to allow different execution models 
> than stream-push from server.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44476) JobArtifactSet is populated with all artifacts if it is not associated with an artifact

2023-07-18 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44476:


Assignee: Venkata Sai Akhil Gudesa

> JobArtifactSet is populated with all artifacts if it is not associated with 
> an artifact
> ---
>
> Key: SPARK-44476
> URL: https://issues.apache.org/jira/browse/SPARK-44476
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Venkata Sai Akhil Gudesa
>Assignee: Venkata Sai Akhil Gudesa
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> Consider each artifact type - files/jars/archives. For each artifact type, 
> the following bug exists:
>  # Initialise a `JobArtifactState` with no artifacts added to it.
>  # Create a  `JobArtifactSet` from the `JobArtifactState`.
>  # Add an artifact with the same active `JobArtifactState`.
>  # Create another `JobArtifactSet`
> In the current behaviour, the set created in step 2 contains all the 
> artifacts (through `sc.allAddedFiles` for example) while step 3 contains only 
> the single artifact added in step 3.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44476) JobArtifactSet is populated with all artifacts if it is not associated with an artifact

2023-07-18 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44476.
--
Resolution: Fixed

Issue resolved by pull request 42062
[https://github.com/apache/spark/pull/42062]

> JobArtifactSet is populated with all artifacts if it is not associated with 
> an artifact
> ---
>
> Key: SPARK-44476
> URL: https://issues.apache.org/jira/browse/SPARK-44476
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Venkata Sai Akhil Gudesa
>Assignee: Venkata Sai Akhil Gudesa
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> Consider each artifact type - files/jars/archives. For each artifact type, 
> the following bug exists:
>  # Initialise a `JobArtifactState` with no artifacts added to it.
>  # Create a  `JobArtifactSet` from the `JobArtifactState`.
>  # Add an artifact with the same active `JobArtifactState`.
>  # Create another `JobArtifactSet`
> In the current behaviour, the set created in step 2 contains all the 
> artifacts (through `sc.allAddedFiles` for example) while step 3 contains only 
> the single artifact added in step 3.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42944) Support Python foreachBatch() in streaming spark connect

2023-07-18 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42944.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42035
[https://github.com/apache/spark/pull/42035]

> Support Python foreachBatch() in streaming spark connect
> 
>
> Key: SPARK-42944
> URL: https://issues.apache.org/jira/browse/SPARK-42944
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Raghu Angadi
>Assignee: Raghu Angadi
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> Add support for foreachBatch() streaming spark connect. This might need deep 
> dive into various complexities of arbitrary spark code since foreachBatch 
> block. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42944) Support Python foreachBatch() in streaming spark connect

2023-07-18 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42944:


Assignee: Raghu Angadi

> Support Python foreachBatch() in streaming spark connect
> 
>
> Key: SPARK-42944
> URL: https://issues.apache.org/jira/browse/SPARK-42944
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Raghu Angadi
>Assignee: Raghu Angadi
>Priority: Major
>
> Add support for foreachBatch() streaming spark connect. This might need deep 
> dive into various complexities of arbitrary spark code since foreachBatch 
> block. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36392) pandas fixed width file support

2023-07-18 Thread Haejoon Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744375#comment-17744375
 ] 

Haejoon Lee commented on SPARK-36392:
-

Not update yet here. [~gsdionis] , are you still interested in working on this 
ticket? Let me just work on if there is not responding until this weekend.

> pandas fixed width file support
> ---
>
> Key: SPARK-36392
> URL: https://issues.apache.org/jira/browse/SPARK-36392
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: John Ayoub
>Priority: Minor
>
> please add support for the fixed width api in pandas to koalas. 
> [reference|https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_fwf.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44464) Fix applyInPandasWithStatePythonRunner to output rows that have Null as first column value

2023-07-18 Thread Siying Dong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744369#comment-17744369
 ] 

Siying Dong commented on SPARK-44464:
-

PR created: [https://github.com/apache/spark/pull/42046] CC [~kabhwan] 

> Fix applyInPandasWithStatePythonRunner to output rows that have Null as first 
> column value
> --
>
> Key: SPARK-44464
> URL: https://issues.apache.org/jira/browse/SPARK-44464
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.3.3
>Reporter: Siying Dong
>Priority: Major
>
> The current implementation of {{ApplyInPandasWithStatePythonRunner}} cannot 
> deal with outputs where the first column of the row is {{{}null{}}}, as it 
> cannot distinguish the case where the column is null, or the field is filled 
> as the number of data records are smaller than state records. It causes 
> incorrect results for the former case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44479) Support Python UDTFs with empty schema

2023-07-18 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-44479:
--
Description: 
Support UDTFs with empty schema, for example:

{code:python}
>>> class TestUDTF:
...   def eval(self):
... yield tuple()
{code}

Currently it fails with `useArrow=True`:

{code:python}
>>> udtf(TestUDTF, returnType=StructType())().collect()
Traceback (most recent call last):
...
ValueError: not enough values to unpack (expected 2, got 0)
{code}

whereas without Arrow:

{code:python}
>>> udtf(TestUDTF, returnType=StructType(), useArrow=False)().collect()
[Row()]
{code}

Otherwise, we should raise an error without Arrow, too.


  was:
Support UDTFs with empty schema, for example:

{code:python}
>>> class TestUDTF:
...   def eval(self):
... yield tuple()
{code}

Currently it fails with `useArrow=True`:

{code:python}
>>> udtf(TestUDTF, returnType=StructType())().collect()
Traceback (most recent call last):
...
ValueError: not enough values to unpack (expected 2, got 0)
{code}

whereas without Arrow:

{code:python}
>>> udtf(TestUDTF, returnType=StructType(), useArrow=False)().collect()
[Row()]
{code}



> Support Python UDTFs with empty schema
> --
>
> Key: SPARK-44479
> URL: https://issues.apache.org/jira/browse/SPARK-44479
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Support UDTFs with empty schema, for example:
> {code:python}
> >>> class TestUDTF:
> ...   def eval(self):
> ... yield tuple()
> {code}
> Currently it fails with `useArrow=True`:
> {code:python}
> >>> udtf(TestUDTF, returnType=StructType())().collect()
> Traceback (most recent call last):
> ...
> ValueError: not enough values to unpack (expected 2, got 0)
> {code}
> whereas without Arrow:
> {code:python}
> >>> udtf(TestUDTF, returnType=StructType(), useArrow=False)().collect()
> [Row()]
> {code}
> Otherwise, we should raise an error without Arrow, too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44479) Support Python UDTFs with empty schema

2023-07-18 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-44479:
--
Description: 
Support UDTFs with empty schema, for example:

{code:python}
>>> class TestUDTF:
...   def eval(self):
... yield tuple()
{code}

Currently it fails with `useArrow=True`:

{code:python}
>>> udtf(TestUDTF, returnType=StructType())().collect()
Traceback (most recent call last):
...
ValueError: not enough values to unpack (expected 2, got 0)
{code}

whereas without Arrow:

{code:python}
>>> udtf(TestUDTF, returnType=StructType(), useArrow=False)().collect()
[Row()]
{code}

Otherwise, we should raise an error without Arrow, too, to be consistent.


  was:
Support UDTFs with empty schema, for example:

{code:python}
>>> class TestUDTF:
...   def eval(self):
... yield tuple()
{code}

Currently it fails with `useArrow=True`:

{code:python}
>>> udtf(TestUDTF, returnType=StructType())().collect()
Traceback (most recent call last):
...
ValueError: not enough values to unpack (expected 2, got 0)
{code}

whereas without Arrow:

{code:python}
>>> udtf(TestUDTF, returnType=StructType(), useArrow=False)().collect()
[Row()]
{code}

Otherwise, we should raise an error without Arrow, too.



> Support Python UDTFs with empty schema
> --
>
> Key: SPARK-44479
> URL: https://issues.apache.org/jira/browse/SPARK-44479
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Support UDTFs with empty schema, for example:
> {code:python}
> >>> class TestUDTF:
> ...   def eval(self):
> ... yield tuple()
> {code}
> Currently it fails with `useArrow=True`:
> {code:python}
> >>> udtf(TestUDTF, returnType=StructType())().collect()
> Traceback (most recent call last):
> ...
> ValueError: not enough values to unpack (expected 2, got 0)
> {code}
> whereas without Arrow:
> {code:python}
> >>> udtf(TestUDTF, returnType=StructType(), useArrow=False)().collect()
> [Row()]
> {code}
> Otherwise, we should raise an error without Arrow, too, to be consistent.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44479) Support Python UDTFs with empty schema

2023-07-18 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-44479:
-

 Summary: Support Python UDTFs with empty schema
 Key: SPARK-44479
 URL: https://issues.apache.org/jira/browse/SPARK-44479
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Takuya Ueshin


Support UDTFs with empty schema, for example:

{code:python}
>>> class TestUDTF:
...   def eval(self):
... yield tuple()
{code}

Currently it fails with `useArrow=True`:

{code:python}
>>> udtf(TestUDTF, returnType=StructType())().collect()
Traceback (most recent call last):
...
ValueError: not enough values to unpack (expected 2, got 0)
{code}

whereas without Arrow:

{code:python}
>>> udtf(TestUDTF, returnType=StructType(), useArrow=False)().collect()
[Row()]
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40296) Error Class for DISTINCT function not found

2023-07-18 Thread Ritika Maheshwari (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744346#comment-17744346
 ] 

Ritika Maheshwari commented on SPARK-40296:
---

Isn't dropDuplicates taking care of applying distinct to multiple columns?

> Error Class for DISTINCT function not found
> ---
>
> Key: SPARK-40296
> URL: https://issues.apache.org/jira/browse/SPARK-40296
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44478) Executor decommission causes stage failure

2023-07-18 Thread Dale Huettenmoser (Jira)
Dale Huettenmoser created SPARK-44478:
-

 Summary: Executor decommission causes stage failure
 Key: SPARK-44478
 URL: https://issues.apache.org/jira/browse/SPARK-44478
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 3.4.1, 3.4.0
Reporter: Dale Huettenmoser


During spark execution, save fails due to executor decommissioning. Issue not 
present in 3.3.0

Sample error:

 
{code:java}
An error occurred while calling o8948.save.
: org.apache.spark.SparkException: Job aborted due to stage failure: Authorized 
committer (attemptNumber=0, stage=170, partition=233) failed; but task commit 
success, data duplication may happen. 
reason=ExecutorLostFailure(1,false,Some(Executor decommission: Executor 1 is 
decommissioned.))
    at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2785)
    at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2721)
    at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2720)
    at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2720)
    at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleStageFailed$1(DAGScheduler.scala:1199)
    at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleStageFailed$1$adapted(DAGScheduler.scala:1199)
    at scala.Option.foreach(Option.scala:407)
    at 
org.apache.spark.scheduler.DAGScheduler.handleStageFailed(DAGScheduler.scala:1199)
    at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2981)
    at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2923)
    at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2912)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:971)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2263)
    at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeWrite$4(FileFormatWriter.scala:307)
    at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.writeAndCommit(FileFormatWriter.scala:271)
    at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeWrite(FileFormatWriter.scala:304)
    at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:190)
    at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:190)
    at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
    at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
    at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
    at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
    at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:118)
    at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195)
    at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
    at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
    at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
    at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
    at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512)
    at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104)
    at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:512)
    at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:31)
    at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
    at 

[jira] [Commented] (SPARK-44477) CheckAnalysis uses error subclass as an error class

2023-07-18 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744314#comment-17744314
 ] 

Bruce Robbins commented on SPARK-44477:
---

PR here: https://github.com/apache/spark/pull/42064

> CheckAnalysis uses error subclass as an error class
> ---
>
> Key: SPARK-44477
> URL: https://issues.apache.org/jira/browse/SPARK-44477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> {{CheckAnalysis}} treats {{TYPE_CHECK_FAILURE_WITH_HINT}} as an error class, 
> but it is instead an error subclass of {{{}DATATYPE_MISMATCH{}}}.
> {noformat}
> spark-sql (default)> select bitmap_count(12);
> [INTERNAL_ERROR] Cannot find main error class 'TYPE_CHECK_FAILURE_WITH_HINT'
> org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot find main error 
> class 'TYPE_CHECK_FAILURE_WITH_HINT'
> at org.apache.spark.SparkException$.internalError(SparkException.scala:83)
> at org.apache.spark.SparkException$.internalError(SparkException.scala:87)
> at 
> org.apache.spark.ErrorClassesJsonReader.$anonfun$getMessageTemplate$1(ErrorClassesJSONReader.scala:68)
> at scala.collection.immutable.HashMap$HashMap1.getOrElse0(HashMap.scala:361)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.getOrElse0(HashMap.scala:594)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.getOrElse0(HashMap.scala:589)
> at scala.collection.immutable.HashMap.getOrElse(HashMap.scala:73)
> {noformat}
> This issue only occurs when an expression uses 
> {{TypeCheckResult.TypeCheckFailure}} to indicate input type check failure. 
> {{TypeCheckResult.TypeCheckFailure}} appears to be deprecated in favor of 
> {{{}TypeCheckResult.DataTypeMismatch{}}}, but recently two expressions were 
> added that use {{{}TypeCheckResult.TypeCheckFailure{}}}: {{BitmapCount}} and 
> {{{}BitmapOrAgg{}}}.
> {{BitmapCount}} and {{BitmapOrAgg}} should probably be fixed to use 
> {{{}TypeCheckResult.DataTypeMismatch{}}}. Regardless, the code in 
> {{CheckAnalysis}} that handles {{TypeCheckResult.TypeCheckFailure}} should be 
> corrected (or removed).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44477) CheckAnalysis uses error subclass as an error class

2023-07-18 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-44477:
-

 Summary: CheckAnalysis uses error subclass as an error class
 Key: SPARK-44477
 URL: https://issues.apache.org/jira/browse/SPARK-44477
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Bruce Robbins


{{CheckAnalysis}} treats {{TYPE_CHECK_FAILURE_WITH_HINT}} as an error class, 
but it is instead an error subclass of {{{}DATATYPE_MISMATCH{}}}.
{noformat}
spark-sql (default)> select bitmap_count(12);
[INTERNAL_ERROR] Cannot find main error class 'TYPE_CHECK_FAILURE_WITH_HINT'
org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot find main error class 
'TYPE_CHECK_FAILURE_WITH_HINT'
at org.apache.spark.SparkException$.internalError(SparkException.scala:83)
at org.apache.spark.SparkException$.internalError(SparkException.scala:87)
at 
org.apache.spark.ErrorClassesJsonReader.$anonfun$getMessageTemplate$1(ErrorClassesJSONReader.scala:68)
at scala.collection.immutable.HashMap$HashMap1.getOrElse0(HashMap.scala:361)
at scala.collection.immutable.HashMap$HashTrieMap.getOrElse0(HashMap.scala:594)
at scala.collection.immutable.HashMap$HashTrieMap.getOrElse0(HashMap.scala:589)
at scala.collection.immutable.HashMap.getOrElse(HashMap.scala:73)
{noformat}
This issue only occurs when an expression uses 
{{TypeCheckResult.TypeCheckFailure}} to indicate input type check failure. 
{{TypeCheckResult.TypeCheckFailure}} appears to be deprecated in favor of 
{{{}TypeCheckResult.DataTypeMismatch{}}}, but recently two expressions were 
added that use {{{}TypeCheckResult.TypeCheckFailure{}}}: {{BitmapCount}} and 
{{{}BitmapOrAgg{}}}.

{{BitmapCount}} and {{BitmapOrAgg}} should probably be fixed to use 
{{{}TypeCheckResult.DataTypeMismatch{}}}. Regardless, the code in 
{{CheckAnalysis}} that handles {{TypeCheckResult.TypeCheckFailure}} should be 
corrected (or removed).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36392) pandas fixed width file support

2023-07-18 Thread John Ayoub (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744297#comment-17744297
 ] 

John Ayoub commented on SPARK-36392:


[~itholic] Hello, any update on this ticket?

> pandas fixed width file support
> ---
>
> Key: SPARK-36392
> URL: https://issues.apache.org/jira/browse/SPARK-36392
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: John Ayoub
>Priority: Minor
>
> please add support for the fixed width api in pandas to koalas. 
> [reference|https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_fwf.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44465) Upgrade zstd-jni to 1.5.5-5

2023-07-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44465.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42047
[https://github.com/apache/spark/pull/42047]

> Upgrade zstd-jni to 1.5.5-5
> ---
>
> Key: SPARK-44465
> URL: https://issues.apache.org/jira/browse/SPARK-44465
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44465) Upgrade zstd-jni to 1.5.5-5

2023-07-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44465:
-

Assignee: BingKun Pan

> Upgrade zstd-jni to 1.5.5-5
> ---
>
> Key: SPARK-44465
> URL: https://issues.apache.org/jira/browse/SPARK-44465
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44476) JobArtifactSet is populated with all artifacts if it is not associated with an artifact

2023-07-18 Thread Venkata Sai Akhil Gudesa (Jira)
Venkata Sai Akhil Gudesa created SPARK-44476:


 Summary: JobArtifactSet is populated with all artifacts if it is 
not associated with an artifact
 Key: SPARK-44476
 URL: https://issues.apache.org/jira/browse/SPARK-44476
 Project: Spark
  Issue Type: Bug
  Components: Connect
Affects Versions: 3.5.0, 4.0.0
Reporter: Venkata Sai Akhil Gudesa
 Fix For: 3.5.0, 4.0.0


Consider each artifact type - files/jars/archives. For each artifact type, the 
following bug exists:
 # Initialise a `JobArtifactState` with no artifacts added to it.
 # Create a  `JobArtifactSet` from the `JobArtifactState`.
 # Add an artifact with the same active `JobArtifactState`.
 # Create another `JobArtifactSet`

In the current behaviour, the set created in step 2 contains all the artifacts 
(through `sc.allAddedFiles` for example) while step 3 contains only the single 
artifact added in step 3.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44448) Wrong results for dense_rank() <= k from InferWindowGroupLimit and DenseRankLimitIterator

2023-07-18 Thread Jack Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-8?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-8:
--
Affects Version/s: 3.5.0
   (was: 3.4.0)

> Wrong results for dense_rank() <= k from InferWindowGroupLimit and 
> DenseRankLimitIterator
> -
>
> Key: SPARK-8
> URL: https://issues.apache.org/jira/browse/SPARK-8
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Jack Chen
>Priority: Major
>
> Top-k filters on a dense_rank() window function return wrong results, due to 
> a bug in optimization InferWindowGroupLimit, specifically in the code for 
> DenseRankLimitIterator, introduced in 
> https://issues.apache.org/jira/browse/SPARK-37099.
> Repro:
> {code:java}
> create or replace temp view t1 (p, o) as values (1, 1), (1, 1), (1, 2), (2, 
> 1), (2, 1), (2, 2);
> select * from (select *, dense_rank() over (partition by p order by o) as rnk 
> from t1) where rnk = 1;{code}
> Spark result:
> {code:java}
> [1,1,1]
> [1,1,1]
> [2,1,1]{code}
> Correct result:
> {code:java}
> [1,1,1]
> [1,1,1]
> [2,1,1]
> [2,1,1]{code}
>  
> The bug is in {{{}DenseRankLimitIterator{}}}, it fails to reset state 
> properly when transitioning from one window partition to the next. {{reset}} 
> only resets {{{}rank = 0{}}}, what it is missing is to reset 
> {{{}currentRankRow = null{}}}. This means that when processing the second and 
> later window partitions, the rank incorrectly gets incremented based on 
> comparing the ordering of the last row of the previous partition to the first 
> row of the new partition.
> This means that a dense_rank window func that has more than one window 
> partition and more than one row with dense_rank = 1 in the second or later 
> partitions can give wrong results when optimized.
> ({{{}RankLimitIterator{}}} narrowly avoids this bug by happenstance, the 
> first row in the new partition will try to increment rank, but increment it 
> by the value of count which is 0, so it happens to work by accident).
> Unfortunately, tests for the optimization only had a single row per rank, so 
> did not catch the bug as the bug requires multiple rows per rank.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44475) Relocate DataType and Parser to sql/api

2023-07-18 Thread Rui Wang (Jira)
Rui Wang created SPARK-44475:


 Summary: Relocate DataType and Parser to sql/api
 Key: SPARK-44475
 URL: https://issues.apache.org/jira/browse/SPARK-44475
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, SQL
Affects Versions: 3.5.0
Reporter: Rui Wang
Assignee: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44467) Setting master version to 4.0.0-SNAPSHOT

2023-07-18 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-44467:
-
Fix Version/s: 4.0.0
   (was: 3.5.0)

> Setting master version to 4.0.0-SNAPSHOT
> 
>
> Key: SPARK-44467
> URL: https://issues.apache.org/jira/browse/SPARK-44467
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44467) Setting master version to 4.0.0-SNAPSHOT

2023-07-18 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-44467.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 42048
[https://github.com/apache/spark/pull/42048]

> Setting master version to 4.0.0-SNAPSHOT
> 
>
> Key: SPARK-44467
> URL: https://issues.apache.org/jira/browse/SPARK-44467
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44467) Setting master version to 4.0.0-SNAPSHOT

2023-07-18 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-44467:


Assignee: Yang Jie

> Setting master version to 4.0.0-SNAPSHOT
> 
>
> Key: SPARK-44467
> URL: https://issues.apache.org/jira/browse/SPARK-44467
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42972) ExecutorAllocationManager cannot allocate new instances when all executors down.

2023-07-18 Thread lvkaihua (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744133#comment-17744133
 ] 

lvkaihua commented on SPARK-42972:
--

I also encountered this issue and tested that the modification was correct

> ExecutorAllocationManager cannot allocate new instances when all executors 
> down.
> 
>
> Key: SPARK-42972
> URL: https://issues.apache.org/jira/browse/SPARK-42972
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.3.2
>Reporter: Jiandan Yang 
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44396) Add direct Arrow deserialization

2023-07-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744127#comment-17744127
 ] 

ASF GitHub Bot commented on SPARK-44396:


User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/42011

> Add direct Arrow deserialization
> 
>
> Key: SPARK-44396
> URL: https://issues.apache.org/jira/browse/SPARK-44396
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SQL
>Affects Versions: 3.4.1
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44472) change the external catalog thread safety way

2023-07-18 Thread Izek Greenfield (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield updated SPARK-44472:

Attachment: add_hive_concurrent_connections.diff

> change the external catalog thread safety way
> -
>
> Key: SPARK-44472
> URL: https://issues.apache.org/jira/browse/SPARK-44472
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Izek Greenfield
>Priority: Major
> Attachments: add_hive_concurrent_connections.diff
>
>
> We test changing the sync of the external catalog to use thread-local instead 
> of the synchronized methods.
> in our tests, it improve the runtime of parallel actions by about 45% for 
> certain workload ** (time reduced from ~15min to ~9min) 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44474) Reenable "Test observe response" at SparkConnectServiceSuite

2023-07-18 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-44474:
-
Summary: Reenable "Test observe response" at SparkConnectServiceSuite  
(was: Reenable Test observe response at SparkConnectServiceSuite)

> Reenable "Test observe response" at SparkConnectServiceSuite
> 
>
> Key: SPARK-44474
> URL: https://issues.apache.org/jira/browse/SPARK-44474
> Project: Spark
>  Issue Type: Task
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Hyukjin Kwon
>Priority: Blocker
>
> [https://github.com/apache/spark/pull/41443] apparently made the test flaky 
> (or failed). We should reenable it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44474) Reenable Test observe response at SparkConnectServiceSuite

2023-07-18 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-44474:
-
Affects Version/s: 3.5.0
   (was: 4.0.0)

> Reenable Test observe response at SparkConnectServiceSuite
> --
>
> Key: SPARK-44474
> URL: https://issues.apache.org/jira/browse/SPARK-44474
> Project: Spark
>  Issue Type: Task
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Hyukjin Kwon
>Priority: Blocker
>
> [https://github.com/apache/spark/pull/41443] apparently made the test flaky 
> (or failed). We should reenable it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44474) Reenable Test observe response at SparkConnectServiceSuite

2023-07-18 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-44474:
-
Priority: Blocker  (was: Major)

> Reenable Test observe response at SparkConnectServiceSuite
> --
>
> Key: SPARK-44474
> URL: https://issues.apache.org/jira/browse/SPARK-44474
> Project: Spark
>  Issue Type: Task
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Blocker
>
> [https://github.com/apache/spark/pull/41443] apparently made the test flaky 
> (or failed). We should reenable it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44474) Reenable Test observe response at SparkConnectServiceSuite

2023-07-18 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-44474:


 Summary: Reenable Test observe response at SparkConnectServiceSuite
 Key: SPARK-44474
 URL: https://issues.apache.org/jira/browse/SPARK-44474
 Project: Spark
  Issue Type: Task
  Components: Connect
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon


[https://github.com/apache/spark/pull/41443] apparently made the test flaky (or 
failed). We should reenable it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44468) Add daily test GA task for branch3.5

2023-07-18 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44468.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42050
[https://github.com/apache/spark/pull/42050]

> Add daily test GA task for branch3.5
> 
>
> Key: SPARK-44468
> URL: https://issues.apache.org/jira/browse/SPARK-44468
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44468) Add daily test GA task for branch3.5

2023-07-18 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44468:


Assignee: BingKun Pan

> Add daily test GA task for branch3.5
> 
>
> Key: SPARK-44468
> URL: https://issues.apache.org/jira/browse/SPARK-44468
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44471) Change branches in build_and_test.yml for master branch

2023-07-18 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-44471:
-
Summary: Change branches in build_and_test.yml for master branch  (was: Add 
Github action test job for branch-3.5)

> Change branches in build_and_test.yml for master branch
> ---
>
> Key: SPARK-44471
> URL: https://issues.apache.org/jira/browse/SPARK-44471
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 3.5.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44471) Add Github action test job for branch-3.5

2023-07-18 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44471.
--
Resolution: Fixed

Issue resolved by pull request 42057
[https://github.com/apache/spark/pull/42057]

> Add Github action test job for branch-3.5
> -
>
> Key: SPARK-44471
> URL: https://issues.apache.org/jira/browse/SPARK-44471
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 3.5.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42972) ExecutorAllocationManager cannot allocate new instances when all executors down.

2023-07-18 Thread liang yu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744092#comment-17744092
 ] 

liang yu commented on SPARK-42972:
--

[~tdas]  I created a PR [PR-42058|https://github.com/apache/spark/pull/42058] 
on github, would you please help me review it?

> ExecutorAllocationManager cannot allocate new instances when all executors 
> down.
> 
>
> Key: SPARK-42972
> URL: https://issues.apache.org/jira/browse/SPARK-42972
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.3.2
>Reporter: Jiandan Yang 
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44473) Overwriting the same partition of a partitioned table multiple times with empty data yields non-idempotent results

2023-07-18 Thread chris Yu (Jira)
chris Yu created SPARK-44473:


 Summary: Overwriting the same partition of a partitioned table 
multiple times with empty data yields non-idempotent results
 Key: SPARK-44473
 URL: https://issues.apache.org/jira/browse/SPARK-44473
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.1, 3.3.2, 3.2.4, 3.1.3
 Environment: spark : 3.x
Reporter: chris Yu


 

Preparation:
Create a simple partition table using spark version 3.x, for example:

 
{code:java}
spark-sql> create table test1 (a int) partitioned by (dt string);
Time taken: 0.219 seconds{code}
 

 
 * Overwrite a new partition with empty data, and you can see that the 
partition information and the corresponding HDFS path are generated , for 
example:

{code:java}

spark-sql> insert overwrite table test1 partition(dt='20230702') select 2 where 
1 <> 1;
Time taken: 0.992 seconds
spark-sql> dfs -ls /user/hive/warehouse/test1;
Found 2 items
-rw-r--r-- 2 hadoop hadoop 0 2023-07-18 14:41 
/user/hive/warehouse/test1/_SUCCESS
drwxrwxrwx- hadoop hadoop 0 2023-07-18 14:41 
/user/hive/warehouse/test1/dt=20230702
spark-sql> show partitions test1;
dt=20230702
Time taken: 0.162 seconds, Fetched 1 row(s)
{code}
 * When re-running the insert overwrite statement, you can see that the HDFS 
path corresponding to this partition does not exist.

 
{code:java}
spark-sql> insert overwrite table test1 partition(dt='20230702') select 2 where 
1 <> 1;
Time taken: 0.706 seconds
spark-sql> dfs -ls /user/hive/warehouse/test1;
Found 1 items
-rw-r--r--   2 hadoop hadoop          0 2023-07-18 14:45 
/user/hive/warehouse/test1/_SUCCESS
spark-sql> show partitions test1;
dt=20230702
Time taken: 0.183 seconds, Fetched 1 row(s){code}
For subsequent tasks that need to use this HDFS path, an exception that the 
path does not exist will be thrown, which caused us trouble.

 

I was expecting to execute the same statement multiple times to get the same 
result, {*}not non-idempotent{*}. thanks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44451) Make built document downloadable

2023-07-18 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-44451.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42028
[https://github.com/apache/spark/pull/42028]

> Make built document downloadable
> 
>
> Key: SPARK-44451
> URL: https://issues.apache.org/jira/browse/SPARK-44451
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44451) Make built document downloadable

2023-07-18 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-44451:
-

Assignee: Ruifeng Zheng

> Make built document downloadable
> 
>
> Key: SPARK-44451
> URL: https://issues.apache.org/jira/browse/SPARK-44451
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44472) change the external catalog thread safety way

2023-07-18 Thread Izek Greenfield (Jira)
Izek Greenfield created SPARK-44472:
---

 Summary: change the external catalog thread safety way
 Key: SPARK-44472
 URL: https://issues.apache.org/jira/browse/SPARK-44472
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.1
Reporter: Izek Greenfield


We test changing the sync of the external catalog to use thread-local instead 
of the synchronized methods.

in our tests, it improve the runtime of parallel actions by about 45% for 
certain workload ** (time reduced from ~15min to ~9min) 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44471) Add Github action test job for branch-3.5

2023-07-18 Thread Yuanjian Li (Jira)
Yuanjian Li created SPARK-44471:
---

 Summary: Add Github action test job for branch-3.5
 Key: SPARK-44471
 URL: https://issues.apache.org/jira/browse/SPARK-44471
 Project: Spark
  Issue Type: Task
  Components: Project Infra
Affects Versions: 3.5.0
Reporter: Yuanjian Li
Assignee: Yuanjian Li
 Fix For: 3.5.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43967) Support Python UDTFs with empty return values

2023-07-18 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43967.
--
Fix Version/s: 3.5.0
 Assignee: Allison Wang
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/42044

> Support Python UDTFs with empty return values
> -
>
> Key: SPARK-43967
> URL: https://issues.apache.org/jira/browse/SPARK-43967
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.5.0
>
>
> Support UDTFs with empty returns, for example:
> {code:java}
> @udtf(returnType="a: int")
> class TestUDTF:
> def eval(self, a: int):
> ... {code}
> Currently, this will fail with the exception 
> {code:java}
> TypeError: 'NoneType' object is not iterable {code}
> Another example
> {code:java}
> class TestUDTF:
> def eval(self, a: int):
> yield {code}
> This will fail with the exceptions
> {code:java}
> java.lang.NullPointerException {code}
> Note, arrow-optimized UDTFs already support this. This error only occurs with 
> regular Python UDTFs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org