[jira] [Assigned] (SPARK-44361) Use PartitionEvaluator API in MapInBatchExec
[ https://issues.apache.org/jira/browse/SPARK-44361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-44361: --- Assignee: Vinod KC > Use PartitionEvaluator API in MapInBatchExec > - > > Key: SPARK-44361 > URL: https://issues.apache.org/jira/browse/SPARK-44361 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Vinod KC >Assignee: Vinod KC >Priority: Major > > Use PartitionEvaluator API in > `MapInBatchExec` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44361) Use PartitionEvaluator API in MapInBatchExec
[ https://issues.apache.org/jira/browse/SPARK-44361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-44361. - Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 42024 [https://github.com/apache/spark/pull/42024] > Use PartitionEvaluator API in MapInBatchExec > - > > Key: SPARK-44361 > URL: https://issues.apache.org/jira/browse/SPARK-44361 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Vinod KC >Assignee: Vinod KC >Priority: Major > Fix For: 3.5.0 > > > Use PartitionEvaluator API in > `MapInBatchExec` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44411) Use PartitionEvaluator API in ArrowEvalPythonExec, BatchEvalPythonExec
[ https://issues.apache.org/jira/browse/SPARK-44411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-44411. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 41998 [https://github.com/apache/spark/pull/41998] > Use PartitionEvaluator API in ArrowEvalPythonExec, BatchEvalPythonExec > -- > > Key: SPARK-44411 > URL: https://issues.apache.org/jira/browse/SPARK-44411 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Vinod KC >Assignee: Vinod KC >Priority: Major > Fix For: 4.0.0 > > > Use PartitionEvaluator API in > `ArrowEvalPythonExec` > `BatchEvalPythonExec` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44411) Use PartitionEvaluator API in ArrowEvalPythonExec, BatchEvalPythonExec
[ https://issues.apache.org/jira/browse/SPARK-44411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-44411: --- Assignee: Vinod KC > Use PartitionEvaluator API in ArrowEvalPythonExec, BatchEvalPythonExec > -- > > Key: SPARK-44411 > URL: https://issues.apache.org/jira/browse/SPARK-44411 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Vinod KC >Assignee: Vinod KC >Priority: Major > > Use PartitionEvaluator API in > `ArrowEvalPythonExec` > `BatchEvalPythonExec` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44375) Use PartitionEvaluator API in DebugExec
[ https://issues.apache.org/jira/browse/SPARK-44375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-44375: --- Assignee: Jia Fan > Use PartitionEvaluator API in DebugExec > --- > > Key: SPARK-44375 > URL: https://issues.apache.org/jira/browse/SPARK-44375 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Jia Fan >Assignee: Jia Fan >Priority: Major > > Use PartitionEvaluator API in DebugExec -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44375) Use PartitionEvaluator API in DebugExec
[ https://issues.apache.org/jira/browse/SPARK-44375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-44375. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 41949 [https://github.com/apache/spark/pull/41949] > Use PartitionEvaluator API in DebugExec > --- > > Key: SPARK-44375 > URL: https://issues.apache.org/jira/browse/SPARK-44375 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Jia Fan >Assignee: Jia Fan >Priority: Major > Fix For: 4.0.0 > > > Use PartitionEvaluator API in DebugExec -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44474) Reenable "Test observe response" at SparkConnectServiceSuite
[ https://issues.apache.org/jira/browse/SPARK-44474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44474. -- Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42063 [https://github.com/apache/spark/pull/42063] > Reenable "Test observe response" at SparkConnectServiceSuite > > > Key: SPARK-44474 > URL: https://issues.apache.org/jira/browse/SPARK-44474 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.5.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Blocker > Fix For: 3.5.0, 4.0.0 > > > [https://github.com/apache/spark/pull/41443] apparently made the test flaky > (or failed). We should reenable it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44474) Reenable "Test observe response" at SparkConnectServiceSuite
[ https://issues.apache.org/jira/browse/SPARK-44474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-44474: Assignee: Hyukjin Kwon > Reenable "Test observe response" at SparkConnectServiceSuite > > > Key: SPARK-44474 > URL: https://issues.apache.org/jira/browse/SPARK-44474 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.5.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Blocker > > [https://github.com/apache/spark/pull/41443] apparently made the test flaky > (or failed). We should reenable it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44264) DeepSpeed Distrobutor
[ https://issues.apache.org/jira/browse/SPARK-44264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744410#comment-17744410 ] Hudson commented on SPARK-44264: User 'mathewjacob1002' has created a pull request for this issue: https://github.com/apache/spark/pull/42067 > DeepSpeed Distrobutor > - > > Key: SPARK-44264 > URL: https://issues.apache.org/jira/browse/SPARK-44264 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.4.1 >Reporter: Lu Wang >Priority: Critical > Fix For: 3.5.0 > > Attachments: Trying to Run Deepspeed Funcs.html > > > To make it easier for Pyspark users to run distributed training and inference > with DeepSpeed on spark clusters using PySpark. This was a project determined > by the Databricks ML Training Team. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44264) DeepSpeed Distrobutor
[ https://issues.apache.org/jira/browse/SPARK-44264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rithwik Ediga Lakhamsani updated SPARK-44264: - Attachment: Trying to Run Deepspeed Funcs.html > DeepSpeed Distrobutor > - > > Key: SPARK-44264 > URL: https://issues.apache.org/jira/browse/SPARK-44264 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.4.1 >Reporter: Lu Wang >Priority: Critical > Fix For: 3.5.0 > > Attachments: Trying to Run Deepspeed Funcs.html > > > To make it easier for Pyspark users to run distributed training and inference > with DeepSpeed on spark clusters using PySpark. This was a project determined > by the Databricks ML Training Team. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44401) Arrow Python UDF Use Guide
[ https://issues.apache.org/jira/browse/SPARK-44401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-44401: Assignee: Xinrong Meng > Arrow Python UDF Use Guide > -- > > Key: SPARK-44401 > URL: https://issues.apache.org/jira/browse/SPARK-44401 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.5.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44401) Arrow Python UDF Use Guide
[ https://issues.apache.org/jira/browse/SPARK-44401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44401. -- Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 41974 [https://github.com/apache/spark/pull/41974] > Arrow Python UDF Use Guide > -- > > Key: SPARK-44401 > URL: https://issues.apache.org/jira/browse/SPARK-44401 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.5.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.5.0, 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44464) Fix applyInPandasWithStatePythonRunner to output rows that have Null as first column value
[ https://issues.apache.org/jira/browse/SPARK-44464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-44464. -- Fix Version/s: 3.5.0 Assignee: Siying Dong Resolution: Fixed Issue resolved via [https://github.com/apache/spark/pull/42046] > Fix applyInPandasWithStatePythonRunner to output rows that have Null as first > column value > -- > > Key: SPARK-44464 > URL: https://issues.apache.org/jira/browse/SPARK-44464 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.3.3 >Reporter: Siying Dong >Assignee: Siying Dong >Priority: Major > Fix For: 3.5.0 > > > The current implementation of {{ApplyInPandasWithStatePythonRunner}} cannot > deal with outputs where the first column of the row is {{{}null{}}}, as it > cannot distinguish the case where the column is null, or the field is filled > as the number of data records are smaller than state records. It causes > incorrect results for the former case. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44448) Wrong results for dense_rank() <= k from InferWindowGroupLimit and DenseRankLimitIterator
[ https://issues.apache.org/jira/browse/SPARK-8?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-8: Fix Version/s: (was: 4.0.0) > Wrong results for dense_rank() <= k from InferWindowGroupLimit and > DenseRankLimitIterator > - > > Key: SPARK-8 > URL: https://issues.apache.org/jira/browse/SPARK-8 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Jack Chen >Assignee: Jack Chen >Priority: Major > Fix For: 3.5.0 > > > Top-k filters on a dense_rank() window function return wrong results, due to > a bug in optimization InferWindowGroupLimit, specifically in the code for > DenseRankLimitIterator, introduced in > https://issues.apache.org/jira/browse/SPARK-37099. > Repro: > {code:java} > create or replace temp view t1 (p, o) as values (1, 1), (1, 1), (1, 2), (2, > 1), (2, 1), (2, 2); > select * from (select *, dense_rank() over (partition by p order by o) as rnk > from t1) where rnk = 1;{code} > Spark result: > {code:java} > [1,1,1] > [1,1,1] > [2,1,1]{code} > Correct result: > {code:java} > [1,1,1] > [1,1,1] > [2,1,1] > [2,1,1]{code} > > The bug is in {{{}DenseRankLimitIterator{}}}, it fails to reset state > properly when transitioning from one window partition to the next. {{reset}} > only resets {{{}rank = 0{}}}, what it is missing is to reset > {{{}currentRankRow = null{}}}. This means that when processing the second and > later window partitions, the rank incorrectly gets incremented based on > comparing the ordering of the last row of the previous partition to the first > row of the new partition. > This means that a dense_rank window func that has more than one window > partition and more than one row with dense_rank = 1 in the second or later > partitions can give wrong results when optimized. > ({{{}RankLimitIterator{}}} narrowly avoids this bug by happenstance, the > first row in the new partition will try to increment rank, but increment it > by the value of count which is 0, so it happens to work by accident). > Unfortunately, tests for the optimization only had a single row per rank, so > did not catch the bug as the bug requires multiple rows per rank. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44448) Wrong results for dense_rank() <= k from InferWindowGroupLimit and DenseRankLimitIterator
[ https://issues.apache.org/jira/browse/SPARK-8?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-8. - Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42026 [https://github.com/apache/spark/pull/42026] > Wrong results for dense_rank() <= k from InferWindowGroupLimit and > DenseRankLimitIterator > - > > Key: SPARK-8 > URL: https://issues.apache.org/jira/browse/SPARK-8 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Jack Chen >Assignee: Jack Chen >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > Top-k filters on a dense_rank() window function return wrong results, due to > a bug in optimization InferWindowGroupLimit, specifically in the code for > DenseRankLimitIterator, introduced in > https://issues.apache.org/jira/browse/SPARK-37099. > Repro: > {code:java} > create or replace temp view t1 (p, o) as values (1, 1), (1, 1), (1, 2), (2, > 1), (2, 1), (2, 2); > select * from (select *, dense_rank() over (partition by p order by o) as rnk > from t1) where rnk = 1;{code} > Spark result: > {code:java} > [1,1,1] > [1,1,1] > [2,1,1]{code} > Correct result: > {code:java} > [1,1,1] > [1,1,1] > [2,1,1] > [2,1,1]{code} > > The bug is in {{{}DenseRankLimitIterator{}}}, it fails to reset state > properly when transitioning from one window partition to the next. {{reset}} > only resets {{{}rank = 0{}}}, what it is missing is to reset > {{{}currentRankRow = null{}}}. This means that when processing the second and > later window partitions, the rank incorrectly gets incremented based on > comparing the ordering of the last row of the previous partition to the first > row of the new partition. > This means that a dense_rank window func that has more than one window > partition and more than one row with dense_rank = 1 in the second or later > partitions can give wrong results when optimized. > ({{{}RankLimitIterator{}}} narrowly avoids this bug by happenstance, the > first row in the new partition will try to increment rank, but increment it > by the value of count which is 0, so it happens to work by accident). > Unfortunately, tests for the optimization only had a single row per rank, so > did not catch the bug as the bug requires multiple rows per rank. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44448) Wrong results for dense_rank() <= k from InferWindowGroupLimit and DenseRankLimitIterator
[ https://issues.apache.org/jira/browse/SPARK-8?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-8: --- Assignee: Jack Chen > Wrong results for dense_rank() <= k from InferWindowGroupLimit and > DenseRankLimitIterator > - > > Key: SPARK-8 > URL: https://issues.apache.org/jira/browse/SPARK-8 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Jack Chen >Assignee: Jack Chen >Priority: Major > > Top-k filters on a dense_rank() window function return wrong results, due to > a bug in optimization InferWindowGroupLimit, specifically in the code for > DenseRankLimitIterator, introduced in > https://issues.apache.org/jira/browse/SPARK-37099. > Repro: > {code:java} > create or replace temp view t1 (p, o) as values (1, 1), (1, 1), (1, 2), (2, > 1), (2, 1), (2, 2); > select * from (select *, dense_rank() over (partition by p order by o) as rnk > from t1) where rnk = 1;{code} > Spark result: > {code:java} > [1,1,1] > [1,1,1] > [2,1,1]{code} > Correct result: > {code:java} > [1,1,1] > [1,1,1] > [2,1,1] > [2,1,1]{code} > > The bug is in {{{}DenseRankLimitIterator{}}}, it fails to reset state > properly when transitioning from one window partition to the next. {{reset}} > only resets {{{}rank = 0{}}}, what it is missing is to reset > {{{}currentRankRow = null{}}}. This means that when processing the second and > later window partitions, the rank incorrectly gets incremented based on > comparing the ordering of the last row of the previous partition to the first > row of the new partition. > This means that a dense_rank window func that has more than one window > partition and more than one row with dense_rank = 1 in the second or later > partitions can give wrong results when optimized. > ({{{}RankLimitIterator{}}} narrowly avoids this bug by happenstance, the > first row in the new partition will try to increment rank, but increment it > by the value of count which is 0, so it happens to work by accident). > Unfortunately, tests for the optimization only had a single row per rank, so > did not catch the bug as the bug requires multiple rows per rank. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44324) Move CaseInsensitiveMap to sql/api
[ https://issues.apache.org/jira/browse/SPARK-44324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-44324. - Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41882 [https://github.com/apache/spark/pull/41882] > Move CaseInsensitiveMap to sql/api > -- > > Key: SPARK-44324 > URL: https://issues.apache.org/jira/browse/SPARK-44324 > Project: Spark > Issue Type: Sub-task > Components: Connect, SQL >Affects Versions: 3.5.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44480) Add option for thread pool to perform maintenance for RocksDB/HDFS State Store Providers
Eric Marnadi created SPARK-44480: Summary: Add option for thread pool to perform maintenance for RocksDB/HDFS State Store Providers Key: SPARK-44480 URL: https://issues.apache.org/jira/browse/SPARK-44480 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.5.0 Reporter: Eric Marnadi Maintenance tasks on StateStore was being done by a single background thread, which is prone to straggling. In this change, the single background thread would instead schedule maintenance tasks to a thread pool. Introduce {{spark.sql.streaming.stateStore.enableStateStoreMaintenanceThreadPool}} config so that the user can enable a thread pool for maintenance manually. Introduce {{spark.sql.streaming.stateStore.numStateStoreMaintenanceThreads}} config so the thread pool size is configurable. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43755) Spark Connect - decouple query execution from RPC handler
[ https://issues.apache.org/jira/browse/SPARK-43755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-43755. -- Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42060 [https://github.com/apache/spark/pull/42060] > Spark Connect - decouple query execution from RPC handler > - > > Key: SPARK-43755 > URL: https://issues.apache.org/jira/browse/SPARK-43755 > Project: Spark > Issue Type: Story > Components: Connect >Affects Versions: 3.5.0 >Reporter: Juliusz Sompolski >Assignee: Juliusz Sompolski >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > Move actual query execution out of the RPC handler callback. This allows: > * (immediately) better control over query cancellation, by interrupting the > execution thread. > * design changes to the RPC interface to allow different execution models > than stream-push from server. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43755) Spark Connect - decouple query execution from RPC handler
[ https://issues.apache.org/jira/browse/SPARK-43755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-43755: Assignee: Juliusz Sompolski > Spark Connect - decouple query execution from RPC handler > - > > Key: SPARK-43755 > URL: https://issues.apache.org/jira/browse/SPARK-43755 > Project: Spark > Issue Type: Story > Components: Connect >Affects Versions: 3.5.0 >Reporter: Juliusz Sompolski >Assignee: Juliusz Sompolski >Priority: Major > > Move actual query execution out of the RPC handler callback. This allows: > * (immediately) better control over query cancellation, by interrupting the > execution thread. > * design changes to the RPC interface to allow different execution models > than stream-push from server. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44476) JobArtifactSet is populated with all artifacts if it is not associated with an artifact
[ https://issues.apache.org/jira/browse/SPARK-44476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-44476: Assignee: Venkata Sai Akhil Gudesa > JobArtifactSet is populated with all artifacts if it is not associated with > an artifact > --- > > Key: SPARK-44476 > URL: https://issues.apache.org/jira/browse/SPARK-44476 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0, 4.0.0 >Reporter: Venkata Sai Akhil Gudesa >Assignee: Venkata Sai Akhil Gudesa >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > Consider each artifact type - files/jars/archives. For each artifact type, > the following bug exists: > # Initialise a `JobArtifactState` with no artifacts added to it. > # Create a `JobArtifactSet` from the `JobArtifactState`. > # Add an artifact with the same active `JobArtifactState`. > # Create another `JobArtifactSet` > In the current behaviour, the set created in step 2 contains all the > artifacts (through `sc.allAddedFiles` for example) while step 3 contains only > the single artifact added in step 3. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44476) JobArtifactSet is populated with all artifacts if it is not associated with an artifact
[ https://issues.apache.org/jira/browse/SPARK-44476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44476. -- Resolution: Fixed Issue resolved by pull request 42062 [https://github.com/apache/spark/pull/42062] > JobArtifactSet is populated with all artifacts if it is not associated with > an artifact > --- > > Key: SPARK-44476 > URL: https://issues.apache.org/jira/browse/SPARK-44476 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0, 4.0.0 >Reporter: Venkata Sai Akhil Gudesa >Assignee: Venkata Sai Akhil Gudesa >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > Consider each artifact type - files/jars/archives. For each artifact type, > the following bug exists: > # Initialise a `JobArtifactState` with no artifacts added to it. > # Create a `JobArtifactSet` from the `JobArtifactState`. > # Add an artifact with the same active `JobArtifactState`. > # Create another `JobArtifactSet` > In the current behaviour, the set created in step 2 contains all the > artifacts (through `sc.allAddedFiles` for example) while step 3 contains only > the single artifact added in step 3. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42944) Support Python foreachBatch() in streaming spark connect
[ https://issues.apache.org/jira/browse/SPARK-42944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42944. -- Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42035 [https://github.com/apache/spark/pull/42035] > Support Python foreachBatch() in streaming spark connect > > > Key: SPARK-42944 > URL: https://issues.apache.org/jira/browse/SPARK-42944 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Raghu Angadi >Assignee: Raghu Angadi >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > Add support for foreachBatch() streaming spark connect. This might need deep > dive into various complexities of arbitrary spark code since foreachBatch > block. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42944) Support Python foreachBatch() in streaming spark connect
[ https://issues.apache.org/jira/browse/SPARK-42944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42944: Assignee: Raghu Angadi > Support Python foreachBatch() in streaming spark connect > > > Key: SPARK-42944 > URL: https://issues.apache.org/jira/browse/SPARK-42944 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Raghu Angadi >Assignee: Raghu Angadi >Priority: Major > > Add support for foreachBatch() streaming spark connect. This might need deep > dive into various complexities of arbitrary spark code since foreachBatch > block. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36392) pandas fixed width file support
[ https://issues.apache.org/jira/browse/SPARK-36392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744375#comment-17744375 ] Haejoon Lee commented on SPARK-36392: - Not update yet here. [~gsdionis] , are you still interested in working on this ticket? Let me just work on if there is not responding until this weekend. > pandas fixed width file support > --- > > Key: SPARK-36392 > URL: https://issues.apache.org/jira/browse/SPARK-36392 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.1.2 >Reporter: John Ayoub >Priority: Minor > > please add support for the fixed width api in pandas to koalas. > [reference|https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_fwf.html] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44464) Fix applyInPandasWithStatePythonRunner to output rows that have Null as first column value
[ https://issues.apache.org/jira/browse/SPARK-44464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744369#comment-17744369 ] Siying Dong commented on SPARK-44464: - PR created: [https://github.com/apache/spark/pull/42046] CC [~kabhwan] > Fix applyInPandasWithStatePythonRunner to output rows that have Null as first > column value > -- > > Key: SPARK-44464 > URL: https://issues.apache.org/jira/browse/SPARK-44464 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.3.3 >Reporter: Siying Dong >Priority: Major > > The current implementation of {{ApplyInPandasWithStatePythonRunner}} cannot > deal with outputs where the first column of the row is {{{}null{}}}, as it > cannot distinguish the case where the column is null, or the field is filled > as the number of data records are smaller than state records. It causes > incorrect results for the former case. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44479) Support Python UDTFs with empty schema
[ https://issues.apache.org/jira/browse/SPARK-44479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-44479: -- Description: Support UDTFs with empty schema, for example: {code:python} >>> class TestUDTF: ... def eval(self): ... yield tuple() {code} Currently it fails with `useArrow=True`: {code:python} >>> udtf(TestUDTF, returnType=StructType())().collect() Traceback (most recent call last): ... ValueError: not enough values to unpack (expected 2, got 0) {code} whereas without Arrow: {code:python} >>> udtf(TestUDTF, returnType=StructType(), useArrow=False)().collect() [Row()] {code} Otherwise, we should raise an error without Arrow, too. was: Support UDTFs with empty schema, for example: {code:python} >>> class TestUDTF: ... def eval(self): ... yield tuple() {code} Currently it fails with `useArrow=True`: {code:python} >>> udtf(TestUDTF, returnType=StructType())().collect() Traceback (most recent call last): ... ValueError: not enough values to unpack (expected 2, got 0) {code} whereas without Arrow: {code:python} >>> udtf(TestUDTF, returnType=StructType(), useArrow=False)().collect() [Row()] {code} > Support Python UDTFs with empty schema > -- > > Key: SPARK-44479 > URL: https://issues.apache.org/jira/browse/SPARK-44479 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Takuya Ueshin >Priority: Major > > Support UDTFs with empty schema, for example: > {code:python} > >>> class TestUDTF: > ... def eval(self): > ... yield tuple() > {code} > Currently it fails with `useArrow=True`: > {code:python} > >>> udtf(TestUDTF, returnType=StructType())().collect() > Traceback (most recent call last): > ... > ValueError: not enough values to unpack (expected 2, got 0) > {code} > whereas without Arrow: > {code:python} > >>> udtf(TestUDTF, returnType=StructType(), useArrow=False)().collect() > [Row()] > {code} > Otherwise, we should raise an error without Arrow, too. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44479) Support Python UDTFs with empty schema
[ https://issues.apache.org/jira/browse/SPARK-44479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-44479: -- Description: Support UDTFs with empty schema, for example: {code:python} >>> class TestUDTF: ... def eval(self): ... yield tuple() {code} Currently it fails with `useArrow=True`: {code:python} >>> udtf(TestUDTF, returnType=StructType())().collect() Traceback (most recent call last): ... ValueError: not enough values to unpack (expected 2, got 0) {code} whereas without Arrow: {code:python} >>> udtf(TestUDTF, returnType=StructType(), useArrow=False)().collect() [Row()] {code} Otherwise, we should raise an error without Arrow, too, to be consistent. was: Support UDTFs with empty schema, for example: {code:python} >>> class TestUDTF: ... def eval(self): ... yield tuple() {code} Currently it fails with `useArrow=True`: {code:python} >>> udtf(TestUDTF, returnType=StructType())().collect() Traceback (most recent call last): ... ValueError: not enough values to unpack (expected 2, got 0) {code} whereas without Arrow: {code:python} >>> udtf(TestUDTF, returnType=StructType(), useArrow=False)().collect() [Row()] {code} Otherwise, we should raise an error without Arrow, too. > Support Python UDTFs with empty schema > -- > > Key: SPARK-44479 > URL: https://issues.apache.org/jira/browse/SPARK-44479 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Takuya Ueshin >Priority: Major > > Support UDTFs with empty schema, for example: > {code:python} > >>> class TestUDTF: > ... def eval(self): > ... yield tuple() > {code} > Currently it fails with `useArrow=True`: > {code:python} > >>> udtf(TestUDTF, returnType=StructType())().collect() > Traceback (most recent call last): > ... > ValueError: not enough values to unpack (expected 2, got 0) > {code} > whereas without Arrow: > {code:python} > >>> udtf(TestUDTF, returnType=StructType(), useArrow=False)().collect() > [Row()] > {code} > Otherwise, we should raise an error without Arrow, too, to be consistent. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44479) Support Python UDTFs with empty schema
Takuya Ueshin created SPARK-44479: - Summary: Support Python UDTFs with empty schema Key: SPARK-44479 URL: https://issues.apache.org/jira/browse/SPARK-44479 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Takuya Ueshin Support UDTFs with empty schema, for example: {code:python} >>> class TestUDTF: ... def eval(self): ... yield tuple() {code} Currently it fails with `useArrow=True`: {code:python} >>> udtf(TestUDTF, returnType=StructType())().collect() Traceback (most recent call last): ... ValueError: not enough values to unpack (expected 2, got 0) {code} whereas without Arrow: {code:python} >>> udtf(TestUDTF, returnType=StructType(), useArrow=False)().collect() [Row()] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40296) Error Class for DISTINCT function not found
[ https://issues.apache.org/jira/browse/SPARK-40296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744346#comment-17744346 ] Ritika Maheshwari commented on SPARK-40296: --- Isn't dropDuplicates taking care of applying distinct to multiple columns? > Error Class for DISTINCT function not found > --- > > Key: SPARK-40296 > URL: https://issues.apache.org/jira/browse/SPARK-40296 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44478) Executor decommission causes stage failure
Dale Huettenmoser created SPARK-44478: - Summary: Executor decommission causes stage failure Key: SPARK-44478 URL: https://issues.apache.org/jira/browse/SPARK-44478 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 3.4.1, 3.4.0 Reporter: Dale Huettenmoser During spark execution, save fails due to executor decommissioning. Issue not present in 3.3.0 Sample error: {code:java} An error occurred while calling o8948.save. : org.apache.spark.SparkException: Job aborted due to stage failure: Authorized committer (attemptNumber=0, stage=170, partition=233) failed; but task commit success, data duplication may happen. reason=ExecutorLostFailure(1,false,Some(Executor decommission: Executor 1 is decommissioned.)) at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2785) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2721) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2720) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2720) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleStageFailed$1(DAGScheduler.scala:1199) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleStageFailed$1$adapted(DAGScheduler.scala:1199) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleStageFailed(DAGScheduler.scala:1199) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2981) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2923) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2912) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:971) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2263) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeWrite$4(FileFormatWriter.scala:307) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.writeAndCommit(FileFormatWriter.scala:271) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeWrite(FileFormatWriter.scala:304) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:190) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:190) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111) at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:118) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:512) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:31) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) at
[jira] [Commented] (SPARK-44477) CheckAnalysis uses error subclass as an error class
[ https://issues.apache.org/jira/browse/SPARK-44477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744314#comment-17744314 ] Bruce Robbins commented on SPARK-44477: --- PR here: https://github.com/apache/spark/pull/42064 > CheckAnalysis uses error subclass as an error class > --- > > Key: SPARK-44477 > URL: https://issues.apache.org/jira/browse/SPARK-44477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Bruce Robbins >Priority: Minor > > {{CheckAnalysis}} treats {{TYPE_CHECK_FAILURE_WITH_HINT}} as an error class, > but it is instead an error subclass of {{{}DATATYPE_MISMATCH{}}}. > {noformat} > spark-sql (default)> select bitmap_count(12); > [INTERNAL_ERROR] Cannot find main error class 'TYPE_CHECK_FAILURE_WITH_HINT' > org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot find main error > class 'TYPE_CHECK_FAILURE_WITH_HINT' > at org.apache.spark.SparkException$.internalError(SparkException.scala:83) > at org.apache.spark.SparkException$.internalError(SparkException.scala:87) > at > org.apache.spark.ErrorClassesJsonReader.$anonfun$getMessageTemplate$1(ErrorClassesJSONReader.scala:68) > at scala.collection.immutable.HashMap$HashMap1.getOrElse0(HashMap.scala:361) > at > scala.collection.immutable.HashMap$HashTrieMap.getOrElse0(HashMap.scala:594) > at > scala.collection.immutable.HashMap$HashTrieMap.getOrElse0(HashMap.scala:589) > at scala.collection.immutable.HashMap.getOrElse(HashMap.scala:73) > {noformat} > This issue only occurs when an expression uses > {{TypeCheckResult.TypeCheckFailure}} to indicate input type check failure. > {{TypeCheckResult.TypeCheckFailure}} appears to be deprecated in favor of > {{{}TypeCheckResult.DataTypeMismatch{}}}, but recently two expressions were > added that use {{{}TypeCheckResult.TypeCheckFailure{}}}: {{BitmapCount}} and > {{{}BitmapOrAgg{}}}. > {{BitmapCount}} and {{BitmapOrAgg}} should probably be fixed to use > {{{}TypeCheckResult.DataTypeMismatch{}}}. Regardless, the code in > {{CheckAnalysis}} that handles {{TypeCheckResult.TypeCheckFailure}} should be > corrected (or removed). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44477) CheckAnalysis uses error subclass as an error class
Bruce Robbins created SPARK-44477: - Summary: CheckAnalysis uses error subclass as an error class Key: SPARK-44477 URL: https://issues.apache.org/jira/browse/SPARK-44477 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: Bruce Robbins {{CheckAnalysis}} treats {{TYPE_CHECK_FAILURE_WITH_HINT}} as an error class, but it is instead an error subclass of {{{}DATATYPE_MISMATCH{}}}. {noformat} spark-sql (default)> select bitmap_count(12); [INTERNAL_ERROR] Cannot find main error class 'TYPE_CHECK_FAILURE_WITH_HINT' org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot find main error class 'TYPE_CHECK_FAILURE_WITH_HINT' at org.apache.spark.SparkException$.internalError(SparkException.scala:83) at org.apache.spark.SparkException$.internalError(SparkException.scala:87) at org.apache.spark.ErrorClassesJsonReader.$anonfun$getMessageTemplate$1(ErrorClassesJSONReader.scala:68) at scala.collection.immutable.HashMap$HashMap1.getOrElse0(HashMap.scala:361) at scala.collection.immutable.HashMap$HashTrieMap.getOrElse0(HashMap.scala:594) at scala.collection.immutable.HashMap$HashTrieMap.getOrElse0(HashMap.scala:589) at scala.collection.immutable.HashMap.getOrElse(HashMap.scala:73) {noformat} This issue only occurs when an expression uses {{TypeCheckResult.TypeCheckFailure}} to indicate input type check failure. {{TypeCheckResult.TypeCheckFailure}} appears to be deprecated in favor of {{{}TypeCheckResult.DataTypeMismatch{}}}, but recently two expressions were added that use {{{}TypeCheckResult.TypeCheckFailure{}}}: {{BitmapCount}} and {{{}BitmapOrAgg{}}}. {{BitmapCount}} and {{BitmapOrAgg}} should probably be fixed to use {{{}TypeCheckResult.DataTypeMismatch{}}}. Regardless, the code in {{CheckAnalysis}} that handles {{TypeCheckResult.TypeCheckFailure}} should be corrected (or removed). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36392) pandas fixed width file support
[ https://issues.apache.org/jira/browse/SPARK-36392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744297#comment-17744297 ] John Ayoub commented on SPARK-36392: [~itholic] Hello, any update on this ticket? > pandas fixed width file support > --- > > Key: SPARK-36392 > URL: https://issues.apache.org/jira/browse/SPARK-36392 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.1.2 >Reporter: John Ayoub >Priority: Minor > > please add support for the fixed width api in pandas to koalas. > [reference|https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_fwf.html] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44465) Upgrade zstd-jni to 1.5.5-5
[ https://issues.apache.org/jira/browse/SPARK-44465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-44465. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42047 [https://github.com/apache/spark/pull/42047] > Upgrade zstd-jni to 1.5.5-5 > --- > > Key: SPARK-44465 > URL: https://issues.apache.org/jira/browse/SPARK-44465 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44465) Upgrade zstd-jni to 1.5.5-5
[ https://issues.apache.org/jira/browse/SPARK-44465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-44465: - Assignee: BingKun Pan > Upgrade zstd-jni to 1.5.5-5 > --- > > Key: SPARK-44465 > URL: https://issues.apache.org/jira/browse/SPARK-44465 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44476) JobArtifactSet is populated with all artifacts if it is not associated with an artifact
Venkata Sai Akhil Gudesa created SPARK-44476: Summary: JobArtifactSet is populated with all artifacts if it is not associated with an artifact Key: SPARK-44476 URL: https://issues.apache.org/jira/browse/SPARK-44476 Project: Spark Issue Type: Bug Components: Connect Affects Versions: 3.5.0, 4.0.0 Reporter: Venkata Sai Akhil Gudesa Fix For: 3.5.0, 4.0.0 Consider each artifact type - files/jars/archives. For each artifact type, the following bug exists: # Initialise a `JobArtifactState` with no artifacts added to it. # Create a `JobArtifactSet` from the `JobArtifactState`. # Add an artifact with the same active `JobArtifactState`. # Create another `JobArtifactSet` In the current behaviour, the set created in step 2 contains all the artifacts (through `sc.allAddedFiles` for example) while step 3 contains only the single artifact added in step 3. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44448) Wrong results for dense_rank() <= k from InferWindowGroupLimit and DenseRankLimitIterator
[ https://issues.apache.org/jira/browse/SPARK-8?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen updated SPARK-8: -- Affects Version/s: 3.5.0 (was: 3.4.0) > Wrong results for dense_rank() <= k from InferWindowGroupLimit and > DenseRankLimitIterator > - > > Key: SPARK-8 > URL: https://issues.apache.org/jira/browse/SPARK-8 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Jack Chen >Priority: Major > > Top-k filters on a dense_rank() window function return wrong results, due to > a bug in optimization InferWindowGroupLimit, specifically in the code for > DenseRankLimitIterator, introduced in > https://issues.apache.org/jira/browse/SPARK-37099. > Repro: > {code:java} > create or replace temp view t1 (p, o) as values (1, 1), (1, 1), (1, 2), (2, > 1), (2, 1), (2, 2); > select * from (select *, dense_rank() over (partition by p order by o) as rnk > from t1) where rnk = 1;{code} > Spark result: > {code:java} > [1,1,1] > [1,1,1] > [2,1,1]{code} > Correct result: > {code:java} > [1,1,1] > [1,1,1] > [2,1,1] > [2,1,1]{code} > > The bug is in {{{}DenseRankLimitIterator{}}}, it fails to reset state > properly when transitioning from one window partition to the next. {{reset}} > only resets {{{}rank = 0{}}}, what it is missing is to reset > {{{}currentRankRow = null{}}}. This means that when processing the second and > later window partitions, the rank incorrectly gets incremented based on > comparing the ordering of the last row of the previous partition to the first > row of the new partition. > This means that a dense_rank window func that has more than one window > partition and more than one row with dense_rank = 1 in the second or later > partitions can give wrong results when optimized. > ({{{}RankLimitIterator{}}} narrowly avoids this bug by happenstance, the > first row in the new partition will try to increment rank, but increment it > by the value of count which is 0, so it happens to work by accident). > Unfortunately, tests for the optimization only had a single row per rank, so > did not catch the bug as the bug requires multiple rows per rank. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44475) Relocate DataType and Parser to sql/api
Rui Wang created SPARK-44475: Summary: Relocate DataType and Parser to sql/api Key: SPARK-44475 URL: https://issues.apache.org/jira/browse/SPARK-44475 Project: Spark Issue Type: Sub-task Components: Connect, SQL Affects Versions: 3.5.0 Reporter: Rui Wang Assignee: Rui Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44467) Setting master version to 4.0.0-SNAPSHOT
[ https://issues.apache.org/jira/browse/SPARK-44467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-44467: - Fix Version/s: 4.0.0 (was: 3.5.0) > Setting master version to 4.0.0-SNAPSHOT > > > Key: SPARK-44467 > URL: https://issues.apache.org/jira/browse/SPARK-44467 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44467) Setting master version to 4.0.0-SNAPSHOT
[ https://issues.apache.org/jira/browse/SPARK-44467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-44467. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 42048 [https://github.com/apache/spark/pull/42048] > Setting master version to 4.0.0-SNAPSHOT > > > Key: SPARK-44467 > URL: https://issues.apache.org/jira/browse/SPARK-44467 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44467) Setting master version to 4.0.0-SNAPSHOT
[ https://issues.apache.org/jira/browse/SPARK-44467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-44467: Assignee: Yang Jie > Setting master version to 4.0.0-SNAPSHOT > > > Key: SPARK-44467 > URL: https://issues.apache.org/jira/browse/SPARK-44467 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42972) ExecutorAllocationManager cannot allocate new instances when all executors down.
[ https://issues.apache.org/jira/browse/SPARK-42972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744133#comment-17744133 ] lvkaihua commented on SPARK-42972: -- I also encountered this issue and tested that the modification was correct > ExecutorAllocationManager cannot allocate new instances when all executors > down. > > > Key: SPARK-42972 > URL: https://issues.apache.org/jira/browse/SPARK-42972 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.3.2 >Reporter: Jiandan Yang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44396) Add direct Arrow deserialization
[ https://issues.apache.org/jira/browse/SPARK-44396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744127#comment-17744127 ] ASF GitHub Bot commented on SPARK-44396: User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/42011 > Add direct Arrow deserialization > > > Key: SPARK-44396 > URL: https://issues.apache.org/jira/browse/SPARK-44396 > Project: Spark > Issue Type: New Feature > Components: Connect, SQL >Affects Versions: 3.4.1 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44472) change the external catalog thread safety way
[ https://issues.apache.org/jira/browse/SPARK-44472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Izek Greenfield updated SPARK-44472: Attachment: add_hive_concurrent_connections.diff > change the external catalog thread safety way > - > > Key: SPARK-44472 > URL: https://issues.apache.org/jira/browse/SPARK-44472 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.1 >Reporter: Izek Greenfield >Priority: Major > Attachments: add_hive_concurrent_connections.diff > > > We test changing the sync of the external catalog to use thread-local instead > of the synchronized methods. > in our tests, it improve the runtime of parallel actions by about 45% for > certain workload ** (time reduced from ~15min to ~9min) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44474) Reenable "Test observe response" at SparkConnectServiceSuite
[ https://issues.apache.org/jira/browse/SPARK-44474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-44474: - Summary: Reenable "Test observe response" at SparkConnectServiceSuite (was: Reenable Test observe response at SparkConnectServiceSuite) > Reenable "Test observe response" at SparkConnectServiceSuite > > > Key: SPARK-44474 > URL: https://issues.apache.org/jira/browse/SPARK-44474 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.5.0 >Reporter: Hyukjin Kwon >Priority: Blocker > > [https://github.com/apache/spark/pull/41443] apparently made the test flaky > (or failed). We should reenable it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44474) Reenable Test observe response at SparkConnectServiceSuite
[ https://issues.apache.org/jira/browse/SPARK-44474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-44474: - Affects Version/s: 3.5.0 (was: 4.0.0) > Reenable Test observe response at SparkConnectServiceSuite > -- > > Key: SPARK-44474 > URL: https://issues.apache.org/jira/browse/SPARK-44474 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.5.0 >Reporter: Hyukjin Kwon >Priority: Blocker > > [https://github.com/apache/spark/pull/41443] apparently made the test flaky > (or failed). We should reenable it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44474) Reenable Test observe response at SparkConnectServiceSuite
[ https://issues.apache.org/jira/browse/SPARK-44474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-44474: - Priority: Blocker (was: Major) > Reenable Test observe response at SparkConnectServiceSuite > -- > > Key: SPARK-44474 > URL: https://issues.apache.org/jira/browse/SPARK-44474 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Blocker > > [https://github.com/apache/spark/pull/41443] apparently made the test flaky > (or failed). We should reenable it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44474) Reenable Test observe response at SparkConnectServiceSuite
Hyukjin Kwon created SPARK-44474: Summary: Reenable Test observe response at SparkConnectServiceSuite Key: SPARK-44474 URL: https://issues.apache.org/jira/browse/SPARK-44474 Project: Spark Issue Type: Task Components: Connect Affects Versions: 4.0.0 Reporter: Hyukjin Kwon [https://github.com/apache/spark/pull/41443] apparently made the test flaky (or failed). We should reenable it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44468) Add daily test GA task for branch3.5
[ https://issues.apache.org/jira/browse/SPARK-44468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44468. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42050 [https://github.com/apache/spark/pull/42050] > Add daily test GA task for branch3.5 > > > Key: SPARK-44468 > URL: https://issues.apache.org/jira/browse/SPARK-44468 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44468) Add daily test GA task for branch3.5
[ https://issues.apache.org/jira/browse/SPARK-44468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-44468: Assignee: BingKun Pan > Add daily test GA task for branch3.5 > > > Key: SPARK-44468 > URL: https://issues.apache.org/jira/browse/SPARK-44468 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44471) Change branches in build_and_test.yml for master branch
[ https://issues.apache.org/jira/browse/SPARK-44471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-44471: - Summary: Change branches in build_and_test.yml for master branch (was: Add Github action test job for branch-3.5) > Change branches in build_and_test.yml for master branch > --- > > Key: SPARK-44471 > URL: https://issues.apache.org/jira/browse/SPARK-44471 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 3.5.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44471) Add Github action test job for branch-3.5
[ https://issues.apache.org/jira/browse/SPARK-44471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44471. -- Resolution: Fixed Issue resolved by pull request 42057 [https://github.com/apache/spark/pull/42057] > Add Github action test job for branch-3.5 > - > > Key: SPARK-44471 > URL: https://issues.apache.org/jira/browse/SPARK-44471 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 3.5.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42972) ExecutorAllocationManager cannot allocate new instances when all executors down.
[ https://issues.apache.org/jira/browse/SPARK-42972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744092#comment-17744092 ] liang yu commented on SPARK-42972: -- [~tdas] I created a PR [PR-42058|https://github.com/apache/spark/pull/42058] on github, would you please help me review it? > ExecutorAllocationManager cannot allocate new instances when all executors > down. > > > Key: SPARK-42972 > URL: https://issues.apache.org/jira/browse/SPARK-42972 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.3.2 >Reporter: Jiandan Yang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44473) Overwriting the same partition of a partitioned table multiple times with empty data yields non-idempotent results
chris Yu created SPARK-44473: Summary: Overwriting the same partition of a partitioned table multiple times with empty data yields non-idempotent results Key: SPARK-44473 URL: https://issues.apache.org/jira/browse/SPARK-44473 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.1, 3.3.2, 3.2.4, 3.1.3 Environment: spark : 3.x Reporter: chris Yu Preparation: Create a simple partition table using spark version 3.x, for example: {code:java} spark-sql> create table test1 (a int) partitioned by (dt string); Time taken: 0.219 seconds{code} * Overwrite a new partition with empty data, and you can see that the partition information and the corresponding HDFS path are generated , for example: {code:java} spark-sql> insert overwrite table test1 partition(dt='20230702') select 2 where 1 <> 1; Time taken: 0.992 seconds spark-sql> dfs -ls /user/hive/warehouse/test1; Found 2 items -rw-r--r-- 2 hadoop hadoop 0 2023-07-18 14:41 /user/hive/warehouse/test1/_SUCCESS drwxrwxrwx- hadoop hadoop 0 2023-07-18 14:41 /user/hive/warehouse/test1/dt=20230702 spark-sql> show partitions test1; dt=20230702 Time taken: 0.162 seconds, Fetched 1 row(s) {code} * When re-running the insert overwrite statement, you can see that the HDFS path corresponding to this partition does not exist. {code:java} spark-sql> insert overwrite table test1 partition(dt='20230702') select 2 where 1 <> 1; Time taken: 0.706 seconds spark-sql> dfs -ls /user/hive/warehouse/test1; Found 1 items -rw-r--r-- 2 hadoop hadoop 0 2023-07-18 14:45 /user/hive/warehouse/test1/_SUCCESS spark-sql> show partitions test1; dt=20230702 Time taken: 0.183 seconds, Fetched 1 row(s){code} For subsequent tasks that need to use this HDFS path, an exception that the path does not exist will be thrown, which caused us trouble. I was expecting to execute the same statement multiple times to get the same result, {*}not non-idempotent{*}. thanks. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44451) Make built document downloadable
[ https://issues.apache.org/jira/browse/SPARK-44451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-44451. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42028 [https://github.com/apache/spark/pull/42028] > Make built document downloadable > > > Key: SPARK-44451 > URL: https://issues.apache.org/jira/browse/SPARK-44451 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44451) Make built document downloadable
[ https://issues.apache.org/jira/browse/SPARK-44451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-44451: - Assignee: Ruifeng Zheng > Make built document downloadable > > > Key: SPARK-44451 > URL: https://issues.apache.org/jira/browse/SPARK-44451 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44472) change the external catalog thread safety way
Izek Greenfield created SPARK-44472: --- Summary: change the external catalog thread safety way Key: SPARK-44472 URL: https://issues.apache.org/jira/browse/SPARK-44472 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.1 Reporter: Izek Greenfield We test changing the sync of the external catalog to use thread-local instead of the synchronized methods. in our tests, it improve the runtime of parallel actions by about 45% for certain workload ** (time reduced from ~15min to ~9min) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44471) Add Github action test job for branch-3.5
Yuanjian Li created SPARK-44471: --- Summary: Add Github action test job for branch-3.5 Key: SPARK-44471 URL: https://issues.apache.org/jira/browse/SPARK-44471 Project: Spark Issue Type: Task Components: Project Infra Affects Versions: 3.5.0 Reporter: Yuanjian Li Assignee: Yuanjian Li Fix For: 3.5.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43967) Support Python UDTFs with empty return values
[ https://issues.apache.org/jira/browse/SPARK-43967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-43967. -- Fix Version/s: 3.5.0 Assignee: Allison Wang Resolution: Fixed Fixed in https://github.com/apache/spark/pull/42044 > Support Python UDTFs with empty return values > - > > Key: SPARK-43967 > URL: https://issues.apache.org/jira/browse/SPARK-43967 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Fix For: 3.5.0 > > > Support UDTFs with empty returns, for example: > {code:java} > @udtf(returnType="a: int") > class TestUDTF: > def eval(self, a: int): > ... {code} > Currently, this will fail with the exception > {code:java} > TypeError: 'NoneType' object is not iterable {code} > Another example > {code:java} > class TestUDTF: > def eval(self, a: int): > yield {code} > This will fail with the exceptions > {code:java} > java.lang.NullPointerException {code} > Note, arrow-optimized UDTFs already support this. This error only occurs with > regular Python UDTFs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org