[jira] [Commented] (SPARK-35531) Can not insert into hive bucket table if create table with upper case schema
[ https://issues.apache.org/jira/browse/SPARK-35531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488612#comment-17488612 ] angerszhu commented on SPARK-35531: --- Sure. > Can not insert into hive bucket table if create table with upper case schema > > > Key: SPARK-35531 > URL: https://issues.apache.org/jira/browse/SPARK-35531 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.1, 3.2.0 >Reporter: Hongyi Zhang >Assignee: angerszhu >Priority: Major > Fix For: 3.3.0 > > > > > create table TEST1( > V1 BIGINT, > S1 INT) > partitioned by (PK BIGINT) > clustered by (V1) > sorted by (S1) > into 200 buckets > STORED AS PARQUET; > > insert into test1 > select > * from values(1,1,1); > > > org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not > part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), > FieldSchema(name:s1, type:int, comment:null)] > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not > part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), > FieldSchema(name:s1, type:int, comment:null)] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38097) Improve the error for pivoting of unsupported value types
[ https://issues.apache.org/jira/browse/SPARK-38097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488614#comment-17488614 ] Apache Spark commented on SPARK-38097: -- User 'yutoacts' has created a pull request for this issue: https://github.com/apache/spark/pull/35433 > Improve the error for pivoting of unsupported value types > - > > Key: SPARK-38097 > URL: https://issues.apache.org/jira/browse/SPARK-38097 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > The error message from: > {code:scala} > test("Improve the error for pivoting of unsupported value types") { > trainingSales > .groupBy($"sales.year") > .pivot(struct(lower($"sales.course"), $"training")) > .agg(sum($"sales.earnings")) > .show(false) > } > {code} > can confuse users: > {code:java} > The feature is not supported: literal for '[dotnet,Dummies]' of class > org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema. > org.apache.spark.SparkRuntimeException: The feature is not supported: literal > for '[dotnet,Dummies]' of class > org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:245) > at > org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:99) > at > org.apache.spark.sql.RelationalGroupedDataset.$anonfun$pivot$2(RelationalGroupedDataset.scala:455) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > {code} > Need to improve the error message and make it more precise. > See https://github.com/apache/spark/pull/35302#discussion_r793629370 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38135) Introduce `spark.kubernetes.job` sheduling related configurations
Yikun Jiang created SPARK-38135: --- Summary: Introduce `spark.kubernetes.job` sheduling related configurations Key: SPARK-38135 URL: https://issues.apache.org/jira/browse/SPARK-38135 Project: Spark Issue Type: Sub-task Components: Kubernetes Affects Versions: 3.3.0 Reporter: Yikun Jiang spark.kubernetes.job.minCPU: the minimum cpu resources for running spark.kubernetes.job.minMemory: the minimum memory resources for running spark.kubernetes.job.minMember: the minimum number of pods for running spark.kubernetes.job.priorityClassName: the priority of the running job spark.kubernetes.job.queue: the queue to which the running job belongs -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38135) Introduce `spark.kubernetes.job` sheduling related configurations
[ https://issues.apache.org/jira/browse/SPARK-38135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yikun Jiang updated SPARK-38135: Description: spark.kubernetes.job.minCPU: the minimum cpu resources for running job spark.kubernetes.job.minMemory: the minimum memory resources for running job spark.kubernetes.job.minMember: the minimum number of pods for running job spark.kubernetes.job.priorityClassName: the priority of the running job spark.kubernetes.job.queue: the queue to which the running job belongs was: spark.kubernetes.job.minCPU: the minimum cpu resources for running spark.kubernetes.job.minMemory: the minimum memory resources for running spark.kubernetes.job.minMember: the minimum number of pods for running spark.kubernetes.job.priorityClassName: the priority of the running job spark.kubernetes.job.queue: the queue to which the running job belongs > Introduce `spark.kubernetes.job` sheduling related configurations > -- > > Key: SPARK-38135 > URL: https://issues.apache.org/jira/browse/SPARK-38135 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Priority: Major > > spark.kubernetes.job.minCPU: the minimum cpu resources for running job > spark.kubernetes.job.minMemory: the minimum memory resources for running job > spark.kubernetes.job.minMember: the minimum number of pods for running job > spark.kubernetes.job.priorityClassName: the priority of the running job > spark.kubernetes.job.queue: the queue to which the running job belongs -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38124) Revive HashClusteredDistribution and apply to stream-stream join
[ https://issues.apache.org/jira/browse/SPARK-38124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-38124: - Summary: Revive HashClusteredDistribution and apply to stream-stream join (was: Revive HashClusteredDistribution and apply to all stateful operators) > Revive HashClusteredDistribution and apply to stream-stream join > > > Key: SPARK-38124 > URL: https://issues.apache.org/jira/browse/SPARK-38124 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Jungtaek Lim >Priority: Blocker > Labels: correctness > > SPARK-35703 removed HashClusteredDistribution and replaced its usages with > ClusteredDistribution. > While this works great for non stateful operators, we still need to have a > separate requirement of distribution for stateful operator, because the > requirement of ClusteredDistribution is too relaxed while the requirement of > physical partitioning on stateful operator is quite strict. > In most cases, stateful operators must require child distribution as > HashClusteredDistribution, with below major assumptions: > # HashClusteredDistribution creates HashPartitioning and we will never ever > change it for the future. > # We will never ever change the implementation of {{partitionIdExpression}} > in HashPartitioning for the future, so that Partitioner will behave > consistently across Spark versions. > # No partitioning except HashPartitioning can satisfy > HashClusteredDistribution. > > We should revive HashClusteredDistribution (with probably renaming > specifically with stateful operator) and apply the distribution to the all > stateful operators. > SPARK-35703 only touched stream-stream join, which means other stateful > operators already used ClusteredDistribution, hence have been broken for a > long time. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38124) Revive HashClusteredDistribution and apply to stream-stream join
[ https://issues.apache.org/jira/browse/SPARK-38124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-38124: - Labels: (was: correctness) > Revive HashClusteredDistribution and apply to stream-stream join > > > Key: SPARK-38124 > URL: https://issues.apache.org/jira/browse/SPARK-38124 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Jungtaek Lim >Priority: Blocker > > SPARK-35703 removed HashClusteredDistribution and replaced its usages with > ClusteredDistribution. > While this works great for non stateful operators, we still need to have a > separate requirement of distribution for stateful operator, because the > requirement of ClusteredDistribution is too relaxed while the requirement of > physical partitioning on stateful operator is quite strict. > In most cases, stateful operators must require child distribution as > HashClusteredDistribution, with below major assumptions: > # HashClusteredDistribution creates HashPartitioning and we will never ever > change it for the future. > # We will never ever change the implementation of {{partitionIdExpression}} > in HashPartitioning for the future, so that Partitioner will behave > consistently across Spark versions. > # No partitioning except HashPartitioning can satisfy > HashClusteredDistribution. > > We should revive HashClusteredDistribution (with probably renaming > specifically with stateful operator) and apply the distribution to the all > stateful operators. > SPARK-35703 only touched stream-stream join, which means stream-stream join > hasn't been broken in actual releases. Let's aim the partial revert of > SPARK-35703 in this ticket, and have another ticket to deal with other > stateful operators, which have been broken for their introduction (2.2+). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38124) Revive HashClusteredDistribution and apply to stream-stream join
[ https://issues.apache.org/jira/browse/SPARK-38124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-38124: - Description: SPARK-35703 removed HashClusteredDistribution and replaced its usages with ClusteredDistribution. While this works great for non stateful operators, we still need to have a separate requirement of distribution for stateful operator, because the requirement of ClusteredDistribution is too relaxed while the requirement of physical partitioning on stateful operator is quite strict. In most cases, stateful operators must require child distribution as HashClusteredDistribution, with below major assumptions: # HashClusteredDistribution creates HashPartitioning and we will never ever change it for the future. # We will never ever change the implementation of {{partitionIdExpression}} in HashPartitioning for the future, so that Partitioner will behave consistently across Spark versions. # No partitioning except HashPartitioning can satisfy HashClusteredDistribution. We should revive HashClusteredDistribution (with probably renaming specifically with stateful operator) and apply the distribution to the all stateful operators. SPARK-35703 only touched stream-stream join, which means stream-stream join hasn't been broken in actual releases. Let's aim the partial revert of SPARK-35703 in this ticket, and have another ticket to deal with other stateful operators, which have been broken for their introduction (2.2+). was: SPARK-35703 removed HashClusteredDistribution and replaced its usages with ClusteredDistribution. While this works great for non stateful operators, we still need to have a separate requirement of distribution for stateful operator, because the requirement of ClusteredDistribution is too relaxed while the requirement of physical partitioning on stateful operator is quite strict. In most cases, stateful operators must require child distribution as HashClusteredDistribution, with below major assumptions: # HashClusteredDistribution creates HashPartitioning and we will never ever change it for the future. # We will never ever change the implementation of {{partitionIdExpression}} in HashPartitioning for the future, so that Partitioner will behave consistently across Spark versions. # No partitioning except HashPartitioning can satisfy HashClusteredDistribution. We should revive HashClusteredDistribution (with probably renaming specifically with stateful operator) and apply the distribution to the all stateful operators. SPARK-35703 only touched stream-stream join, which means other stateful operators already used ClusteredDistribution, hence have been broken for a long time. > Revive HashClusteredDistribution and apply to stream-stream join > > > Key: SPARK-38124 > URL: https://issues.apache.org/jira/browse/SPARK-38124 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Jungtaek Lim >Priority: Blocker > Labels: correctness > > SPARK-35703 removed HashClusteredDistribution and replaced its usages with > ClusteredDistribution. > While this works great for non stateful operators, we still need to have a > separate requirement of distribution for stateful operator, because the > requirement of ClusteredDistribution is too relaxed while the requirement of > physical partitioning on stateful operator is quite strict. > In most cases, stateful operators must require child distribution as > HashClusteredDistribution, with below major assumptions: > # HashClusteredDistribution creates HashPartitioning and we will never ever > change it for the future. > # We will never ever change the implementation of {{partitionIdExpression}} > in HashPartitioning for the future, so that Partitioner will behave > consistently across Spark versions. > # No partitioning except HashPartitioning can satisfy > HashClusteredDistribution. > > We should revive HashClusteredDistribution (with probably renaming > specifically with stateful operator) and apply the distribution to the all > stateful operators. > SPARK-35703 only touched stream-stream join, which means stream-stream join > hasn't been broken in actual releases. Let's aim the partial revert of > SPARK-35703 in this ticket, and have another ticket to deal with other > stateful operators, which have been broken for their introduction (2.2+). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38136) Update GitHub Action test image
Dongjoon Hyun created SPARK-38136: - Summary: Update GitHub Action test image Key: SPARK-38136 URL: https://issues.apache.org/jira/browse/SPARK-38136 Project: Spark Issue Type: Test Components: Project Infra, Tests Affects Versions: 3.3.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38136) Update GitHub Action test image
[ https://issues.apache.org/jira/browse/SPARK-38136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488638#comment-17488638 ] Apache Spark commented on SPARK-38136: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/35434 > Update GitHub Action test image > --- > > Key: SPARK-38136 > URL: https://issues.apache.org/jira/browse/SPARK-38136 > Project: Spark > Issue Type: Test > Components: Project Infra, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38136) Update GitHub Action test image
[ https://issues.apache.org/jira/browse/SPARK-38136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38136: Assignee: Apache Spark > Update GitHub Action test image > --- > > Key: SPARK-38136 > URL: https://issues.apache.org/jira/browse/SPARK-38136 > Project: Spark > Issue Type: Test > Components: Project Infra, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38136) Update GitHub Action test image
[ https://issues.apache.org/jira/browse/SPARK-38136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38136: Assignee: (was: Apache Spark) > Update GitHub Action test image > --- > > Key: SPARK-38136 > URL: https://issues.apache.org/jira/browse/SPARK-38136 > Project: Spark > Issue Type: Test > Components: Project Infra, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37500) Incorrect scope when using named_windows in CTEs
[ https://issues.apache.org/jira/browse/SPARK-37500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488649#comment-17488649 ] Lauri Koobas commented on SPARK-37500: -- This particular case is fixed indeed, but I believe it was fixed in a bad way. This works: (notice the JOIN is commented out): {noformat} create or replace temporary view test_temp_view as with step_1 as ( select * , min(a) over w2 as min_a_over_w1 from (select 1 as a, 2 as b, 3 as c) window w2 as (partition by b order by c) ) , step_2 as ( select * from (select 1 as e, 2 as f, 3 as g) --join step_1 on true window w1 as (partition by f order by g) ) select * from step_2 {noformat} This doesn't: {noformat} create or replace temporary view test_temp_view as with step_1 as ( select * , min(a) over w2 as min_a_over_w1 from (select 1 as a, 2 as b, 3 as c) window w2 as (partition by b order by c) ) , step_2 as ( select * from (select 1 as e, 2 as f, 3 as g) join step_1 on true window w1 as (partition by f order by g) ) select * from step_2 {noformat} It seems that JOIN-ng one CTE with a named window to another CTE with a named window will overwrite/clear some scope. Somehow it also ONLY applies when creating a view, but not when just running the query. > Incorrect scope when using named_windows in CTEs > > > Key: SPARK-37500 > URL: https://issues.apache.org/jira/browse/SPARK-37500 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0 > Environment: Databricks Runtime 9.0, 9.1, 10.0 >Reporter: Lauri Koobas >Priority: Major > > This works, but shouldn't. The named_window is described outside the CTE that > uses it. > {code:sql} > with step_1 as ( > select * > , min(a) over w1 as min_a_over_w1 > from (select 1 as a, 2 as b, 3 as c) > ) > select * > from step_1 > window w1 as (partition by b order by c) > {code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38033) The structured streaming processing cannot be started because the commitId and offsetId are inconsistent
[ https://issues.apache.org/jira/browse/SPARK-38033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488650#comment-17488650 ] LLiu commented on SPARK-38033: -- Hi [~kabhwan] , this is a good idea(y), thanks for your suggestion, and I'll try to provide better error message. > The structured streaming processing cannot be started because the commitId > and offsetId are inconsistent > > > Key: SPARK-38033 > URL: https://issues.apache.org/jira/browse/SPARK-38033 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.6, 3.0.3 >Reporter: LLiu >Priority: Major > > Streaming Processing could not start due to an unexpected machine shutdown. > The exception is as follows > > {code:java} > ERROR 22/01/12 02:48:36 MicroBatchExecution: Query > streaming_4a026335eafd4bb498ee51752b49f7fb [id = > 647ba9e4-16d2-4972-9824-6f9179588806, runId = > 92385d5b-f31f-40d0-9ac7-bb7d9796d774] terminated with error > java.lang.IllegalStateException: batch 113258 doesn't exist > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$4.apply(MicroBatchExecution.scala:256) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$4.apply(MicroBatchExecution.scala:256) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets(MicroBatchExecution.scala:255) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:169) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166) > at > org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193) > {code} > I checked checkpoint file on HDFS and found the latest offset is 113259. But > commits is 113257. The following > {code:java} > commits > /tmp/streaming_/commits/113253 > /tmp/streaming_/commits/113254 > /tmp/streaming_/commits/113255 > /tmp/streaming_/commits/113256 > /tmp/streaming_/commits/113257 > offset > /tmp/streaming_/offsets/113253 > /tmp/streaming_/offsets/113254 > /tmp/streaming_/offsets/113255 > /tmp/streaming_/offsets/113256 > /tmp/streaming_/offsets/113257 > /tmp/streaming_/offsets/113259{code} > Finally, I deleted offsets “/tmp/streaming_/offsets/113259” and the > program started normally. I think there is a problem here and we should try > to handle this exception or give some resolution in the log. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org