[jira] [Commented] (SPARK-49281) Optimze parquet binary getBytes with getBytesUnsafe to avoid copy cost
[ https://issues.apache.org/jira/browse/SPARK-49281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875698#comment-17875698 ] XiDuo You commented on SPARK-49281: --- resovled by https://github.com/apache/spark/pull/47797 > Optimze parquet binary getBytes with getBytesUnsafe to avoid copy cost > --- > > Key: SPARK-49281 > URL: https://issues.apache.org/jira/browse/SPARK-49281 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: xy >Assignee: xy >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Optimze parquet binary getBytes with getBytesUnsafe to avoid copy cost -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49281) Optimze parquet binary getBytes with getBytesUnsafe to avoid copy cost
[ https://issues.apache.org/jira/browse/SPARK-49281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You resolved SPARK-49281. --- Fix Version/s: 4.0.0 Assignee: xy Resolution: Fixed > Optimze parquet binary getBytes with getBytesUnsafe to avoid copy cost > --- > > Key: SPARK-49281 > URL: https://issues.apache.org/jira/browse/SPARK-49281 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: xy >Assignee: xy >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Optimze parquet binary getBytes with getBytesUnsafe to avoid copy cost -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49179) Fix v2 multi bucketed inner joins throw AssertionError
[ https://issues.apache.org/jira/browse/SPARK-49179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-49179: -- Affects Version/s: 3.5.2 (was: 3.5.1) > Fix v2 multi bucketed inner joins throw AssertionError > -- > > Key: SPARK-49179 > URL: https://issues.apache.org/jira/browse/SPARK-49179 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0, 3.3.4, 3.5.2, 3.4.3 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.4.4, 3.5.3 > > > {code:java} > [info] Cause: java.lang.AssertionError: assertion failed > [info] at scala.Predef$.assert(Predef.scala:264) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements.createKeyGroupedShuffleSpec(EnsureRequirements.scala:642) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$checkKeyGroupCompatible$1(EnsureRequirements.scala:385) > [info] at scala.collection.immutable.List.map(List.scala:247) > [info] at scala.collection.immutable.List.map(List.scala:79) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements.checkKeyGroupCompatible(EnsureRequirements.scala:382) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements.checkKeyGroupCompatible(EnsureRequirements.scala:364) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$ensureDistributionAndOrdering(EnsureRequirements.scala:166) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:714) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:689) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$4(TreeNode.scala:528) > [info] at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:84) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:528) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:497) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:689) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:51) > [info] at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$.$anonfun$applyPhysicalRules$2(AdaptiveSparkPlanExec.scala:882) > [in > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49179) Fix v2 multi bucketed inner joins throw AssertionError
[ https://issues.apache.org/jira/browse/SPARK-49179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-49179: -- Fix Version/s: 3.4.4 3.5.3 > Fix v2 multi bucketed inner joins throw AssertionError > -- > > Key: SPARK-49179 > URL: https://issues.apache.org/jira/browse/SPARK-49179 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0, 3.5.1, 3.3.4, 3.4.3 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.4.4, 3.5.3 > > > {code:java} > [info] Cause: java.lang.AssertionError: assertion failed > [info] at scala.Predef$.assert(Predef.scala:264) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements.createKeyGroupedShuffleSpec(EnsureRequirements.scala:642) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$checkKeyGroupCompatible$1(EnsureRequirements.scala:385) > [info] at scala.collection.immutable.List.map(List.scala:247) > [info] at scala.collection.immutable.List.map(List.scala:79) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements.checkKeyGroupCompatible(EnsureRequirements.scala:382) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements.checkKeyGroupCompatible(EnsureRequirements.scala:364) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$ensureDistributionAndOrdering(EnsureRequirements.scala:166) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:714) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:689) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$4(TreeNode.scala:528) > [info] at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:84) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:528) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:497) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:689) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:51) > [info] at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$.$anonfun$applyPhysicalRules$2(AdaptiveSparkPlanExec.scala:882) > [in > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49179) Fix v2 multi bucketed inner joins throw AssertionError
[ https://issues.apache.org/jira/browse/SPARK-49179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You resolved SPARK-49179. --- Fix Version/s: 3.4.4 4.0.0 3.5.3 Resolution: Fixed Issue resolved by pull request 47683 [https://github.com/apache/spark/pull/47683] > Fix v2 multi bucketed inner joins throw AssertionError > -- > > Key: SPARK-49179 > URL: https://issues.apache.org/jira/browse/SPARK-49179 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0, 3.5.1, 3.3.4, 3.4.3 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Labels: pull-request-available > Fix For: 3.4.4, 4.0.0, 3.5.3 > > > {code:java} > [info] Cause: java.lang.AssertionError: assertion failed > [info] at scala.Predef$.assert(Predef.scala:264) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements.createKeyGroupedShuffleSpec(EnsureRequirements.scala:642) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$checkKeyGroupCompatible$1(EnsureRequirements.scala:385) > [info] at scala.collection.immutable.List.map(List.scala:247) > [info] at scala.collection.immutable.List.map(List.scala:79) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements.checkKeyGroupCompatible(EnsureRequirements.scala:382) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements.checkKeyGroupCompatible(EnsureRequirements.scala:364) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$ensureDistributionAndOrdering(EnsureRequirements.scala:166) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:714) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:689) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$4(TreeNode.scala:528) > [info] at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:84) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:528) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:497) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:689) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:51) > [info] at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$.$anonfun$applyPhysicalRules$2(AdaptiveSparkPlanExec.scala:882) > [in > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49179) Fix v2 multi bucketed inner joins throw AssertionError
[ https://issues.apache.org/jira/browse/SPARK-49179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You reassigned SPARK-49179: - Assignee: XiDuo You > Fix v2 multi bucketed inner joins throw AssertionError > -- > > Key: SPARK-49179 > URL: https://issues.apache.org/jira/browse/SPARK-49179 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0, 3.5.1, 3.3.4, 3.4.3 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Labels: pull-request-available > > {code:java} > [info] Cause: java.lang.AssertionError: assertion failed > [info] at scala.Predef$.assert(Predef.scala:264) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements.createKeyGroupedShuffleSpec(EnsureRequirements.scala:642) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$checkKeyGroupCompatible$1(EnsureRequirements.scala:385) > [info] at scala.collection.immutable.List.map(List.scala:247) > [info] at scala.collection.immutable.List.map(List.scala:79) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements.checkKeyGroupCompatible(EnsureRequirements.scala:382) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements.checkKeyGroupCompatible(EnsureRequirements.scala:364) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$ensureDistributionAndOrdering(EnsureRequirements.scala:166) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:714) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:689) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$4(TreeNode.scala:528) > [info] at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:84) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:528) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:497) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:689) > [info] at > org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:51) > [info] at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$.$anonfun$applyPhysicalRules$2(AdaptiveSparkPlanExec.scala:882) > [in > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-49200) Fix null type non-codegen ordering exception
[ https://issues.apache.org/jira/browse/SPARK-49200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You resolved SPARK-49200. --- Fix Version/s: 4.0.0 3.5.3 Resolution: Fixed Issue resolved by pull request 47707 [https://github.com/apache/spark/pull/47707] > Fix null type non-codegen ordering exception > > > Key: SPARK-49200 > URL: https://issues.apache.org/jira/browse/SPARK-49200 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0, 3.5.2 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0, 3.5.3 > > > {code:java} > set spark.sql.codegen.factoryMode=NO_CODEGEN; > set > spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.EliminateSorts; > select * from range(10) order by array(null); > {code} > {code:java} > org.apache.spark.SparkIllegalArgumentException: Type PhysicalNullType does > not support ordered operations. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.orderedOperationUnsupportedByDataTypeError(QueryExecutionErrors.scala:352) > at > org.apache.spark.sql.catalyst.types.PhysicalNullType.ordering(PhysicalDataType.scala:246) > at > org.apache.spark.sql.catalyst.types.PhysicalNullType.ordering(PhysicalDataType.scala:243) > at > org.apache.spark.sql.catalyst.types.PhysicalArrayType$$anon$1.(PhysicalDataType.scala:283) > at > org.apache.spark.sql.catalyst.types.PhysicalArrayType.interpretedOrdering$lzycompute(PhysicalDataType.scala:281) > at > org.apache.spark.sql.catalyst.types.PhysicalArrayType.interpretedOrdering(PhysicalDataType.scala:281) > at > org.apache.spark.sql.catalyst.types.PhysicalArrayType.ordering(PhysicalDataType.scala:277) > at > org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:67) > at > org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:40) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter$RowComparator.compare(UnsafeExternalRowSorter.java:254) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:70) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:44) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49205) KeyGroupedPartitioning should inherit HashPartitioningLike
XiDuo You created SPARK-49205: - Summary: KeyGroupedPartitioning should inherit HashPartitioningLike Key: SPARK-49205 URL: https://issues.apache.org/jira/browse/SPARK-49205 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: XiDuo You -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49179) Fix v2 multi bucketed inner joins throw AssertionError
XiDuo You created SPARK-49179: - Summary: Fix v2 multi bucketed inner joins throw AssertionError Key: SPARK-49179 URL: https://issues.apache.org/jira/browse/SPARK-49179 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.3, 3.3.4, 3.5.1, 4.0.0 Reporter: XiDuo You {code:java} [info] Cause: java.lang.AssertionError: assertion failed [info] at scala.Predef$.assert(Predef.scala:264) [info] at org.apache.spark.sql.execution.exchange.EnsureRequirements.createKeyGroupedShuffleSpec(EnsureRequirements.scala:642) [info] at org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$checkKeyGroupCompatible$1(EnsureRequirements.scala:385) [info] at scala.collection.immutable.List.map(List.scala:247) [info] at scala.collection.immutable.List.map(List.scala:79) [info] at org.apache.spark.sql.execution.exchange.EnsureRequirements.checkKeyGroupCompatible(EnsureRequirements.scala:382) [info] at org.apache.spark.sql.execution.exchange.EnsureRequirements.checkKeyGroupCompatible(EnsureRequirements.scala:364) [info] at org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$ensureDistributionAndOrdering(EnsureRequirements.scala:166) [info] at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:714) [info] at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:689) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$4(TreeNode.scala:528) [info] at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:84) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:528) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:497) [info] at org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:689) [info] at org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:51) [info] at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$.$anonfun$applyPhysicalRules$2(AdaptiveSparkPlanExec.scala:882) [in {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-49030) Self join of a CTE seems non-deterministic
[ https://issues.apache.org/jira/browse/SPARK-49030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870374#comment-17870374 ] XiDuo You commented on SPARK-49030: --- [~jira.shegalov] the case you point makes sence to me, I can re-produce it. > Self join of a CTE seems non-deterministic > -- > > Key: SPARK-49030 > URL: https://issues.apache.org/jira/browse/SPARK-49030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 > Environment: Tested with Spark 3.4.1, 3.5.1, and 4.0.0-preview. >Reporter: Jihoon Son >Priority: Minor > Fix For: 3.5.3 > > Attachments: screenshot-1.png > > > {code:java} > WITH c AS (SELECT * FROM customer LIMIT 10) > SELECT count(*) > FROM c c1, c c2 > WHERE c1.c_customer_sk > c2.c_customer_sk{code} > Suppose a self join query on a CTE such as the one above. > Spark generates a physical plan like the one below for this query. > {code:java} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[], functions=[count(1)], output=[count(1)#194L]) > +- HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#233L]) > +- Project > +- BroadcastNestedLoopJoin BuildRight, Inner, (c_customer_sk#0 > > c_customer_sk#214) > :- Filter isnotnull(c_customer_sk#0) > : +- GlobalLimit 10, 0 > : +- Exchange SinglePartition, ENSURE_REQUIREMENTS, > [plan_id=256] > : +- LocalLimit 10 > : +- FileScan parquet [c_customer_sk#0] Batched: true, > DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 > paths)[file:/some/path/customer], PartitionFilters: [], PushedFilters: [], > ReadSchema: struct > +- BroadcastExchange IdentityBroadcastMode, [plan_id=263] > +- Filter isnotnull(c_customer_sk#214) > +- GlobalLimit 10, 0 > +- Exchange SinglePartition, ENSURE_REQUIREMENTS, > [plan_id=259] > +- LocalLimit 10 > +- FileScan parquet [c_customer_sk#214] Batched: > true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 > paths)[file:/some/path/customer], PartitionFilters: [], PushedFilters: [], > ReadSchema: struct{code} > Evaluating this plan produces non-deterministic result because the limit is > independently pushed into the two sides of the join. Each limit can produce > different data, and thus the join can produce results that vary across runs. > I understand that the query in question is not deterministic (and thus not > very practical) as, due to the nature of the limit in distributed engines, it > is not expected to produce the same result anyway across repeated runs. > However, I would still expect that the query plan evaluation remains > deterministic. > Per extended analysis as seen below, it seems that the query plan has changed > at some point during optimization. > {code:java} > == Analyzed Logical Plan == > count(1): bigint > WithCTE > :- CTERelationDef 2, false > : +- SubqueryAlias c > : +- GlobalLimit 10 > : +- LocalLimit 10 > : +- Project [c_customer_sk#0, c_customer_id#1, > c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, > c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, > c_first_name#8, c_last_name#9, c_preferred_cust_flag#10, c_birth_day#11L, > c_birth_month#12L, c_birth_year#13L, c_birth_country#14, c_login#15, > c_email_address#16, c_last_review_date_sk#17] > : +- SubqueryAlias customer > : +- View (`customer`, [c_customer_sk#0, c_customer_id#1, > c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, > c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, > c_first_name#8, c_last_name#9, c_preferred_cust_flag#10, c_birth_day#11L, > c_birth_month#12L, c_birth_year#13L, c_birth_country#14, c_login#15, > c_email_address#16, c_last_review_date_sk#17]) > : +- Relation > [c_customer_sk#0,c_customer_id#1,c_current_cdemo_sk#2,c_current_hdemo_sk#3,c_current_addr_sk#4,c_first_shipto_date_sk#5,c_first_sales_date_sk#6,c_salutation#7,c_first_name#8,c_last_name#9,c_preferred_cust_flag#10,c_birth_day#11L,c_birth_month#12L,c_birth_year#13L,c_birth_country#14,c_login#15,c_email_address#16,c_last_review_date_sk#17] > parquet > +- Aggregate [count(1) AS count(1)#194L] > +- Filter (c_customer_sk#0 > c_customer_sk#176) > +- Join Inner > :- SubqueryAlias c1 > : +- SubqueryAlias c > : +- CTERelationRef 2, true, [c_customer_sk#0, c_customer_id#1, > c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, > c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, > c_
[jira] [Resolved] (SPARK-49071) Remove ArraySortLike trait
[ https://issues.apache.org/jira/browse/SPARK-49071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You resolved SPARK-49071. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47547 [https://github.com/apache/spark/pull/47547] > Remove ArraySortLike trait > -- > > Key: SPARK-49071 > URL: https://issues.apache.org/jira/browse/SPARK-49071 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-49071) Remove ArraySortLike trait
[ https://issues.apache.org/jira/browse/SPARK-49071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You reassigned SPARK-49071: - Assignee: XiDuo You > Remove ArraySortLike trait > -- > > Key: SPARK-49071 > URL: https://issues.apache.org/jira/browse/SPARK-49071 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-49030) Self join of a CTE seems non-deterministic
[ https://issues.apache.org/jira/browse/SPARK-49030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870322#comment-17870322 ] XiDuo You commented on SPARK-49030: --- > Hi [~ulysses], thank you for having a look. I wonder how you tested it > though. I'm pretty sure that the query plan shown above is the final query > plan. [~jihoonson] It seems not the truth... The query plan tree string you put in JIRA deacription contains: {code:java} AdaptiveSparkPlan isFinalPlan=false {code} so I think it is actually an initial plan. > I see the same query plan in the Spark web UI after the query completes. The > stage views in the UI show the same query plan. Finally, when I repeatedly > run the query in question, each run returns a different result. I tried to use spark 3.5.1 with following case but can not re-produce your issue. I always get the same result because the shuffle of limit is a reused exchange. {code:java} create table t1 using parquet as select /*+ repartition(3) */ rand() as c1, id as c2 from range(1); WITH c AS (SELECT * FROM t1 LIMIT 10) SELECT count(*) FROM c c1, c c2 WHERE c1.c1 > c2.c1 ; {code} !screenshot-1.png! So which Spark version you tested? Have you tried Spark3.5.1 or 4.0.0 ? > Self join of a CTE seems non-deterministic > -- > > Key: SPARK-49030 > URL: https://issues.apache.org/jira/browse/SPARK-49030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 > Environment: Tested with Spark 3.4.1, 3.5.1, and 4.0.0-preview. >Reporter: Jihoon Son >Priority: Minor > Attachments: screenshot-1.png > > > {code:java} > WITH c AS (SELECT * FROM customer LIMIT 10) > SELECT count(*) > FROM c c1, c c2 > WHERE c1.c_customer_sk > c2.c_customer_sk{code} > Suppose a self join query on a CTE such as the one above. > Spark generates a physical plan like the one below for this query. > {code:java} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[], functions=[count(1)], output=[count(1)#194L]) > +- HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#233L]) > +- Project > +- BroadcastNestedLoopJoin BuildRight, Inner, (c_customer_sk#0 > > c_customer_sk#214) > :- Filter isnotnull(c_customer_sk#0) > : +- GlobalLimit 10, 0 > : +- Exchange SinglePartition, ENSURE_REQUIREMENTS, > [plan_id=256] > : +- LocalLimit 10 > : +- FileScan parquet [c_customer_sk#0] Batched: true, > DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 > paths)[file:/some/path/customer], PartitionFilters: [], PushedFilters: [], > ReadSchema: struct > +- BroadcastExchange IdentityBroadcastMode, [plan_id=263] > +- Filter isnotnull(c_customer_sk#214) > +- GlobalLimit 10, 0 > +- Exchange SinglePartition, ENSURE_REQUIREMENTS, > [plan_id=259] > +- LocalLimit 10 > +- FileScan parquet [c_customer_sk#214] Batched: > true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 > paths)[file:/some/path/customer], PartitionFilters: [], PushedFilters: [], > ReadSchema: struct{code} > Evaluating this plan produces non-deterministic result because the limit is > independently pushed into the two sides of the join. Each limit can produce > different data, and thus the join can produce results that vary across runs. > I understand that the query in question is not deterministic (and thus not > very practical) as, due to the nature of the limit in distributed engines, it > is not expected to produce the same result anyway across repeated runs. > However, I would still expect that the query plan evaluation remains > deterministic. > Per extended analysis as seen below, it seems that the query plan has changed > at some point during optimization. > {code:java} > == Analyzed Logical Plan == > count(1): bigint > WithCTE > :- CTERelationDef 2, false > : +- SubqueryAlias c > : +- GlobalLimit 10 > : +- LocalLimit 10 > : +- Project [c_customer_sk#0, c_customer_id#1, > c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, > c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, > c_first_name#8, c_last_name#9, c_preferred_cust_flag#10, c_birth_day#11L, > c_birth_month#12L, c_birth_year#13L, c_birth_country#14, c_login#15, > c_email_address#16, c_last_review_date_sk#17] > : +- SubqueryAlias customer > : +- View (`customer`, [c_customer_sk#0, c_customer_id#1, > c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, > c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, > c_first_name#8, c_last_name#9, c
[jira] [Updated] (SPARK-49030) Self join of a CTE seems non-deterministic
[ https://issues.apache.org/jira/browse/SPARK-49030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-49030: -- Attachment: screenshot-1.png > Self join of a CTE seems non-deterministic > -- > > Key: SPARK-49030 > URL: https://issues.apache.org/jira/browse/SPARK-49030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 > Environment: Tested with Spark 3.4.1, 3.5.1, and 4.0.0-preview. >Reporter: Jihoon Son >Priority: Minor > Attachments: screenshot-1.png > > > {code:java} > WITH c AS (SELECT * FROM customer LIMIT 10) > SELECT count(*) > FROM c c1, c c2 > WHERE c1.c_customer_sk > c2.c_customer_sk{code} > Suppose a self join query on a CTE such as the one above. > Spark generates a physical plan like the one below for this query. > {code:java} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[], functions=[count(1)], output=[count(1)#194L]) > +- HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#233L]) > +- Project > +- BroadcastNestedLoopJoin BuildRight, Inner, (c_customer_sk#0 > > c_customer_sk#214) > :- Filter isnotnull(c_customer_sk#0) > : +- GlobalLimit 10, 0 > : +- Exchange SinglePartition, ENSURE_REQUIREMENTS, > [plan_id=256] > : +- LocalLimit 10 > : +- FileScan parquet [c_customer_sk#0] Batched: true, > DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 > paths)[file:/some/path/customer], PartitionFilters: [], PushedFilters: [], > ReadSchema: struct > +- BroadcastExchange IdentityBroadcastMode, [plan_id=263] > +- Filter isnotnull(c_customer_sk#214) > +- GlobalLimit 10, 0 > +- Exchange SinglePartition, ENSURE_REQUIREMENTS, > [plan_id=259] > +- LocalLimit 10 > +- FileScan parquet [c_customer_sk#214] Batched: > true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 > paths)[file:/some/path/customer], PartitionFilters: [], PushedFilters: [], > ReadSchema: struct{code} > Evaluating this plan produces non-deterministic result because the limit is > independently pushed into the two sides of the join. Each limit can produce > different data, and thus the join can produce results that vary across runs. > I understand that the query in question is not deterministic (and thus not > very practical) as, due to the nature of the limit in distributed engines, it > is not expected to produce the same result anyway across repeated runs. > However, I would still expect that the query plan evaluation remains > deterministic. > Per extended analysis as seen below, it seems that the query plan has changed > at some point during optimization. > {code:java} > == Analyzed Logical Plan == > count(1): bigint > WithCTE > :- CTERelationDef 2, false > : +- SubqueryAlias c > : +- GlobalLimit 10 > : +- LocalLimit 10 > : +- Project [c_customer_sk#0, c_customer_id#1, > c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, > c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, > c_first_name#8, c_last_name#9, c_preferred_cust_flag#10, c_birth_day#11L, > c_birth_month#12L, c_birth_year#13L, c_birth_country#14, c_login#15, > c_email_address#16, c_last_review_date_sk#17] > : +- SubqueryAlias customer > : +- View (`customer`, [c_customer_sk#0, c_customer_id#1, > c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, > c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, > c_first_name#8, c_last_name#9, c_preferred_cust_flag#10, c_birth_day#11L, > c_birth_month#12L, c_birth_year#13L, c_birth_country#14, c_login#15, > c_email_address#16, c_last_review_date_sk#17]) > : +- Relation > [c_customer_sk#0,c_customer_id#1,c_current_cdemo_sk#2,c_current_hdemo_sk#3,c_current_addr_sk#4,c_first_shipto_date_sk#5,c_first_sales_date_sk#6,c_salutation#7,c_first_name#8,c_last_name#9,c_preferred_cust_flag#10,c_birth_day#11L,c_birth_month#12L,c_birth_year#13L,c_birth_country#14,c_login#15,c_email_address#16,c_last_review_date_sk#17] > parquet > +- Aggregate [count(1) AS count(1)#194L] > +- Filter (c_customer_sk#0 > c_customer_sk#176) > +- Join Inner > :- SubqueryAlias c1 > : +- SubqueryAlias c > : +- CTERelationRef 2, true, [c_customer_sk#0, c_customer_id#1, > c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, > c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, > c_first_name#8, c_last_name#9, c_preferred_cust_flag#10, c_birth_day#11L, > c_birth_month#12L, c_birth_year#13L, c_birth_country#14,
[jira] [Commented] (SPARK-49030) Self join of a CTE seems non-deterministic
[ https://issues.apache.org/jira/browse/SPARK-49030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870144#comment-17870144 ] XiDuo You commented on SPARK-49030: --- [~jihoonson] seems you are reference an initial query plan. AQE will reuse the excahnge at runtime so I think the limit operator should be deterministic. For the non-deterministic expression, Spark has already fixed the issue using exchange reuse, see SPARK-36447. also cc [~yao] > Self join of a CTE seems non-deterministic > -- > > Key: SPARK-49030 > URL: https://issues.apache.org/jira/browse/SPARK-49030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 > Environment: Tested with Spark 3.4.1, 3.5.1, and 4.0.0-preview. >Reporter: Jihoon Son >Priority: Minor > > {code:java} > WITH c AS (SELECT * FROM customer LIMIT 10) > SELECT count(*) > FROM c c1, c c2 > WHERE c1.c_customer_sk > c2.c_customer_sk{code} > Suppose a self join query on a CTE such as the one above. > Spark generates a physical plan like the one below for this query. > {code:java} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[], functions=[count(1)], output=[count(1)#194L]) > +- HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#233L]) > +- Project > +- BroadcastNestedLoopJoin BuildRight, Inner, (c_customer_sk#0 > > c_customer_sk#214) > :- Filter isnotnull(c_customer_sk#0) > : +- GlobalLimit 10, 0 > : +- Exchange SinglePartition, ENSURE_REQUIREMENTS, > [plan_id=256] > : +- LocalLimit 10 > : +- FileScan parquet [c_customer_sk#0] Batched: true, > DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 > paths)[file:/some/path/customer], PartitionFilters: [], PushedFilters: [], > ReadSchema: struct > +- BroadcastExchange IdentityBroadcastMode, [plan_id=263] > +- Filter isnotnull(c_customer_sk#214) > +- GlobalLimit 10, 0 > +- Exchange SinglePartition, ENSURE_REQUIREMENTS, > [plan_id=259] > +- LocalLimit 10 > +- FileScan parquet [c_customer_sk#214] Batched: > true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 > paths)[file:/some/path/customer], PartitionFilters: [], PushedFilters: [], > ReadSchema: struct{code} > Evaluating this plan produces non-deterministic result because the limit is > independently pushed into the two sides of the join. Each limit can produce > different data, and thus the join can produce results that vary across runs. > I understand that the query in question is not deterministic (and thus not > very practical) as, due to the nature of the limit in distributed engines, it > is not expected to produce the same result anyway across repeated runs. > However, I would still expect that the query plan evaluation remains > deterministic. > Per extended analysis as seen below, it seems that the query plan has changed > at some point during optimization. > {code:java} > == Analyzed Logical Plan == > count(1): bigint > WithCTE > :- CTERelationDef 2, false > : +- SubqueryAlias c > : +- GlobalLimit 10 > : +- LocalLimit 10 > : +- Project [c_customer_sk#0, c_customer_id#1, > c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, > c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, > c_first_name#8, c_last_name#9, c_preferred_cust_flag#10, c_birth_day#11L, > c_birth_month#12L, c_birth_year#13L, c_birth_country#14, c_login#15, > c_email_address#16, c_last_review_date_sk#17] > : +- SubqueryAlias customer > : +- View (`customer`, [c_customer_sk#0, c_customer_id#1, > c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, > c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, > c_first_name#8, c_last_name#9, c_preferred_cust_flag#10, c_birth_day#11L, > c_birth_month#12L, c_birth_year#13L, c_birth_country#14, c_login#15, > c_email_address#16, c_last_review_date_sk#17]) > : +- Relation > [c_customer_sk#0,c_customer_id#1,c_current_cdemo_sk#2,c_current_hdemo_sk#3,c_current_addr_sk#4,c_first_shipto_date_sk#5,c_first_sales_date_sk#6,c_salutation#7,c_first_name#8,c_last_name#9,c_preferred_cust_flag#10,c_birth_day#11L,c_birth_month#12L,c_birth_year#13L,c_birth_country#14,c_login#15,c_email_address#16,c_last_review_date_sk#17] > parquet > +- Aggregate [count(1) AS count(1)#194L] > +- Filter (c_customer_sk#0 > c_customer_sk#176) > +- Join Inner > :- SubqueryAlias c1 > : +- SubqueryAlias c > : +- CTERelationRef 2, true, [c_customer_sk#0, c_customer_id#1, > c
[jira] [Updated] (SPARK-49071) Remove ArraySortLike trait
[ https://issues.apache.org/jira/browse/SPARK-49071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-49071: -- Summary: Remove ArraySortLike trait (was: Cleanup SortArray) > Remove ArraySortLike trait > -- > > Key: SPARK-49071 > URL: https://issues.apache.org/jira/browse/SPARK-49071 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: XiDuo You >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49071) Cleanup SortArray
XiDuo You created SPARK-49071: - Summary: Cleanup SortArray Key: SPARK-49071 URL: https://issues.apache.org/jira/browse/SPARK-49071 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: XiDuo You -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45201) NoClassDefFoundError: InternalFutureFailureAccess when compiling Spark 3.5.0
[ https://issues.apache.org/jira/browse/SPARK-45201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You resolved SPARK-45201. --- Fix Version/s: 3.5.2 Resolution: Fixed > NoClassDefFoundError: InternalFutureFailureAccess when compiling Spark 3.5.0 > > > Key: SPARK-45201 > URL: https://issues.apache.org/jira/browse/SPARK-45201 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0, 3.5.1 >Reporter: Sebastian Daberdaku >Priority: Major > Fix For: 3.5.2 > > Attachments: Dockerfile, spark-3.5.0.patch, spark-3.5.1.patch > > > I am trying to compile Spark 3.5.0 and make a distribution that supports > Spark Connect and Kubernetes. The compilation seems to complete correctly, > but when I try to run the Spark Connect server on kubernetes I get a > "NoClassDefFoundError" as follows: > {code:java} > Exception in thread "main" java.lang.NoClassDefFoundError: > org/sparkproject/guava/util/concurrent/internal/InternalFutureFailureAccess > at java.base/java.lang.ClassLoader.defineClass1(Native Method) > at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017) > at > java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150) > at > java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862) > at > java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760) > at > java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681) > at > java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639) > at > java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525) > at java.base/java.lang.ClassLoader.defineClass1(Native Method) > at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017) > at > java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150) > at > java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862) > at > java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760) > at > java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681) > at > java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639) > at > java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525) > at java.base/java.lang.ClassLoader.defineClass1(Native Method) > at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017) > at > java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150) > at > java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862) > at > java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760) > at > java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681) > at > java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639) > at > java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525) > at > org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3511) > at > org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3515) > at > org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2168) > at > org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2079) > at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4011) > at org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4034) > at > org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5010) > at > org.apache.spark.storage.BlockManagerId$.getCachedBlockManagerId(BlockManagerId.scala:146) > at > org.apache.spark.storage.BlockManagerId$.apply(BlockManagerId.scala:127) > at > org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:536) > at org.apache.spark.SparkContext.(SparkContext.scala:625) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2888) > at > org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:1099) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.SparkSession$Build
[jira] [Commented] (SPARK-45201) NoClassDefFoundError: InternalFutureFailureAccess when compiling Spark 3.5.0
[ https://issues.apache.org/jira/browse/SPARK-45201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17868221#comment-17868221 ] XiDuo You commented on SPARK-45201: --- This issue has been fixed by https://github.com/apache/spark/pull/45775 > NoClassDefFoundError: InternalFutureFailureAccess when compiling Spark 3.5.0 > > > Key: SPARK-45201 > URL: https://issues.apache.org/jira/browse/SPARK-45201 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0, 3.5.1 >Reporter: Sebastian Daberdaku >Priority: Major > Attachments: Dockerfile, spark-3.5.0.patch, spark-3.5.1.patch > > > I am trying to compile Spark 3.5.0 and make a distribution that supports > Spark Connect and Kubernetes. The compilation seems to complete correctly, > but when I try to run the Spark Connect server on kubernetes I get a > "NoClassDefFoundError" as follows: > {code:java} > Exception in thread "main" java.lang.NoClassDefFoundError: > org/sparkproject/guava/util/concurrent/internal/InternalFutureFailureAccess > at java.base/java.lang.ClassLoader.defineClass1(Native Method) > at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017) > at > java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150) > at > java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862) > at > java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760) > at > java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681) > at > java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639) > at > java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525) > at java.base/java.lang.ClassLoader.defineClass1(Native Method) > at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017) > at > java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150) > at > java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862) > at > java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760) > at > java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681) > at > java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639) > at > java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525) > at java.base/java.lang.ClassLoader.defineClass1(Native Method) > at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017) > at > java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150) > at > java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862) > at > java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760) > at > java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681) > at > java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639) > at > java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525) > at > org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3511) > at > org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3515) > at > org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2168) > at > org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2079) > at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4011) > at org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4034) > at > org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5010) > at > org.apache.spark.storage.BlockManagerId$.getCachedBlockManagerId(BlockManagerId.scala:146) > at > org.apache.spark.storage.BlockManagerId$.apply(BlockManagerId.scala:127) > at > org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:536) > at org.apache.spark.SparkContext.(SparkContext.scala:625) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2888) > at > org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:1099) > at scala.Option.getOrElse(Option.scala:189) >
[jira] [Updated] (SPARK-41763) Refactor v1 writes
[ https://issues.apache.org/jira/browse/SPARK-41763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-41763: -- Summary: Refactor v1 writes (was: Improve v1 writes) > Refactor v1 writes > -- > > Key: SPARK-41763 > URL: https://issues.apache.org/jira/browse/SPARK-41763 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.4.0 > > > This umbrella is used to track all tickets about the changes of v1 writes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41763) Improve v1 writes
[ https://issues.apache.org/jira/browse/SPARK-41763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You resolved SPARK-41763. --- Fix Version/s: 3.4.0 Assignee: XiDuo You Resolution: Fixed > Improve v1 writes > - > > Key: SPARK-41763 > URL: https://issues.apache.org/jira/browse/SPARK-41763 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.4.0 > > > This umbrella is used to track all tickets about the changes of v1 writes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48880) Avoid throw NullPointerException if driver plugin fails to initialize
[ https://issues.apache.org/jira/browse/SPARK-48880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You resolved SPARK-48880. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47321 [https://github.com/apache/spark/pull/47321] > Avoid throw NullPointerException if driver plugin fails to initialize > - > > Key: SPARK-48880 > URL: https://issues.apache.org/jira/browse/SPARK-48880 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48880) Avoid throw NullPointerException if driver plugin fails to initialize
[ https://issues.apache.org/jira/browse/SPARK-48880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You reassigned SPARK-48880: - Assignee: XiDuo You > Avoid throw NullPointerException if driver plugin fails to initialize > - > > Key: SPARK-48880 > URL: https://issues.apache.org/jira/browse/SPARK-48880 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48880) Avoid throw NullPointerException if driver plugin fails to initialize
XiDuo You created SPARK-48880: - Summary: Avoid throw NullPointerException if driver plugin fails to initialize Key: SPARK-48880 URL: https://issues.apache.org/jira/browse/SPARK-48880 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 4.0.0 Reporter: XiDuo You -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48817) MultiInsert is split to multiple sql executions, resulting in no exchange reuse
[ https://issues.apache.org/jira/browse/SPARK-48817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You reassigned SPARK-48817: - Assignee: Zhen Wang > MultiInsert is split to multiple sql executions, resulting in no exchange > reuse > --- > > Key: SPARK-48817 > URL: https://issues.apache.org/jira/browse/SPARK-48817 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0, 3.5.1 >Reporter: Zhen Wang >Assignee: Zhen Wang >Priority: Major > Labels: pull-request-available > Attachments: image-2024-07-05-14-59-35-340.png, > image-2024-07-05-14-59-55-291.png, image-2024-07-05-15-00-01-805.png, > image-2024-07-05-15-00-09-181.png, image-2024-07-05-15-00-17-693.png, > image-2024-07-05-16-42-01-973.png, image-2024-07-05-16-42-17-817.png, > image-2024-07-05-16-42-27-033.png, image-2024-07-05-16-42-34-738.png, > image-2024-07-05-16-42-46-500.png > > > MultiInsert is split to multiple sql executions, resulting in no exchange > reuse. > > Reproduce sql: > {code:java} > create table wangzhen_t1(c1 int); > create table wangzhen_t2(c1 int); > create table wangzhen_t3(c1 int); > insert into wangzhen_t1 values (1), (2), (3); > from (select /*+ REPARTITION(3) */ c1 from wangzhen_t1) > insert overwrite table wangzhen_t2 select c1 > insert overwrite table wangzhen_t3 select c1; {code} > > In Spark 3.1, there is only one SQL execution and there is a reuse exchange. > !image-2024-07-05-14-59-35-340.png! > > However, in Spark 3.5, it was split to multiple executions and there was no > ReuseExchange. > !image-2024-07-05-16-42-01-973.png! > !image-2024-07-05-16-42-17-817.png!!image-2024-07-05-16-42-34-738.png!!image-2024-07-05-16-42-46-500.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48817) MultiInsert is split to multiple sql executions, resulting in no exchange reuse
[ https://issues.apache.org/jira/browse/SPARK-48817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You resolved SPARK-48817. --- Fix Version/s: 4.0.0 Resolution: Fixed > MultiInsert is split to multiple sql executions, resulting in no exchange > reuse > --- > > Key: SPARK-48817 > URL: https://issues.apache.org/jira/browse/SPARK-48817 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0, 3.5.1 >Reporter: Zhen Wang >Assignee: Zhen Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: image-2024-07-05-14-59-35-340.png, > image-2024-07-05-14-59-55-291.png, image-2024-07-05-15-00-01-805.png, > image-2024-07-05-15-00-09-181.png, image-2024-07-05-15-00-17-693.png, > image-2024-07-05-16-42-01-973.png, image-2024-07-05-16-42-17-817.png, > image-2024-07-05-16-42-27-033.png, image-2024-07-05-16-42-34-738.png, > image-2024-07-05-16-42-46-500.png > > > MultiInsert is split to multiple sql executions, resulting in no exchange > reuse. > > Reproduce sql: > {code:java} > create table wangzhen_t1(c1 int); > create table wangzhen_t2(c1 int); > create table wangzhen_t3(c1 int); > insert into wangzhen_t1 values (1), (2), (3); > from (select /*+ REPARTITION(3) */ c1 from wangzhen_t1) > insert overwrite table wangzhen_t2 select c1 > insert overwrite table wangzhen_t3 select c1; {code} > > In Spark 3.1, there is only one SQL execution and there is a reuse exchange. > !image-2024-07-05-14-59-35-340.png! > > However, in Spark 3.5, it was split to multiple executions and there was no > ReuseExchange. > !image-2024-07-05-16-42-01-973.png! > !image-2024-07-05-16-42-17-817.png!!image-2024-07-05-16-42-34-738.png!!image-2024-07-05-16-42-46-500.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48815) Update environment when stoping connect session
[ https://issues.apache.org/jira/browse/SPARK-48815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You resolved SPARK-48815. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47223 [https://github.com/apache/spark/pull/47223] > Update environment when stoping connect session > --- > > Key: SPARK-48815 > URL: https://issues.apache.org/jira/browse/SPARK-48815 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > We should update environment if any added files are removed when stoping > connect session. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48815) Update environment when stoping connect session
XiDuo You created SPARK-48815: - Summary: Update environment when stoping connect session Key: SPARK-48815 URL: https://issues.apache.org/jira/browse/SPARK-48815 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 4.0.0 Reporter: XiDuo You We should update environment if any added files are removed when stoping connect session. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48168) Add bitwise shifting operators support
[ https://issues.apache.org/jira/browse/SPARK-48168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You resolved SPARK-48168. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46440 [https://github.com/apache/spark/pull/46440] > Add bitwise shifting operators support > -- > > Key: SPARK-48168 > URL: https://issues.apache.org/jira/browse/SPARK-48168 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46090) Support plan fragment level SQL configs in AQE
[ https://issues.apache.org/jira/browse/SPARK-46090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You reassigned SPARK-46090: - Assignee: XiDuo You > Support plan fragment level SQL configs in AQE > --- > > Key: SPARK-46090 > URL: https://issues.apache.org/jira/browse/SPARK-46090 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Labels: pull-request-available > > AQE executes query plan stage by stage, so there is a chance to support plan > fragment level SQL configs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46090) Support plan fragment level SQL configs in AQE
[ https://issues.apache.org/jira/browse/SPARK-46090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You resolved SPARK-46090. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44013 [https://github.com/apache/spark/pull/44013] > Support plan fragment level SQL configs in AQE > --- > > Key: SPARK-46090 > URL: https://issues.apache.org/jira/browse/SPARK-46090 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > AQE executes query plan stage by stage, so there is a chance to support plan > fragment level SQL configs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48004) Add WriteFilesExecBase trait for v1 write
XiDuo You created SPARK-48004: - Summary: Add WriteFilesExecBase trait for v1 write Key: SPARK-48004 URL: https://issues.apache.org/jira/browse/SPARK-48004 Project: Spark Issue Type: Task Components: SQL Affects Versions: 4.0.0 Reporter: XiDuo You -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47285) AdaptiveSparkPlanExec should always use the context.session
XiDuo You created SPARK-47285: - Summary: AdaptiveSparkPlanExec should always use the context.session Key: SPARK-47285 URL: https://issues.apache.org/jira/browse/SPARK-47285 Project: Spark Issue Type: Task Components: SQL Affects Versions: 4.0.0 Reporter: XiDuo You -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47177) Cached SQL plan do not display final AQE plan in explain string
[ https://issues.apache.org/jira/browse/SPARK-47177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-47177: -- Fix Version/s: 3.4.3 > Cached SQL plan do not display final AQE plan in explain string > --- > > Key: SPARK-47177 > URL: https://issues.apache.org/jira/browse/SPARK-47177 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.2, 3.5.0, 4.0.0, 3.5.1, 3.5.2 >Reporter: Ziqi Liu >Assignee: XiDuo You >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.2, 3.4.3 > > > AQE plan is expected to display final plan after execution. This is not true > for cached SQL plan: it will show the initial plan instead. This behavior > change is introduced in [https://github.com/apache/spark/pull/40812] it tried > to fix the concurrency issue with cached plan. > *In short, the plan used to executed and the plan used to explain is not the > same instance, thus causing the inconsistency.* > > I don't have a clear idea how yet > * maybe we just a coarse granularity lock in explain? > * make innerChildren a function: clone the initial plan, every time checked > for whether the original AQE plan is finalized (making the final flag atomic > first, of course), if no return the cloned initial plan, if it's finalized, > clone the final plan and return that one. But still this won't be able to > reflect the AQE plan in real time, in a concurrent situation, but at least we > have initial version and final version. > > A simple repro: > {code:java} > d1 = spark.range(1000).withColumn("key", expr("id % > 100")).groupBy("key").agg({"key": "count"}) > cached_d2 = d1.cache() > df = cached_d2.filter("key > 10") > df.collect() {code} > {code:java} > >>> df.explain() > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=true > +- == Final Plan == > *(1) Filter (isnotnull(key#4L) AND (key#4L > 10)) > +- TableCacheQueryStage 0 > +- InMemoryTableScan [key#4L, count(key)#10L], [isnotnull(key#4L), > (key#4L > 10)] > +- InMemoryRelation [key#4L, count(key)#10L], StorageLevel(disk, > memory, deserialized, 1 replicas) > +- AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[key#4L], > functions=[count(key#4L)]) > +- Exchange hashpartitioning(key#4L, 200), > ENSURE_REQUIREMENTS, [plan_id=24] > +- HashAggregate(keys=[key#4L], > functions=[partial_count(key#4L)]) > +- Project [(id#2L % 100) AS key#4L] > +- Range (0, 1000, step=1, splits=10) > +- == Initial Plan == > Filter (isnotnull(key#4L) AND (key#4L > 10)) > +- InMemoryTableScan [key#4L, count(key)#10L], [isnotnull(key#4L), (key#4L > > 10)] > +- InMemoryRelation [key#4L, count(key)#10L], StorageLevel(disk, > memory, deserialized, 1 replicas) > +- AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[key#4L], functions=[count(key#4L)]) > +- Exchange hashpartitioning(key#4L, 200), > ENSURE_REQUIREMENTS, [plan_id=24] > +- HashAggregate(keys=[key#4L], > functions=[partial_count(key#4L)]) > +- Project [(id#2L % 100) AS key#4L] > +- Range (0, 1000, step=1, splits=10){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47177) Cached SQL plan do not display final AQE plan in explain string
[ https://issues.apache.org/jira/browse/SPARK-47177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You resolved SPARK-47177. --- Fix Version/s: 3.5.2 4.0.0 Resolution: Fixed Issue resolved by pull request 45282 [https://github.com/apache/spark/pull/45282] > Cached SQL plan do not display final AQE plan in explain string > --- > > Key: SPARK-47177 > URL: https://issues.apache.org/jira/browse/SPARK-47177 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.2, 3.5.0, 4.0.0, 3.5.1, 3.5.2 >Reporter: Ziqi Liu >Priority: Major > Labels: pull-request-available > Fix For: 3.5.2, 4.0.0 > > > AQE plan is expected to display final plan after execution. This is not true > for cached SQL plan: it will show the initial plan instead. This behavior > change is introduced in [https://github.com/apache/spark/pull/40812] it tried > to fix the concurrency issue with cached plan. > *In short, the plan used to executed and the plan used to explain is not the > same instance, thus causing the inconsistency.* > > I don't have a clear idea how yet > * maybe we just a coarse granularity lock in explain? > * make innerChildren a function: clone the initial plan, every time checked > for whether the original AQE plan is finalized (making the final flag atomic > first, of course), if no return the cloned initial plan, if it's finalized, > clone the final plan and return that one. But still this won't be able to > reflect the AQE plan in real time, in a concurrent situation, but at least we > have initial version and final version. > > A simple repro: > {code:java} > d1 = spark.range(1000).withColumn("key", expr("id % > 100")).groupBy("key").agg({"key": "count"}) > cached_d2 = d1.cache() > df = cached_d2.filter("key > 10") > df.collect() {code} > {code:java} > >>> df.explain() > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=true > +- == Final Plan == > *(1) Filter (isnotnull(key#4L) AND (key#4L > 10)) > +- TableCacheQueryStage 0 > +- InMemoryTableScan [key#4L, count(key)#10L], [isnotnull(key#4L), > (key#4L > 10)] > +- InMemoryRelation [key#4L, count(key)#10L], StorageLevel(disk, > memory, deserialized, 1 replicas) > +- AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[key#4L], > functions=[count(key#4L)]) > +- Exchange hashpartitioning(key#4L, 200), > ENSURE_REQUIREMENTS, [plan_id=24] > +- HashAggregate(keys=[key#4L], > functions=[partial_count(key#4L)]) > +- Project [(id#2L % 100) AS key#4L] > +- Range (0, 1000, step=1, splits=10) > +- == Initial Plan == > Filter (isnotnull(key#4L) AND (key#4L > 10)) > +- InMemoryTableScan [key#4L, count(key)#10L], [isnotnull(key#4L), (key#4L > > 10)] > +- InMemoryRelation [key#4L, count(key)#10L], StorageLevel(disk, > memory, deserialized, 1 replicas) > +- AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[key#4L], functions=[count(key#4L)]) > +- Exchange hashpartitioning(key#4L, 200), > ENSURE_REQUIREMENTS, [plan_id=24] > +- HashAggregate(keys=[key#4L], > functions=[partial_count(key#4L)]) > +- Project [(id#2L % 100) AS key#4L] > +- Range (0, 1000, step=1, splits=10){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47177) Cached SQL plan do not display final AQE plan in explain string
[ https://issues.apache.org/jira/browse/SPARK-47177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You reassigned SPARK-47177: - Assignee: XiDuo You > Cached SQL plan do not display final AQE plan in explain string > --- > > Key: SPARK-47177 > URL: https://issues.apache.org/jira/browse/SPARK-47177 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.2, 3.5.0, 4.0.0, 3.5.1, 3.5.2 >Reporter: Ziqi Liu >Assignee: XiDuo You >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.2 > > > AQE plan is expected to display final plan after execution. This is not true > for cached SQL plan: it will show the initial plan instead. This behavior > change is introduced in [https://github.com/apache/spark/pull/40812] it tried > to fix the concurrency issue with cached plan. > *In short, the plan used to executed and the plan used to explain is not the > same instance, thus causing the inconsistency.* > > I don't have a clear idea how yet > * maybe we just a coarse granularity lock in explain? > * make innerChildren a function: clone the initial plan, every time checked > for whether the original AQE plan is finalized (making the final flag atomic > first, of course), if no return the cloned initial plan, if it's finalized, > clone the final plan and return that one. But still this won't be able to > reflect the AQE plan in real time, in a concurrent situation, but at least we > have initial version and final version. > > A simple repro: > {code:java} > d1 = spark.range(1000).withColumn("key", expr("id % > 100")).groupBy("key").agg({"key": "count"}) > cached_d2 = d1.cache() > df = cached_d2.filter("key > 10") > df.collect() {code} > {code:java} > >>> df.explain() > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=true > +- == Final Plan == > *(1) Filter (isnotnull(key#4L) AND (key#4L > 10)) > +- TableCacheQueryStage 0 > +- InMemoryTableScan [key#4L, count(key)#10L], [isnotnull(key#4L), > (key#4L > 10)] > +- InMemoryRelation [key#4L, count(key)#10L], StorageLevel(disk, > memory, deserialized, 1 replicas) > +- AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[key#4L], > functions=[count(key#4L)]) > +- Exchange hashpartitioning(key#4L, 200), > ENSURE_REQUIREMENTS, [plan_id=24] > +- HashAggregate(keys=[key#4L], > functions=[partial_count(key#4L)]) > +- Project [(id#2L % 100) AS key#4L] > +- Range (0, 1000, step=1, splits=10) > +- == Initial Plan == > Filter (isnotnull(key#4L) AND (key#4L > 10)) > +- InMemoryTableScan [key#4L, count(key)#10L], [isnotnull(key#4L), (key#4L > > 10)] > +- InMemoryRelation [key#4L, count(key)#10L], StorageLevel(disk, > memory, deserialized, 1 replicas) > +- AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[key#4L], functions=[count(key#4L)]) > +- Exchange hashpartitioning(key#4L, 200), > ENSURE_REQUIREMENTS, [plan_id=24] > +- HashAggregate(keys=[key#4L], > functions=[partial_count(key#4L)]) > +- Project [(id#2L % 100) AS key#4L] > +- Range (0, 1000, step=1, splits=10){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47187) Fix hive compress output config does not work
[ https://issues.apache.org/jira/browse/SPARK-47187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You resolved SPARK-47187. --- Fix Version/s: 3.4.3 Resolution: Fixed Issue resolved by pull request 45286 [https://github.com/apache/spark/pull/45286] > Fix hive compress output config does not work > - > > Key: SPARK-47187 > URL: https://issues.apache.org/jira/browse/SPARK-47187 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.2 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Labels: pull-request-available > Fix For: 3.4.3 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47187) Fix hive compress output config does not work
[ https://issues.apache.org/jira/browse/SPARK-47187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You reassigned SPARK-47187: - Assignee: XiDuo You > Fix hive compress output config does not work > - > > Key: SPARK-47187 > URL: https://issues.apache.org/jira/browse/SPARK-47187 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.2 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47187) Fix hive compress output config does not work
XiDuo You created SPARK-47187: - Summary: Fix hive compress output config does not work Key: SPARK-47187 URL: https://issues.apache.org/jira/browse/SPARK-47187 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.4.2 Reporter: XiDuo You -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46756) Add rule to rewrite null safe equality join keys
XiDuo You created SPARK-46756: - Summary: Add rule to rewrite null safe equality join keys Key: SPARK-46756 URL: https://issues.apache.org/jira/browse/SPARK-46756 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: XiDuo You -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46480) Fix NPE when table cache task attempt
[ https://issues.apache.org/jira/browse/SPARK-46480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-46480: -- Fix Version/s: 3.5.1 > Fix NPE when table cache task attempt > - > > Key: SPARK-46480 > URL: https://issues.apache.org/jira/browse/SPARK-46480 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 4.0.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46480) Fix NPE when table cache task attempt
[ https://issues.apache.org/jira/browse/SPARK-46480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-46480: -- Summary: Fix NPE when table cache task attempt (was: Fix NPE when table cache task do attempt) > Fix NPE when table cache task attempt > - > > Key: SPARK-46480 > URL: https://issues.apache.org/jira/browse/SPARK-46480 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 4.0.0 >Reporter: XiDuo You >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46480) Fix NPE when table cache task do attempt
[ https://issues.apache.org/jira/browse/SPARK-46480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-46480: -- Component/s: Spark Core > Fix NPE when table cache task do attempt > > > Key: SPARK-46480 > URL: https://issues.apache.org/jira/browse/SPARK-46480 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 4.0.0 >Reporter: XiDuo You >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46480) Fix NPE when table cache task do attempt
[ https://issues.apache.org/jira/browse/SPARK-46480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-46480: -- Issue Type: Bug (was: Improvement) > Fix NPE when table cache task do attempt > > > Key: SPARK-46480 > URL: https://issues.apache.org/jira/browse/SPARK-46480 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: XiDuo You >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46480) Fix NPE when table cache task do attempt
XiDuo You created SPARK-46480: - Summary: Fix NPE when table cache task do attempt Key: SPARK-46480 URL: https://issues.apache.org/jira/browse/SPARK-46480 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: XiDuo You -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46227) Move `withSQLConf` from SQLHelper trait to `SQLConfHelper` trait
XiDuo You created SPARK-46227: - Summary: Move `withSQLConf` from SQLHelper trait to `SQLConfHelper` trait Key: SPARK-46227 URL: https://issues.apache.org/jira/browse/SPARK-46227 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: XiDuo You -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46170) Support inject adaptive query post planner strategy rules in SparkSessionExtensions
[ https://issues.apache.org/jira/browse/SPARK-46170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You reassigned SPARK-46170: - Assignee: XiDuo You > Support inject adaptive query post planner strategy rules in > SparkSessionExtensions > --- > > Key: SPARK-46170 > URL: https://issues.apache.org/jira/browse/SPARK-46170 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46170) Support inject adaptive query post planner strategy rules in SparkSessionExtensions
[ https://issues.apache.org/jira/browse/SPARK-46170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You resolved SPARK-46170. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44074 [https://github.com/apache/spark/pull/44074] > Support inject adaptive query post planner strategy rules in > SparkSessionExtensions > --- > > Key: SPARK-46170 > URL: https://issues.apache.org/jira/browse/SPARK-46170 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46170) Support inject adaptive query post planner strategy rules in SparkSessionExtensions
XiDuo You created SPARK-46170: - Summary: Support inject adaptive query post planner strategy rules in SparkSessionExtensions Key: SPARK-46170 URL: https://issues.apache.org/jira/browse/SPARK-46170 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: XiDuo You -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46105) df.emptyDataFrame shows 1 if we repartition(1) in Spark 3.3.x and above
[ https://issues.apache.org/jira/browse/SPARK-46105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17789908#comment-17789908 ] XiDuo You commented on SPARK-46105: --- Please see SPARK-39915 > df.emptyDataFrame shows 1 if we repartition(1) in Spark 3.3.x and above > --- > > Key: SPARK-46105 > URL: https://issues.apache.org/jira/browse/SPARK-46105 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.3 > Environment: EKS > EMR >Reporter: dharani_sugumar >Priority: Major > Attachments: Screenshot 2023-11-26 at 11.54.58 AM.png > > > {color:#FF}Version: 3.3.3{color} > > {color:#FF}scala> val df = spark.emptyDataFrame{color} > {color:#FF}df: org.apache.spark.sql.DataFrame = []{color} > {color:#FF}scala> df.rdd.getNumPartitions{color} > {color:#FF}res0: Int = 0{color} > {color:#FF}scala> df.repartition(1).rdd.getNumPartitions{color} > {color:#FF}res1: Int = 1{color} > {color:#FF}scala> df.repartition(1).rdd.isEmpty(){color} > {color:#FF}[Stage 1:> > (0 + 1) / > res2: Boolean = true{color} > Version: 3.2.4 > scala> val df = spark.emptyDataFrame > df: org.apache.spark.sql.DataFrame = [] > scala> df.rdd.getNumPartitions > res0: Int = 0 > scala> df.repartition(1).rdd.getNumPartitions > res1: Int = 0 > scala> df.repartition(1).rdd.isEmpty() > res2: Boolean = true > > {color:#FF}Version: 3.5.0{color} > {color:#FF}scala> val df = spark.emptyDataFrame{color} > {color:#FF}df: org.apache.spark.sql.DataFrame = []{color} > {color:#FF}scala> df.rdd.getNumPartitions{color} > {color:#FF}res0: Int = 0{color} > {color:#FF}scala> df.repartition(1).rdd.getNumPartitions{color} > {color:#FF}res1: Int = 1{color} > {color:#FF}scala> df.repartition(1).rdd.isEmpty(){color} > {color:#FF}[Stage 1:> > (0 + 1) / > res2: Boolean = true{color} > > When we do repartition of 1 on an empty dataframe, the resultant partition is > 1 in version 3.3.x and 3.5.x whereas when I do the same in version 3.2.x, the > resultant partition is 0. May i know why this behaviour is changed from 3.2.x > to higher versions. > > The reason for raising this as a bug is I have a scenario where my final > dataframe returns 0 records in EKS(local spark) with single node(driver and > executor on the sam node) but it returns 1 in EMR both uses a same spark > version 3.3.3. I'm not sure why this behaves different in both the > environments. As a interim solution, I had to repartition a empty dataframe > if my final dataframe is empty which returns 1 for 3.3.3. Would like to know > if this really a bug or this behaviour exists in the future versions and > cannot be changed? > > Because, If we go for a spark upgrade and this behaviour is changed, we will > face the issue again. > Please confirm on this. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46090) Support plan fragment level SQL configs in AQE
[ https://issues.apache.org/jira/browse/SPARK-46090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-46090: -- Summary: Support plan fragment level SQL configs in AQE (was: Support plan fragment level SQL configs) > Support plan fragment level SQL configs in AQE > --- > > Key: SPARK-46090 > URL: https://issues.apache.org/jira/browse/SPARK-46090 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: XiDuo You >Priority: Major > > AQE executes query plan stage by stage, so there is a chance to support plan > fragment level SQL configs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46090) Support plan fragment level SQL configs
[ https://issues.apache.org/jira/browse/SPARK-46090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-46090: -- Summary: Support plan fragment level SQL configs (was: Support stage level SQL configs) > Support plan fragment level SQL configs > --- > > Key: SPARK-46090 > URL: https://issues.apache.org/jira/browse/SPARK-46090 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: XiDuo You >Priority: Major > > AQE executes query plan stage by stage, so there is a chance to support stage > level SQL configs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46090) Support plan fragment level SQL configs
[ https://issues.apache.org/jira/browse/SPARK-46090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-46090: -- Description: AQE executes query plan stage by stage, so there is a chance to support plan fragment level SQL configs. (was: AQE executes query plan stage by stage, so there is a chance to support stage level SQL configs.) > Support plan fragment level SQL configs > --- > > Key: SPARK-46090 > URL: https://issues.apache.org/jira/browse/SPARK-46090 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: XiDuo You >Priority: Major > > AQE executes query plan stage by stage, so there is a chance to support plan > fragment level SQL configs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46090) Support stage level SQL configs
XiDuo You created SPARK-46090: - Summary: Support stage level SQL configs Key: SPARK-46090 URL: https://issues.apache.org/jira/browse/SPARK-46090 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: XiDuo You AQE executes query plan stage by stage, so there is a chance to support stage level SQL configs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45882) BroadcastHashJoinExec propagate partitioning should respect CoalescedHashPartitioning
[ https://issues.apache.org/jira/browse/SPARK-45882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-45882: -- Fix Version/s: 3.4.2 4.0.0 > BroadcastHashJoinExec propagate partitioning should respect > CoalescedHashPartitioning > - > > Key: SPARK-45882 > URL: https://issues.apache.org/jira/browse/SPARK-45882 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Labels: pull-request-available > Fix For: 3.4.2, 4.0.0, 3.5.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45882) BroadcastHashJoinExec propagate partitioning should respect CoalescedHashPartitioning
[ https://issues.apache.org/jira/browse/SPARK-45882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You resolved SPARK-45882. --- Fix Version/s: 3.5.1 Resolution: Fixed Issue resolved by pull request 43792 [https://github.com/apache/spark/pull/43792] > BroadcastHashJoinExec propagate partitioning should respect > CoalescedHashPartitioning > - > > Key: SPARK-45882 > URL: https://issues.apache.org/jira/browse/SPARK-45882 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Labels: pull-request-available > Fix For: 3.5.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45882) BroadcastHashJoinExec propagate partitioning should respect CoalescedHashPartitioning
[ https://issues.apache.org/jira/browse/SPARK-45882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You reassigned SPARK-45882: - Assignee: XiDuo You > BroadcastHashJoinExec propagate partitioning should respect > CoalescedHashPartitioning > - > > Key: SPARK-45882 > URL: https://issues.apache.org/jira/browse/SPARK-45882 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45882) BroadcastHashJoinExec propagate partitioning should respect CoalescedHashPartitioning
XiDuo You created SPARK-45882: - Summary: BroadcastHashJoinExec propagate partitioning should respect CoalescedHashPartitioning Key: SPARK-45882 URL: https://issues.apache.org/jira/browse/SPARK-45882 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: XiDuo You -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34444) Pushdown scalar-subquery filter to FileSourceScan
[ https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You resolved SPARK-3. --- Fix Version/s: 4.0.0 Resolution: Fixed > Pushdown scalar-subquery filter to FileSourceScan > - > > Key: SPARK-3 > URL: https://issues.apache.org/jira/browse/SPARK-3 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Major > Fix For: 4.0.0 > > > We can pushdown {{a < (select max(d) from t2)}} to FileSourceScan: > {code:scala} > sql("CREATE TABLE t1 using parquet AS SELECT id AS a, id AS b FROM > range(5L)") > sql("CREATE TABLE t2 using parquet AS SELECT id AS d FROM range(20)") > sql("SELECT * FROM t1 WHERE b = (select max(d) from t2)").show > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45740) Relax the node prefix of SparkPlanGraphCluster
[ https://issues.apache.org/jira/browse/SPARK-45740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-45740: -- Summary: Relax the node prefix of SparkPlanGraphCluster (was: Release the node prefix of SparkPlanGraphCluster) > Relax the node prefix of SparkPlanGraphCluster > -- > > Key: SPARK-45740 > URL: https://issues.apache.org/jira/browse/SPARK-45740 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: XiDuo You >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45740) Release the node prefix of SparkPlanGraphCluster
XiDuo You created SPARK-45740: - Summary: Release the node prefix of SparkPlanGraphCluster Key: SPARK-45740 URL: https://issues.apache.org/jira/browse/SPARK-45740 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: XiDuo You -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45705) Fix flaky test: Status of a failed DDL/DML with no jobs should be FAILED
[ https://issues.apache.org/jira/browse/SPARK-45705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You resolved SPARK-45705. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43554 [https://github.com/apache/spark/pull/43554] > Fix flaky test: Status of a failed DDL/DML with no jobs should be FAILED > - > > Key: SPARK-45705 > URL: https://issues.apache.org/jira/browse/SPARK-45705 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45632) Table cache should avoid unnecessary ColumnarToRow when enable AQE
[ https://issues.apache.org/jira/browse/SPARK-45632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You reassigned SPARK-45632: - Assignee: XiDuo You > Table cache should avoid unnecessary ColumnarToRow when enable AQE > -- > > Key: SPARK-45632 > URL: https://issues.apache.org/jira/browse/SPARK-45632 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Labels: pull-request-available > > If the cache serializer supports columnar input, then we do not need a > ColumnarToRow before cache data. This pr improves the optimization with AQE > enabled. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45632) Table cache should avoid unnecessary ColumnarToRow when enable AQE
[ https://issues.apache.org/jira/browse/SPARK-45632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You resolved SPARK-45632. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43484 [https://github.com/apache/spark/pull/43484] > Table cache should avoid unnecessary ColumnarToRow when enable AQE > -- > > Key: SPARK-45632 > URL: https://issues.apache.org/jira/browse/SPARK-45632 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > If the cache serializer supports columnar input, then we do not need a > ColumnarToRow before cache data. This pr improves the optimization with AQE > enabled. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45632) Table cache should avoid unnecessary ColumnarToRow when enable AQE
XiDuo You created SPARK-45632: - Summary: Table cache should avoid unnecessary ColumnarToRow when enable AQE Key: SPARK-45632 URL: https://issues.apache.org/jira/browse/SPARK-45632 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: XiDuo You If the cache serializer supports columnar input, then we do not need a ColumnarToRow before cache data. This pr improves the optimization with AQE enabled. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45443) Revisit TableCacheQueryStage to avoid replicated InMemoryRelation materialization
[ https://issues.apache.org/jira/browse/SPARK-45443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17773507#comment-17773507 ] XiDuo You commented on SPARK-45443: --- > But this idea only work for one query Please see the following `say, if there are multi-queries which reference the same caced RDD (e.g., in thiftserver)`. There are some race condition if multi-queries reference and materialize the same cached rdd. They are in different query execution and different thread. > spark.sql.optimizer.canChangeCachedPlanOutputPartitioning It is totally irrelevant with the TableCacheQueryStage. This config is used to make AQE work for the cached plan. > Revisit TableCacheQueryStage to avoid replicated InMemoryRelation > materialization > - > > Key: SPARK-45443 > URL: https://issues.apache.org/jira/browse/SPARK-45443 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Eren Avsarogullari >Priority: Major > Attachments: IMR Materialization - Stage 2.png, IMR Materialization - > Stage 3.png > > > TableCacheQueryStage is created per InMemoryTableScanExec by > AdaptiveSparkPlanExec and it materializes InMemoryTableScanExec output > (cached RDD) to provide runtime stats in order to apply AQE optimizations > into remaining physical plan stages. TableCacheQueryStage materializes > InMemoryTableScanExec eagerly by submitting job per TableCacheQueryStage > instance. For example, if there are 2 TableCacheQueryStage instances > referencing same IMR instance (cached RDD) and first InMemoryTableScanExec' s > materialization takes longer, following logic will return false > (inMemoryTableScan.isMaterialized => false) and this may cause replicated IMR > materialization. This behavior can be more visible when cached RDD size is > high. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala#L281] > Would like to get community feedback. Thanks in advance. > cc [~ulysses] [~cloud_fan] > *Sample Query to simulate the problem:* > // Both join legs uses same IMR instance > {code:java} > import spark.implicits._ > val arr = (1 to 12).map { i => { > val index = i % 5 > (index, s"Employee_$index", s"Department_$index") > } > } > val df = arr.toDF("id", "name", "department") > .filter('id >= 0) > .sort("id") > .groupBy('id, 'name, 'department) > .count().as("count") > df.persist() > val df2 = df.sort("count").filter('count <= 2) > val df3 = df.sort("count").filter('count >= 3) > val df4 = df2.join(df3, Seq("id", "name", "department"), "fullouter") > df4.show() {code} > *Physical Plan:* > {code:java} > == Physical Plan == > AdaptiveSparkPlan (31) > +- == Final Plan == > CollectLimit (21) > +- * Project (20) > +- * SortMergeJoin FullOuter (19) > :- * Sort (10) > : +- * Filter (9) > : +- TableCacheQueryStage (8), Statistics(sizeInBytes=210.0 B, > rowCount=5) > : +- InMemoryTableScan (1) > : +- InMemoryRelation (2) > : +- AdaptiveSparkPlan (7) > : +- HashAggregate (6) > : +- Exchange (5) > : +- HashAggregate (4) > : +- LocalTableScan (3) > +- * Sort (18) > +- * Filter (17) > +- TableCacheQueryStage (16), Statistics(sizeInBytes=210.0 B, > rowCount=5) > +- InMemoryTableScan (11) > +- InMemoryRelation (12) > +- AdaptiveSparkPlan (15) > +- HashAggregate (14) > +- Exchange (13) > +- HashAggregate (4) > +- LocalTableScan (3) {code} > *Stages DAGs materializing the same IMR instance:* > !IMR Materialization - Stage 2.png|width=303,height=134! > !IMR Materialization - Stage 3.png|width=303,height=134! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45443) Revisit TableCacheQueryStage to avoid replicated InMemoryRelation materialization
[ https://issues.apache.org/jira/browse/SPARK-45443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17773074#comment-17773074 ] XiDuo You commented on SPARK-45443: --- > Can this increase probability of concurrent IMR materialization for same IMR > instance? I think they are same, The TableCacheQueryStage is more like a barrier and report some metrics to AQE framework. The gap of `eagerly` is very small. > For queries using AQE, can introducing TableCacheQueryStage into physical > plan once per unique IMR instance help I did not see the difference. I think one idea is, we can introduce something like `ReusedTableCacheQueryStage`. The `ReusedTableCacheQueryStage` only holds an empty future which wait for the first TableCacheQueryStage materialization, so that we can make sure the cached RDD only be executed once. But this idea only work for one query, say, if there are multi-queries which reference the same caced RDD (e.g., in thiftserver), the issue is still existed. > Revisit TableCacheQueryStage to avoid replicated InMemoryRelation > materialization > - > > Key: SPARK-45443 > URL: https://issues.apache.org/jira/browse/SPARK-45443 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Eren Avsarogullari >Priority: Major > Attachments: IMR Materialization - Stage 2.png, IMR Materialization - > Stage 3.png > > > TableCacheQueryStage is created per InMemoryTableScanExec by > AdaptiveSparkPlanExec and it materializes InMemoryTableScanExec output > (cached RDD) to provide runtime stats in order to apply AQE optimizations > into remaining physical plan stages. TableCacheQueryStage materializes > InMemoryTableScanExec eagerly by submitting job per TableCacheQueryStage > instance. For example, if there are 2 TableCacheQueryStage instances > referencing same IMR instance (cached RDD) and first InMemoryTableScanExec' s > materialization takes longer, following logic will return false > (inMemoryTableScan.isMaterialized => false) and this may cause replicated IMR > materialization. This behavior can be more visible when cached RDD size is > high. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala#L281] > Would like to get community feedback. Thanks in advance. > cc [~ulysses] [~cloud_fan] > *Sample Query to simulate the problem:* > // Both join legs uses same IMR instance > {code:java} > import spark.implicits._ > val arr = (1 to 12).map { i => { > val index = i % 5 > (index, s"Employee_$index", s"Department_$index") > } > } > val df = arr.toDF("id", "name", "department") > .filter('id >= 0) > .sort("id") > .groupBy('id, 'name, 'department) > .count().as("count") > df.persist() > val df2 = df.sort("count").filter('count <= 2) > val df3 = df.sort("count").filter('count >= 3) > val df4 = df2.join(df3, Seq("id", "name", "department"), "fullouter") > df4.show() {code} > *Physical Plan:* > {code:java} > == Physical Plan == > AdaptiveSparkPlan (31) > +- == Final Plan == > CollectLimit (21) > +- * Project (20) > +- * SortMergeJoin FullOuter (19) > :- * Sort (10) > : +- * Filter (9) > : +- TableCacheQueryStage (8), Statistics(sizeInBytes=210.0 B, > rowCount=5) > : +- InMemoryTableScan (1) > : +- InMemoryRelation (2) > : +- AdaptiveSparkPlan (7) > : +- HashAggregate (6) > : +- Exchange (5) > : +- HashAggregate (4) > : +- LocalTableScan (3) > +- * Sort (18) > +- * Filter (17) > +- TableCacheQueryStage (16), Statistics(sizeInBytes=210.0 B, > rowCount=5) > +- InMemoryTableScan (11) > +- InMemoryRelation (12) > +- AdaptiveSparkPlan (15) > +- HashAggregate (14) > +- Exchange (13) > +- HashAggregate (4) > +- LocalTableScan (3) {code} > *Stages DAGs materializing the same IMR instance:* > !IMR Materialization - Stage 2.png|width=303,height=134! > !IMR Materialization - Stage 3.png|width=303,height=134! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45451) Make the default storage level of dataset cache configurable
XiDuo You created SPARK-45451: - Summary: Make the default storage level of dataset cache configurable Key: SPARK-45451 URL: https://issues.apache.org/jira/browse/SPARK-45451 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: XiDuo You -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45443) Revisit TableCacheQueryStage to avoid replicated InMemoryRelation materialization
[ https://issues.apache.org/jira/browse/SPARK-45443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17772729#comment-17772729 ] XiDuo You commented on SPARK-45443: --- hi [~erenavsarogullari] , it seems that, it depends on the behavior of rdd cache. Say, what happens if we materialize a cached rdd twice at the same time ? There are some race condition in block manager per rdd partition so it makes things slow. BTW, what's the behavior before we have TableCacheQueryStage ? Does not it have this issue ? > Revisit TableCacheQueryStage to avoid replicated InMemoryRelation > materialization > - > > Key: SPARK-45443 > URL: https://issues.apache.org/jira/browse/SPARK-45443 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Eren Avsarogullari >Priority: Major > Attachments: IMR Materialization - Stage 2.png, IMR Materialization - > Stage 3.png > > > TableCacheQueryStage is created per InMemoryTableScanExec by > AdaptiveSparkPlanExec and it materializes InMemoryTableScanExec output > (cached RDD) to provide runtime stats in order to apply AQE optimizations > into remaining physical plan stages. TableCacheQueryStage materializes > InMemoryTableScanExec eagerly by submitting job per TableCacheQueryStage > instance. For example, if there are 2 TableCacheQueryStage instances > referencing same IMR instance (cached RDD) and first InMemoryTableScanExec' s > materialization takes longer, following logic will return false > (inMemoryTableScan.isMaterialized => false) and this may cause replicated IMR > materialization. This behavior can be more visible when cached RDD size is > high. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala#L281] > Would like to get community feedback. Thanks in advance. > cc [~ulysses] [~cloud_fan] > *Sample Query to simulate the problem:* > // Both join legs uses same IMR instance > {code:java} > import spark.implicits._ > val arr = (1 to 12).map { i => { > val index = i % 5 > (index, s"Employee_$index", s"Department_$index") > } > } > val df = arr.toDF("id", "name", "department") > .filter('id >= 0) > .sort("id") > .groupBy('id, 'name, 'department) > .count().as("count") > df.persist() > val df2 = df.sort("count").filter('count <= 2) > val df3 = df.sort("count").filter('count >= 3) > val df4 = df2.join(df3, Seq("id", "name", "department"), "fullouter") > df4.show() {code} > *Physical Plan:* > {code:java} > == Physical Plan == > AdaptiveSparkPlan (31) > +- == Final Plan == > CollectLimit (21) > +- * Project (20) > +- * SortMergeJoin FullOuter (19) > :- * Sort (10) > : +- * Filter (9) > : +- TableCacheQueryStage (8), Statistics(sizeInBytes=210.0 B, > rowCount=5) > : +- InMemoryTableScan (1) > : +- InMemoryRelation (2) > : +- AdaptiveSparkPlan (7) > : +- HashAggregate (6) > : +- Exchange (5) > : +- HashAggregate (4) > : +- LocalTableScan (3) > +- * Sort (18) > +- * Filter (17) > +- TableCacheQueryStage (16), Statistics(sizeInBytes=210.0 B, > rowCount=5) > +- InMemoryTableScan (11) > +- InMemoryRelation (12) > +- AdaptiveSparkPlan (15) > +- HashAggregate (14) > +- Exchange (13) > +- HashAggregate (4) > +- LocalTableScan (3) {code} > *Stages DAGs materializing the same IMR instance:* > !IMR Materialization - Stage 2.png|width=303,height=134! > !IMR Materialization - Stage 3.png|width=303,height=134! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45282) Join loses records for cached datasets
[ https://issues.apache.org/jira/browse/SPARK-45282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769383#comment-17769383 ] XiDuo You commented on SPARK-45282: --- I can not re-produce this issue in master branch (4.0.0), [~koert] have you tried master branch ? > Join loses records for cached datasets > -- > > Key: SPARK-45282 > URL: https://issues.apache.org/jira/browse/SPARK-45282 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0 > Environment: spark 3.4.1 on apache hadoop 3.3.6 or kubernetes 1.26 or > databricks 13.3 >Reporter: koert kuipers >Priority: Major > Labels: CorrectnessBug, correctness > > we observed this issue on spark 3.4.1 but it is also present on 3.5.0. it is > not present on spark 3.3.1. > it only shows up in distributed environment. i cannot replicate in unit test. > however i did get it to show up on hadoop cluster, kubernetes, and on > databricks 13.3 > the issue is that records are dropped when two cached dataframes are joined. > it seems in spark 3.4.1 in queryplan some Exchanges are dropped as an > optimization while in spark 3.3.1 these Exhanges are still present. it seems > to be an issue with AQE with canChangeCachedPlanOutputPartitioning=true. > to reproduce on distributed cluster these settings needed are: > {code:java} > spark.sql.adaptive.advisoryPartitionSizeInBytes 33554432 > spark.sql.adaptive.coalescePartitions.parallelismFirst false > spark.sql.adaptive.enabled true > spark.sql.optimizer.canChangeCachedPlanOutputPartitioning true {code} > code using scala to reproduce is: > {code:java} > import java.util.UUID > import org.apache.spark.sql.functions.col > import spark.implicits._ > val data = (1 to 100).toDS().map(i => > UUID.randomUUID().toString).persist() > val left = data.map(k => (k, 1)) > val right = data.map(k => (k, k)) // if i change this to k => (k, 1) it works! > println("number of left " + left.count()) > println("number of right " + right.count()) > println("number of (left join right) " + > left.toDF("key", "value1").join(right.toDF("key", "value2"), "key").count() > ) > val left1 = left > .toDF("key", "value1") > .repartition(col("key")) // comment out this line to make it work > .persist() > println("number of left1 " + left1.count()) > val right1 = right > .toDF("key", "value2") > .repartition(col("key")) // comment out this line to make it work > .persist() > println("number of right1 " + right1.count()) > println("number of (left1 join right1) " + left1.join(right1, > "key").count()) // this gives incorrect result{code} > this produces the following output: > {code:java} > number of left 100 > number of right 100 > number of (left join right) 100 > number of left1 100 > number of right1 100 > number of (left1 join right1) 859531 {code} > note that the last number (the incorrect one) actually varies depending on > settings and cluster size etc. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45244) Correct spelling in VolcanoTestsSuite
[ https://issues.apache.org/jira/browse/SPARK-45244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You reassigned SPARK-45244: - Assignee: Binjie Yang > Correct spelling in VolcanoTestsSuite > - > > Key: SPARK-45244 > URL: https://issues.apache.org/jira/browse/SPARK-45244 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Tests >Affects Versions: 3.5.0 >Reporter: Binjie Yang >Assignee: Binjie Yang >Priority: Minor > Labels: pull-request-available > > Typo in method naming checkAnnotaion -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45244) Correct spelling in VolcanoTestsSuite
[ https://issues.apache.org/jira/browse/SPARK-45244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You resolved SPARK-45244. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43026 [https://github.com/apache/spark/pull/43026] > Correct spelling in VolcanoTestsSuite > - > > Key: SPARK-45244 > URL: https://issues.apache.org/jira/browse/SPARK-45244 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Tests >Affects Versions: 3.5.0 >Reporter: Binjie Yang >Assignee: Binjie Yang >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > Typo in method naming checkAnnotaion -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45191) InMemoryTableScanExec simpleStringWithNodeId adds columnar info
[ https://issues.apache.org/jira/browse/SPARK-45191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You resolved SPARK-45191. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42967 [https://github.com/apache/spark/pull/42967] > InMemoryTableScanExec simpleStringWithNodeId adds columnar info > --- > > Key: SPARK-45191 > URL: https://issues.apache.org/jira/browse/SPARK-45191 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: XiDuo You >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > InMemoryTableScanExec supports both row-based and columnar input and output > which is based on the cache serialzier. It would be more friendly for user if > we can provide the columnar info to show whether it is columnar in/out. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45191) InMemoryTableScanExec simpleStringWithNodeId adds columnar info
XiDuo You created SPARK-45191: - Summary: InMemoryTableScanExec simpleStringWithNodeId adds columnar info Key: SPARK-45191 URL: https://issues.apache.org/jira/browse/SPARK-45191 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: XiDuo You InMemoryTableScanExec supports both row-based and columnar input and output which is based on the cache serialzier. It would be more friendly for user if we can provide the columnar info to show whether it is columnar in/out. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44598) spark 3.2+ can not read hive table with hbase serde when hbase StorefileSize is 0
[ https://issues.apache.org/jira/browse/SPARK-44598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17749421#comment-17749421 ] XiDuo You commented on SPARK-44598: --- please try `--conf spark.hadoopRDD.ignoreEmptySplits=false` > spark 3.2+ can not read hive table with hbase serde when hbase StorefileSize > is 0 > -- > > Key: SPARK-44598 > URL: https://issues.apache.org/jira/browse/SPARK-44598 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.3 >Reporter: ming95 >Priority: Major > > We using spark to read a hive table with hbase serde . We found that when the > hbase table data is relatively small (hbase StorefileSize is 0), the data > read by spark 3.2 or 3.5 is empty, and there is no error message. > But when using spark2.4 or hive to read, the data can be read normally. Other > information shows that spark3.1 can also read data normally, can anyone > provide some ideas? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44579) Support Interrupt On Cancel in SQLExecution
[ https://issues.apache.org/jira/browse/SPARK-44579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You reassigned SPARK-44579: - Assignee: Kent Yao > Support Interrupt On Cancel in SQLExecution > --- > > Key: SPARK-44579 > URL: https://issues.apache.org/jira/browse/SPARK-44579 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 4.0.0 > > > Currently, we support interrupting task threads for users by 1) APIs of the > spark core module, 2) a thrift config for the SQL module. Other Spark SQL > Apps are limited to use this functionality. Specifically, the built-in > spark-sql-shell lacks a user-controlled knob for interrupting task threads. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44579) Support Interrupt On Cancel in SQLExecution
[ https://issues.apache.org/jira/browse/SPARK-44579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You resolved SPARK-44579. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42199 [https://github.com/apache/spark/pull/42199] > Support Interrupt On Cancel in SQLExecution > --- > > Key: SPARK-44579 > URL: https://issues.apache.org/jira/browse/SPARK-44579 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Kent Yao >Priority: Major > Fix For: 4.0.0 > > > Currently, we support interrupting task threads for users by 1) APIs of the > spark core module, 2) a thrift config for the SQL module. Other Spark SQL > Apps are limited to use this functionality. Specifically, the built-in > spark-sql-shell lacks a user-controlled knob for interrupting task threads. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43402) FileSourceScanExec supports push down data filter with scalar subquery
[ https://issues.apache.org/jira/browse/SPARK-43402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You resolved SPARK-43402. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 41088 [https://github.com/apache/spark/pull/41088] > FileSourceScanExec supports push down data filter with scalar subquery > -- > > Key: SPARK-43402 > URL: https://issues.apache.org/jira/browse/SPARK-43402 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 4.0.0 > > > Scalar subquery can be pushed down as data filter at runtime, since we always > execute subquery first. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43402) FileSourceScanExec supports push down data filter with scalar subquery
[ https://issues.apache.org/jira/browse/SPARK-43402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You reassigned SPARK-43402: - Assignee: XiDuo You > FileSourceScanExec supports push down data filter with scalar subquery > -- > > Key: SPARK-43402 > URL: https://issues.apache.org/jira/browse/SPARK-43402 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > > Scalar subquery can be pushed down as data filter at runtime, since we always > execute subquery first. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43777) Coalescing partitions in AQE returns different results with row_number windows.
[ https://issues.apache.org/jira/browse/SPARK-43777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726025#comment-17726025 ] XiDuo You commented on SPARK-43777: --- It acutally is a random event that all the row count of id 2 are 1 so sort by id take no effect. The ordering of shuffle result is not determinate and so is AQE. If you really need a determinate result, change window sepc to `orderBy($"count".desc, $"place")` > Coalescing partitions in AQE returns different results with row_number > windows. > > > Key: SPARK-43777 > URL: https://issues.apache.org/jira/browse/SPARK-43777 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.3, 3.3.2 > Environment: SBT based Spark project with unit tests running on > spark-testing-base. Tested with Spark 3.1.3 and 3.3.2 >Reporter: Kristopher Kane >Priority: Major > > While updating our code base from 3.1 to 3.3, I had a test fail due to wrong > results. With 3.1, we did not proactively turn on AQE in sbt based tests and > noticed the failure due to AQE enabled by default between 3.1 and 3.3 > An easily reproducible test: > {code:java} > val testDataDf = Seq( > (1, 1, 0, 0), > (1, 1, 0, 0), > (1, 1, 0, 0), > (1, 0, 0, 1), > (1, 0, 0, 1), > (2, 0, 0, 0), > (2, 0, 1, 0), > (2, 1, 0, 0), > (3, 0, 1, 0), > (3, 0, 1, 0), > ).toDF("id", "is_attribute1", "is_attribute2", "is_attribute3") > val placeWindowSpec = Window > .partitionBy("id") > .orderBy($"count".desc) > val resultDf: DataFrame = testDataDf > .select("id", "is_attribute1", "is_attribute2", "is_attribute3") > .withColumn( > "place", > when($"is_attribute1" === 1, "attribute1") > .when($"is_attribute2" === 1, "attribute2") > .when($"is_attribute3" === 1, "attribute3") > .otherwise("other") > ) > .groupBy("id", "place") > .agg( > functions.count("*").as("count") > ) > .withColumn( > "rank", > row_number().over(placeWindowSpec) > ) > resultDf.orderBy("id", "place", "rank").show() {code} > > Various results based on Spark version and AQE settings: > {code:java} > Spark 3.1 > Without AQE > +---+--+-++ > | id| place|count|rank| > +---+--+-++ > | 1|attribute1| 3| 1| > | 1|attribute3| 2| 2| > | 2|attribute1| 1| 2| > | 2|attribute2| 1| 1| > | 2| other| 1| 3| > | 3|attribute2| 2| 1| > +---+--+-++ > AQE with defaults > +---+--+-++ > | id| place|count|rank| > +---+--+-++ > | 1|attribute1| 3| 1| > | 1|attribute3| 2| 2| > | 2|attribute1| 1| 2| > | 2|attribute2| 1| 1| > | 2| other| 1| 3| > | 3|attribute2| 2| 1| > +---+--+-++ > AQE with .set("spark.sql.adaptive.coalescePartitions.enabled", "false") > +---+--+-++ > | id| place|count|rank| > +---+--+-++ > | 1|attribute1| 3| 1| > | 1|attribute3| 2| 2| > | 2|attribute1| 1| 2| > | 2|attribute2| 1| 1| > | 2| other| 1| 3| > | 3|attribute2| 2| 1| > +---+--+-++ > AQE with .set("spark.sql.adaptive.coalescePartitions.initialPartitionNum", > "1") - Like Spark 3.3 with AQE defaults > +---+--+-++ > | id| place|count|rank| > +---+--+-++ > | 1|attribute1| 3| 1| > | 1|attribute3| 2| 2| > | 2|attribute1| 1| 3| > | 2|attribute2| 1| 2| > | 2| other| 1| 1| > | 3|attribute2| 2| 1| > > Spark 3.3.2 > > AQE with defaults > +---+--+-++ > | id| place|count|rank| > +---+--+-++ > | 1|attribute1| 3| 1| > | 1|attribute3| 2| 2| > | 2|attribute1| 1| 3| > | 2|attribute2| 1| 2| > | 2| other| 1| 1| > | 3|attribute2| 2| 1| > +---+--+-++ > > AQE with .set("spark.sql.adaptive.coalescePartitions.enabled", "false") - > This matches Spark 3.1 > +---+--+-++ > | id| place|count|rank| > +---+--+-++ > | 1|attribute1| 3| 1| > | 1|attribute3| 2| 2| > | 2|attribute1| 1| 2| > | 2|attribute2| 1| 1| > | 2| other| 1| 3| > | 3|attribute2| 2| 1| > +---+--+-++ {code} > As you can see, the 'rank' column of row_number(partition by, order by) > returns a different rank for id value 2's three attributes based on how AQE > coalesces partitions. -- This message was sent by Atla
[jira] [Created] (SPARK-43420) Make DisableUnnecessaryBucketedScan smart with table cache
XiDuo You created SPARK-43420: - Summary: Make DisableUnnecessaryBucketedScan smart with table cache Key: SPARK-43420 URL: https://issues.apache.org/jira/browse/SPARK-43420 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: XiDuo You If a bucket scan has no interesting partition or contains shuffle exchange, then we would disable it. But If the bucket scan is inside table cache, the cached plan would be accessed multi-times, then we should not disable it as it could preserve output partitioning and more likely be reused. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43402) FileSourceScanExec supports push down data filter with scalar subquery
XiDuo You created SPARK-43402: - Summary: FileSourceScanExec supports push down data filter with scalar subquery Key: SPARK-43402 URL: https://issues.apache.org/jira/browse/SPARK-43402 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: XiDuo You Scalar subquery can be pushed down as data filter at runtime, since we always execute subquery first. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43377) Enable spark.sql.thriftServer.interruptOnCancel by default
[ https://issues.apache.org/jira/browse/SPARK-43377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-43377: -- Summary: Enable spark.sql.thriftServer.interruptOnCancel by default (was: Enable spark.sql.thriftServer.interruptOnCancel by defauly) > Enable spark.sql.thriftServer.interruptOnCancel by default > -- > > Key: SPARK-43377 > URL: https://issues.apache.org/jira/browse/SPARK-43377 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43377) Enable spark.sql.thriftServer.interruptOnCancel by defauly
XiDuo You created SPARK-43377: - Summary: Enable spark.sql.thriftServer.interruptOnCancel by defauly Key: SPARK-43377 URL: https://issues.apache.org/jira/browse/SPARK-43377 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: XiDuo You -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43376) Improve reuse subquery with table cache
XiDuo You created SPARK-43376: - Summary: Improve reuse subquery with table cache Key: SPARK-43376 URL: https://issues.apache.org/jira/browse/SPARK-43376 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: XiDuo You AQE can not reuse subquery if it is pushed into InMemoryTableScan. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43317) Support combine adjacent aggregation
XiDuo You created SPARK-43317: - Summary: Support combine adjacent aggregation Key: SPARK-43317 URL: https://issues.apache.org/jira/browse/SPARK-43317 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: XiDuo You If there have adjacent aggregation with Partial and Final mode, we can combine them to Complete mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43281) Fix concurrent writer does not update file metrics
XiDuo You created SPARK-43281: - Summary: Fix concurrent writer does not update file metrics Key: SPARK-43281 URL: https://issues.apache.org/jira/browse/SPARK-43281 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: XiDuo You It uses temp file path to get file status after commit task. However, the temp file has already moved to new path during commit task. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43232) Improve ObjectHashAggregateExec performance for high cardinality
[ https://issues.apache.org/jira/browse/SPARK-43232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-43232: -- Description: The `ObjectHashAggregateExec` has three preformance issues: - heavy overhead of scala sugar in `createNewAggregationBuffer` - unnecessary grouping key comparation after fallback to sort based aggregator - the aggregation buffer in sort based aggregator is not reused for all rest rows was: The `ObjectHashAggregateExec` has three preformance issues: - heavy overhead of scala sugar in `createNewAggregationBuffer` - unnecessary grouping key comparation after fallback to sort based aggregator - the aggregation buffer in sort based aggregator is not actually reused > Improve ObjectHashAggregateExec performance for high cardinality > > > Key: SPARK-43232 > URL: https://issues.apache.org/jira/browse/SPARK-43232 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Priority: Major > > The `ObjectHashAggregateExec` has three preformance issues: > - heavy overhead of scala sugar in `createNewAggregationBuffer` > - unnecessary grouping key comparation after fallback to sort based > aggregator > - the aggregation buffer in sort based aggregator is not reused for all rest > rows > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43232) Improve ObjectHashAggregateExec performance for high cardinality
[ https://issues.apache.org/jira/browse/SPARK-43232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-43232: -- Summary: Improve ObjectHashAggregateExec performance for high cardinality (was: Improve ObjectHashAggregateExec performance with high cardinality) > Improve ObjectHashAggregateExec performance for high cardinality > > > Key: SPARK-43232 > URL: https://issues.apache.org/jira/browse/SPARK-43232 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Priority: Major > > The `ObjectHashAggregateExec` has three preformance issues: > - heavy overhead of scala sugar in `createNewAggregationBuffer` > - unnecessary grouping key comparation after fallback to sort based > aggregator > - the aggregation buffer in sort based aggregator is not actually reused > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43232) Improve ObjectHashAggregateExec performance with high cardinality
[ https://issues.apache.org/jira/browse/SPARK-43232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-43232: -- Summary: Improve ObjectHashAggregateExec performance with high cardinality (was: Improve ObjectHashAggregateExec performance) > Improve ObjectHashAggregateExec performance with high cardinality > - > > Key: SPARK-43232 > URL: https://issues.apache.org/jira/browse/SPARK-43232 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Priority: Major > > The `ObjectHashAggregateExec` has three preformance issues: > - heavy overhead of scala sugar in `createNewAggregationBuffer` > - unnecessary grouping key comparation after fallback to sort based > aggregator > - the aggregation buffer in sort based aggregator is not actually reused > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43232) Improve ObjectHashAggregateExec performance
[ https://issues.apache.org/jira/browse/SPARK-43232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-43232: -- Description: The `ObjectHashAggregateExec` has three preformance issues: - heavy overhead of scala sugar in `createNewAggregationBuffer` - unnecessary grouping key comparation after fallback to sort based aggregator - the aggregation buffer in sort based aggregator is not actually reused was: The `ObjectHashAggregateExec` has three preformance issues: - heavy overhead of scala sugar in `createNewAggregationBuffer` - unnecessary grouping key comparation after fallback to sort based aggregator - the aggregation buffer in sort based aggregator is not actually reused > Improve ObjectHashAggregateExec performance > --- > > Key: SPARK-43232 > URL: https://issues.apache.org/jira/browse/SPARK-43232 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Priority: Major > > The `ObjectHashAggregateExec` has three preformance issues: > - heavy overhead of scala sugar in `createNewAggregationBuffer` > - unnecessary grouping key comparation after fallback to sort based > aggregator > - the aggregation buffer in sort based aggregator is not actually reused > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43232) Improve ObjectHashAggregateExec performance
[ https://issues.apache.org/jira/browse/SPARK-43232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-43232: -- Description: The `ObjectHashAggregateExec` has three preformance issues: - heavy overhead of scala sugar in `createNewAggregationBuffer` - unnecessary grouping key comparation after fallback to sort based aggregator - the aggregation buffer in sort based aggregator is not actually reused was: The `ObjectHashAggregateExec` has two preformance issues: - heavy overhead of scala sugar in `createNewAggregationBuffer` - unnecessary grouping key comparation after fallback to sort based aggregator > Improve ObjectHashAggregateExec performance > --- > > Key: SPARK-43232 > URL: https://issues.apache.org/jira/browse/SPARK-43232 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Priority: Major > > The `ObjectHashAggregateExec` has three preformance issues: > - heavy overhead of scala sugar in `createNewAggregationBuffer` > - unnecessary grouping key comparation after fallback to sort based > aggregator > - the aggregation buffer in sort based aggregator is not actually reused > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43232) Improve ObjectHashAggregateExec performance
[ https://issues.apache.org/jira/browse/SPARK-43232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-43232: -- Description: The `ObjectHashAggregateExec` has two preformance issues: - heavy overhead of scala sugar in `createNewAggregationBuffer` - unnecessary grouping key comparation after fallback to sort based aggregator was: The `ObjectHashAggregateExec` has two preformance issues: - heavy overhead of scala sugar in `createNewAggregationBuffer` - unnecessary grouping key comparation if fallback to sort based aggregator > Improve ObjectHashAggregateExec performance > --- > > Key: SPARK-43232 > URL: https://issues.apache.org/jira/browse/SPARK-43232 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Priority: Major > > The `ObjectHashAggregateExec` has two preformance issues: > - heavy overhead of scala sugar in `createNewAggregationBuffer` > - unnecessary grouping key comparation after fallback to sort based > aggregator > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43232) Improve ObjectHashAggregateExec performance
XiDuo You created SPARK-43232: - Summary: Improve ObjectHashAggregateExec performance Key: SPARK-43232 URL: https://issues.apache.org/jira/browse/SPARK-43232 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: XiDuo You The `ObjectHashAggregateExec` has two preformance issues: - heavy overhead of scala sugar in `createNewAggregationBuffer` - unnecessary grouping key comparation if fallback to sort based aggregator -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43026) Apply AQE with non-exchange table cache
XiDuo You created SPARK-43026: - Summary: Apply AQE with non-exchange table cache Key: SPARK-43026 URL: https://issues.apache.org/jira/browse/SPARK-43026 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: XiDuo You TableCacheQueryStageExec supports to report runtime statistics, so it's possible that AQE plans a better executed during re-optimization. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42963) Extend SparkSessionExtensions to inject rules into AQE query stage optimizer
XiDuo You created SPARK-42963: - Summary: Extend SparkSessionExtensions to inject rules into AQE query stage optimizer Key: SPARK-42963 URL: https://issues.apache.org/jira/browse/SPARK-42963 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: XiDuo You -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42942) Support coalesce table cache stage partitions
XiDuo You created SPARK-42942: - Summary: Support coalesce table cache stage partitions Key: SPARK-42942 URL: https://issues.apache.org/jira/browse/SPARK-42942 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: XiDuo You If people cache a plan which holds some small partitions, then we can coalesce those just like what we did for shuffle. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42815) Subexpression elimination support shortcut expression
[ https://issues.apache.org/jira/browse/SPARK-42815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-42815: -- Summary: Subexpression elimination support shortcut expression (was: Subexpression elimination support shortcut conditional expression) > Subexpression elimination support shortcut expression > - > > Key: SPARK-42815 > URL: https://issues.apache.org/jira/browse/SPARK-42815 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Priority: Minor > > The subexpression may not need to eval even if it appears more than once. > e.g., {{{}if(or(a, and(b, b))){}}}, the expression {{b}} would be skipped if > {{a}} is true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42815) Subexpression elimination support shortcut conditional expression
[ https://issues.apache.org/jira/browse/SPARK-42815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-42815: -- Description: The subexpression may not need to eval even if it appears more than once. e.g., {{{}if(or(a, and(b, b))){}}}, the expression {{b}} would be skipped if {{a}} is true. was: The subexpression in conditional expression may not need to eval even if it appears more than once. e.g., `if(or(a, and(b, b)))`, the expression `b` would be skipped if `a` is true. > Subexpression elimination support shortcut conditional expression > - > > Key: SPARK-42815 > URL: https://issues.apache.org/jira/browse/SPARK-42815 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Priority: Minor > > The subexpression may not need to eval even if it appears more than once. > e.g., {{{}if(or(a, and(b, b))){}}}, the expression {{b}} would be skipped if > {{a}} is true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42815) Subexpression elimination support shortcut conditional expression
XiDuo You created SPARK-42815: - Summary: Subexpression elimination support shortcut conditional expression Key: SPARK-42815 URL: https://issues.apache.org/jira/browse/SPARK-42815 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: XiDuo You The subexpression in conditional expression may not need to eval even if it appears more than once. e.g., `if(or(a, and(b, b)))`, the expression `b` would be skipped if `a` is true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org