[jira] [Commented] (SPARK-49281) Optimze parquet binary getBytes with getBytesUnsafe to avoid copy cost

2024-08-21 Thread XiDuo You (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-49281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875698#comment-17875698
 ] 

XiDuo You commented on SPARK-49281:
---

resovled by https://github.com/apache/spark/pull/47797

> Optimze parquet binary getBytes with getBytesUnsafe to avoid copy cost 
> ---
>
> Key: SPARK-49281
> URL: https://issues.apache.org/jira/browse/SPARK-49281
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: xy
>Assignee: xy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Optimze parquet binary getBytes with getBytesUnsafe to avoid copy cost 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49281) Optimze parquet binary getBytes with getBytesUnsafe to avoid copy cost

2024-08-21 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You resolved SPARK-49281.
---
Fix Version/s: 4.0.0
 Assignee: xy
   Resolution: Fixed

> Optimze parquet binary getBytes with getBytesUnsafe to avoid copy cost 
> ---
>
> Key: SPARK-49281
> URL: https://issues.apache.org/jira/browse/SPARK-49281
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: xy
>Assignee: xy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Optimze parquet binary getBytes with getBytesUnsafe to avoid copy cost 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-49179) Fix v2 multi bucketed inner joins throw AssertionError

2024-08-13 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-49179:
--
Affects Version/s: 3.5.2
   (was: 3.5.1)

> Fix v2 multi bucketed inner joins throw AssertionError
> --
>
> Key: SPARK-49179
> URL: https://issues.apache.org/jira/browse/SPARK-49179
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0, 3.3.4, 3.5.2, 3.4.3
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.4.4, 3.5.3
>
>
> {code:java}
> [info]   Cause: java.lang.AssertionError: assertion failed
> [info]   at scala.Predef$.assert(Predef.scala:264)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.createKeyGroupedShuffleSpec(EnsureRequirements.scala:642)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$checkKeyGroupCompatible$1(EnsureRequirements.scala:385)
> [info]   at scala.collection.immutable.List.map(List.scala:247)
> [info]   at scala.collection.immutable.List.map(List.scala:79)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.checkKeyGroupCompatible(EnsureRequirements.scala:382)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.checkKeyGroupCompatible(EnsureRequirements.scala:364)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$ensureDistributionAndOrdering(EnsureRequirements.scala:166)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:714)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:689)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$4(TreeNode.scala:528)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:84)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:528)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:497)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:689)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:51)
> [info]   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$.$anonfun$applyPhysicalRules$2(AdaptiveSparkPlanExec.scala:882)
> [in
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-49179) Fix v2 multi bucketed inner joins throw AssertionError

2024-08-13 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-49179:
--
Fix Version/s: 3.4.4
   3.5.3

> Fix v2 multi bucketed inner joins throw AssertionError
> --
>
> Key: SPARK-49179
> URL: https://issues.apache.org/jira/browse/SPARK-49179
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.1, 3.3.4, 3.4.3
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.4.4, 3.5.3
>
>
> {code:java}
> [info]   Cause: java.lang.AssertionError: assertion failed
> [info]   at scala.Predef$.assert(Predef.scala:264)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.createKeyGroupedShuffleSpec(EnsureRequirements.scala:642)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$checkKeyGroupCompatible$1(EnsureRequirements.scala:385)
> [info]   at scala.collection.immutable.List.map(List.scala:247)
> [info]   at scala.collection.immutable.List.map(List.scala:79)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.checkKeyGroupCompatible(EnsureRequirements.scala:382)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.checkKeyGroupCompatible(EnsureRequirements.scala:364)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$ensureDistributionAndOrdering(EnsureRequirements.scala:166)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:714)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:689)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$4(TreeNode.scala:528)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:84)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:528)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:497)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:689)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:51)
> [info]   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$.$anonfun$applyPhysicalRules$2(AdaptiveSparkPlanExec.scala:882)
> [in
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49179) Fix v2 multi bucketed inner joins throw AssertionError

2024-08-12 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You resolved SPARK-49179.
---
Fix Version/s: 3.4.4
   4.0.0
   3.5.3
   Resolution: Fixed

Issue resolved by pull request 47683
[https://github.com/apache/spark/pull/47683]

> Fix v2 multi bucketed inner joins throw AssertionError
> --
>
> Key: SPARK-49179
> URL: https://issues.apache.org/jira/browse/SPARK-49179
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.1, 3.3.4, 3.4.3
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.4, 4.0.0, 3.5.3
>
>
> {code:java}
> [info]   Cause: java.lang.AssertionError: assertion failed
> [info]   at scala.Predef$.assert(Predef.scala:264)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.createKeyGroupedShuffleSpec(EnsureRequirements.scala:642)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$checkKeyGroupCompatible$1(EnsureRequirements.scala:385)
> [info]   at scala.collection.immutable.List.map(List.scala:247)
> [info]   at scala.collection.immutable.List.map(List.scala:79)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.checkKeyGroupCompatible(EnsureRequirements.scala:382)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.checkKeyGroupCompatible(EnsureRequirements.scala:364)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$ensureDistributionAndOrdering(EnsureRequirements.scala:166)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:714)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:689)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$4(TreeNode.scala:528)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:84)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:528)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:497)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:689)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:51)
> [info]   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$.$anonfun$applyPhysicalRules$2(AdaptiveSparkPlanExec.scala:882)
> [in
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-49179) Fix v2 multi bucketed inner joins throw AssertionError

2024-08-12 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You reassigned SPARK-49179:
-

Assignee: XiDuo You

> Fix v2 multi bucketed inner joins throw AssertionError
> --
>
> Key: SPARK-49179
> URL: https://issues.apache.org/jira/browse/SPARK-49179
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.1, 3.3.4, 3.4.3
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
>  Labels: pull-request-available
>
> {code:java}
> [info]   Cause: java.lang.AssertionError: assertion failed
> [info]   at scala.Predef$.assert(Predef.scala:264)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.createKeyGroupedShuffleSpec(EnsureRequirements.scala:642)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$checkKeyGroupCompatible$1(EnsureRequirements.scala:385)
> [info]   at scala.collection.immutable.List.map(List.scala:247)
> [info]   at scala.collection.immutable.List.map(List.scala:79)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.checkKeyGroupCompatible(EnsureRequirements.scala:382)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.checkKeyGroupCompatible(EnsureRequirements.scala:364)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$ensureDistributionAndOrdering(EnsureRequirements.scala:166)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:714)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:689)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$4(TreeNode.scala:528)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:84)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:528)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:497)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:689)
> [info]   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:51)
> [info]   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$.$anonfun$applyPhysicalRules$2(AdaptiveSparkPlanExec.scala:882)
> [in
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-49200) Fix null type non-codegen ordering exception

2024-08-12 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You resolved SPARK-49200.
---
Fix Version/s: 4.0.0
   3.5.3
   Resolution: Fixed

Issue resolved by pull request 47707
[https://github.com/apache/spark/pull/47707]

> Fix null type non-codegen ordering exception
> 
>
> Key: SPARK-49200
> URL: https://issues.apache.org/jira/browse/SPARK-49200
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.2
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.3
>
>
> {code:java}
> set spark.sql.codegen.factoryMode=NO_CODEGEN;
> set 
> spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.EliminateSorts;
> select * from range(10) order by array(null);
> {code}
> {code:java}
> org.apache.spark.SparkIllegalArgumentException: Type PhysicalNullType does 
> not support ordered operations.
> at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.orderedOperationUnsupportedByDataTypeError(QueryExecutionErrors.scala:352)
> at 
> org.apache.spark.sql.catalyst.types.PhysicalNullType.ordering(PhysicalDataType.scala:246)
> at 
> org.apache.spark.sql.catalyst.types.PhysicalNullType.ordering(PhysicalDataType.scala:243)
> at 
> org.apache.spark.sql.catalyst.types.PhysicalArrayType$$anon$1.(PhysicalDataType.scala:283)
> at 
> org.apache.spark.sql.catalyst.types.PhysicalArrayType.interpretedOrdering$lzycompute(PhysicalDataType.scala:281)
> at 
> org.apache.spark.sql.catalyst.types.PhysicalArrayType.interpretedOrdering(PhysicalDataType.scala:281)
> at 
> org.apache.spark.sql.catalyst.types.PhysicalArrayType.ordering(PhysicalDataType.scala:277)
> at 
> org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:67)
> at 
> org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:40)
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$RowComparator.compare(UnsafeExternalRowSorter.java:254)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:70)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:44)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-49205) KeyGroupedPartitioning should inherit HashPartitioningLike

2024-08-12 Thread XiDuo You (Jira)
XiDuo You created SPARK-49205:
-

 Summary: KeyGroupedPartitioning should inherit HashPartitioningLike
 Key: SPARK-49205
 URL: https://issues.apache.org/jira/browse/SPARK-49205
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: XiDuo You






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-49179) Fix v2 multi bucketed inner joins throw AssertionError

2024-08-09 Thread XiDuo You (Jira)
XiDuo You created SPARK-49179:
-

 Summary: Fix v2 multi bucketed inner joins throw AssertionError
 Key: SPARK-49179
 URL: https://issues.apache.org/jira/browse/SPARK-49179
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.3, 3.3.4, 3.5.1, 4.0.0
Reporter: XiDuo You



{code:java}
[info]   Cause: java.lang.AssertionError: assertion failed
[info]   at scala.Predef$.assert(Predef.scala:264)
[info]   at 
org.apache.spark.sql.execution.exchange.EnsureRequirements.createKeyGroupedShuffleSpec(EnsureRequirements.scala:642)
[info]   at 
org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$checkKeyGroupCompatible$1(EnsureRequirements.scala:385)
[info]   at scala.collection.immutable.List.map(List.scala:247)
[info]   at scala.collection.immutable.List.map(List.scala:79)
[info]   at 
org.apache.spark.sql.execution.exchange.EnsureRequirements.checkKeyGroupCompatible(EnsureRequirements.scala:382)
[info]   at 
org.apache.spark.sql.execution.exchange.EnsureRequirements.checkKeyGroupCompatible(EnsureRequirements.scala:364)
[info]   at 
org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$ensureDistributionAndOrdering(EnsureRequirements.scala:166)
[info]   at 
org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:714)
[info]   at 
org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:689)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$4(TreeNode.scala:528)
[info]   at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:84)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:528)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:497)
[info]   at 
org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:689)
[info]   at 
org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:51)
[info]   at 
org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$.$anonfun$applyPhysicalRules$2(AdaptiveSparkPlanExec.scala:882)
[in
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-49030) Self join of a CTE seems non-deterministic

2024-08-01 Thread XiDuo You (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-49030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870374#comment-17870374
 ] 

XiDuo You commented on SPARK-49030:
---

[~jira.shegalov] the case you point makes sence to me, I can re-produce it.

> Self join of a CTE seems non-deterministic
> --
>
> Key: SPARK-49030
> URL: https://issues.apache.org/jira/browse/SPARK-49030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
> Environment: Tested with Spark 3.4.1, 3.5.1, and 4.0.0-preview.
>Reporter: Jihoon Son
>Priority: Minor
> Fix For: 3.5.3
>
> Attachments: screenshot-1.png
>
>
> {code:java}
> WITH c AS (SELECT * FROM customer LIMIT 10)
> SELECT count(*)
> FROM c c1, c c2
> WHERE c1.c_customer_sk > c2.c_customer_sk{code}
> Suppose a self join query on a CTE such as the one above.
> Spark generates a physical plan like the one below for this query.
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[], functions=[count(1)], output=[count(1)#194L])
>    +- HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#233L])
>       +- Project
>          +- BroadcastNestedLoopJoin BuildRight, Inner, (c_customer_sk#0 > 
> c_customer_sk#214)
>             :- Filter isnotnull(c_customer_sk#0)
>             :  +- GlobalLimit 10, 0
>             :     +- Exchange SinglePartition, ENSURE_REQUIREMENTS, 
> [plan_id=256]
>             :        +- LocalLimit 10
>             :           +- FileScan parquet [c_customer_sk#0] Batched: true, 
> DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 
> paths)[file:/some/path/customer], PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
>             +- BroadcastExchange IdentityBroadcastMode, [plan_id=263]
>                +- Filter isnotnull(c_customer_sk#214)
>                   +- GlobalLimit 10, 0
>                      +- Exchange SinglePartition, ENSURE_REQUIREMENTS, 
> [plan_id=259]
>                         +- LocalLimit 10
>                            +- FileScan parquet [c_customer_sk#214] Batched: 
> true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 
> paths)[file:/some/path/customer], PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct{code}
> Evaluating this plan produces non-deterministic result because the limit is 
> independently pushed into the two sides of the join. Each limit can produce 
> different data, and thus the join can produce results that vary across runs.
> I understand that the query in question is not deterministic (and thus not 
> very practical) as, due to the nature of the limit in distributed engines, it 
> is not expected to produce the same result anyway across repeated runs. 
> However, I would still expect that the query plan evaluation remains 
> deterministic.
> Per extended analysis as seen below, it seems that the query plan has changed 
> at some point during optimization.
> {code:java}
> == Analyzed Logical Plan ==
> count(1): bigint
> WithCTE
> :- CTERelationDef 2, false
> :  +- SubqueryAlias c
> :     +- GlobalLimit 10
> :        +- LocalLimit 10
> :           +- Project [c_customer_sk#0, c_customer_id#1, 
> c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, 
> c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, 
> c_first_name#8, c_last_name#9, c_preferred_cust_flag#10, c_birth_day#11L, 
> c_birth_month#12L, c_birth_year#13L, c_birth_country#14, c_login#15, 
> c_email_address#16, c_last_review_date_sk#17]
> :              +- SubqueryAlias customer
> :                 +- View (`customer`, [c_customer_sk#0, c_customer_id#1, 
> c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, 
> c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, 
> c_first_name#8, c_last_name#9, c_preferred_cust_flag#10, c_birth_day#11L, 
> c_birth_month#12L, c_birth_year#13L, c_birth_country#14, c_login#15, 
> c_email_address#16, c_last_review_date_sk#17])
> :                    +- Relation 
> [c_customer_sk#0,c_customer_id#1,c_current_cdemo_sk#2,c_current_hdemo_sk#3,c_current_addr_sk#4,c_first_shipto_date_sk#5,c_first_sales_date_sk#6,c_salutation#7,c_first_name#8,c_last_name#9,c_preferred_cust_flag#10,c_birth_day#11L,c_birth_month#12L,c_birth_year#13L,c_birth_country#14,c_login#15,c_email_address#16,c_last_review_date_sk#17]
>  parquet
> +- Aggregate [count(1) AS count(1)#194L]
>    +- Filter (c_customer_sk#0 > c_customer_sk#176)
>       +- Join Inner
>          :- SubqueryAlias c1
>          :  +- SubqueryAlias c
>          :     +- CTERelationRef 2, true, [c_customer_sk#0, c_customer_id#1, 
> c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, 
> c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, 
> c_

[jira] [Resolved] (SPARK-49071) Remove ArraySortLike trait

2024-08-01 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You resolved SPARK-49071.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47547
[https://github.com/apache/spark/pull/47547]

> Remove ArraySortLike trait
> --
>
> Key: SPARK-49071
> URL: https://issues.apache.org/jira/browse/SPARK-49071
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-49071) Remove ArraySortLike trait

2024-08-01 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You reassigned SPARK-49071:
-

Assignee: XiDuo You

> Remove ArraySortLike trait
> --
>
> Key: SPARK-49071
> URL: https://issues.apache.org/jira/browse/SPARK-49071
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-49030) Self join of a CTE seems non-deterministic

2024-08-01 Thread XiDuo You (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-49030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870322#comment-17870322
 ] 

XiDuo You commented on SPARK-49030:
---

> Hi [~ulysses], thank you for having a look. I wonder how you tested it 
> though. I'm pretty sure that the query plan shown above is the final query 
> plan.

[~jihoonson] It seems not the truth... The query plan tree string you put in 
JIRA deacription contains:
{code:java}
AdaptiveSparkPlan isFinalPlan=false
{code}
so I think it is actually an initial plan.

> I see the same query plan in the Spark web UI after the query completes. The 
> stage views in the UI show the same query plan. Finally, when I repeatedly 
> run the query in question, each run returns a different result.

I tried to use spark 3.5.1 with following case but can not re-produce your 
issue. I always get the same result because the shuffle of limit is a reused 
exchange.
{code:java}
create table t1 using parquet as select /*+ repartition(3) */ rand() as c1, id 
as c2 from range(1);

WITH c AS (SELECT * FROM t1 LIMIT 10)
SELECT count(*)
FROM c c1, c c2
WHERE c1.c1 > c2.c1
;
{code}
!screenshot-1.png!

So which Spark version you tested? Have you tried Spark3.5.1 or 4.0.0 ?

> Self join of a CTE seems non-deterministic
> --
>
> Key: SPARK-49030
> URL: https://issues.apache.org/jira/browse/SPARK-49030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
> Environment: Tested with Spark 3.4.1, 3.5.1, and 4.0.0-preview.
>Reporter: Jihoon Son
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> {code:java}
> WITH c AS (SELECT * FROM customer LIMIT 10)
> SELECT count(*)
> FROM c c1, c c2
> WHERE c1.c_customer_sk > c2.c_customer_sk{code}
> Suppose a self join query on a CTE such as the one above.
> Spark generates a physical plan like the one below for this query.
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[], functions=[count(1)], output=[count(1)#194L])
>    +- HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#233L])
>       +- Project
>          +- BroadcastNestedLoopJoin BuildRight, Inner, (c_customer_sk#0 > 
> c_customer_sk#214)
>             :- Filter isnotnull(c_customer_sk#0)
>             :  +- GlobalLimit 10, 0
>             :     +- Exchange SinglePartition, ENSURE_REQUIREMENTS, 
> [plan_id=256]
>             :        +- LocalLimit 10
>             :           +- FileScan parquet [c_customer_sk#0] Batched: true, 
> DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 
> paths)[file:/some/path/customer], PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
>             +- BroadcastExchange IdentityBroadcastMode, [plan_id=263]
>                +- Filter isnotnull(c_customer_sk#214)
>                   +- GlobalLimit 10, 0
>                      +- Exchange SinglePartition, ENSURE_REQUIREMENTS, 
> [plan_id=259]
>                         +- LocalLimit 10
>                            +- FileScan parquet [c_customer_sk#214] Batched: 
> true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 
> paths)[file:/some/path/customer], PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct{code}
> Evaluating this plan produces non-deterministic result because the limit is 
> independently pushed into the two sides of the join. Each limit can produce 
> different data, and thus the join can produce results that vary across runs.
> I understand that the query in question is not deterministic (and thus not 
> very practical) as, due to the nature of the limit in distributed engines, it 
> is not expected to produce the same result anyway across repeated runs. 
> However, I would still expect that the query plan evaluation remains 
> deterministic.
> Per extended analysis as seen below, it seems that the query plan has changed 
> at some point during optimization.
> {code:java}
> == Analyzed Logical Plan ==
> count(1): bigint
> WithCTE
> :- CTERelationDef 2, false
> :  +- SubqueryAlias c
> :     +- GlobalLimit 10
> :        +- LocalLimit 10
> :           +- Project [c_customer_sk#0, c_customer_id#1, 
> c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, 
> c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, 
> c_first_name#8, c_last_name#9, c_preferred_cust_flag#10, c_birth_day#11L, 
> c_birth_month#12L, c_birth_year#13L, c_birth_country#14, c_login#15, 
> c_email_address#16, c_last_review_date_sk#17]
> :              +- SubqueryAlias customer
> :                 +- View (`customer`, [c_customer_sk#0, c_customer_id#1, 
> c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, 
> c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, 
> c_first_name#8, c_last_name#9, c

[jira] [Updated] (SPARK-49030) Self join of a CTE seems non-deterministic

2024-08-01 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-49030:
--
Attachment: screenshot-1.png

> Self join of a CTE seems non-deterministic
> --
>
> Key: SPARK-49030
> URL: https://issues.apache.org/jira/browse/SPARK-49030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
> Environment: Tested with Spark 3.4.1, 3.5.1, and 4.0.0-preview.
>Reporter: Jihoon Son
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> {code:java}
> WITH c AS (SELECT * FROM customer LIMIT 10)
> SELECT count(*)
> FROM c c1, c c2
> WHERE c1.c_customer_sk > c2.c_customer_sk{code}
> Suppose a self join query on a CTE such as the one above.
> Spark generates a physical plan like the one below for this query.
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[], functions=[count(1)], output=[count(1)#194L])
>    +- HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#233L])
>       +- Project
>          +- BroadcastNestedLoopJoin BuildRight, Inner, (c_customer_sk#0 > 
> c_customer_sk#214)
>             :- Filter isnotnull(c_customer_sk#0)
>             :  +- GlobalLimit 10, 0
>             :     +- Exchange SinglePartition, ENSURE_REQUIREMENTS, 
> [plan_id=256]
>             :        +- LocalLimit 10
>             :           +- FileScan parquet [c_customer_sk#0] Batched: true, 
> DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 
> paths)[file:/some/path/customer], PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
>             +- BroadcastExchange IdentityBroadcastMode, [plan_id=263]
>                +- Filter isnotnull(c_customer_sk#214)
>                   +- GlobalLimit 10, 0
>                      +- Exchange SinglePartition, ENSURE_REQUIREMENTS, 
> [plan_id=259]
>                         +- LocalLimit 10
>                            +- FileScan parquet [c_customer_sk#214] Batched: 
> true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 
> paths)[file:/some/path/customer], PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct{code}
> Evaluating this plan produces non-deterministic result because the limit is 
> independently pushed into the two sides of the join. Each limit can produce 
> different data, and thus the join can produce results that vary across runs.
> I understand that the query in question is not deterministic (and thus not 
> very practical) as, due to the nature of the limit in distributed engines, it 
> is not expected to produce the same result anyway across repeated runs. 
> However, I would still expect that the query plan evaluation remains 
> deterministic.
> Per extended analysis as seen below, it seems that the query plan has changed 
> at some point during optimization.
> {code:java}
> == Analyzed Logical Plan ==
> count(1): bigint
> WithCTE
> :- CTERelationDef 2, false
> :  +- SubqueryAlias c
> :     +- GlobalLimit 10
> :        +- LocalLimit 10
> :           +- Project [c_customer_sk#0, c_customer_id#1, 
> c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, 
> c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, 
> c_first_name#8, c_last_name#9, c_preferred_cust_flag#10, c_birth_day#11L, 
> c_birth_month#12L, c_birth_year#13L, c_birth_country#14, c_login#15, 
> c_email_address#16, c_last_review_date_sk#17]
> :              +- SubqueryAlias customer
> :                 +- View (`customer`, [c_customer_sk#0, c_customer_id#1, 
> c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, 
> c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, 
> c_first_name#8, c_last_name#9, c_preferred_cust_flag#10, c_birth_day#11L, 
> c_birth_month#12L, c_birth_year#13L, c_birth_country#14, c_login#15, 
> c_email_address#16, c_last_review_date_sk#17])
> :                    +- Relation 
> [c_customer_sk#0,c_customer_id#1,c_current_cdemo_sk#2,c_current_hdemo_sk#3,c_current_addr_sk#4,c_first_shipto_date_sk#5,c_first_sales_date_sk#6,c_salutation#7,c_first_name#8,c_last_name#9,c_preferred_cust_flag#10,c_birth_day#11L,c_birth_month#12L,c_birth_year#13L,c_birth_country#14,c_login#15,c_email_address#16,c_last_review_date_sk#17]
>  parquet
> +- Aggregate [count(1) AS count(1)#194L]
>    +- Filter (c_customer_sk#0 > c_customer_sk#176)
>       +- Join Inner
>          :- SubqueryAlias c1
>          :  +- SubqueryAlias c
>          :     +- CTERelationRef 2, true, [c_customer_sk#0, c_customer_id#1, 
> c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, 
> c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, 
> c_first_name#8, c_last_name#9, c_preferred_cust_flag#10, c_birth_day#11L, 
> c_birth_month#12L, c_birth_year#13L, c_birth_country#14, 

[jira] [Commented] (SPARK-49030) Self join of a CTE seems non-deterministic

2024-08-01 Thread XiDuo You (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-49030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870144#comment-17870144
 ] 

XiDuo You commented on SPARK-49030:
---

[~jihoonson] seems you are reference an initial query plan. AQE will reuse the 
excahnge at runtime so I think the limit operator should be deterministic. For 
the non-deterministic expression, Spark has already fixed the issue using 
exchange reuse, see SPARK-36447.  also cc  [~yao] 
 

> Self join of a CTE seems non-deterministic
> --
>
> Key: SPARK-49030
> URL: https://issues.apache.org/jira/browse/SPARK-49030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
> Environment: Tested with Spark 3.4.1, 3.5.1, and 4.0.0-preview.
>Reporter: Jihoon Son
>Priority: Minor
>
> {code:java}
> WITH c AS (SELECT * FROM customer LIMIT 10)
> SELECT count(*)
> FROM c c1, c c2
> WHERE c1.c_customer_sk > c2.c_customer_sk{code}
> Suppose a self join query on a CTE such as the one above.
> Spark generates a physical plan like the one below for this query.
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[], functions=[count(1)], output=[count(1)#194L])
>    +- HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#233L])
>       +- Project
>          +- BroadcastNestedLoopJoin BuildRight, Inner, (c_customer_sk#0 > 
> c_customer_sk#214)
>             :- Filter isnotnull(c_customer_sk#0)
>             :  +- GlobalLimit 10, 0
>             :     +- Exchange SinglePartition, ENSURE_REQUIREMENTS, 
> [plan_id=256]
>             :        +- LocalLimit 10
>             :           +- FileScan parquet [c_customer_sk#0] Batched: true, 
> DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 
> paths)[file:/some/path/customer], PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
>             +- BroadcastExchange IdentityBroadcastMode, [plan_id=263]
>                +- Filter isnotnull(c_customer_sk#214)
>                   +- GlobalLimit 10, 0
>                      +- Exchange SinglePartition, ENSURE_REQUIREMENTS, 
> [plan_id=259]
>                         +- LocalLimit 10
>                            +- FileScan parquet [c_customer_sk#214] Batched: 
> true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 
> paths)[file:/some/path/customer], PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct{code}
> Evaluating this plan produces non-deterministic result because the limit is 
> independently pushed into the two sides of the join. Each limit can produce 
> different data, and thus the join can produce results that vary across runs.
> I understand that the query in question is not deterministic (and thus not 
> very practical) as, due to the nature of the limit in distributed engines, it 
> is not expected to produce the same result anyway across repeated runs. 
> However, I would still expect that the query plan evaluation remains 
> deterministic.
> Per extended analysis as seen below, it seems that the query plan has changed 
> at some point during optimization.
> {code:java}
> == Analyzed Logical Plan ==
> count(1): bigint
> WithCTE
> :- CTERelationDef 2, false
> :  +- SubqueryAlias c
> :     +- GlobalLimit 10
> :        +- LocalLimit 10
> :           +- Project [c_customer_sk#0, c_customer_id#1, 
> c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, 
> c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, 
> c_first_name#8, c_last_name#9, c_preferred_cust_flag#10, c_birth_day#11L, 
> c_birth_month#12L, c_birth_year#13L, c_birth_country#14, c_login#15, 
> c_email_address#16, c_last_review_date_sk#17]
> :              +- SubqueryAlias customer
> :                 +- View (`customer`, [c_customer_sk#0, c_customer_id#1, 
> c_current_cdemo_sk#2, c_current_hdemo_sk#3, c_current_addr_sk#4, 
> c_first_shipto_date_sk#5, c_first_sales_date_sk#6, c_salutation#7, 
> c_first_name#8, c_last_name#9, c_preferred_cust_flag#10, c_birth_day#11L, 
> c_birth_month#12L, c_birth_year#13L, c_birth_country#14, c_login#15, 
> c_email_address#16, c_last_review_date_sk#17])
> :                    +- Relation 
> [c_customer_sk#0,c_customer_id#1,c_current_cdemo_sk#2,c_current_hdemo_sk#3,c_current_addr_sk#4,c_first_shipto_date_sk#5,c_first_sales_date_sk#6,c_salutation#7,c_first_name#8,c_last_name#9,c_preferred_cust_flag#10,c_birth_day#11L,c_birth_month#12L,c_birth_year#13L,c_birth_country#14,c_login#15,c_email_address#16,c_last_review_date_sk#17]
>  parquet
> +- Aggregate [count(1) AS count(1)#194L]
>    +- Filter (c_customer_sk#0 > c_customer_sk#176)
>       +- Join Inner
>          :- SubqueryAlias c1
>          :  +- SubqueryAlias c
>          :     +- CTERelationRef 2, true, [c_customer_sk#0, c_customer_id#1, 
> c

[jira] [Updated] (SPARK-49071) Remove ArraySortLike trait

2024-07-31 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-49071:
--
Summary: Remove ArraySortLike trait  (was: Cleanup SortArray)

> Remove ArraySortLike trait
> --
>
> Key: SPARK-49071
> URL: https://issues.apache.org/jira/browse/SPARK-49071
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-49071) Cleanup SortArray

2024-07-30 Thread XiDuo You (Jira)
XiDuo You created SPARK-49071:
-

 Summary: Cleanup SortArray
 Key: SPARK-49071
 URL: https://issues.apache.org/jira/browse/SPARK-49071
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: XiDuo You






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45201) NoClassDefFoundError: InternalFutureFailureAccess when compiling Spark 3.5.0

2024-07-23 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You resolved SPARK-45201.
---
Fix Version/s: 3.5.2
   Resolution: Fixed

> NoClassDefFoundError: InternalFutureFailureAccess when compiling Spark 3.5.0
> 
>
> Key: SPARK-45201
> URL: https://issues.apache.org/jira/browse/SPARK-45201
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Sebastian Daberdaku
>Priority: Major
> Fix For: 3.5.2
>
> Attachments: Dockerfile, spark-3.5.0.patch, spark-3.5.1.patch
>
>
> I am trying to compile Spark 3.5.0 and make a distribution that supports 
> Spark Connect and Kubernetes. The compilation seems to complete correctly, 
> but when I try to run the Spark Connect server on kubernetes I get a 
> "NoClassDefFoundError" as follows:
> {code:java}
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/sparkproject/guava/util/concurrent/internal/InternalFutureFailureAccess
>     at java.base/java.lang.ClassLoader.defineClass1(Native Method)
>     at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
>     at 
> java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
>     at 
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
>     at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
>     at java.base/java.lang.ClassLoader.defineClass1(Native Method)
>     at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
>     at 
> java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
>     at 
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
>     at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
>     at java.base/java.lang.ClassLoader.defineClass1(Native Method)
>     at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
>     at 
> java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
>     at 
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
>     at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
>     at 
> org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3511)
>     at 
> org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3515)
>     at 
> org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2168)
>     at 
> org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2079)
>     at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4011)
>     at org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4034)
>     at 
> org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5010)
>     at 
> org.apache.spark.storage.BlockManagerId$.getCachedBlockManagerId(BlockManagerId.scala:146)
>     at 
> org.apache.spark.storage.BlockManagerId$.apply(BlockManagerId.scala:127)
>     at 
> org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:536)
>     at org.apache.spark.SparkContext.(SparkContext.scala:625)
>     at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2888)
>     at 
> org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:1099)
>     at scala.Option.getOrElse(Option.scala:189)
>     at 
> org.apache.spark.sql.SparkSession$Build

[jira] [Commented] (SPARK-45201) NoClassDefFoundError: InternalFutureFailureAccess when compiling Spark 3.5.0

2024-07-23 Thread XiDuo You (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17868221#comment-17868221
 ] 

XiDuo You commented on SPARK-45201:
---

This issue has been fixed by https://github.com/apache/spark/pull/45775

> NoClassDefFoundError: InternalFutureFailureAccess when compiling Spark 3.5.0
> 
>
> Key: SPARK-45201
> URL: https://issues.apache.org/jira/browse/SPARK-45201
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Sebastian Daberdaku
>Priority: Major
> Attachments: Dockerfile, spark-3.5.0.patch, spark-3.5.1.patch
>
>
> I am trying to compile Spark 3.5.0 and make a distribution that supports 
> Spark Connect and Kubernetes. The compilation seems to complete correctly, 
> but when I try to run the Spark Connect server on kubernetes I get a 
> "NoClassDefFoundError" as follows:
> {code:java}
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/sparkproject/guava/util/concurrent/internal/InternalFutureFailureAccess
>     at java.base/java.lang.ClassLoader.defineClass1(Native Method)
>     at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
>     at 
> java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
>     at 
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
>     at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
>     at java.base/java.lang.ClassLoader.defineClass1(Native Method)
>     at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
>     at 
> java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
>     at 
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
>     at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
>     at java.base/java.lang.ClassLoader.defineClass1(Native Method)
>     at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
>     at 
> java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
>     at 
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
>     at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
>     at 
> org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3511)
>     at 
> org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3515)
>     at 
> org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2168)
>     at 
> org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2079)
>     at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4011)
>     at org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4034)
>     at 
> org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5010)
>     at 
> org.apache.spark.storage.BlockManagerId$.getCachedBlockManagerId(BlockManagerId.scala:146)
>     at 
> org.apache.spark.storage.BlockManagerId$.apply(BlockManagerId.scala:127)
>     at 
> org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:536)
>     at org.apache.spark.SparkContext.(SparkContext.scala:625)
>     at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2888)
>     at 
> org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:1099)
>     at scala.Option.getOrElse(Option.scala:189)
>   

[jira] [Updated] (SPARK-41763) Refactor v1 writes

2024-07-15 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-41763:
--
Summary: Refactor v1 writes  (was: Improve v1 writes)

> Refactor v1 writes
> --
>
> Key: SPARK-41763
> URL: https://issues.apache.org/jira/browse/SPARK-41763
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>
> This umbrella is used to track all tickets about the changes of v1 writes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41763) Improve v1 writes

2024-07-15 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You resolved SPARK-41763.
---
Fix Version/s: 3.4.0
 Assignee: XiDuo You
   Resolution: Fixed

> Improve v1 writes
> -
>
> Key: SPARK-41763
> URL: https://issues.apache.org/jira/browse/SPARK-41763
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>
> This umbrella is used to track all tickets about the changes of v1 writes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48880) Avoid throw NullPointerException if driver plugin fails to initialize

2024-07-14 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You resolved SPARK-48880.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47321
[https://github.com/apache/spark/pull/47321]

> Avoid throw NullPointerException if driver plugin fails to initialize
> -
>
> Key: SPARK-48880
> URL: https://issues.apache.org/jira/browse/SPARK-48880
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48880) Avoid throw NullPointerException if driver plugin fails to initialize

2024-07-14 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You reassigned SPARK-48880:
-

Assignee: XiDuo You

> Avoid throw NullPointerException if driver plugin fails to initialize
> -
>
> Key: SPARK-48880
> URL: https://issues.apache.org/jira/browse/SPARK-48880
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48880) Avoid throw NullPointerException if driver plugin fails to initialize

2024-07-12 Thread XiDuo You (Jira)
XiDuo You created SPARK-48880:
-

 Summary: Avoid throw NullPointerException if driver plugin fails 
to initialize
 Key: SPARK-48880
 URL: https://issues.apache.org/jira/browse/SPARK-48880
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: XiDuo You






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48817) MultiInsert is split to multiple sql executions, resulting in no exchange reuse

2024-07-09 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You reassigned SPARK-48817:
-

Assignee: Zhen Wang

> MultiInsert is split to multiple sql executions, resulting in no exchange 
> reuse
> ---
>
> Key: SPARK-48817
> URL: https://issues.apache.org/jira/browse/SPARK-48817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.1
>Reporter: Zhen Wang
>Assignee: Zhen Wang
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2024-07-05-14-59-35-340.png, 
> image-2024-07-05-14-59-55-291.png, image-2024-07-05-15-00-01-805.png, 
> image-2024-07-05-15-00-09-181.png, image-2024-07-05-15-00-17-693.png, 
> image-2024-07-05-16-42-01-973.png, image-2024-07-05-16-42-17-817.png, 
> image-2024-07-05-16-42-27-033.png, image-2024-07-05-16-42-34-738.png, 
> image-2024-07-05-16-42-46-500.png
>
>
> MultiInsert is split to multiple sql executions, resulting in no exchange 
> reuse.
>  
> Reproduce sql:
> {code:java}
> create table wangzhen_t1(c1 int);
> create table wangzhen_t2(c1 int);
> create table wangzhen_t3(c1 int);
> insert into wangzhen_t1 values (1), (2), (3);
> from (select /*+ REPARTITION(3) */ c1 from wangzhen_t1)
> insert overwrite table wangzhen_t2 select c1
> insert overwrite table wangzhen_t3 select c1; {code}
>  
> In Spark 3.1, there is only one SQL execution and there is a reuse exchange.
> !image-2024-07-05-14-59-35-340.png!
>  
> However, in Spark 3.5, it was split to multiple executions and there was no 
> ReuseExchange.
> !image-2024-07-05-16-42-01-973.png!
> !image-2024-07-05-16-42-17-817.png!!image-2024-07-05-16-42-34-738.png!!image-2024-07-05-16-42-46-500.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48817) MultiInsert is split to multiple sql executions, resulting in no exchange reuse

2024-07-09 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You resolved SPARK-48817.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

> MultiInsert is split to multiple sql executions, resulting in no exchange 
> reuse
> ---
>
> Key: SPARK-48817
> URL: https://issues.apache.org/jira/browse/SPARK-48817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.1
>Reporter: Zhen Wang
>Assignee: Zhen Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: image-2024-07-05-14-59-35-340.png, 
> image-2024-07-05-14-59-55-291.png, image-2024-07-05-15-00-01-805.png, 
> image-2024-07-05-15-00-09-181.png, image-2024-07-05-15-00-17-693.png, 
> image-2024-07-05-16-42-01-973.png, image-2024-07-05-16-42-17-817.png, 
> image-2024-07-05-16-42-27-033.png, image-2024-07-05-16-42-34-738.png, 
> image-2024-07-05-16-42-46-500.png
>
>
> MultiInsert is split to multiple sql executions, resulting in no exchange 
> reuse.
>  
> Reproduce sql:
> {code:java}
> create table wangzhen_t1(c1 int);
> create table wangzhen_t2(c1 int);
> create table wangzhen_t3(c1 int);
> insert into wangzhen_t1 values (1), (2), (3);
> from (select /*+ REPARTITION(3) */ c1 from wangzhen_t1)
> insert overwrite table wangzhen_t2 select c1
> insert overwrite table wangzhen_t3 select c1; {code}
>  
> In Spark 3.1, there is only one SQL execution and there is a reuse exchange.
> !image-2024-07-05-14-59-35-340.png!
>  
> However, in Spark 3.5, it was split to multiple executions and there was no 
> ReuseExchange.
> !image-2024-07-05-16-42-01-973.png!
> !image-2024-07-05-16-42-17-817.png!!image-2024-07-05-16-42-34-738.png!!image-2024-07-05-16-42-46-500.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48815) Update environment when stoping connect session

2024-07-05 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You resolved SPARK-48815.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47223
[https://github.com/apache/spark/pull/47223]

> Update environment when stoping connect session
> ---
>
> Key: SPARK-48815
> URL: https://issues.apache.org/jira/browse/SPARK-48815
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> We should update environment if any added files are removed when stoping 
> connect session.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48815) Update environment when stoping connect session

2024-07-04 Thread XiDuo You (Jira)
XiDuo You created SPARK-48815:
-

 Summary: Update environment when stoping connect session
 Key: SPARK-48815
 URL: https://issues.apache.org/jira/browse/SPARK-48815
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 4.0.0
Reporter: XiDuo You


We should update environment if any added files are removed when stoping 
connect session.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48168) Add bitwise shifting operators support

2024-05-24 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You resolved SPARK-48168.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46440
[https://github.com/apache/spark/pull/46440]

> Add bitwise shifting operators support
> --
>
> Key: SPARK-48168
> URL: https://issues.apache.org/jira/browse/SPARK-48168
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46090) Support plan fragment level SQL configs in AQE

2024-05-23 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You reassigned SPARK-46090:
-

Assignee: XiDuo You

> Support plan fragment level SQL configs  in AQE
> ---
>
> Key: SPARK-46090
> URL: https://issues.apache.org/jira/browse/SPARK-46090
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
>  Labels: pull-request-available
>
> AQE executes query plan stage by stage, so there is a chance to support plan 
> fragment level SQL configs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46090) Support plan fragment level SQL configs in AQE

2024-05-23 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You resolved SPARK-46090.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44013
[https://github.com/apache/spark/pull/44013]

> Support plan fragment level SQL configs  in AQE
> ---
>
> Key: SPARK-46090
> URL: https://issues.apache.org/jira/browse/SPARK-46090
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> AQE executes query plan stage by stage, so there is a chance to support plan 
> fragment level SQL configs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48004) Add WriteFilesExecBase trait for v1 write

2024-04-26 Thread XiDuo You (Jira)
XiDuo You created SPARK-48004:
-

 Summary: Add WriteFilesExecBase trait for v1 write
 Key: SPARK-48004
 URL: https://issues.apache.org/jira/browse/SPARK-48004
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 4.0.0
Reporter: XiDuo You






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47285) AdaptiveSparkPlanExec should always use the context.session

2024-03-05 Thread XiDuo You (Jira)
XiDuo You created SPARK-47285:
-

 Summary: AdaptiveSparkPlanExec should always use the 
context.session
 Key: SPARK-47285
 URL: https://issues.apache.org/jira/browse/SPARK-47285
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 4.0.0
Reporter: XiDuo You






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47177) Cached SQL plan do not display final AQE plan in explain string

2024-03-05 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-47177:
--
Fix Version/s: 3.4.3

> Cached SQL plan do not display final AQE plan in explain string
> ---
>
> Key: SPARK-47177
> URL: https://issues.apache.org/jira/browse/SPARK-47177
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.2, 3.5.0, 4.0.0, 3.5.1, 3.5.2
>Reporter: Ziqi Liu
>Assignee: XiDuo You
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2, 3.4.3
>
>
> AQE plan is expected to display final plan after execution. This is not true 
> for cached SQL plan: it will show the initial plan instead. This behavior 
> change is introduced in [https://github.com/apache/spark/pull/40812] it tried 
> to fix the concurrency issue with cached plan. 
> *In short, the plan used to executed and the plan used to explain is not the 
> same instance, thus causing the inconsistency.*
>  
> I don't have a clear idea how yet
>  * maybe we just a coarse granularity lock in explain?
>  * make innerChildren a function: clone the initial plan, every time checked 
> for whether the original AQE plan is finalized (making the final flag atomic 
> first, of course), if no return the cloned initial plan, if it's finalized, 
> clone the final plan and return that one. But still this won't be able to 
> reflect the AQE plan in real time, in a concurrent situation, but at least we 
> have initial version and final version.
>  
> A simple repro:
> {code:java}
> d1 = spark.range(1000).withColumn("key", expr("id % 
> 100")).groupBy("key").agg({"key": "count"})
> cached_d2 = d1.cache()
> df = cached_d2.filter("key > 10")
> df.collect() {code}
> {code:java}
> >>> df.explain()
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=true
> +- == Final Plan ==
>    *(1) Filter (isnotnull(key#4L) AND (key#4L > 10))
>    +- TableCacheQueryStage 0
>       +- InMemoryTableScan [key#4L, count(key)#10L], [isnotnull(key#4L), 
> (key#4L > 10)]
>             +- InMemoryRelation [key#4L, count(key)#10L], StorageLevel(disk, 
> memory, deserialized, 1 replicas)
>                   +- AdaptiveSparkPlan isFinalPlan=false
>                      +- HashAggregate(keys=[key#4L], 
> functions=[count(key#4L)])
>                         +- Exchange hashpartitioning(key#4L, 200), 
> ENSURE_REQUIREMENTS, [plan_id=24]
>                            +- HashAggregate(keys=[key#4L], 
> functions=[partial_count(key#4L)])
>                               +- Project [(id#2L % 100) AS key#4L]
>                                  +- Range (0, 1000, step=1, splits=10)
> +- == Initial Plan ==
>    Filter (isnotnull(key#4L) AND (key#4L > 10))
>    +- InMemoryTableScan [key#4L, count(key)#10L], [isnotnull(key#4L), (key#4L 
> > 10)]
>          +- InMemoryRelation [key#4L, count(key)#10L], StorageLevel(disk, 
> memory, deserialized, 1 replicas)
>                +- AdaptiveSparkPlan isFinalPlan=false
>                   +- HashAggregate(keys=[key#4L], functions=[count(key#4L)])
>                      +- Exchange hashpartitioning(key#4L, 200), 
> ENSURE_REQUIREMENTS, [plan_id=24]
>                         +- HashAggregate(keys=[key#4L], 
> functions=[partial_count(key#4L)])
>                            +- Project [(id#2L % 100) AS key#4L]
>                               +- Range (0, 1000, step=1, splits=10){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47177) Cached SQL plan do not display final AQE plan in explain string

2024-03-04 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You resolved SPARK-47177.
---
Fix Version/s: 3.5.2
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 45282
[https://github.com/apache/spark/pull/45282]

> Cached SQL plan do not display final AQE plan in explain string
> ---
>
> Key: SPARK-47177
> URL: https://issues.apache.org/jira/browse/SPARK-47177
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.2, 3.5.0, 4.0.0, 3.5.1, 3.5.2
>Reporter: Ziqi Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.2, 4.0.0
>
>
> AQE plan is expected to display final plan after execution. This is not true 
> for cached SQL plan: it will show the initial plan instead. This behavior 
> change is introduced in [https://github.com/apache/spark/pull/40812] it tried 
> to fix the concurrency issue with cached plan. 
> *In short, the plan used to executed and the plan used to explain is not the 
> same instance, thus causing the inconsistency.*
>  
> I don't have a clear idea how yet
>  * maybe we just a coarse granularity lock in explain?
>  * make innerChildren a function: clone the initial plan, every time checked 
> for whether the original AQE plan is finalized (making the final flag atomic 
> first, of course), if no return the cloned initial plan, if it's finalized, 
> clone the final plan and return that one. But still this won't be able to 
> reflect the AQE plan in real time, in a concurrent situation, but at least we 
> have initial version and final version.
>  
> A simple repro:
> {code:java}
> d1 = spark.range(1000).withColumn("key", expr("id % 
> 100")).groupBy("key").agg({"key": "count"})
> cached_d2 = d1.cache()
> df = cached_d2.filter("key > 10")
> df.collect() {code}
> {code:java}
> >>> df.explain()
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=true
> +- == Final Plan ==
>    *(1) Filter (isnotnull(key#4L) AND (key#4L > 10))
>    +- TableCacheQueryStage 0
>       +- InMemoryTableScan [key#4L, count(key)#10L], [isnotnull(key#4L), 
> (key#4L > 10)]
>             +- InMemoryRelation [key#4L, count(key)#10L], StorageLevel(disk, 
> memory, deserialized, 1 replicas)
>                   +- AdaptiveSparkPlan isFinalPlan=false
>                      +- HashAggregate(keys=[key#4L], 
> functions=[count(key#4L)])
>                         +- Exchange hashpartitioning(key#4L, 200), 
> ENSURE_REQUIREMENTS, [plan_id=24]
>                            +- HashAggregate(keys=[key#4L], 
> functions=[partial_count(key#4L)])
>                               +- Project [(id#2L % 100) AS key#4L]
>                                  +- Range (0, 1000, step=1, splits=10)
> +- == Initial Plan ==
>    Filter (isnotnull(key#4L) AND (key#4L > 10))
>    +- InMemoryTableScan [key#4L, count(key)#10L], [isnotnull(key#4L), (key#4L 
> > 10)]
>          +- InMemoryRelation [key#4L, count(key)#10L], StorageLevel(disk, 
> memory, deserialized, 1 replicas)
>                +- AdaptiveSparkPlan isFinalPlan=false
>                   +- HashAggregate(keys=[key#4L], functions=[count(key#4L)])
>                      +- Exchange hashpartitioning(key#4L, 200), 
> ENSURE_REQUIREMENTS, [plan_id=24]
>                         +- HashAggregate(keys=[key#4L], 
> functions=[partial_count(key#4L)])
>                            +- Project [(id#2L % 100) AS key#4L]
>                               +- Range (0, 1000, step=1, splits=10){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47177) Cached SQL plan do not display final AQE plan in explain string

2024-03-04 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You reassigned SPARK-47177:
-

Assignee: XiDuo You

> Cached SQL plan do not display final AQE plan in explain string
> ---
>
> Key: SPARK-47177
> URL: https://issues.apache.org/jira/browse/SPARK-47177
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.2, 3.5.0, 4.0.0, 3.5.1, 3.5.2
>Reporter: Ziqi Liu
>Assignee: XiDuo You
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2
>
>
> AQE plan is expected to display final plan after execution. This is not true 
> for cached SQL plan: it will show the initial plan instead. This behavior 
> change is introduced in [https://github.com/apache/spark/pull/40812] it tried 
> to fix the concurrency issue with cached plan. 
> *In short, the plan used to executed and the plan used to explain is not the 
> same instance, thus causing the inconsistency.*
>  
> I don't have a clear idea how yet
>  * maybe we just a coarse granularity lock in explain?
>  * make innerChildren a function: clone the initial plan, every time checked 
> for whether the original AQE plan is finalized (making the final flag atomic 
> first, of course), if no return the cloned initial plan, if it's finalized, 
> clone the final plan and return that one. But still this won't be able to 
> reflect the AQE plan in real time, in a concurrent situation, but at least we 
> have initial version and final version.
>  
> A simple repro:
> {code:java}
> d1 = spark.range(1000).withColumn("key", expr("id % 
> 100")).groupBy("key").agg({"key": "count"})
> cached_d2 = d1.cache()
> df = cached_d2.filter("key > 10")
> df.collect() {code}
> {code:java}
> >>> df.explain()
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=true
> +- == Final Plan ==
>    *(1) Filter (isnotnull(key#4L) AND (key#4L > 10))
>    +- TableCacheQueryStage 0
>       +- InMemoryTableScan [key#4L, count(key)#10L], [isnotnull(key#4L), 
> (key#4L > 10)]
>             +- InMemoryRelation [key#4L, count(key)#10L], StorageLevel(disk, 
> memory, deserialized, 1 replicas)
>                   +- AdaptiveSparkPlan isFinalPlan=false
>                      +- HashAggregate(keys=[key#4L], 
> functions=[count(key#4L)])
>                         +- Exchange hashpartitioning(key#4L, 200), 
> ENSURE_REQUIREMENTS, [plan_id=24]
>                            +- HashAggregate(keys=[key#4L], 
> functions=[partial_count(key#4L)])
>                               +- Project [(id#2L % 100) AS key#4L]
>                                  +- Range (0, 1000, step=1, splits=10)
> +- == Initial Plan ==
>    Filter (isnotnull(key#4L) AND (key#4L > 10))
>    +- InMemoryTableScan [key#4L, count(key)#10L], [isnotnull(key#4L), (key#4L 
> > 10)]
>          +- InMemoryRelation [key#4L, count(key)#10L], StorageLevel(disk, 
> memory, deserialized, 1 replicas)
>                +- AdaptiveSparkPlan isFinalPlan=false
>                   +- HashAggregate(keys=[key#4L], functions=[count(key#4L)])
>                      +- Exchange hashpartitioning(key#4L, 200), 
> ENSURE_REQUIREMENTS, [plan_id=24]
>                         +- HashAggregate(keys=[key#4L], 
> functions=[partial_count(key#4L)])
>                            +- Project [(id#2L % 100) AS key#4L]
>                               +- Range (0, 1000, step=1, splits=10){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47187) Fix hive compress output config does not work

2024-02-27 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You resolved SPARK-47187.
---
Fix Version/s: 3.4.3
   Resolution: Fixed

Issue resolved by pull request 45286
[https://github.com/apache/spark/pull/45286]

> Fix hive compress output config does not work
> -
>
> Key: SPARK-47187
> URL: https://issues.apache.org/jira/browse/SPARK-47187
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.2
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47187) Fix hive compress output config does not work

2024-02-27 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You reassigned SPARK-47187:
-

Assignee: XiDuo You

> Fix hive compress output config does not work
> -
>
> Key: SPARK-47187
> URL: https://issues.apache.org/jira/browse/SPARK-47187
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.2
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47187) Fix hive compress output config does not work

2024-02-27 Thread XiDuo You (Jira)
XiDuo You created SPARK-47187:
-

 Summary: Fix hive compress output config does not work
 Key: SPARK-47187
 URL: https://issues.apache.org/jira/browse/SPARK-47187
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.4.2
Reporter: XiDuo You






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46756) Add rule to rewrite null safe equality join keys

2024-01-17 Thread XiDuo You (Jira)
XiDuo You created SPARK-46756:
-

 Summary: Add rule to rewrite null safe equality join keys
 Key: SPARK-46756
 URL: https://issues.apache.org/jira/browse/SPARK-46756
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: XiDuo You






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46480) Fix NPE when table cache task attempt

2023-12-22 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-46480:
--
Fix Version/s: 3.5.1

> Fix NPE when table cache task attempt
> -
>
> Key: SPARK-46480
> URL: https://issues.apache.org/jira/browse/SPARK-46480
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46480) Fix NPE when table cache task attempt

2023-12-21 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-46480:
--
Summary: Fix NPE when table cache task attempt  (was: Fix NPE when table 
cache task do attempt)

> Fix NPE when table cache task attempt
> -
>
> Key: SPARK-46480
> URL: https://issues.apache.org/jira/browse/SPARK-46480
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46480) Fix NPE when table cache task do attempt

2023-12-21 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-46480:
--
Component/s: Spark Core

> Fix NPE when table cache task do attempt
> 
>
> Key: SPARK-46480
> URL: https://issues.apache.org/jira/browse/SPARK-46480
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46480) Fix NPE when table cache task do attempt

2023-12-21 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-46480:
--
Issue Type: Bug  (was: Improvement)

> Fix NPE when table cache task do attempt
> 
>
> Key: SPARK-46480
> URL: https://issues.apache.org/jira/browse/SPARK-46480
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46480) Fix NPE when table cache task do attempt

2023-12-21 Thread XiDuo You (Jira)
XiDuo You created SPARK-46480:
-

 Summary: Fix NPE when table cache task do attempt
 Key: SPARK-46480
 URL: https://issues.apache.org/jira/browse/SPARK-46480
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: XiDuo You






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46227) Move `withSQLConf` from SQLHelper trait to `SQLConfHelper` trait

2023-12-03 Thread XiDuo You (Jira)
XiDuo You created SPARK-46227:
-

 Summary: Move `withSQLConf` from SQLHelper trait to 
`SQLConfHelper` trait
 Key: SPARK-46227
 URL: https://issues.apache.org/jira/browse/SPARK-46227
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: XiDuo You






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46170) Support inject adaptive query post planner strategy rules in SparkSessionExtensions

2023-11-30 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You reassigned SPARK-46170:
-

Assignee: XiDuo You

> Support inject adaptive query post planner strategy rules in 
> SparkSessionExtensions
> ---
>
> Key: SPARK-46170
> URL: https://issues.apache.org/jira/browse/SPARK-46170
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46170) Support inject adaptive query post planner strategy rules in SparkSessionExtensions

2023-11-30 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You resolved SPARK-46170.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44074
[https://github.com/apache/spark/pull/44074]

> Support inject adaptive query post planner strategy rules in 
> SparkSessionExtensions
> ---
>
> Key: SPARK-46170
> URL: https://issues.apache.org/jira/browse/SPARK-46170
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46170) Support inject adaptive query post planner strategy rules in SparkSessionExtensions

2023-11-28 Thread XiDuo You (Jira)
XiDuo You created SPARK-46170:
-

 Summary: Support inject adaptive query post planner strategy rules 
in SparkSessionExtensions
 Key: SPARK-46170
 URL: https://issues.apache.org/jira/browse/SPARK-46170
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: XiDuo You






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-46105) df.emptyDataFrame shows 1 if we repartition(1) in Spark 3.3.x and above

2023-11-26 Thread XiDuo You (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17789908#comment-17789908
 ] 

XiDuo You commented on SPARK-46105:
---

Please see SPARK-39915

> df.emptyDataFrame shows 1 if we repartition(1) in Spark 3.3.x and above
> ---
>
> Key: SPARK-46105
> URL: https://issues.apache.org/jira/browse/SPARK-46105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.3
> Environment: EKS
> EMR
>Reporter: dharani_sugumar
>Priority: Major
> Attachments: Screenshot 2023-11-26 at 11.54.58 AM.png
>
>
> {color:#FF}Version: 3.3.3{color}
>  
> {color:#FF}scala> val df = spark.emptyDataFrame{color}
> {color:#FF}df: org.apache.spark.sql.DataFrame = []{color}
> {color:#FF}scala> df.rdd.getNumPartitions{color}
> {color:#FF}res0: Int = 0{color}
> {color:#FF}scala> df.repartition(1).rdd.getNumPartitions{color}
> {color:#FF}res1: Int = 1{color}
> {color:#FF}scala> df.repartition(1).rdd.isEmpty(){color}
> {color:#FF}[Stage 1:>                                                     
>      (0 + 1) /                                                                
>              res2: Boolean = true{color}
> Version: 3.2.4
> scala> val df = spark.emptyDataFrame
> df: org.apache.spark.sql.DataFrame = []
> scala> df.rdd.getNumPartitions
> res0: Int = 0
> scala> df.repartition(1).rdd.getNumPartitions
> res1: Int = 0
> scala> df.repartition(1).rdd.isEmpty()
> res2: Boolean = true
>  
> {color:#FF}Version: 3.5.0{color}
> {color:#FF}scala> val df = spark.emptyDataFrame{color}
> {color:#FF}df: org.apache.spark.sql.DataFrame = []{color}
> {color:#FF}scala> df.rdd.getNumPartitions{color}
> {color:#FF}res0: Int = 0{color}
> {color:#FF}scala> df.repartition(1).rdd.getNumPartitions{color}
> {color:#FF}res1: Int = 1{color}
> {color:#FF}scala> df.repartition(1).rdd.isEmpty(){color}
> {color:#FF}[Stage 1:>                                                     
>      (0 + 1) /                                                                
>              res2: Boolean = true{color}
>  
> When we do repartition of 1 on an empty dataframe, the resultant partition is 
> 1 in version 3.3.x and 3.5.x whereas when I do the same in version 3.2.x, the 
> resultant partition is 0. May i know why this behaviour is changed from 3.2.x 
> to higher versions. 
>  
> The reason for raising this as a bug is I have a scenario where my final 
> dataframe returns 0 records in EKS(local spark) with single node(driver and 
> executor on the sam node) but it returns 1 in EMR both uses a same spark 
> version 3.3.3. I'm not sure why this behaves different in both the 
> environments. As a interim solution, I had to repartition a empty dataframe 
> if my final dataframe is empty which returns 1 for 3.3.3. Would like to know 
> if this really a bug or this behaviour exists in the future versions and 
> cannot be changed?
>  
> Because, If we go for a spark upgrade and this behaviour is changed, we will 
> face the issue again. 
> Please confirm on this.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46090) Support plan fragment level SQL configs in AQE

2023-11-25 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-46090:
--
Summary: Support plan fragment level SQL configs  in AQE  (was: Support 
plan fragment level SQL configs)

> Support plan fragment level SQL configs  in AQE
> ---
>
> Key: SPARK-46090
> URL: https://issues.apache.org/jira/browse/SPARK-46090
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Priority: Major
>
> AQE executes query plan stage by stage, so there is a chance to support plan 
> fragment level SQL configs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46090) Support plan fragment level SQL configs

2023-11-25 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-46090:
--
Summary: Support plan fragment level SQL configs  (was: Support stage level 
SQL configs)

> Support plan fragment level SQL configs
> ---
>
> Key: SPARK-46090
> URL: https://issues.apache.org/jira/browse/SPARK-46090
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Priority: Major
>
> AQE executes query plan stage by stage, so there is a chance to support stage 
> level SQL configs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46090) Support plan fragment level SQL configs

2023-11-25 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-46090:
--
Description: AQE executes query plan stage by stage, so there is a chance 
to support plan fragment level SQL configs.  (was: AQE executes query plan 
stage by stage, so there is a chance to support stage level SQL configs.)

> Support plan fragment level SQL configs
> ---
>
> Key: SPARK-46090
> URL: https://issues.apache.org/jira/browse/SPARK-46090
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Priority: Major
>
> AQE executes query plan stage by stage, so there is a chance to support plan 
> fragment level SQL configs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46090) Support stage level SQL configs

2023-11-24 Thread XiDuo You (Jira)
XiDuo You created SPARK-46090:
-

 Summary: Support stage level SQL configs
 Key: SPARK-46090
 URL: https://issues.apache.org/jira/browse/SPARK-46090
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: XiDuo You


AQE executes query plan stage by stage, so there is a chance to support stage 
level SQL configs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45882) BroadcastHashJoinExec propagate partitioning should respect CoalescedHashPartitioning

2023-11-14 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-45882:
--
Fix Version/s: 3.4.2
   4.0.0

> BroadcastHashJoinExec propagate partitioning should respect 
> CoalescedHashPartitioning
> -
>
> Key: SPARK-45882
> URL: https://issues.apache.org/jira/browse/SPARK-45882
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.2, 4.0.0, 3.5.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45882) BroadcastHashJoinExec propagate partitioning should respect CoalescedHashPartitioning

2023-11-14 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You resolved SPARK-45882.
---
Fix Version/s: 3.5.1
   Resolution: Fixed

Issue resolved by pull request 43792
[https://github.com/apache/spark/pull/43792]

> BroadcastHashJoinExec propagate partitioning should respect 
> CoalescedHashPartitioning
> -
>
> Key: SPARK-45882
> URL: https://issues.apache.org/jira/browse/SPARK-45882
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45882) BroadcastHashJoinExec propagate partitioning should respect CoalescedHashPartitioning

2023-11-14 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You reassigned SPARK-45882:
-

Assignee: XiDuo You

> BroadcastHashJoinExec propagate partitioning should respect 
> CoalescedHashPartitioning
> -
>
> Key: SPARK-45882
> URL: https://issues.apache.org/jira/browse/SPARK-45882
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45882) BroadcastHashJoinExec propagate partitioning should respect CoalescedHashPartitioning

2023-11-10 Thread XiDuo You (Jira)
XiDuo You created SPARK-45882:
-

 Summary: BroadcastHashJoinExec propagate partitioning should 
respect CoalescedHashPartitioning
 Key: SPARK-45882
 URL: https://issues.apache.org/jira/browse/SPARK-45882
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: XiDuo You






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34444) Pushdown scalar-subquery filter to FileSourceScan

2023-11-06 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You resolved SPARK-3.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

> Pushdown scalar-subquery filter to FileSourceScan
> -
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
> Fix For: 4.0.0
>
>
> We can pushdown {{a < (select max(d) from t2)}} to FileSourceScan:
> {code:scala}
> sql("CREATE TABLE t1 using parquet AS SELECT id AS a, id AS b FROM 
> range(5L)")
> sql("CREATE TABLE t2 using parquet AS SELECT id AS d FROM range(20)")
> sql("SELECT * FROM t1 WHERE b = (select max(d) from t2)").show
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45740) Relax the node prefix of SparkPlanGraphCluster

2023-10-31 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-45740:
--
Summary: Relax the node prefix of SparkPlanGraphCluster  (was: Release the 
node prefix of SparkPlanGraphCluster)

> Relax the node prefix of SparkPlanGraphCluster
> --
>
> Key: SPARK-45740
> URL: https://issues.apache.org/jira/browse/SPARK-45740
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45740) Release the node prefix of SparkPlanGraphCluster

2023-10-31 Thread XiDuo You (Jira)
XiDuo You created SPARK-45740:
-

 Summary: Release the node prefix of SparkPlanGraphCluster
 Key: SPARK-45740
 URL: https://issues.apache.org/jira/browse/SPARK-45740
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: XiDuo You






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45705) Fix flaky test: Status of a failed DDL/DML with no jobs should be FAILED

2023-10-27 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You resolved SPARK-45705.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43554
[https://github.com/apache/spark/pull/43554]

> Fix flaky test: Status of a failed DDL/DML with no jobs should be FAILED 
> -
>
> Key: SPARK-45705
> URL: https://issues.apache.org/jira/browse/SPARK-45705
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45632) Table cache should avoid unnecessary ColumnarToRow when enable AQE

2023-10-23 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You reassigned SPARK-45632:
-

Assignee: XiDuo You

> Table cache should avoid unnecessary ColumnarToRow when enable AQE
> --
>
> Key: SPARK-45632
> URL: https://issues.apache.org/jira/browse/SPARK-45632
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
>  Labels: pull-request-available
>
> If the cache serializer supports columnar input, then we do not need a 
> ColumnarToRow before cache data. This pr improves the optimization with AQE 
> enabled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45632) Table cache should avoid unnecessary ColumnarToRow when enable AQE

2023-10-23 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You resolved SPARK-45632.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43484
[https://github.com/apache/spark/pull/43484]

> Table cache should avoid unnecessary ColumnarToRow when enable AQE
> --
>
> Key: SPARK-45632
> URL: https://issues.apache.org/jira/browse/SPARK-45632
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> If the cache serializer supports columnar input, then we do not need a 
> ColumnarToRow before cache data. This pr improves the optimization with AQE 
> enabled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45632) Table cache should avoid unnecessary ColumnarToRow when enable AQE

2023-10-23 Thread XiDuo You (Jira)
XiDuo You created SPARK-45632:
-

 Summary: Table cache should avoid unnecessary ColumnarToRow when 
enable AQE
 Key: SPARK-45632
 URL: https://issues.apache.org/jira/browse/SPARK-45632
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: XiDuo You


If the cache serializer supports columnar input, then we do not need a 
ColumnarToRow before cache data. This pr improves the optimization with AQE 
enabled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45443) Revisit TableCacheQueryStage to avoid replicated InMemoryRelation materialization

2023-10-09 Thread XiDuo You (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17773507#comment-17773507
 ] 

XiDuo You commented on SPARK-45443:
---

> But this idea only work for one query

Please see the following `say, if there are multi-queries which reference the 
same caced RDD (e.g., in thiftserver)`. There are some race condition if 
multi-queries reference and materialize the same cached rdd. They are in 
different query execution and different thread.

> spark.sql.optimizer.canChangeCachedPlanOutputPartitioning

It is totally irrelevant with the TableCacheQueryStage. This config is used to 
make AQE work for the cached plan.

> Revisit TableCacheQueryStage to avoid replicated InMemoryRelation 
> materialization
> -
>
> Key: SPARK-45443
> URL: https://issues.apache.org/jira/browse/SPARK-45443
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Eren Avsarogullari
>Priority: Major
> Attachments: IMR Materialization - Stage 2.png, IMR Materialization - 
> Stage 3.png
>
>
> TableCacheQueryStage is created per InMemoryTableScanExec by 
> AdaptiveSparkPlanExec and it materializes InMemoryTableScanExec output 
> (cached RDD) to provide runtime stats in order to apply AQE  optimizations 
> into remaining physical plan stages. TableCacheQueryStage materializes 
> InMemoryTableScanExec eagerly by submitting job per TableCacheQueryStage 
> instance. For example, if there are 2 TableCacheQueryStage instances 
> referencing same IMR instance (cached RDD) and first InMemoryTableScanExec' s 
> materialization takes longer, following logic will return false 
> (inMemoryTableScan.isMaterialized => false) and this may cause replicated IMR 
> materialization. This behavior can be more visible when cached RDD size is 
> high.
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala#L281]
> Would like to get community feedback. Thanks in advance.
> cc [~ulysses] [~cloud_fan]
> *Sample Query to simulate the problem:*
> // Both join legs uses same IMR instance
> {code:java}
> import spark.implicits._
> val arr = (1 to 12).map { i => {
> val index = i % 5
> (index, s"Employee_$index", s"Department_$index")
>   }
> }
> val df = arr.toDF("id", "name", "department")
>   .filter('id >= 0)
>   .sort("id")
>   .groupBy('id, 'name, 'department)
>   .count().as("count")
> df.persist()
> val df2 = df.sort("count").filter('count <= 2)
> val df3 = df.sort("count").filter('count >= 3)
> val df4 = df2.join(df3, Seq("id", "name", "department"), "fullouter")
> df4.show() {code}
> *Physical Plan:*
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan (31)
> +- == Final Plan ==
>    CollectLimit (21)
>    +- * Project (20)
>       +- * SortMergeJoin FullOuter (19)
>          :- * Sort (10)
>          :  +- * Filter (9)
>          :     +- TableCacheQueryStage (8), Statistics(sizeInBytes=210.0 B, 
> rowCount=5)
>          :        +- InMemoryTableScan (1)
>          :              +- InMemoryRelation (2)
>          :                    +- AdaptiveSparkPlan (7)
>          :                       +- HashAggregate (6)
>          :                          +- Exchange (5)
>          :                             +- HashAggregate (4)
>          :                                +- LocalTableScan (3)
>          +- * Sort (18)
>             +- * Filter (17)
>                +- TableCacheQueryStage (16), Statistics(sizeInBytes=210.0 B, 
> rowCount=5)
>                   +- InMemoryTableScan (11)
>                         +- InMemoryRelation (12)
>                               +- AdaptiveSparkPlan (15)
>                                  +- HashAggregate (14)
>                                     +- Exchange (13)
>                                        +- HashAggregate (4)
>                                           +- LocalTableScan (3) {code}
> *Stages DAGs materializing the same IMR instance:*
> !IMR Materialization - Stage 2.png|width=303,height=134!
> !IMR Materialization - Stage 3.png|width=303,height=134!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45443) Revisit TableCacheQueryStage to avoid replicated InMemoryRelation materialization

2023-10-08 Thread XiDuo You (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17773074#comment-17773074
 ] 

XiDuo You commented on SPARK-45443:
---

> Can this increase probability of concurrent IMR materialization for same IMR 
> instance?

I think they are same, The TableCacheQueryStage is more like a barrier and 
report some metrics to AQE framework. The gap of `eagerly` is very small.

> For queries using AQE, can introducing TableCacheQueryStage into physical 
> plan once per unique IMR instance help

I did not see the difference. I think one idea is, we can introduce something 
like `ReusedTableCacheQueryStage`. The `ReusedTableCacheQueryStage` only holds 
an empty future which wait for the first TableCacheQueryStage materialization, 
so that we can make sure the cached RDD only be executed once. But this idea 
only work for one query, say, if there are multi-queries which reference the 
same caced RDD (e.g., in thiftserver), the issue is still existed.


> Revisit TableCacheQueryStage to avoid replicated InMemoryRelation 
> materialization
> -
>
> Key: SPARK-45443
> URL: https://issues.apache.org/jira/browse/SPARK-45443
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Eren Avsarogullari
>Priority: Major
> Attachments: IMR Materialization - Stage 2.png, IMR Materialization - 
> Stage 3.png
>
>
> TableCacheQueryStage is created per InMemoryTableScanExec by 
> AdaptiveSparkPlanExec and it materializes InMemoryTableScanExec output 
> (cached RDD) to provide runtime stats in order to apply AQE  optimizations 
> into remaining physical plan stages. TableCacheQueryStage materializes 
> InMemoryTableScanExec eagerly by submitting job per TableCacheQueryStage 
> instance. For example, if there are 2 TableCacheQueryStage instances 
> referencing same IMR instance (cached RDD) and first InMemoryTableScanExec' s 
> materialization takes longer, following logic will return false 
> (inMemoryTableScan.isMaterialized => false) and this may cause replicated IMR 
> materialization. This behavior can be more visible when cached RDD size is 
> high.
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala#L281]
> Would like to get community feedback. Thanks in advance.
> cc [~ulysses] [~cloud_fan]
> *Sample Query to simulate the problem:*
> // Both join legs uses same IMR instance
> {code:java}
> import spark.implicits._
> val arr = (1 to 12).map { i => {
> val index = i % 5
> (index, s"Employee_$index", s"Department_$index")
>   }
> }
> val df = arr.toDF("id", "name", "department")
>   .filter('id >= 0)
>   .sort("id")
>   .groupBy('id, 'name, 'department)
>   .count().as("count")
> df.persist()
> val df2 = df.sort("count").filter('count <= 2)
> val df3 = df.sort("count").filter('count >= 3)
> val df4 = df2.join(df3, Seq("id", "name", "department"), "fullouter")
> df4.show() {code}
> *Physical Plan:*
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan (31)
> +- == Final Plan ==
>    CollectLimit (21)
>    +- * Project (20)
>       +- * SortMergeJoin FullOuter (19)
>          :- * Sort (10)
>          :  +- * Filter (9)
>          :     +- TableCacheQueryStage (8), Statistics(sizeInBytes=210.0 B, 
> rowCount=5)
>          :        +- InMemoryTableScan (1)
>          :              +- InMemoryRelation (2)
>          :                    +- AdaptiveSparkPlan (7)
>          :                       +- HashAggregate (6)
>          :                          +- Exchange (5)
>          :                             +- HashAggregate (4)
>          :                                +- LocalTableScan (3)
>          +- * Sort (18)
>             +- * Filter (17)
>                +- TableCacheQueryStage (16), Statistics(sizeInBytes=210.0 B, 
> rowCount=5)
>                   +- InMemoryTableScan (11)
>                         +- InMemoryRelation (12)
>                               +- AdaptiveSparkPlan (15)
>                                  +- HashAggregate (14)
>                                     +- Exchange (13)
>                                        +- HashAggregate (4)
>                                           +- LocalTableScan (3) {code}
> *Stages DAGs materializing the same IMR instance:*
> !IMR Materialization - Stage 2.png|width=303,height=134!
> !IMR Materialization - Stage 3.png|width=303,height=134!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45451) Make the default storage level of dataset cache configurable

2023-10-06 Thread XiDuo You (Jira)
XiDuo You created SPARK-45451:
-

 Summary: Make the default storage level of dataset cache 
configurable
 Key: SPARK-45451
 URL: https://issues.apache.org/jira/browse/SPARK-45451
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: XiDuo You






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45443) Revisit TableCacheQueryStage to avoid replicated InMemoryRelation materialization

2023-10-06 Thread XiDuo You (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17772729#comment-17772729
 ] 

XiDuo You commented on SPARK-45443:
---

hi [~erenavsarogullari] , it seems that, it depends on the behavior of rdd 
cache. Say, what happens if we materialize a cached rdd twice at the same time 
? There are some race condition in block manager per rdd partition so it makes 
things slow. BTW, what's the behavior before we have TableCacheQueryStage ? 
Does not it have this issue ?

> Revisit TableCacheQueryStage to avoid replicated InMemoryRelation 
> materialization
> -
>
> Key: SPARK-45443
> URL: https://issues.apache.org/jira/browse/SPARK-45443
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Eren Avsarogullari
>Priority: Major
> Attachments: IMR Materialization - Stage 2.png, IMR Materialization - 
> Stage 3.png
>
>
> TableCacheQueryStage is created per InMemoryTableScanExec by 
> AdaptiveSparkPlanExec and it materializes InMemoryTableScanExec output 
> (cached RDD) to provide runtime stats in order to apply AQE  optimizations 
> into remaining physical plan stages. TableCacheQueryStage materializes 
> InMemoryTableScanExec eagerly by submitting job per TableCacheQueryStage 
> instance. For example, if there are 2 TableCacheQueryStage instances 
> referencing same IMR instance (cached RDD) and first InMemoryTableScanExec' s 
> materialization takes longer, following logic will return false 
> (inMemoryTableScan.isMaterialized => false) and this may cause replicated IMR 
> materialization. This behavior can be more visible when cached RDD size is 
> high.
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala#L281]
> Would like to get community feedback. Thanks in advance.
> cc [~ulysses] [~cloud_fan]
> *Sample Query to simulate the problem:*
> // Both join legs uses same IMR instance
> {code:java}
> import spark.implicits._
> val arr = (1 to 12).map { i => {
> val index = i % 5
> (index, s"Employee_$index", s"Department_$index")
>   }
> }
> val df = arr.toDF("id", "name", "department")
>   .filter('id >= 0)
>   .sort("id")
>   .groupBy('id, 'name, 'department)
>   .count().as("count")
> df.persist()
> val df2 = df.sort("count").filter('count <= 2)
> val df3 = df.sort("count").filter('count >= 3)
> val df4 = df2.join(df3, Seq("id", "name", "department"), "fullouter")
> df4.show() {code}
> *Physical Plan:*
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan (31)
> +- == Final Plan ==
>    CollectLimit (21)
>    +- * Project (20)
>       +- * SortMergeJoin FullOuter (19)
>          :- * Sort (10)
>          :  +- * Filter (9)
>          :     +- TableCacheQueryStage (8), Statistics(sizeInBytes=210.0 B, 
> rowCount=5)
>          :        +- InMemoryTableScan (1)
>          :              +- InMemoryRelation (2)
>          :                    +- AdaptiveSparkPlan (7)
>          :                       +- HashAggregate (6)
>          :                          +- Exchange (5)
>          :                             +- HashAggregate (4)
>          :                                +- LocalTableScan (3)
>          +- * Sort (18)
>             +- * Filter (17)
>                +- TableCacheQueryStage (16), Statistics(sizeInBytes=210.0 B, 
> rowCount=5)
>                   +- InMemoryTableScan (11)
>                         +- InMemoryRelation (12)
>                               +- AdaptiveSparkPlan (15)
>                                  +- HashAggregate (14)
>                                     +- Exchange (13)
>                                        +- HashAggregate (4)
>                                           +- LocalTableScan (3) {code}
> *Stages DAGs materializing the same IMR instance:*
> !IMR Materialization - Stage 2.png|width=303,height=134!
> !IMR Materialization - Stage 3.png|width=303,height=134!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45282) Join loses records for cached datasets

2023-09-26 Thread XiDuo You (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769383#comment-17769383
 ] 

XiDuo You commented on SPARK-45282:
---

I can not re-produce this issue in master branch (4.0.0), [~koert] have you 
tried master branch ?

> Join loses records for cached datasets
> --
>
> Key: SPARK-45282
> URL: https://issues.apache.org/jira/browse/SPARK-45282
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0
> Environment: spark 3.4.1 on apache hadoop 3.3.6 or kubernetes 1.26 or 
> databricks 13.3
>Reporter: koert kuipers
>Priority: Major
>  Labels: CorrectnessBug, correctness
>
> we observed this issue on spark 3.4.1 but it is also present on 3.5.0. it is 
> not present on spark 3.3.1.
> it only shows up in distributed environment. i cannot replicate in unit test. 
> however i did get it to show up on hadoop cluster, kubernetes, and on 
> databricks 13.3
> the issue is that records are dropped when two cached dataframes are joined. 
> it seems in spark 3.4.1 in queryplan some Exchanges are dropped as an 
> optimization while in spark 3.3.1 these Exhanges are still present. it seems 
> to be an issue with AQE with canChangeCachedPlanOutputPartitioning=true.
> to reproduce on distributed cluster these settings needed are:
> {code:java}
> spark.sql.adaptive.advisoryPartitionSizeInBytes 33554432
> spark.sql.adaptive.coalescePartitions.parallelismFirst false
> spark.sql.adaptive.enabled true
> spark.sql.optimizer.canChangeCachedPlanOutputPartitioning true {code}
> code using scala to reproduce is:
> {code:java}
> import java.util.UUID
> import org.apache.spark.sql.functions.col
> import spark.implicits._
> val data = (1 to 100).toDS().map(i => 
> UUID.randomUUID().toString).persist()
> val left = data.map(k => (k, 1))
> val right = data.map(k => (k, k)) // if i change this to k => (k, 1) it works!
> println("number of left " + left.count())
> println("number of right " + right.count())
> println("number of (left join right) " +
>   left.toDF("key", "value1").join(right.toDF("key", "value2"), "key").count()
> )
> val left1 = left
>   .toDF("key", "value1")
>   .repartition(col("key")) // comment out this line to make it work
>   .persist()
> println("number of left1 " + left1.count())
> val right1 = right
>   .toDF("key", "value2")
>   .repartition(col("key")) // comment out this line to make it work
>   .persist()
> println("number of right1 " + right1.count())
> println("number of (left1 join right1) " +  left1.join(right1, 
> "key").count()) // this gives incorrect result{code}
> this produces the following output:
> {code:java}
> number of left 100
> number of right 100
> number of (left join right) 100
> number of left1 100
> number of right1 100
> number of (left1 join right1) 859531 {code}
> note that the last number (the incorrect one) actually varies depending on 
> settings and cluster size etc.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45244) Correct spelling in VolcanoTestsSuite

2023-09-21 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You reassigned SPARK-45244:
-

Assignee: Binjie Yang

> Correct spelling in VolcanoTestsSuite
> -
>
> Key: SPARK-45244
> URL: https://issues.apache.org/jira/browse/SPARK-45244
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.5.0
>Reporter: Binjie Yang
>Assignee: Binjie Yang
>Priority: Minor
>  Labels: pull-request-available
>
> Typo in method naming checkAnnotaion



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45244) Correct spelling in VolcanoTestsSuite

2023-09-21 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You resolved SPARK-45244.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43026
[https://github.com/apache/spark/pull/43026]

> Correct spelling in VolcanoTestsSuite
> -
>
> Key: SPARK-45244
> URL: https://issues.apache.org/jira/browse/SPARK-45244
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.5.0
>Reporter: Binjie Yang
>Assignee: Binjie Yang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Typo in method naming checkAnnotaion



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45191) InMemoryTableScanExec simpleStringWithNodeId adds columnar info

2023-09-21 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You resolved SPARK-45191.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42967
[https://github.com/apache/spark/pull/42967]

> InMemoryTableScanExec simpleStringWithNodeId adds columnar info
> ---
>
> Key: SPARK-45191
> URL: https://issues.apache.org/jira/browse/SPARK-45191
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> InMemoryTableScanExec supports both row-based and columnar input and output 
> which is based on the cache serialzier. It would be more friendly for user if 
> we can provide the columnar info to show whether it is columnar in/out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45191) InMemoryTableScanExec simpleStringWithNodeId adds columnar info

2023-09-17 Thread XiDuo You (Jira)
XiDuo You created SPARK-45191:
-

 Summary: InMemoryTableScanExec simpleStringWithNodeId adds 
columnar info
 Key: SPARK-45191
 URL: https://issues.apache.org/jira/browse/SPARK-45191
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: XiDuo You


InMemoryTableScanExec supports both row-based and columnar input and output 
which is based on the cache serialzier. It would be more friendly for user if 
we can provide the columnar info to show whether it is columnar in/out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44598) spark 3.2+ can not read hive table with hbase serde when hbase StorefileSize is 0

2023-07-31 Thread XiDuo You (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17749421#comment-17749421
 ] 

XiDuo You commented on SPARK-44598:
---

please try `--conf spark.hadoopRDD.ignoreEmptySplits=false`

> spark 3.2+ can not read hive table with hbase serde when hbase StorefileSize  
> is 0
> --
>
> Key: SPARK-44598
> URL: https://issues.apache.org/jira/browse/SPARK-44598
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.3
>Reporter: ming95
>Priority: Major
>
> We using spark to read a hive table with hbase serde . We found that when the 
> hbase table data is relatively small (hbase StorefileSize is 0), the data 
> read by spark 3.2 or 3.5 is empty, and there is no error message.
> But when using spark2.4 or hive to read, the data can be read normally. Other 
> information shows that spark3.1 can also read data normally, can anyone 
> provide some ideas?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44579) Support Interrupt On Cancel in SQLExecution

2023-07-31 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You reassigned SPARK-44579:
-

Assignee: Kent Yao

> Support Interrupt On Cancel in SQLExecution
> ---
>
> Key: SPARK-44579
> URL: https://issues.apache.org/jira/browse/SPARK-44579
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 4.0.0
>
>
> Currently, we support interrupting task threads for users by 1) APIs of the 
> spark core module, 2) a thrift config for the SQL module.  Other Spark SQL 
> Apps are limited to use this functionality. Specifically,  the built-in 
> spark-sql-shell lacks a user-controlled knob for interrupting task threads.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44579) Support Interrupt On Cancel in SQLExecution

2023-07-31 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You resolved SPARK-44579.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42199
[https://github.com/apache/spark/pull/42199]

> Support Interrupt On Cancel in SQLExecution
> ---
>
> Key: SPARK-44579
> URL: https://issues.apache.org/jira/browse/SPARK-44579
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Kent Yao
>Priority: Major
> Fix For: 4.0.0
>
>
> Currently, we support interrupting task threads for users by 1) APIs of the 
> spark core module, 2) a thrift config for the SQL module.  Other Spark SQL 
> Apps are limited to use this functionality. Specifically,  the built-in 
> spark-sql-shell lacks a user-controlled knob for interrupting task threads.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43402) FileSourceScanExec supports push down data filter with scalar subquery

2023-07-30 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You resolved SPARK-43402.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 41088
[https://github.com/apache/spark/pull/41088]

> FileSourceScanExec supports push down data filter with scalar subquery
> --
>
> Key: SPARK-43402
> URL: https://issues.apache.org/jira/browse/SPARK-43402
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 4.0.0
>
>
> Scalar subquery can be pushed down as data filter at runtime, since we always 
> execute subquery first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43402) FileSourceScanExec supports push down data filter with scalar subquery

2023-07-30 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You reassigned SPARK-43402:
-

Assignee: XiDuo You

> FileSourceScanExec supports push down data filter with scalar subquery
> --
>
> Key: SPARK-43402
> URL: https://issues.apache.org/jira/browse/SPARK-43402
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
>
> Scalar subquery can be pushed down as data filter at runtime, since we always 
> execute subquery first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43777) Coalescing partitions in AQE returns different results with row_number windows.

2023-05-24 Thread XiDuo You (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726025#comment-17726025
 ] 

XiDuo You commented on SPARK-43777:
---

It acutally is a random event that all the row count of id 2 are 1 so sort by 
id take no effect. The ordering of shuffle result is not determinate and so is 
AQE.

If you really need a determinate result, change window sepc to 
`orderBy($"count".desc, $"place")`

> Coalescing partitions in AQE returns different results with row_number 
> windows. 
> 
>
> Key: SPARK-43777
> URL: https://issues.apache.org/jira/browse/SPARK-43777
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.3, 3.3.2
> Environment: SBT based Spark project with unit tests running on 
> spark-testing-base. Tested with Spark 3.1.3 and 3.3.2
>Reporter: Kristopher Kane
>Priority: Major
>
> While updating our code base from 3.1 to 3.3, I had a test fail due to wrong 
> results.  With 3.1, we did not proactively turn on AQE in sbt based tests and 
> noticed the failure due to AQE enabled by default between 3.1 and 3.3
> An easily reproducible test: 
> {code:java}
>     val testDataDf = Seq(
>       (1, 1, 0, 0),
>       (1, 1, 0, 0),
>       (1, 1, 0, 0),
>       (1, 0, 0, 1),
>       (1, 0, 0, 1),
>       (2, 0, 0, 0),
>       (2, 0, 1, 0),
>       (2, 1, 0, 0),
>       (3, 0, 1, 0),
>       (3, 0, 1, 0),
>     ).toDF("id", "is_attribute1", "is_attribute2", "is_attribute3")    
> val placeWindowSpec = Window
>       .partitionBy("id")
>       .orderBy($"count".desc)    
> val resultDf: DataFrame = testDataDf
>       .select("id", "is_attribute1", "is_attribute2", "is_attribute3")
>       .withColumn(
>         "place",
>         when($"is_attribute1" === 1, "attribute1")
>           .when($"is_attribute2" === 1, "attribute2")
>           .when($"is_attribute3" === 1, "attribute3")
>           .otherwise("other")
>       )
>       .groupBy("id", "place")
>       .agg(
>         functions.count("*").as("count")
>       )
>       .withColumn(
>         "rank",
>         row_number().over(placeWindowSpec)
>       )    
> resultDf.orderBy("id", "place", "rank").show() {code}
>  
> Various results based on Spark version and AQE settings: 
> {code:java}
> Spark 3.1
> Without AQE
> +---+--+-++
> | id|     place|count|rank|
> +---+--+-++
> |  1|attribute1|    3|   1|
> |  1|attribute3|    2|   2|
> |  2|attribute1|    1|   2|
> |  2|attribute2|    1|   1|
> |  2|     other|    1|   3|
> |  3|attribute2|    2|   1|
> +---+--+-++
> AQE with defaults
> +---+--+-++
> | id|     place|count|rank|
> +---+--+-++
> |  1|attribute1|    3|   1|
> |  1|attribute3|    2|   2|
> |  2|attribute1|    1|   2|
> |  2|attribute2|    1|   1|
> |  2|     other|    1|   3|
> |  3|attribute2|    2|   1|
> +---+--+-++
> AQE with .set("spark.sql.adaptive.coalescePartitions.enabled", "false")
> +---+--+-++
> | id|     place|count|rank|
> +---+--+-++
> |  1|attribute1|    3|   1|
> |  1|attribute3|    2|   2|
> |  2|attribute1|    1|   2|
> |  2|attribute2|    1|   1|
> |  2|     other|    1|   3|
> |  3|attribute2|    2|   1|
> +---+--+-++
> AQE with .set("spark.sql.adaptive.coalescePartitions.initialPartitionNum", 
> "1") - Like Spark 3.3 with AQE defaults
> +---+--+-++
> | id|     place|count|rank|
> +---+--+-++
> |  1|attribute1|    3|   1|
> |  1|attribute3|    2|   2|
> |  2|attribute1|    1|   3|
> |  2|attribute2|    1|   2|
> |  2|     other|    1|   1|
> |  3|attribute2|    2|   1|
> 
> Spark 3.3.2
> 
> AQE with defaults
> +---+--+-++
> | id|     place|count|rank|
> +---+--+-++
> |  1|attribute1|    3|   1|
> |  1|attribute3|    2|   2|
> |  2|attribute1|    1|   3|
> |  2|attribute2|    1|   2|
> |  2|     other|    1|   1|
> |  3|attribute2|    2|   1|
> +---+--+-++
> 
> AQE with .set("spark.sql.adaptive.coalescePartitions.enabled", "false") - 
> This matches Spark 3.1
> +---+--+-++
> | id|     place|count|rank|
> +---+--+-++
> |  1|attribute1|    3|   1|
> |  1|attribute3|    2|   2|
> |  2|attribute1|    1|   2|
> |  2|attribute2|    1|   1|
> |  2|     other|    1|   3|
> |  3|attribute2|    2|   1|
> +---+--+-++ {code}
> As you can see, the 'rank' column of row_number(partition by, order by) 
> returns a different rank for id value 2's three attributes based on how AQE 
> coalesces partitions. 



--
This message was sent by Atla

[jira] [Created] (SPARK-43420) Make DisableUnnecessaryBucketedScan smart with table cache

2023-05-08 Thread XiDuo You (Jira)
XiDuo You created SPARK-43420:
-

 Summary: Make DisableUnnecessaryBucketedScan smart with table cache
 Key: SPARK-43420
 URL: https://issues.apache.org/jira/browse/SPARK-43420
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: XiDuo You


If a bucket scan has no interesting partition or contains shuffle exchange, 
then we would disable it. But If the bucket scan is inside table cache, the 
cached plan would be accessed multi-times, then we should not disable it as it 
could preserve output partitioning and more likely be reused.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43402) FileSourceScanExec supports push down data filter with scalar subquery

2023-05-07 Thread XiDuo You (Jira)
XiDuo You created SPARK-43402:
-

 Summary: FileSourceScanExec supports push down data filter with 
scalar subquery
 Key: SPARK-43402
 URL: https://issues.apache.org/jira/browse/SPARK-43402
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: XiDuo You


Scalar subquery can be pushed down as data filter at runtime, since we always 
execute subquery first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43377) Enable spark.sql.thriftServer.interruptOnCancel by default

2023-05-04 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-43377:
--
Summary: Enable spark.sql.thriftServer.interruptOnCancel by default  (was: 
Enable spark.sql.thriftServer.interruptOnCancel by defauly)

> Enable spark.sql.thriftServer.interruptOnCancel by default
> --
>
> Key: SPARK-43377
> URL: https://issues.apache.org/jira/browse/SPARK-43377
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43377) Enable spark.sql.thriftServer.interruptOnCancel by defauly

2023-05-04 Thread XiDuo You (Jira)
XiDuo You created SPARK-43377:
-

 Summary: Enable spark.sql.thriftServer.interruptOnCancel by defauly
 Key: SPARK-43377
 URL: https://issues.apache.org/jira/browse/SPARK-43377
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: XiDuo You






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43376) Improve reuse subquery with table cache

2023-05-04 Thread XiDuo You (Jira)
XiDuo You created SPARK-43376:
-

 Summary: Improve reuse subquery with table cache
 Key: SPARK-43376
 URL: https://issues.apache.org/jira/browse/SPARK-43376
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: XiDuo You


AQE can not reuse subquery if it is pushed into InMemoryTableScan.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43317) Support combine adjacent aggregation

2023-04-27 Thread XiDuo You (Jira)
XiDuo You created SPARK-43317:
-

 Summary: Support combine adjacent aggregation
 Key: SPARK-43317
 URL: https://issues.apache.org/jira/browse/SPARK-43317
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: XiDuo You


If there have adjacent aggregation with Partial and Final mode, we can combine 
them to Complete mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43281) Fix concurrent writer does not update file metrics

2023-04-25 Thread XiDuo You (Jira)
XiDuo You created SPARK-43281:
-

 Summary: Fix concurrent writer does not update file metrics
 Key: SPARK-43281
 URL: https://issues.apache.org/jira/browse/SPARK-43281
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: XiDuo You


It uses temp file path to get file status after commit task. However, the temp 
file has already moved to new path during commit task.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43232) Improve ObjectHashAggregateExec performance for high cardinality

2023-04-23 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-43232:
--
Description: 
The `ObjectHashAggregateExec` has three preformance issues:
 - heavy overhead of scala sugar in `createNewAggregationBuffer`
 - unnecessary grouping key comparation after fallback to sort based aggregator
 - the aggregation buffer in sort based aggregator is not reused for all rest 
rows

 

  was:
The `ObjectHashAggregateExec` has three preformance issues:
 - heavy overhead of scala sugar in `createNewAggregationBuffer`
 - unnecessary grouping key comparation after fallback to sort based aggregator
 - the aggregation buffer in sort based aggregator is not actually reused

 


> Improve ObjectHashAggregateExec performance for high cardinality
> 
>
> Key: SPARK-43232
> URL: https://issues.apache.org/jira/browse/SPARK-43232
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Priority: Major
>
> The `ObjectHashAggregateExec` has three preformance issues:
>  - heavy overhead of scala sugar in `createNewAggregationBuffer`
>  - unnecessary grouping key comparation after fallback to sort based 
> aggregator
>  - the aggregation buffer in sort based aggregator is not reused for all rest 
> rows
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43232) Improve ObjectHashAggregateExec performance for high cardinality

2023-04-23 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-43232:
--
Summary: Improve ObjectHashAggregateExec performance for high cardinality  
(was: Improve ObjectHashAggregateExec performance with high cardinality)

> Improve ObjectHashAggregateExec performance for high cardinality
> 
>
> Key: SPARK-43232
> URL: https://issues.apache.org/jira/browse/SPARK-43232
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Priority: Major
>
> The `ObjectHashAggregateExec` has three preformance issues:
>  - heavy overhead of scala sugar in `createNewAggregationBuffer`
>  - unnecessary grouping key comparation after fallback to sort based 
> aggregator
>  - the aggregation buffer in sort based aggregator is not actually reused
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43232) Improve ObjectHashAggregateExec performance with high cardinality

2023-04-23 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-43232:
--
Summary: Improve ObjectHashAggregateExec performance with high cardinality  
(was: Improve ObjectHashAggregateExec performance)

> Improve ObjectHashAggregateExec performance with high cardinality
> -
>
> Key: SPARK-43232
> URL: https://issues.apache.org/jira/browse/SPARK-43232
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Priority: Major
>
> The `ObjectHashAggregateExec` has three preformance issues:
>  - heavy overhead of scala sugar in `createNewAggregationBuffer`
>  - unnecessary grouping key comparation after fallback to sort based 
> aggregator
>  - the aggregation buffer in sort based aggregator is not actually reused
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43232) Improve ObjectHashAggregateExec performance

2023-04-22 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-43232:
--
Description: 
The `ObjectHashAggregateExec` has three preformance issues:
 - heavy overhead of scala sugar in `createNewAggregationBuffer`
 - unnecessary grouping key comparation after fallback to sort based aggregator
 - the aggregation buffer in sort based aggregator is not actually reused

 

  was:
The `ObjectHashAggregateExec` has three preformance issues:
 - heavy overhead of scala sugar in `createNewAggregationBuffer`

 - unnecessary grouping key comparation after fallback to sort based aggregator
 - the aggregation buffer in sort based aggregator is not actually reused

 


> Improve ObjectHashAggregateExec performance
> ---
>
> Key: SPARK-43232
> URL: https://issues.apache.org/jira/browse/SPARK-43232
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Priority: Major
>
> The `ObjectHashAggregateExec` has three preformance issues:
>  - heavy overhead of scala sugar in `createNewAggregationBuffer`
>  - unnecessary grouping key comparation after fallback to sort based 
> aggregator
>  - the aggregation buffer in sort based aggregator is not actually reused
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43232) Improve ObjectHashAggregateExec performance

2023-04-22 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-43232:
--
Description: 
The `ObjectHashAggregateExec` has three preformance issues:
 - heavy overhead of scala sugar in `createNewAggregationBuffer`

 - unnecessary grouping key comparation after fallback to sort based aggregator
 - the aggregation buffer in sort based aggregator is not actually reused

 

  was:
The `ObjectHashAggregateExec` has two preformance issues:
 - heavy overhead of scala sugar in `createNewAggregationBuffer`

 - unnecessary grouping key comparation after fallback to sort based aggregator

 


> Improve ObjectHashAggregateExec performance
> ---
>
> Key: SPARK-43232
> URL: https://issues.apache.org/jira/browse/SPARK-43232
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Priority: Major
>
> The `ObjectHashAggregateExec` has three preformance issues:
>  - heavy overhead of scala sugar in `createNewAggregationBuffer`
>  - unnecessary grouping key comparation after fallback to sort based 
> aggregator
>  - the aggregation buffer in sort based aggregator is not actually reused
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43232) Improve ObjectHashAggregateExec performance

2023-04-21 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-43232:
--
Description: 
The `ObjectHashAggregateExec` has two preformance issues:
 - heavy overhead of scala sugar in `createNewAggregationBuffer`

 - unnecessary grouping key comparation after fallback to sort based aggregator

 

  was:
The `ObjectHashAggregateExec` has two preformance issues:

- heavy overhead of scala sugar in `createNewAggregationBuffer`

- unnecessary grouping key comparation if fallback to sort based aggregator

 


> Improve ObjectHashAggregateExec performance
> ---
>
> Key: SPARK-43232
> URL: https://issues.apache.org/jira/browse/SPARK-43232
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Priority: Major
>
> The `ObjectHashAggregateExec` has two preformance issues:
>  - heavy overhead of scala sugar in `createNewAggregationBuffer`
>  - unnecessary grouping key comparation after fallback to sort based 
> aggregator
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43232) Improve ObjectHashAggregateExec performance

2023-04-21 Thread XiDuo You (Jira)
XiDuo You created SPARK-43232:
-

 Summary: Improve ObjectHashAggregateExec performance
 Key: SPARK-43232
 URL: https://issues.apache.org/jira/browse/SPARK-43232
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: XiDuo You


The `ObjectHashAggregateExec` has two preformance issues:

- heavy overhead of scala sugar in `createNewAggregationBuffer`

- unnecessary grouping key comparation if fallback to sort based aggregator

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43026) Apply AQE with non-exchange table cache

2023-04-03 Thread XiDuo You (Jira)
XiDuo You created SPARK-43026:
-

 Summary: Apply AQE with non-exchange table cache
 Key: SPARK-43026
 URL: https://issues.apache.org/jira/browse/SPARK-43026
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: XiDuo You


TableCacheQueryStageExec supports to report runtime statistics, so it's 
possible that AQE  plans a better executed during re-optimization.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42963) Extend SparkSessionExtensions to inject rules into AQE query stage optimizer

2023-03-29 Thread XiDuo You (Jira)
XiDuo You created SPARK-42963:
-

 Summary: Extend SparkSessionExtensions to inject rules into AQE 
query stage optimizer
 Key: SPARK-42963
 URL: https://issues.apache.org/jira/browse/SPARK-42963
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: XiDuo You






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42942) Support coalesce table cache stage partitions

2023-03-27 Thread XiDuo You (Jira)
XiDuo You created SPARK-42942:
-

 Summary: Support coalesce table cache stage partitions
 Key: SPARK-42942
 URL: https://issues.apache.org/jira/browse/SPARK-42942
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: XiDuo You


If people cache a plan which holds some small partitions, then we can coalesce 
those just like what we did for shuffle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42815) Subexpression elimination support shortcut expression

2023-03-15 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-42815:
--
Summary: Subexpression elimination support shortcut expression  (was: 
Subexpression elimination support shortcut conditional expression)

> Subexpression elimination support shortcut expression
> -
>
> Key: SPARK-42815
> URL: https://issues.apache.org/jira/browse/SPARK-42815
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Priority: Minor
>
> The subexpression may not need to eval even if it appears more than once.
> e.g., {{{}if(or(a, and(b, b))){}}}, the expression {{b}} would be skipped if 
> {{a}} is true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42815) Subexpression elimination support shortcut conditional expression

2023-03-15 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-42815:
--
Description: 
The subexpression may not need to eval even if it appears more than once.
e.g., {{{}if(or(a, and(b, b))){}}}, the expression {{b}} would be skipped if 
{{a}} is true.

  was:
The subexpression in conditional expression may not need to eval even if it 
appears more than once.
e.g., `if(or(a, and(b, b)))`, the expression `b` would be skipped if `a` is 
true.


> Subexpression elimination support shortcut conditional expression
> -
>
> Key: SPARK-42815
> URL: https://issues.apache.org/jira/browse/SPARK-42815
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Priority: Minor
>
> The subexpression may not need to eval even if it appears more than once.
> e.g., {{{}if(or(a, and(b, b))){}}}, the expression {{b}} would be skipped if 
> {{a}} is true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42815) Subexpression elimination support shortcut conditional expression

2023-03-15 Thread XiDuo You (Jira)
XiDuo You created SPARK-42815:
-

 Summary: Subexpression elimination support shortcut conditional 
expression
 Key: SPARK-42815
 URL: https://issues.apache.org/jira/browse/SPARK-42815
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: XiDuo You


The subexpression in conditional expression may not need to eval even if it 
appears more than once.
e.g., `if(or(a, and(b, b)))`, the expression `b` would be skipped if `a` is 
true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   >