[jira] [Created] (SPARK-41048) Improve output partitioning and ordering with AQE cache
XiDuo You created SPARK-41048: - Summary: Improve output partitioning and ordering with AQE cache Key: SPARK-41048 URL: https://issues.apache.org/jira/browse/SPARK-41048 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: XiDuo You The cached plan in InMemoryRelation can be AdaptiveSparkPlanExec, however AdaptiveSparkPlanExec deos not specify its output partitioning and ordering. It causes unnecessary shuffle and local sort for downstream action. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41112) RuntimeFilter should apply ColumnPruning eagerly with in-subquery filter
XiDuo You created SPARK-41112: - Summary: RuntimeFilter should apply ColumnPruning eagerly with in-subquery filter Key: SPARK-41112 URL: https://issues.apache.org/jira/browse/SPARK-41112 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: XiDuo You The inferred in-subquery filter should apply ColumnPruning before get plan statistics and check if can be broadcasted. Otherwise, the final physical plan will be different from expected. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40798) Alter partition should verify value
[ https://issues.apache.org/jira/browse/SPARK-40798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632413#comment-17632413 ] XiDuo You commented on SPARK-40798: --- [~rangareddy.av...@gmail.com] sure, just go ahead > Alter partition should verify value > --- > > Key: SPARK-40798 > URL: https://issues.apache.org/jira/browse/SPARK-40798 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.4.0 > > > > {code:java} > CREATE TABLE t (c int) USING PARQUET PARTITIONED BY(p int); > -- This DDL should fail but worked: > ALTER TABLE t ADD PARTITION(p='aaa'); {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40798) Alter partition should verify value
[ https://issues.apache.org/jira/browse/SPARK-40798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17633429#comment-17633429 ] XiDuo You commented on SPARK-40798: --- I think the name should be fine without change. You are actually altering partition when do insert a partitioned table, right ? > Alter partition should verify value > --- > > Key: SPARK-40798 > URL: https://issues.apache.org/jira/browse/SPARK-40798 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.4.0 > > > > {code:java} > CREATE TABLE t (c int) USING PARQUET PARTITIONED BY(p int); > -- This DDL should fail but worked: > ALTER TABLE t ADD PARTITION(p='aaa'); {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40798) Alter partition should verify value
[ https://issues.apache.org/jira/browse/SPARK-40798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17633486#comment-17633486 ] XiDuo You commented on SPARK-40798: --- I think it's fine, the semantics of insert is `insert data` + `add partition`. > Alter partition should verify value > --- > > Key: SPARK-40798 > URL: https://issues.apache.org/jira/browse/SPARK-40798 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.4.0 > > > > {code:java} > CREATE TABLE t (c int) USING PARQUET PARTITIONED BY(p int); > -- This DDL should fail but worked: > ALTER TABLE t ADD PARTITION(p='aaa'); {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40798) Alter partition should verify value
[ https://issues.apache.org/jira/browse/SPARK-40798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17633489#comment-17633489 ] XiDuo You commented on SPARK-40798: --- thank you [~rangareddy.av...@gmail.com] > Alter partition should verify value > --- > > Key: SPARK-40798 > URL: https://issues.apache.org/jira/browse/SPARK-40798 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.4.0 > > > > {code:java} > CREATE TABLE t (c int) USING PARQUET PARTITIONED BY(p int); > -- This DDL should fail but worked: > ALTER TABLE t ADD PARTITION(p='aaa'); {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41144) UnresolvedHint should not cause query failure
XiDuo You created SPARK-41144: - Summary: UnresolvedHint should not cause query failure Key: SPARK-41144 URL: https://issues.apache.org/jira/browse/SPARK-41144 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: XiDuo You {code:java} CREATE TABLE t1(c1 bigint) USING PARQUET; CREATE TABLE t2(c2 bigint) USING PARQUET; SELECT /*+ hash(t2) */ * FROM t1 join t2 on c1 = c2;{code} failed with msg: {code:java} org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to exprId on unresolved object at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.exprId(unresolved.scala:147) at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$hasMetadataCol$4(Analyzer.scala:1005) at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$hasMetadataCol$4$adapted(Analyzer.scala:1005) at scala.collection.Iterator.exists(Iterator.scala:969) at scala.collection.Iterator.exists$(Iterator.scala:967) at scala.collection.AbstractIterator.exists(Iterator.scala:1431) at scala.collection.IterableLike.exists(IterableLike.scala:79) at scala.collection.IterableLike.exists$(IterableLike.scala:78) at scala.collection.AbstractIterable.exists(Iterable.scala:56) at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$hasMetadataCol$3(Analyzer.scala:1005) at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$hasMetadataCol$3$adapted(Analyzer.scala:1005) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41207) Regression in IntegralDivide
[ https://issues.apache.org/jira/browse/SPARK-41207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17636385#comment-17636385 ] XiDuo You commented on SPARK-41207: --- Hi [~razajafri] , I can not reproduce it with version 3.4/3.3/3.2 {code:java} scala> val simpleSchema = StructType(Array( | | StructField("a", DecimalType(6,5),true), | | StructField("b", DecimalType(36,-5), true))) org.apache.spark.sql.AnalysisException: Negative scale is not allowed: -5. You can use spark.sql.legacy.allowNegativeScaleOfDecimal=true to enable legacy mode to allow it. at org.apache.spark.sql.errors.QueryCompilationErrors$.negativeScaleNotAllowedError(QueryCompilationErrors.scala:1679) at org.apache.spark.sql.types.DecimalType$.checkNegativeScale(DecimalType.scala:160) at org.apache.spark.sql.types.DecimalType.(DecimalType.scala:45) ... 47 elided {code} Spark does not support create a negative scale decimal type long time ago, see SPARK-30252 BTW, the example in description can work if you set spark.sql.legacy.allowNegativeScaleOfDecimal=true. > Regression in IntegralDivide > > > Key: SPARK-41207 > URL: https://issues.apache.org/jira/browse/SPARK-41207 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Raza Jafri >Priority: Minor > > There has been a regression in Integral Divide after the removal of > PromotePrecision from Spark 3.4.0. > {code:java} > scala> val data = Seq(Row(BigDecimal("-7.70892"), > BigDecimal("4.27138661282262736522411173299611831E+40"))) > scala> val simpleSchema = StructType(Array( > | StructField("a", DecimalType(6,5),true), > | StructField("b", DecimalType(36,-5), true))) > scala> val df = spark.createDataFrame(spark.sparkContext.parallelize(data), > simpleSchema) > {code} > The above statements result in an AnalysisException thrown > > {code:java} > org.apache.spark.sql.AnalysisException: Decimal scale (0) cannot be greater > than precision (-4). > at > org.apache.spark.sql.errors.QueryCompilationErrors$.decimalCannotGreaterThanPrecisionError(QueryCompilationErrors.scala:2237) > at org.apache.spark.sql.types.DecimalType.(DecimalType.scala:49) > at org.apache.spark.sql.types.DecimalType$.bounded(DecimalType.scala:164) > at > org.apache.spark.sql.catalyst.expressions.IntegralDivide.resultDecimalType(arithmetic.scala:868) > at > org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.dataType(arithmetic.scala:238) > at > org.apache.spark.sql.catalyst.expressions.IntegralDivide.org$apache$spark$sql$catalyst$expressions$DivModLike$$super$dataType(arithmetic.scala:842) > {code} > I believe this is happening because we aren't promoting the precision like we > were before this > [PR|https://github.com/apache/spark/commit/301a13963808d1ad44be5cacf0a20f65b853d5a2] > went in. Without promoting precision the resultDecimalType in the example > above tries to return a Decimal with precision of -4 and scale of 0 which is > invalid -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41207) Regression in IntegralDivide
[ https://issues.apache.org/jira/browse/SPARK-41207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17636406#comment-17636406 ] XiDuo You commented on SPARK-41207: --- I see it. If we do `df.selectExpr("a div b").collect` next, then the failure happens. (3.3 works) I also tried `df.selectExpr("a / b").collect`, it threw a different exception (3.3/3.2 also does not work) It seems the negative scale decimal type for DivModLike has some problems > Regression in IntegralDivide > > > Key: SPARK-41207 > URL: https://issues.apache.org/jira/browse/SPARK-41207 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Raza Jafri >Priority: Minor > > There has been a regression in Integral Divide after the removal of > PromotePrecision from Spark 3.4.0. > {code:java} > scala> val data = Seq(Row(BigDecimal("-7.70892"), > BigDecimal("4.27138661282262736522411173299611831E+40"))) > scala> val simpleSchema = StructType(Array( > | StructField("a", DecimalType(6,5),true), > | StructField("b", DecimalType(36,-5), true))) > scala> val df = spark.createDataFrame(spark.sparkContext.parallelize(data), > simpleSchema) > {code} > The above statements result in an AnalysisException thrown > > {code:java} > org.apache.spark.sql.AnalysisException: Decimal scale (0) cannot be greater > than precision (-4). > at > org.apache.spark.sql.errors.QueryCompilationErrors$.decimalCannotGreaterThanPrecisionError(QueryCompilationErrors.scala:2237) > at org.apache.spark.sql.types.DecimalType.(DecimalType.scala:49) > at org.apache.spark.sql.types.DecimalType$.bounded(DecimalType.scala:164) > at > org.apache.spark.sql.catalyst.expressions.IntegralDivide.resultDecimalType(arithmetic.scala:868) > at > org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.dataType(arithmetic.scala:238) > at > org.apache.spark.sql.catalyst.expressions.IntegralDivide.org$apache$spark$sql$catalyst$expressions$DivModLike$$super$dataType(arithmetic.scala:842) > {code} > I believe this is happening because we aren't promoting the precision like we > were before this > [PR|https://github.com/apache/spark/commit/301a13963808d1ad44be5cacf0a20f65b853d5a2] > went in. Without promoting precision the resultDecimalType in the example > above tries to return a Decimal with precision of -4 and scale of 0 which is > invalid -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41220) Range partitioner sample supports column pruning
XiDuo You created SPARK-41220: - Summary: Range partitioner sample supports column pruning Key: SPARK-41220 URL: https://issues.apache.org/jira/browse/SPARK-41220 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: XiDuo You When do a global sort, firstly we do sample to get range bounds, then we use the range partitioner to do shuffle exchange. The issue is, the sample plan is coupled with the shuffle plan that causes we can not optimize the sample plan. What we need for sample plan is the columns for sort order but the shuffle plan contains all data columns.So at least, we can do column pruning for the sample plan to only fetch the ordering columns. A common example is: `OPTIMIZE table ZORDER BY columns` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41219) Regression in IntegralDivide returning null instead of 0
[ https://issues.apache.org/jira/browse/SPARK-41219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17637201#comment-17637201 ] XiDuo You commented on SPARK-41219: --- I'm looking at this > Regression in IntegralDivide returning null instead of 0 > > > Key: SPARK-41219 > URL: https://issues.apache.org/jira/browse/SPARK-41219 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Raza Jafri >Priority: Major > > There seems to be a regression in Spark 3.4 Integral Divide > > {code:java} > scala> val df = Seq("0.5944910","0.3314242").toDF("a") > df: org.apache.spark.sql.DataFrame = [a: string] > scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show > +-+ > |(CAST(a AS DECIMAL(7,7)) div 100)| > +-+ > | null| > | null| > +-+ > {code} > > While in Spark 3.3.0 > {code:java} > scala> val df = Seq("0.5944910","0.3314242").toDF("a") > df: org.apache.spark.sql.DataFrame = [a: string] > scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show > +-+ > |(CAST(a AS DECIMAL(7,7)) div 100)| > +-+ > | 0| > | 0| > +-+ > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41219) Regression in IntegralDivide returning null instead of 0
[ https://issues.apache.org/jira/browse/SPARK-41219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17637216#comment-17637216 ] XiDuo You commented on SPARK-41219: --- it seems the root reason is decimal.toPrecision will break when change to decimal(0, 0) {code:java} val df = Seq(0).toDF("a") // return 0 df.selectExpr("cast(0 as decimal(0,0))").show // return null df.select(lit(BigDecimal(0)) as "c").selectExpr("cast(c as decimal(0,0))").show {code} > Regression in IntegralDivide returning null instead of 0 > > > Key: SPARK-41219 > URL: https://issues.apache.org/jira/browse/SPARK-41219 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Raza Jafri >Priority: Major > > There seems to be a regression in Spark 3.4 Integral Divide > > {code:java} > scala> val df = Seq("0.5944910","0.3314242").toDF("a") > df: org.apache.spark.sql.DataFrame = [a: string] > scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show > +-+ > |(CAST(a AS DECIMAL(7,7)) div 100)| > +-+ > | null| > | null| > +-+ > {code} > > While in Spark 3.3.0 > {code:java} > scala> val df = Seq("0.5944910","0.3314242").toDF("a") > df: org.apache.spark.sql.DataFrame = [a: string] > scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show > +-+ > |(CAST(a AS DECIMAL(7,7)) div 100)| > +-+ > | 0| > | 0| > +-+ > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-41219) Regression in IntegralDivide returning null instead of 0
[ https://issues.apache.org/jira/browse/SPARK-41219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17637216#comment-17637216 ] XiDuo You edited comment on SPARK-41219 at 11/22/22 11:59 AM: -- it seems the root reason is decimal.toPrecision will break when change to decimal(0, 0) {code:java} val df = Seq(0).toDF("a") // return 0 df.selectExpr("cast(0 as decimal(0,0))").show // reutrn 0 df.select(lit(BigDecimal(0)) as "c").show // return null df.select(lit(BigDecimal(0)) as "c").selectExpr("cast(c as decimal(0,0))").show {code} was (Author: ulysses): it seems the root reason is decimal.toPrecision will break when change to decimal(0, 0) {code:java} val df = Seq(0).toDF("a") // return 0 df.selectExpr("cast(0 as decimal(0,0))").show // return null df.select(lit(BigDecimal(0)) as "c").selectExpr("cast(c as decimal(0,0))").show {code} > Regression in IntegralDivide returning null instead of 0 > > > Key: SPARK-41219 > URL: https://issues.apache.org/jira/browse/SPARK-41219 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Raza Jafri >Priority: Major > > There seems to be a regression in Spark 3.4 Integral Divide > > {code:java} > scala> val df = Seq("0.5944910","0.3314242").toDF("a") > df: org.apache.spark.sql.DataFrame = [a: string] > scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show > +-+ > |(CAST(a AS DECIMAL(7,7)) div 100)| > +-+ > | null| > | null| > +-+ > {code} > > While in Spark 3.3.0 > {code:java} > scala> val df = Seq("0.5944910","0.3314242").toDF("a") > df: org.apache.spark.sql.DataFrame = [a: string] > scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show > +-+ > |(CAST(a AS DECIMAL(7,7)) div 100)| > +-+ > | 0| > | 0| > +-+ > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41219) Regression in IntegralDivide returning null instead of 0
[ https://issues.apache.org/jira/browse/SPARK-41219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17637705#comment-17637705 ] XiDuo You commented on SPARK-41219: --- [~razajafri] it would use `resultDecimalType`. Before going to Long, the result is Decimal. Can see the code in `DivModeLike`. > Regression in IntegralDivide returning null instead of 0 > > > Key: SPARK-41219 > URL: https://issues.apache.org/jira/browse/SPARK-41219 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Raza Jafri >Priority: Major > > There seems to be a regression in Spark 3.4 Integral Divide > > {code:java} > scala> val df = Seq("0.5944910","0.3314242").toDF("a") > df: org.apache.spark.sql.DataFrame = [a: string] > scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show > +-+ > |(CAST(a AS DECIMAL(7,7)) div 100)| > +-+ > | null| > | null| > +-+ > {code} > > While in Spark 3.3.0 > {code:java} > scala> val df = Seq("0.5944910","0.3314242").toDF("a") > df: org.apache.spark.sql.DataFrame = [a: string] > scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show > +-+ > |(CAST(a AS DECIMAL(7,7)) div 100)| > +-+ > | 0| > | 0| > +-+ > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41262) Enable canChangeCachedPlanOutputPartitioning by default
XiDuo You created SPARK-41262: - Summary: Enable canChangeCachedPlanOutputPartitioning by default Key: SPARK-41262 URL: https://issues.apache.org/jira/browse/SPARK-41262 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: XiDuo You Tune spark.sql.optimizer.canChangeCachedPlanOutputPartitioning from false to true by default to make AQE work with cached plan. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41262) Enable canChangeCachedPlanOutputPartitioning by default
[ https://issues.apache.org/jira/browse/SPARK-41262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-41262: -- Description: Remove the `internal` tag of `spark.sql.optimizer.canChangeCachedPlanOutputPartitioning`, and tune it from false to true by default to make AQE work with cached plan. (was: Tune spark.sql.optimizer.canChangeCachedPlanOutputPartitioning from false to true by default to make AQE work with cached plan.) > Enable canChangeCachedPlanOutputPartitioning by default > --- > > Key: SPARK-41262 > URL: https://issues.apache.org/jira/browse/SPARK-41262 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Major > > Remove the `internal` tag of > `spark.sql.optimizer.canChangeCachedPlanOutputPartitioning`, and tune it from > false to true by default to make AQE work with cached plan. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41407) Pull out v1 write to WriteFiles
XiDuo You created SPARK-41407: - Summary: Pull out v1 write to WriteFiles Key: SPARK-41407 URL: https://issues.apache.org/jira/browse/SPARK-41407 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: XiDuo You Add new plan WriteFiles to do write files for v1writes. We can make v1 write support whole stage codegen in future. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41511) LongToUnsafeRowMap support ignoresDuplicatedKey
XiDuo You created SPARK-41511: - Summary: LongToUnsafeRowMap support ignoresDuplicatedKey Key: SPARK-41511 URL: https://issues.apache.org/jira/browse/SPARK-41511 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: XiDuo You For left semi and left anti hash join, the duplicated keys of build side have no meaning. Previous, we supported ingore duplicated keys for UnsafeHashedRelation. We can also optimize LongHashedRelation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41708) Pull v1write information to write file node
XiDuo You created SPARK-41708: - Summary: Pull v1write information to write file node Key: SPARK-41708 URL: https://issues.apache.org/jira/browse/SPARK-41708 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: XiDuo You Make WriteFiles hold v1 write information -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41713) Make CTAS hold a nested execution for data writing
XiDuo You created SPARK-41713: - Summary: Make CTAS hold a nested execution for data writing Key: SPARK-41713 URL: https://issues.apache.org/jira/browse/SPARK-41713 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: XiDuo You decouple the create table and data writing command -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41726) Remove OptimizedCreateHiveTableAsSelectCommand
XiDuo You created SPARK-41726: - Summary: Remove OptimizedCreateHiveTableAsSelectCommand Key: SPARK-41726 URL: https://issues.apache.org/jira/browse/SPARK-41726 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: XiDuo You CTAS use a nested execution to do data writing, so it is unnecessary to have OptimizedCreateHiveTableAsSelectCommand. The inside InsertIntoHiveTable would be converted to InsertIntoHadoopFsRelationCommand if possible. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41763) Improve v1 writes
XiDuo You created SPARK-41763: - Summary: Improve v1 writes Key: SPARK-41763 URL: https://issues.apache.org/jira/browse/SPARK-41763 Project: Spark Issue Type: Umbrella Components: SQL Affects Versions: 3.4.0 Reporter: XiDuo You This umbrella is used to track all tickets about the changes of v1 writes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37287) Pull out dynamic partition and bucket sort from FileFormatWriter
[ https://issues.apache.org/jira/browse/SPARK-37287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-37287: -- Parent: SPARK-41763 Issue Type: Sub-task (was: Improvement) > Pull out dynamic partition and bucket sort from FileFormatWriter > > > Key: SPARK-37287 > URL: https://issues.apache.org/jira/browse/SPARK-37287 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Assignee: Allison Wang >Priority: Major > Fix For: 3.4.0 > > > `FileFormatWriter.write` now is used by all V1 write which includes > datasource and hive table. However it contains a sort which is based on > dynamic partition and bucket columns that can not be seen in plan directly. > V2 write has a better approach that it satisfies the order or even > distribution by using rule `V2Writes`. > V1 write should do the similar thing with V2 write. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37194) Avoid unnecessary sort in FileFormatWriter if it's not dynamic partition
[ https://issues.apache.org/jira/browse/SPARK-37194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-37194: -- Parent: SPARK-41763 Issue Type: Sub-task (was: Improvement) > Avoid unnecessary sort in FileFormatWriter if it's not dynamic partition > > > Key: SPARK-37194 > URL: https://issues.apache.org/jira/browse/SPARK-37194 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.4.0 > > > `FileFormatWriter.write` will sort the partition and bucket column before > writing. I think this code path assumed the input `partitionColumns` are > dynamic but actually it's not. It now is used by three code path: > - `FileStreamSink`; it should be always dynamic partition > - `SaveAsHiveFile`; it followed the assuming that `InsertIntoHiveTable` has > removed the static partition and `InsertIntoHiveDirCommand` has no partition > - `InsertIntoHadoopFsRelationCommand`; it passed `partitionColumns` into > `FileFormatWriter.write` without removing static partition because we need it > to generate the partition path in `DynamicPartitionDataWriter` > It shows that the unnecessary sort only affected the > `InsertIntoHadoopFsRelationCommand` if we write data with static partition. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40107) Pull out empty2null conversion from FileFormatWriter
[ https://issues.apache.org/jira/browse/SPARK-40107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-40107: -- Parent: SPARK-41763 Issue Type: Sub-task (was: Improvement) > Pull out empty2null conversion from FileFormatWriter > > > Key: SPARK-40107 > URL: https://issues.apache.org/jira/browse/SPARK-40107 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Fix For: 3.4.0 > > > This is a follow-up for SPARK-37287. We can pull out the physical project to > convert empty string partition columns to null in `FileFormatWriter` into > logical planning as well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40201) Improve v1 write test coverage
[ https://issues.apache.org/jira/browse/SPARK-40201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-40201: -- Parent: SPARK-41763 Issue Type: Sub-task (was: Improvement) > Improve v1 write test coverage > -- > > Key: SPARK-40201 > URL: https://issues.apache.org/jira/browse/SPARK-40201 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Major > > Make v1 write test work on all SQL tests -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41407) Pull out v1 write to WriteFiles
[ https://issues.apache.org/jira/browse/SPARK-41407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-41407: -- Parent: SPARK-41763 Issue Type: Sub-task (was: Improvement) > Pull out v1 write to WriteFiles > --- > > Key: SPARK-41407 > URL: https://issues.apache.org/jira/browse/SPARK-41407 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.4.0 > > > Add new plan WriteFiles to do write files for v1writes. > We can make v1 write support whole stage codegen in future. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41713) Make CTAS hold a nested execution for data writing
[ https://issues.apache.org/jira/browse/SPARK-41713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-41713: -- Parent: SPARK-41763 Issue Type: Sub-task (was: Improvement) > Make CTAS hold a nested execution for data writing > -- > > Key: SPARK-41713 > URL: https://issues.apache.org/jira/browse/SPARK-41713 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.4.0 > > > decouple the create table and data writing command -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41726) Remove OptimizedCreateHiveTableAsSelectCommand
[ https://issues.apache.org/jira/browse/SPARK-41726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-41726: -- Parent: SPARK-41763 Issue Type: Sub-task (was: Improvement) > Remove OptimizedCreateHiveTableAsSelectCommand > -- > > Key: SPARK-41726 > URL: https://issues.apache.org/jira/browse/SPARK-41726 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Major > > CTAS use a nested execution to do data writing, so it is unnecessary to have > OptimizedCreateHiveTableAsSelectCommand. The inside InsertIntoHiveTable would > be converted to InsertIntoHadoopFsRelationCommand if possible. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41708) Pull v1write information to write file node
[ https://issues.apache.org/jira/browse/SPARK-41708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-41708: -- Parent: SPARK-41763 Issue Type: Sub-task (was: Improvement) > Pull v1write information to write file node > --- > > Key: SPARK-41708 > URL: https://issues.apache.org/jira/browse/SPARK-41708 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Major > > Make WriteFiles hold v1 write information -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41765) Move v1 write metrics to WriteFiles
XiDuo You created SPARK-41765: - Summary: Move v1 write metrics to WriteFiles Key: SPARK-41765 URL: https://issues.apache.org/jira/browse/SPARK-41765 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: XiDuo You Move the metrics which is in `InsertIntoHiveTable` and `InsertIntoHadoopFsRelationCommand` to `WriteFiles` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41763) Improve v1 writes
[ https://issues.apache.org/jira/browse/SPARK-41763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17652692#comment-17652692 ] XiDuo You commented on SPARK-41763: --- cc [~allisonwang] [~cloud_fan] FYI > Improve v1 writes > - > > Key: SPARK-41763 > URL: https://issues.apache.org/jira/browse/SPARK-41763 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Major > > This umbrella is used to track all tickets about the changes of v1 writes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41708) Pull v1write information to WriteFiles
[ https://issues.apache.org/jira/browse/SPARK-41708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-41708: -- Summary: Pull v1write information to WriteFiles (was: Pull v1write information to write file node) > Pull v1write information to WriteFiles > -- > > Key: SPARK-41708 > URL: https://issues.apache.org/jira/browse/SPARK-41708 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Major > > Make WriteFiles hold v1 write information -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41867) Selective predicate should respect InMemoryRelation
XiDuo You created SPARK-41867: - Summary: Selective predicate should respect InMemoryRelation Key: SPARK-41867 URL: https://issues.apache.org/jira/browse/SPARK-41867 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: XiDuo You DDP and Runtime Filter require the build side has a selective predicate. It should also work if the filter is inside the cached relation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41867) Selective predicate should respect InMemoryRelation
[ https://issues.apache.org/jira/browse/SPARK-41867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-41867: -- Description: DPP and Runtime Filter require the build side has a selective predicate. It should also work if the filter is inside the cached relation. (was: DDP and Runtime Filter require the build side has a selective predicate. It should also work if the filter is inside the cached relation.) > Selective predicate should respect InMemoryRelation > --- > > Key: SPARK-41867 > URL: https://issues.apache.org/jira/browse/SPARK-41867 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Major > > DPP and Runtime Filter require the build side has a selective predicate. It > should also work if the filter is inside the cached relation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org