[jira] [Created] (SPARK-41048) Improve output partitioning and ordering with AQE cache

2022-11-08 Thread XiDuo You (Jira)
XiDuo You created SPARK-41048:
-

 Summary: Improve output partitioning and ordering with AQE cache 
 Key: SPARK-41048
 URL: https://issues.apache.org/jira/browse/SPARK-41048
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: XiDuo You


The cached plan in InMemoryRelation can be AdaptiveSparkPlanExec, however 
AdaptiveSparkPlanExec deos not specify its output partitioning and ordering. It 
causes unnecessary shuffle and local sort for downstream action.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41112) RuntimeFilter should apply ColumnPruning eagerly with in-subquery filter

2022-11-10 Thread XiDuo You (Jira)
XiDuo You created SPARK-41112:
-

 Summary: RuntimeFilter should apply ColumnPruning eagerly with 
in-subquery filter
 Key: SPARK-41112
 URL: https://issues.apache.org/jira/browse/SPARK-41112
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: XiDuo You


The inferred in-subquery filter should apply ColumnPruning before get plan 
statistics and check if can be broadcasted. Otherwise, the final physical plan 
will be different from expected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40798) Alter partition should verify value

2022-11-11 Thread XiDuo You (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632413#comment-17632413
 ] 

XiDuo You commented on SPARK-40798:
---

[~rangareddy.av...@gmail.com] sure, just go ahead

> Alter partition should verify value
> ---
>
> Key: SPARK-40798
> URL: https://issues.apache.org/jira/browse/SPARK-40798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>
>  
> {code:java}
> CREATE TABLE t (c int) USING PARQUET PARTITIONED BY(p int);
> -- This DDL should fail but worked:
> ALTER TABLE t ADD PARTITION(p='aaa'); {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40798) Alter partition should verify value

2022-11-13 Thread XiDuo You (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17633429#comment-17633429
 ] 

XiDuo You commented on SPARK-40798:
---

I think the name should be fine without change. You are actually altering 
partition when do insert a partitioned table, right ?

> Alter partition should verify value
> ---
>
> Key: SPARK-40798
> URL: https://issues.apache.org/jira/browse/SPARK-40798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>
>  
> {code:java}
> CREATE TABLE t (c int) USING PARQUET PARTITIONED BY(p int);
> -- This DDL should fail but worked:
> ALTER TABLE t ADD PARTITION(p='aaa'); {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40798) Alter partition should verify value

2022-11-13 Thread XiDuo You (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17633486#comment-17633486
 ] 

XiDuo You commented on SPARK-40798:
---

I think it's fine, the semantics of insert is `insert data` + `add partition`.

> Alter partition should verify value
> ---
>
> Key: SPARK-40798
> URL: https://issues.apache.org/jira/browse/SPARK-40798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>
>  
> {code:java}
> CREATE TABLE t (c int) USING PARQUET PARTITIONED BY(p int);
> -- This DDL should fail but worked:
> ALTER TABLE t ADD PARTITION(p='aaa'); {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40798) Alter partition should verify value

2022-11-13 Thread XiDuo You (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17633489#comment-17633489
 ] 

XiDuo You commented on SPARK-40798:
---

thank you [~rangareddy.av...@gmail.com] 

> Alter partition should verify value
> ---
>
> Key: SPARK-40798
> URL: https://issues.apache.org/jira/browse/SPARK-40798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>
>  
> {code:java}
> CREATE TABLE t (c int) USING PARQUET PARTITIONED BY(p int);
> -- This DDL should fail but worked:
> ALTER TABLE t ADD PARTITION(p='aaa'); {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41144) UnresolvedHint should not cause query failure

2022-11-14 Thread XiDuo You (Jira)
XiDuo You created SPARK-41144:
-

 Summary: UnresolvedHint should not cause query failure
 Key: SPARK-41144
 URL: https://issues.apache.org/jira/browse/SPARK-41144
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: XiDuo You


 
{code:java}
CREATE TABLE t1(c1 bigint) USING PARQUET;
CREATE TABLE t2(c2 bigint) USING PARQUET;
SELECT /*+ hash(t2) */ * FROM t1 join t2 on c1 = c2;{code}
 

 

failed with msg:
{code:java}
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
exprId on unresolved object
  at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.exprId(unresolved.scala:147)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$hasMetadataCol$4(Analyzer.scala:1005)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$hasMetadataCol$4$adapted(Analyzer.scala:1005)
  at scala.collection.Iterator.exists(Iterator.scala:969)
  at scala.collection.Iterator.exists$(Iterator.scala:967)
  at scala.collection.AbstractIterator.exists(Iterator.scala:1431)
  at scala.collection.IterableLike.exists(IterableLike.scala:79)
  at scala.collection.IterableLike.exists$(IterableLike.scala:78)
  at scala.collection.AbstractIterable.exists(Iterable.scala:56)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$hasMetadataCol$3(Analyzer.scala:1005)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$hasMetadataCol$3$adapted(Analyzer.scala:1005)
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41207) Regression in IntegralDivide

2022-11-20 Thread XiDuo You (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17636385#comment-17636385
 ] 

XiDuo You commented on SPARK-41207:
---

Hi [~razajafri] , I can not reproduce it with version 3.4/3.3/3.2
{code:java}
scala> val simpleSchema = StructType(Array(
     | | StructField("a", DecimalType(6,5),true),
     | | StructField("b", DecimalType(36,-5), true)))
org.apache.spark.sql.AnalysisException: Negative scale is not allowed: -5. You 
can use spark.sql.legacy.allowNegativeScaleOfDecimal=true to enable legacy mode 
to allow it.
  at 
org.apache.spark.sql.errors.QueryCompilationErrors$.negativeScaleNotAllowedError(QueryCompilationErrors.scala:1679)
  at 
org.apache.spark.sql.types.DecimalType$.checkNegativeScale(DecimalType.scala:160)
  at org.apache.spark.sql.types.DecimalType.(DecimalType.scala:45)
  ... 47 elided {code}
 
Spark does not support create a negative scale decimal type long time ago, see  
SPARK-30252
BTW, the example in description can work if you set 
spark.sql.legacy.allowNegativeScaleOfDecimal=true.
 

> Regression in IntegralDivide
> 
>
> Key: SPARK-41207
> URL: https://issues.apache.org/jira/browse/SPARK-41207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Raza Jafri
>Priority: Minor
>
> There has been a regression in Integral Divide after the removal of 
> PromotePrecision from Spark 3.4.0.
> {code:java}
> scala> val data = Seq(Row(BigDecimal("-7.70892"), 
> BigDecimal("4.27138661282262736522411173299611831E+40")))
> scala> val simpleSchema = StructType(Array(      
> | StructField("a", DecimalType(6,5),true),      
> | StructField("b", DecimalType(36,-5), true)))
> scala> val df = spark.createDataFrame(spark.sparkContext.parallelize(data), 
> simpleSchema)
> {code}
>  The above statements result in an AnalysisException thrown
>  
> {code:java}
> org.apache.spark.sql.AnalysisException: Decimal scale (0) cannot be greater 
> than precision (-4).
>   at 
> org.apache.spark.sql.errors.QueryCompilationErrors$.decimalCannotGreaterThanPrecisionError(QueryCompilationErrors.scala:2237)
>   at org.apache.spark.sql.types.DecimalType.(DecimalType.scala:49)
>   at org.apache.spark.sql.types.DecimalType$.bounded(DecimalType.scala:164)
>   at 
> org.apache.spark.sql.catalyst.expressions.IntegralDivide.resultDecimalType(arithmetic.scala:868)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.dataType(arithmetic.scala:238)
>   at 
> org.apache.spark.sql.catalyst.expressions.IntegralDivide.org$apache$spark$sql$catalyst$expressions$DivModLike$$super$dataType(arithmetic.scala:842)
> {code}
> I believe this is happening because we aren't promoting the precision like we 
> were before this 
> [PR|https://github.com/apache/spark/commit/301a13963808d1ad44be5cacf0a20f65b853d5a2]
>  went in. Without promoting precision the resultDecimalType in the example 
> above tries to return a Decimal with precision of -4 and scale of 0 which is 
> invalid



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41207) Regression in IntegralDivide

2022-11-20 Thread XiDuo You (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17636406#comment-17636406
 ] 

XiDuo You commented on SPARK-41207:
---

I see it. If we do `df.selectExpr("a div b").collect` next, then the failure 
happens. (3.3 works)

I also tried `df.selectExpr("a / b").collect`, it threw a different exception 
(3.3/3.2 also does not work)

It seems the negative scale decimal type for DivModLike has some problems

> Regression in IntegralDivide
> 
>
> Key: SPARK-41207
> URL: https://issues.apache.org/jira/browse/SPARK-41207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Raza Jafri
>Priority: Minor
>
> There has been a regression in Integral Divide after the removal of 
> PromotePrecision from Spark 3.4.0.
> {code:java}
> scala> val data = Seq(Row(BigDecimal("-7.70892"), 
> BigDecimal("4.27138661282262736522411173299611831E+40")))
> scala> val simpleSchema = StructType(Array(      
> | StructField("a", DecimalType(6,5),true),      
> | StructField("b", DecimalType(36,-5), true)))
> scala> val df = spark.createDataFrame(spark.sparkContext.parallelize(data), 
> simpleSchema)
> {code}
>  The above statements result in an AnalysisException thrown
>  
> {code:java}
> org.apache.spark.sql.AnalysisException: Decimal scale (0) cannot be greater 
> than precision (-4).
>   at 
> org.apache.spark.sql.errors.QueryCompilationErrors$.decimalCannotGreaterThanPrecisionError(QueryCompilationErrors.scala:2237)
>   at org.apache.spark.sql.types.DecimalType.(DecimalType.scala:49)
>   at org.apache.spark.sql.types.DecimalType$.bounded(DecimalType.scala:164)
>   at 
> org.apache.spark.sql.catalyst.expressions.IntegralDivide.resultDecimalType(arithmetic.scala:868)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.dataType(arithmetic.scala:238)
>   at 
> org.apache.spark.sql.catalyst.expressions.IntegralDivide.org$apache$spark$sql$catalyst$expressions$DivModLike$$super$dataType(arithmetic.scala:842)
> {code}
> I believe this is happening because we aren't promoting the precision like we 
> were before this 
> [PR|https://github.com/apache/spark/commit/301a13963808d1ad44be5cacf0a20f65b853d5a2]
>  went in. Without promoting precision the resultDecimalType in the example 
> above tries to return a Decimal with precision of -4 and scale of 0 which is 
> invalid



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41220) Range partitioner sample supports column pruning

2022-11-22 Thread XiDuo You (Jira)
XiDuo You created SPARK-41220:
-

 Summary: Range partitioner sample supports column pruning
 Key: SPARK-41220
 URL: https://issues.apache.org/jira/browse/SPARK-41220
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: XiDuo You


When do a global sort, firstly we do sample to get range bounds, then we use 
the range partitioner to do shuffle exchange.
The issue is, the sample plan is coupled with the shuffle plan that causes we 
can not optimize the sample plan. What we need for sample plan is the columns 
for sort order but the shuffle plan contains all data columns.So at least, we 
can do column pruning for the sample plan to only fetch the ordering columns.

A common example is: `OPTIMIZE table ZORDER BY columns`





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41219) Regression in IntegralDivide returning null instead of 0

2022-11-22 Thread XiDuo You (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17637201#comment-17637201
 ] 

XiDuo You commented on SPARK-41219:
---

I'm looking at this

> Regression in IntegralDivide returning null instead of 0
> 
>
> Key: SPARK-41219
> URL: https://issues.apache.org/jira/browse/SPARK-41219
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Raza Jafri
>Priority: Major
>
> There seems to be a regression in Spark 3.4 Integral Divide
>  
> {code:java}
> scala> val df = Seq("0.5944910","0.3314242").toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show
> +-+
> |(CAST(a AS DECIMAL(7,7)) div 100)|
> +-+
> |                             null|
> |                             null|
> +-+
> {code}
>  
> While in Spark 3.3.0
> {code:java}
> scala> val df = Seq("0.5944910","0.3314242").toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show
> +-+
> |(CAST(a AS DECIMAL(7,7)) div 100)|
> +-+
> |                                0|
> |                                0|
> +-+
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41219) Regression in IntegralDivide returning null instead of 0

2022-11-22 Thread XiDuo You (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17637216#comment-17637216
 ] 

XiDuo You commented on SPARK-41219:
---

it seems the root reason is decimal.toPrecision will break when change to 
decimal(0, 0)

{code:java}
val df = Seq(0).toDF("a")
// return 0
df.selectExpr("cast(0 as decimal(0,0))").show
// return null
df.select(lit(BigDecimal(0)) as "c").selectExpr("cast(c as decimal(0,0))").show
{code}

> Regression in IntegralDivide returning null instead of 0
> 
>
> Key: SPARK-41219
> URL: https://issues.apache.org/jira/browse/SPARK-41219
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Raza Jafri
>Priority: Major
>
> There seems to be a regression in Spark 3.4 Integral Divide
>  
> {code:java}
> scala> val df = Seq("0.5944910","0.3314242").toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show
> +-+
> |(CAST(a AS DECIMAL(7,7)) div 100)|
> +-+
> |                             null|
> |                             null|
> +-+
> {code}
>  
> While in Spark 3.3.0
> {code:java}
> scala> val df = Seq("0.5944910","0.3314242").toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show
> +-+
> |(CAST(a AS DECIMAL(7,7)) div 100)|
> +-+
> |                                0|
> |                                0|
> +-+
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-41219) Regression in IntegralDivide returning null instead of 0

2022-11-22 Thread XiDuo You (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17637216#comment-17637216
 ] 

XiDuo You edited comment on SPARK-41219 at 11/22/22 11:59 AM:
--

it seems the root reason is decimal.toPrecision will break when change to 
decimal(0, 0)
{code:java}
val df = Seq(0).toDF("a")
// return 0
df.selectExpr("cast(0 as decimal(0,0))").show
// reutrn 0
df.select(lit(BigDecimal(0)) as "c").show
// return null
df.select(lit(BigDecimal(0)) as "c").selectExpr("cast(c as decimal(0,0))").show
{code}


was (Author: ulysses):
it seems the root reason is decimal.toPrecision will break when change to 
decimal(0, 0)

{code:java}
val df = Seq(0).toDF("a")
// return 0
df.selectExpr("cast(0 as decimal(0,0))").show
// return null
df.select(lit(BigDecimal(0)) as "c").selectExpr("cast(c as decimal(0,0))").show
{code}

> Regression in IntegralDivide returning null instead of 0
> 
>
> Key: SPARK-41219
> URL: https://issues.apache.org/jira/browse/SPARK-41219
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Raza Jafri
>Priority: Major
>
> There seems to be a regression in Spark 3.4 Integral Divide
>  
> {code:java}
> scala> val df = Seq("0.5944910","0.3314242").toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show
> +-+
> |(CAST(a AS DECIMAL(7,7)) div 100)|
> +-+
> |                             null|
> |                             null|
> +-+
> {code}
>  
> While in Spark 3.3.0
> {code:java}
> scala> val df = Seq("0.5944910","0.3314242").toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show
> +-+
> |(CAST(a AS DECIMAL(7,7)) div 100)|
> +-+
> |                                0|
> |                                0|
> +-+
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41219) Regression in IntegralDivide returning null instead of 0

2022-11-23 Thread XiDuo You (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17637705#comment-17637705
 ] 

XiDuo You commented on SPARK-41219:
---

[~razajafri] it would use `resultDecimalType`. Before going to Long, the result 
is Decimal. Can see the code in `DivModeLike`.

> Regression in IntegralDivide returning null instead of 0
> 
>
> Key: SPARK-41219
> URL: https://issues.apache.org/jira/browse/SPARK-41219
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Raza Jafri
>Priority: Major
>
> There seems to be a regression in Spark 3.4 Integral Divide
>  
> {code:java}
> scala> val df = Seq("0.5944910","0.3314242").toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show
> +-+
> |(CAST(a AS DECIMAL(7,7)) div 100)|
> +-+
> |                             null|
> |                             null|
> +-+
> {code}
>  
> While in Spark 3.3.0
> {code:java}
> scala> val df = Seq("0.5944910","0.3314242").toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show
> +-+
> |(CAST(a AS DECIMAL(7,7)) div 100)|
> +-+
> |                                0|
> |                                0|
> +-+
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41262) Enable canChangeCachedPlanOutputPartitioning by default

2022-11-24 Thread XiDuo You (Jira)
XiDuo You created SPARK-41262:
-

 Summary: Enable canChangeCachedPlanOutputPartitioning by default
 Key: SPARK-41262
 URL: https://issues.apache.org/jira/browse/SPARK-41262
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: XiDuo You


Tune spark.sql.optimizer.canChangeCachedPlanOutputPartitioning from false to 
true by default to make AQE work with cached plan.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41262) Enable canChangeCachedPlanOutputPartitioning by default

2022-11-25 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-41262:
--
Description: Remove the `internal` tag of 
`spark.sql.optimizer.canChangeCachedPlanOutputPartitioning`, and tune it from 
false to true by default to make AQE work with cached plan.  (was: Tune 
spark.sql.optimizer.canChangeCachedPlanOutputPartitioning from false to true by 
default to make AQE work with cached plan.)

> Enable canChangeCachedPlanOutputPartitioning by default
> ---
>
> Key: SPARK-41262
> URL: https://issues.apache.org/jira/browse/SPARK-41262
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> Remove the `internal` tag of 
> `spark.sql.optimizer.canChangeCachedPlanOutputPartitioning`, and tune it from 
> false to true by default to make AQE work with cached plan.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41407) Pull out v1 write to WriteFiles

2022-12-06 Thread XiDuo You (Jira)
XiDuo You created SPARK-41407:
-

 Summary: Pull out v1 write to WriteFiles
 Key: SPARK-41407
 URL: https://issues.apache.org/jira/browse/SPARK-41407
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: XiDuo You


Add new plan WriteFiles to do write files for v1writes.

We can make v1 write support whole stage codegen in future.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41511) LongToUnsafeRowMap support ignoresDuplicatedKey

2022-12-13 Thread XiDuo You (Jira)
XiDuo You created SPARK-41511:
-

 Summary: LongToUnsafeRowMap support ignoresDuplicatedKey
 Key: SPARK-41511
 URL: https://issues.apache.org/jira/browse/SPARK-41511
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: XiDuo You


For left semi and left anti hash join, the duplicated keys of build side have 
no meaning.

Previous, we supported ingore duplicated keys for UnsafeHashedRelation. We can 
also optimize LongHashedRelation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41708) Pull v1write information to write file node

2022-12-25 Thread XiDuo You (Jira)
XiDuo You created SPARK-41708:
-

 Summary: Pull v1write information to write file node
 Key: SPARK-41708
 URL: https://issues.apache.org/jira/browse/SPARK-41708
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: XiDuo You


Make WriteFiles hold v1 write information



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41713) Make CTAS hold a nested execution for data writing

2022-12-25 Thread XiDuo You (Jira)
XiDuo You created SPARK-41713:
-

 Summary: Make CTAS hold a nested execution for data writing
 Key: SPARK-41713
 URL: https://issues.apache.org/jira/browse/SPARK-41713
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: XiDuo You


decouple the create table and data writing command



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41726) Remove OptimizedCreateHiveTableAsSelectCommand

2022-12-26 Thread XiDuo You (Jira)
XiDuo You created SPARK-41726:
-

 Summary: Remove OptimizedCreateHiveTableAsSelectCommand
 Key: SPARK-41726
 URL: https://issues.apache.org/jira/browse/SPARK-41726
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: XiDuo You


CTAS use a nested execution to do data writing, so it is unnecessary to have 
OptimizedCreateHiveTableAsSelectCommand. The inside InsertIntoHiveTable would 
be converted to InsertIntoHadoopFsRelationCommand if possible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41763) Improve v1 writes

2022-12-28 Thread XiDuo You (Jira)
XiDuo You created SPARK-41763:
-

 Summary: Improve v1 writes
 Key: SPARK-41763
 URL: https://issues.apache.org/jira/browse/SPARK-41763
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Affects Versions: 3.4.0
Reporter: XiDuo You


This umbrella is used to track all tickets about the changes of v1 writes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37287) Pull out dynamic partition and bucket sort from FileFormatWriter

2022-12-28 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-37287:
--
Parent: SPARK-41763
Issue Type: Sub-task  (was: Improvement)

> Pull out dynamic partition and bucket sort from FileFormatWriter
> 
>
> Key: SPARK-37287
> URL: https://issues.apache.org/jira/browse/SPARK-37287
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.4.0
>
>
> `FileFormatWriter.write` now is used by all V1 write which includes 
> datasource and hive table. However it contains a sort which is based on 
> dynamic partition and bucket columns that can not be seen in plan directly.
> V2 write has a better approach that it satisfies the order or even 
> distribution by using rule `V2Writes`.
> V1 write should do the similar thing with V2 write.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37194) Avoid unnecessary sort in FileFormatWriter if it's not dynamic partition

2022-12-28 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-37194:
--
Parent: SPARK-41763
Issue Type: Sub-task  (was: Improvement)

> Avoid unnecessary sort in FileFormatWriter if it's not dynamic partition
> 
>
> Key: SPARK-37194
> URL: https://issues.apache.org/jira/browse/SPARK-37194
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>
> `FileFormatWriter.write` will sort the partition and bucket column before 
> writing. I think this code path assumed the input `partitionColumns` are 
> dynamic but actually it's not. It now is used by three code path:
>  - `FileStreamSink`; it should be always dynamic partition
>  - `SaveAsHiveFile`; it followed the assuming that `InsertIntoHiveTable` has 
> removed the static partition and `InsertIntoHiveDirCommand` has no partition
>  - `InsertIntoHadoopFsRelationCommand`; it passed `partitionColumns` into 
> `FileFormatWriter.write` without removing static partition because we need it 
> to generate the partition path in `DynamicPartitionDataWriter`
> It shows that the unnecessary sort only affected the 
> `InsertIntoHadoopFsRelationCommand` if we write data with static partition.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40107) Pull out empty2null conversion from FileFormatWriter

2022-12-28 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-40107:
--
Parent: SPARK-41763
Issue Type: Sub-task  (was: Improvement)

> Pull out empty2null conversion from FileFormatWriter
> 
>
> Key: SPARK-40107
> URL: https://issues.apache.org/jira/browse/SPARK-40107
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.4.0
>
>
> This is a follow-up for SPARK-37287. We can pull out the physical project to 
> convert empty string partition columns to null in `FileFormatWriter` into 
> logical planning as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40201) Improve v1 write test coverage

2022-12-28 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-40201:
--
Parent: SPARK-41763
Issue Type: Sub-task  (was: Improvement)

> Improve v1 write test coverage
> --
>
> Key: SPARK-40201
> URL: https://issues.apache.org/jira/browse/SPARK-40201
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> Make v1 write test work on all SQL tests



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41407) Pull out v1 write to WriteFiles

2022-12-28 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-41407:
--
Parent: SPARK-41763
Issue Type: Sub-task  (was: Improvement)

> Pull out v1 write to WriteFiles
> ---
>
> Key: SPARK-41407
> URL: https://issues.apache.org/jira/browse/SPARK-41407
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>
> Add new plan WriteFiles to do write files for v1writes.
> We can make v1 write support whole stage codegen in future.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41713) Make CTAS hold a nested execution for data writing

2022-12-28 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-41713:
--
Parent: SPARK-41763
Issue Type: Sub-task  (was: Improvement)

> Make CTAS hold a nested execution for data writing
> --
>
> Key: SPARK-41713
> URL: https://issues.apache.org/jira/browse/SPARK-41713
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>
> decouple the create table and data writing command



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41726) Remove OptimizedCreateHiveTableAsSelectCommand

2022-12-28 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-41726:
--
Parent: SPARK-41763
Issue Type: Sub-task  (was: Improvement)

> Remove OptimizedCreateHiveTableAsSelectCommand
> --
>
> Key: SPARK-41726
> URL: https://issues.apache.org/jira/browse/SPARK-41726
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> CTAS use a nested execution to do data writing, so it is unnecessary to have 
> OptimizedCreateHiveTableAsSelectCommand. The inside InsertIntoHiveTable would 
> be converted to InsertIntoHadoopFsRelationCommand if possible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41708) Pull v1write information to write file node

2022-12-28 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-41708:
--
Parent: SPARK-41763
Issue Type: Sub-task  (was: Improvement)

> Pull v1write information to write file node
> ---
>
> Key: SPARK-41708
> URL: https://issues.apache.org/jira/browse/SPARK-41708
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> Make WriteFiles hold v1 write information



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41765) Move v1 write metrics to WriteFiles

2022-12-28 Thread XiDuo You (Jira)
XiDuo You created SPARK-41765:
-

 Summary: Move v1 write metrics to WriteFiles
 Key: SPARK-41765
 URL: https://issues.apache.org/jira/browse/SPARK-41765
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: XiDuo You


Move the metrics which is in `InsertIntoHiveTable` and 
`InsertIntoHadoopFsRelationCommand` to `WriteFiles`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41763) Improve v1 writes

2022-12-28 Thread XiDuo You (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17652692#comment-17652692
 ] 

XiDuo You commented on SPARK-41763:
---

cc [~allisonwang] [~cloud_fan] FYI

> Improve v1 writes
> -
>
> Key: SPARK-41763
> URL: https://issues.apache.org/jira/browse/SPARK-41763
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> This umbrella is used to track all tickets about the changes of v1 writes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41708) Pull v1write information to WriteFiles

2022-12-29 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-41708:
--
Summary: Pull v1write information to WriteFiles  (was: Pull v1write 
information to write file node)

> Pull v1write information to WriteFiles
> --
>
> Key: SPARK-41708
> URL: https://issues.apache.org/jira/browse/SPARK-41708
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> Make WriteFiles hold v1 write information



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41867) Selective predicate should respect InMemoryRelation

2023-01-03 Thread XiDuo You (Jira)
XiDuo You created SPARK-41867:
-

 Summary: Selective predicate should respect InMemoryRelation
 Key: SPARK-41867
 URL: https://issues.apache.org/jira/browse/SPARK-41867
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: XiDuo You


DDP and Runtime Filter require the build side has a selective predicate. It 
should also work if the filter is inside the cached relation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41867) Selective predicate should respect InMemoryRelation

2023-01-03 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-41867:
--
Description: DPP and Runtime Filter require the build side has a selective 
predicate. It should also work if the filter is inside the cached relation.  
(was: DDP and Runtime Filter require the build side has a selective predicate. 
It should also work if the filter is inside the cached relation.)

> Selective predicate should respect InMemoryRelation
> ---
>
> Key: SPARK-41867
> URL: https://issues.apache.org/jira/browse/SPARK-41867
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>
> DPP and Runtime Filter require the build side has a selective predicate. It 
> should also work if the filter is inside the cached relation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   4   5