date:20211209

[jira] [Assigned] (SPARK-37592) Improve performance of JoinSelection

2021-12-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37592:


Assignee: Apache Spark

> Improve performance of JoinSelection
> 
>
> Key: SPARK-37592
> URL: https://issues.apache.org/jira/browse/SPARK-37592
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> When I reading the implement of AQE, I find the process select join with hint 
> exists a lot cumbersome code.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37592) Improve performance of JoinSelection

2021-12-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37592:


Assignee: (was: Apache Spark)

> Improve performance of JoinSelection
> 
>
> Key: SPARK-37592
> URL: https://issues.apache.org/jira/browse/SPARK-37592
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> When I reading the implement of AQE, I find the process select join with hint 
> exists a lot cumbersome code.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37592) Improve performance of JoinSelection

2021-12-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456218#comment-17456218
 ] 

Apache Spark commented on SPARK-37592:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/34844

> Improve performance of JoinSelection
> 
>
> Key: SPARK-37592
> URL: https://issues.apache.org/jira/browse/SPARK-37592
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> When I reading the implement of AQE, I find the process select join with hint 
> exists a lot cumbersome code.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37592) Improve performance of JoinSelection

2021-12-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456219#comment-17456219
 ] 

Apache Spark commented on SPARK-37592:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/34844

> Improve performance of JoinSelection
> 
>
> Key: SPARK-37592
> URL: https://issues.apache.org/jira/browse/SPARK-37592
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> When I reading the implement of AQE, I find the process select join with hint 
> exists a lot cumbersome code.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37577) ClassCastException: ArrayType cannot be cast to StructType

2021-12-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456243#comment-17456243
 ] 

Apache Spark commented on SPARK-37577:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/34845

> ClassCastException: ArrayType cannot be cast to StructType
> --
>
> Key: SPARK-37577
> URL: https://issues.apache.org/jira/browse/SPARK-37577
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
> Environment: Py: 3.9
>Reporter: Rafal Wojdyla
>Priority: Major
>
> Reproduction:
> {code:python}
> import pyspark.sql.functions as F
> from pyspark.sql.types import StructType, StructField, ArrayType, StringType
> t = StructType([StructField('o', ArrayType(StructType([StructField('s',
>StringType(), False), StructField('b',
>ArrayType(StructType([StructField('e', StringType(),
>False)]), True), False)]), True), False)])
> (
> spark.createDataFrame([], schema=t)
> .select(F.explode("o").alias("eo"))
> .select("eo.*")
> .select(F.explode("b"))
> .count()
> )
> {code}
> Code above works fine in 3.1.2, fails in 3.2.0. See stacktrace below. Note 
> that if you remove, field {{s}}, the code works fine, which is a bit 
> unexpected and likely a clue.
> {noformat}
> Py4JJavaError: An error occurred while calling o156.count.
> : java.lang.ClassCastException: class org.apache.spark.sql.types.ArrayType 
> cannot be cast to class org.apache.spark.sql.types.StructType 
> (org.apache.spark.sql.types.ArrayType and 
> org.apache.spark.sql.types.StructType are in unnamed module of loader 'app')
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema$lzycompute(complexTypeExtractors.scala:107)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema(complexTypeExtractors.scala:107)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.$anonfun$extractFieldName$1(complexTypeExtractors.scala:117)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.extractFieldName(complexTypeExtractors.scala:117)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1$$anonfun$2.applyOrElse(NestedColumnAliasing.scala:372)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1$$anonfun$2.applyOrElse(NestedColumnAliasing.scala:368)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$4(TreeNode.scala:539)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:539)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:508)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1.applyOrElse(NestedColumnAliasing.scala:368)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1.applyOrElse(NestedColumnAliasing.scala:366)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsDownWithPruning$1(QueryPlan.scala:152)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:323)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDownWithPruning(QueryPlan.scala:152)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsWithPruning(QueryPlan.scala:123)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:101)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$.una

[jira] [Assigned] (SPARK-37577) ClassCastException: ArrayType cannot be cast to StructType

2021-12-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37577:


Assignee: Apache Spark

> ClassCastException: ArrayType cannot be cast to StructType
> --
>
> Key: SPARK-37577
> URL: https://issues.apache.org/jira/browse/SPARK-37577
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
> Environment: Py: 3.9
>Reporter: Rafal Wojdyla
>Assignee: Apache Spark
>Priority: Major
>
> Reproduction:
> {code:python}
> import pyspark.sql.functions as F
> from pyspark.sql.types import StructType, StructField, ArrayType, StringType
> t = StructType([StructField('o', ArrayType(StructType([StructField('s',
>StringType(), False), StructField('b',
>ArrayType(StructType([StructField('e', StringType(),
>False)]), True), False)]), True), False)])
> (
> spark.createDataFrame([], schema=t)
> .select(F.explode("o").alias("eo"))
> .select("eo.*")
> .select(F.explode("b"))
> .count()
> )
> {code}
> Code above works fine in 3.1.2, fails in 3.2.0. See stacktrace below. Note 
> that if you remove, field {{s}}, the code works fine, which is a bit 
> unexpected and likely a clue.
> {noformat}
> Py4JJavaError: An error occurred while calling o156.count.
> : java.lang.ClassCastException: class org.apache.spark.sql.types.ArrayType 
> cannot be cast to class org.apache.spark.sql.types.StructType 
> (org.apache.spark.sql.types.ArrayType and 
> org.apache.spark.sql.types.StructType are in unnamed module of loader 'app')
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema$lzycompute(complexTypeExtractors.scala:107)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema(complexTypeExtractors.scala:107)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.$anonfun$extractFieldName$1(complexTypeExtractors.scala:117)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.extractFieldName(complexTypeExtractors.scala:117)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1$$anonfun$2.applyOrElse(NestedColumnAliasing.scala:372)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1$$anonfun$2.applyOrElse(NestedColumnAliasing.scala:368)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$4(TreeNode.scala:539)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:539)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:508)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1.applyOrElse(NestedColumnAliasing.scala:368)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1.applyOrElse(NestedColumnAliasing.scala:366)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsDownWithPruning$1(QueryPlan.scala:152)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:323)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDownWithPruning(QueryPlan.scala:152)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsWithPruning(QueryPlan.scala:123)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:101)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$.unapply(NestedColumnAliasing.scala:366)
>   at 
> org.apache.spark.sql.catalyst.optimiz

[jira] [Assigned] (SPARK-37577) ClassCastException: ArrayType cannot be cast to StructType

2021-12-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37577:


Assignee: (was: Apache Spark)

> ClassCastException: ArrayType cannot be cast to StructType
> --
>
> Key: SPARK-37577
> URL: https://issues.apache.org/jira/browse/SPARK-37577
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
> Environment: Py: 3.9
>Reporter: Rafal Wojdyla
>Priority: Major
>
> Reproduction:
> {code:python}
> import pyspark.sql.functions as F
> from pyspark.sql.types import StructType, StructField, ArrayType, StringType
> t = StructType([StructField('o', ArrayType(StructType([StructField('s',
>StringType(), False), StructField('b',
>ArrayType(StructType([StructField('e', StringType(),
>False)]), True), False)]), True), False)])
> (
> spark.createDataFrame([], schema=t)
> .select(F.explode("o").alias("eo"))
> .select("eo.*")
> .select(F.explode("b"))
> .count()
> )
> {code}
> Code above works fine in 3.1.2, fails in 3.2.0. See stacktrace below. Note 
> that if you remove, field {{s}}, the code works fine, which is a bit 
> unexpected and likely a clue.
> {noformat}
> Py4JJavaError: An error occurred while calling o156.count.
> : java.lang.ClassCastException: class org.apache.spark.sql.types.ArrayType 
> cannot be cast to class org.apache.spark.sql.types.StructType 
> (org.apache.spark.sql.types.ArrayType and 
> org.apache.spark.sql.types.StructType are in unnamed module of loader 'app')
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema$lzycompute(complexTypeExtractors.scala:107)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema(complexTypeExtractors.scala:107)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.$anonfun$extractFieldName$1(complexTypeExtractors.scala:117)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.extractFieldName(complexTypeExtractors.scala:117)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1$$anonfun$2.applyOrElse(NestedColumnAliasing.scala:372)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1$$anonfun$2.applyOrElse(NestedColumnAliasing.scala:368)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$4(TreeNode.scala:539)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:539)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:508)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1.applyOrElse(NestedColumnAliasing.scala:368)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1.applyOrElse(NestedColumnAliasing.scala:366)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsDownWithPruning$1(QueryPlan.scala:152)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:323)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDownWithPruning(QueryPlan.scala:152)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsWithPruning(QueryPlan.scala:123)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:101)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$.unapply(NestedColumnAliasing.scala:366)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ColumnPruning$$anonfun

[jira] [Updated] (SPARK-37577) ClassCastException: ArrayType cannot be cast to StructType

2021-12-09 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-37577:

Component/s: SQL
 (was: PySpark)

> ClassCastException: ArrayType cannot be cast to StructType
> --
>
> Key: SPARK-37577
> URL: https://issues.apache.org/jira/browse/SPARK-37577
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: Py: 3.9
>Reporter: Rafal Wojdyla
>Priority: Major
>
> Reproduction:
> {code:python}
> import pyspark.sql.functions as F
> from pyspark.sql.types import StructType, StructField, ArrayType, StringType
> t = StructType([StructField('o', ArrayType(StructType([StructField('s',
>StringType(), False), StructField('b',
>ArrayType(StructType([StructField('e', StringType(),
>False)]), True), False)]), True), False)])
> (
> spark.createDataFrame([], schema=t)
> .select(F.explode("o").alias("eo"))
> .select("eo.*")
> .select(F.explode("b"))
> .count()
> )
> {code}
> Code above works fine in 3.1.2, fails in 3.2.0. See stacktrace below. Note 
> that if you remove, field {{s}}, the code works fine, which is a bit 
> unexpected and likely a clue.
> {noformat}
> Py4JJavaError: An error occurred while calling o156.count.
> : java.lang.ClassCastException: class org.apache.spark.sql.types.ArrayType 
> cannot be cast to class org.apache.spark.sql.types.StructType 
> (org.apache.spark.sql.types.ArrayType and 
> org.apache.spark.sql.types.StructType are in unnamed module of loader 'app')
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema$lzycompute(complexTypeExtractors.scala:107)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema(complexTypeExtractors.scala:107)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.$anonfun$extractFieldName$1(complexTypeExtractors.scala:117)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.extractFieldName(complexTypeExtractors.scala:117)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1$$anonfun$2.applyOrElse(NestedColumnAliasing.scala:372)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1$$anonfun$2.applyOrElse(NestedColumnAliasing.scala:368)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$4(TreeNode.scala:539)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:539)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:508)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1.applyOrElse(NestedColumnAliasing.scala:368)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$$anonfun$1.applyOrElse(NestedColumnAliasing.scala:366)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsDownWithPruning$1(QueryPlan.scala:152)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:323)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDownWithPruning(QueryPlan.scala:152)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsWithPruning(QueryPlan.scala:123)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:101)
>   at 
> org.apache.spark.sql.catalyst.optimizer.GeneratorNestedColumnAliasing$.unapply(NestedColumnAliasing.scala:366)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ColumnPruning$$a

[jira] [Updated] (SPARK-37593) Optimize HeapMemoryAllocator to avoid memory waste in humongous allocation when using G1GC

2021-12-09 Thread EdisonWang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-37593:
---
Summary: Optimize HeapMemoryAllocator to avoid memory waste in humongous 
allocation when using G1GC  (was: Optimize HeapMmeoryAllocator to avoid memory 
waste in humongous allocation when using G1GC)

> Optimize HeapMemoryAllocator to avoid memory waste in humongous allocation 
> when using G1GC
> --
>
> Key: SPARK-37593
> URL: https://issues.apache.org/jira/browse/SPARK-37593
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: EdisonWang
>Priority: Minor
> Fix For: 3.3.0
>
>
> As we may know, a phenomenon called humongous allocations exists in G1GC when 
> allocations that are larger than 50% of the region size.
> Spark's tungsten memory model usually tries to allocate memory by one `page` 
> each time and allocated by long[pageSizeBytes/8] in 
> HeapMemoryAllocator.allocate. 
> Remember that java long array needs extra object header (usually 16 bytes in 
> 64bit system), so the really bytes allocated is pageSize+16.
> Assume that the G1HeapRegionSize is 4M and pageSizeBytes is 4M as well. Since 
> every time we need to allocate 4M+16byte memory, so two regions are used with 
> one region only occupies 16byte. Then there are about 50% memory waste.
> It can happenes under different combinations of G1HeapRegionSize (varies from 
> 1M to 32M) and pageSizeBytes (varies from 1M to 64M).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37593) Optimize HeapMmeoryAllocator to avoid memory waste in humongous allocation when using G1GC

2021-12-09 Thread EdisonWang (Jira)

EdisonWang created SPARK-37593:
--

 Summary: Optimize HeapMmeoryAllocator to avoid memory waste in 
humongous allocation when using G1GC
 Key: SPARK-37593
 URL: https://issues.apache.org/jira/browse/SPARK-37593
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 3.3.0
Reporter: EdisonWang
 Fix For: 3.3.0


As we may know, a phenomenon called humongous allocations exists in G1GC when 
allocations that are larger than 50% of the region size.

Spark's tungsten memory model usually tries to allocate memory by one `page` 
each time and allocated by long[pageSizeBytes/8] in 
HeapMemoryAllocator.allocate. 

Remember that java long array needs extra object header (usually 16 bytes in 
64bit system), so the really bytes allocated is pageSize+16.

Assume that the G1HeapRegionSize is 4M and pageSizeBytes is 4M as well. Since 
every time we need to allocate 4M+16byte memory, so two regions are used with 
one region only occupies 16byte. Then there are about 50% memory waste.
It can happenes under different combinations of G1HeapRegionSize (varies from 
1M to 32M) and pageSizeBytes (varies from 1M to 64M).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37578) DSV2 is not updating Output Metrics

2021-12-09 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456259#comment-17456259
 ] 

Wenchen Fan commented on SPARK-37578:
-

Maybe we can define some builtin metric names like `bytesWritten`, so that 
Spark can recognize it and update the task metrics accordingly.

 

cc [~viirya] 

> DSV2 is not updating Output Metrics
> ---
>
> Key: SPARK-37578
> URL: https://issues.apache.org/jira/browse/SPARK-37578
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.3, 3.1.2
>Reporter: Sandeep Katta
>Priority: Major
>
> Repro code
> ./bin/spark-shell --master local  --jars 
> /Users/jars/iceberg-spark3-runtime-0.12.1.jar
>  
> {code:java}
> import scala.collection.mutable
> import org.apache.spark.scheduler._val bytesWritten = new 
> mutable.ArrayBuffer[Long]()
> val recordsWritten = new mutable.ArrayBuffer[Long]()
> val bytesWrittenListener = new SparkListener() {
>   override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
>     bytesWritten += taskEnd.taskMetrics.outputMetrics.bytesWritten
>     recordsWritten += taskEnd.taskMetrics.outputMetrics.recordsWritten
>   }
> }
> spark.sparkContext.addSparkListener(bytesWrittenListener)
> try {
> val df = spark.range(1000).toDF("id")
>   df.write.format("iceberg").save("Users/data/dsv2_test")
>   
> assert(bytesWritten.sum > 0)
> assert(recordsWritten.sum > 0)
> } finally {
>   spark.sparkContext.removeSparkListener(bytesWrittenListener)
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37594) Make UT test("SPARK-34399: Add job commit duration metrics for DataWritingCommand") more stable

2021-12-09 Thread angerszhu (Jira)

angerszhu created SPARK-37594:
-

 Summary: Make UT test("SPARK-34399: Add job commit duration 
metrics for DataWritingCommand") more stable
 Key: SPARK-37594
 URL: https://issues.apache.org/jira/browse/SPARK-37594
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: angerszhu


Test sometimes failed 
```
https://github.com/apache/spark/runs/4378978842
```



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37593) Optimize HeapMemoryAllocator to avoid memory waste in humongous allocation when using G1GC

2021-12-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456268#comment-17456268
 ] 

Apache Spark commented on SPARK-37593:
--

User 'WangGuangxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/34846

> Optimize HeapMemoryAllocator to avoid memory waste in humongous allocation 
> when using G1GC
> --
>
> Key: SPARK-37593
> URL: https://issues.apache.org/jira/browse/SPARK-37593
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: EdisonWang
>Priority: Minor
> Fix For: 3.3.0
>
>
> As we may know, a phenomenon called humongous allocations exists in G1GC when 
> allocations that are larger than 50% of the region size.
> Spark's tungsten memory model usually tries to allocate memory by one `page` 
> each time and allocated by long[pageSizeBytes/8] in 
> HeapMemoryAllocator.allocate. 
> Remember that java long array needs extra object header (usually 16 bytes in 
> 64bit system), so the really bytes allocated is pageSize+16.
> Assume that the G1HeapRegionSize is 4M and pageSizeBytes is 4M as well. Since 
> every time we need to allocate 4M+16byte memory, so two regions are used with 
> one region only occupies 16byte. Then there are about 50% memory waste.
> It can happenes under different combinations of G1HeapRegionSize (varies from 
> 1M to 32M) and pageSizeBytes (varies from 1M to 64M).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37593) Optimize HeapMemoryAllocator to avoid memory waste in humongous allocation when using G1GC

2021-12-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37593:


Assignee: (was: Apache Spark)

> Optimize HeapMemoryAllocator to avoid memory waste in humongous allocation 
> when using G1GC
> --
>
> Key: SPARK-37593
> URL: https://issues.apache.org/jira/browse/SPARK-37593
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: EdisonWang
>Priority: Minor
> Fix For: 3.3.0
>
>
> As we may know, a phenomenon called humongous allocations exists in G1GC when 
> allocations that are larger than 50% of the region size.
> Spark's tungsten memory model usually tries to allocate memory by one `page` 
> each time and allocated by long[pageSizeBytes/8] in 
> HeapMemoryAllocator.allocate. 
> Remember that java long array needs extra object header (usually 16 bytes in 
> 64bit system), so the really bytes allocated is pageSize+16.
> Assume that the G1HeapRegionSize is 4M and pageSizeBytes is 4M as well. Since 
> every time we need to allocate 4M+16byte memory, so two regions are used with 
> one region only occupies 16byte. Then there are about 50% memory waste.
> It can happenes under different combinations of G1HeapRegionSize (varies from 
> 1M to 32M) and pageSizeBytes (varies from 1M to 64M).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37593) Optimize HeapMemoryAllocator to avoid memory waste in humongous allocation when using G1GC

2021-12-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37593:


Assignee: Apache Spark

> Optimize HeapMemoryAllocator to avoid memory waste in humongous allocation 
> when using G1GC
> --
>
> Key: SPARK-37593
> URL: https://issues.apache.org/jira/browse/SPARK-37593
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: EdisonWang
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.3.0
>
>
> As we may know, a phenomenon called humongous allocations exists in G1GC when 
> allocations that are larger than 50% of the region size.
> Spark's tungsten memory model usually tries to allocate memory by one `page` 
> each time and allocated by long[pageSizeBytes/8] in 
> HeapMemoryAllocator.allocate. 
> Remember that java long array needs extra object header (usually 16 bytes in 
> 64bit system), so the really bytes allocated is pageSize+16.
> Assume that the G1HeapRegionSize is 4M and pageSizeBytes is 4M as well. Since 
> every time we need to allocate 4M+16byte memory, so two regions are used with 
> one region only occupies 16byte. Then there are about 50% memory waste.
> It can happenes under different combinations of G1HeapRegionSize (varies from 
> 1M to 32M) and pageSizeBytes (varies from 1M to 64M).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37593) Optimize HeapMemoryAllocator to avoid memory waste in humongous allocation when using G1GC

2021-12-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456269#comment-17456269
 ] 

Apache Spark commented on SPARK-37593:
--

User 'WangGuangxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/34846

> Optimize HeapMemoryAllocator to avoid memory waste in humongous allocation 
> when using G1GC
> --
>
> Key: SPARK-37593
> URL: https://issues.apache.org/jira/browse/SPARK-37593
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: EdisonWang
>Priority: Minor
> Fix For: 3.3.0
>
>
> As we may know, a phenomenon called humongous allocations exists in G1GC when 
> allocations that are larger than 50% of the region size.
> Spark's tungsten memory model usually tries to allocate memory by one `page` 
> each time and allocated by long[pageSizeBytes/8] in 
> HeapMemoryAllocator.allocate. 
> Remember that java long array needs extra object header (usually 16 bytes in 
> 64bit system), so the really bytes allocated is pageSize+16.
> Assume that the G1HeapRegionSize is 4M and pageSizeBytes is 4M as well. Since 
> every time we need to allocate 4M+16byte memory, so two regions are used with 
> one region only occupies 16byte. Then there are about 50% memory waste.
> It can happenes under different combinations of G1HeapRegionSize (varies from 
> 1M to 32M) and pageSizeBytes (varies from 1M to 64M).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37594) Make UT test("SPARK-34399: Add job commit duration metrics for DataWritingCommand") more stable

2021-12-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37594:


Assignee: Apache Spark

> Make UT test("SPARK-34399: Add job commit duration metrics for 
> DataWritingCommand") more stable
> ---
>
> Key: SPARK-37594
> URL: https://issues.apache.org/jira/browse/SPARK-37594
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> Test sometimes failed 
> ```
> https://github.com/apache/spark/runs/4378978842
> ```



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37594) Make UT test("SPARK-34399: Add job commit duration metrics for DataWritingCommand") more stable

2021-12-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37594:


Assignee: (was: Apache Spark)

> Make UT test("SPARK-34399: Add job commit duration metrics for 
> DataWritingCommand") more stable
> ---
>
> Key: SPARK-37594
> URL: https://issues.apache.org/jira/browse/SPARK-37594
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Test sometimes failed 
> ```
> https://github.com/apache/spark/runs/4378978842
> ```



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37594) Make UT test("SPARK-34399: Add job commit duration metrics for DataWritingCommand") more stable

2021-12-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456271#comment-17456271
 ] 

Apache Spark commented on SPARK-37594:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34847

> Make UT test("SPARK-34399: Add job commit duration metrics for 
> DataWritingCommand") more stable
> ---
>
> Key: SPARK-37594
> URL: https://issues.apache.org/jira/browse/SPARK-37594
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Test sometimes failed 
> ```
> https://github.com/apache/spark/runs/4378978842
> ```



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34399) Add file commit time to metrics and shown in SQL Tab UI

2021-12-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456272#comment-17456272
 ] 

Apache Spark commented on SPARK-34399:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34847

> Add file commit time to metrics and shown in SQL Tab UI
> ---
>
> Key: SPARK-34399
> URL: https://issues.apache.org/jira/browse/SPARK-34399
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>
> Add file commit time to metrics and shown in SQL Tab UI



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37595) DatasourceV2 `exists ... select *` column push down

2021-12-09 Thread wang-zhun (Jira)

wang-zhun created SPARK-37595:
-

 Summary: DatasourceV2 `exists ... select *` column push down
 Key: SPARK-37595
 URL: https://issues.apache.org/jira/browse/SPARK-37595
 Project: Spark
  Issue Type: Wish
  Components: SQL
Affects Versions: 3.2.0, 3.1.2
Reporter: wang-zhun


The datasourcev2 table is very slow when executing TPCDS, because `exists ... 
select *` will not push down the cropped columns to the data source

Add test in `org.apache.spark.sql.connector.DataSourceV2SQLSuite`
```
test("datasourcev2 exists") {
    val t1 = s"${catalogAndNamespace}t1"
    withTable(t1) {
      sql(s"CREATE TABLE $t1 (col1 string, col2 string) USING $v2Format")
      val t2 = s"${catalogAndNamespace}t2"
      withTable(t2) {
        sql(s"CREATE TABLE $t2 (col1 string, col2 string) USING $v2Format")
        val query = sql(s"select * from $t1 where not exists" +
            s"(select * from $t2 where t1.col1=t2.col1)").queryExecution
        // scalastyle:off println
        println(query.executedPlan)
      }
    }
  }

AdaptiveSparkPlan isFinalPlan=false
+- BroadcastHashJoin [col1#17], [col1#19], LeftSemi, BuildRight, false
   :- Project [col1#17, col2#18]
   :  +- BatchScan[col1#17, col2#18] class 
org.apache.spark.sql.connector.catalog.InMemoryTable$InMemoryBatchScan 
RuntimeFilters: []
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, 
true]),false), [id=#28]
      +- Project [col1#19]
         +- BatchScan[col1#19, col2#20] class 
org.apache.spark.sql.connector.catalog.InMemoryTable$InMemoryBatchScan 
RuntimeFilters: []

Expectation is `BatchScan[col1#19] class 
org.apache.spark.sql.connector.catalog.InMemoryTable$InMemoryBatchScan 
RuntimeFilters: []`
```

Reason `Batch("Early Filter and Projection Push-Down" V2ScanRelationPushDown` 
is executed before `Batch("RewriteSubquery"`, parallel datasourceV2 does not 
support `FileSourceStrategy`



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37595) DatasourceV2 `exists ... select *` column push down

2021-12-09 Thread wang-zhun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wang-zhun updated SPARK-37595:
--
Description: 
datasourcev2表在执行TPCDS时很慢，因为`exists ... select *`不会将裁剪后的列下推到数据源

在 `org.apache.spark.sql.connector.DataSourceV2SQLSuite` 中添加测试
{code:java}

{code}
测试（“datasourcev2 存在”）{
{code:java}

{code}
    val t1 = s"${catalogAndNamespace}t1"
{code:java}

{code}
    withTable(t1) {
{code:java}

{code}
      sql(s"CREATE TABLE $t1 (col1 string, col2 string) USING $v2Format")
{code:java}

{code}
      val t2 = s"${catalogAndNamespace}t2"
{code:java}

{code}
      withTable(t2) {
{code:java}

{code}
        sql(s"CREATE TABLE $t2 (col1 string, col2 string) USING $v2Format")
{code:java}

{code}
        val query = sql(s"select * from $t1 where not exist" +
{code:java}

{code}
            s"(select * from $t2 where t1.col1=t2.col1)").queryExecution
{code:java}

{code}
        // scalastyle:off println
{code:java}

{code}
        println（查询。执行计划）
{code:java}

{code}
      }
{code:java}

{code}
    }
{code:java}

{code}
  }AdaptiveSparkPlan isFinalPlan=false
{code:java}

{code}
+- BroadcastHashJoin [col1#17], [col1#19], LeftSemi, BuildRight, false
{code:java}

{code}
   :- 项目 [col1#17, col2#18]
{code:java}

{code}
   : +- BatchScan[col1#17, col2#18] 类 
org.apache.spark.sql.connector.catalog.InMemoryTable$InMemoryBatchScan 
RuntimeFilters: []
{code:java}

{code}
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, 
true]),false), [id=#28]
{code:java}

{code}
      +- 项目 [col1#19]
{code:java}

{code}
         +- BatchScan[col1#19, col2#20] class 
org.apache.spark.sql.connector.catalog.InMemoryTable$InMemoryBatchScan 
RuntimeFilters: []
{code:java}



{code}
期望是 `BatchScan[col1#19] class org.apache.spark.sql。 
connector.catalog.InMemoryTable$InMemoryBatchScan RuntimeFilters: []`

原因`Batch("Early Filter and Projection Push-Down" 
V2ScanRelationPushDown`在`Batch("RewriteSubquery"`之前执行，并行datasourceV2不支持`FileSourceStrategy`

  was:
The datasourcev2 table is very slow when executing TPCDS, because `exists ... 
select *` will not push down the cropped columns to the data source

Add test in `org.apache.spark.sql.connector.DataSourceV2SQLSuite`
```
test("datasourcev2 exists") {
    val t1 = s"${catalogAndNamespace}t1"
    withTable(t1) {
      sql(s"CREATE TABLE $t1 (col1 string, col2 string) USING $v2Format")
      val t2 = s"${catalogAndNamespace}t2"
      withTable(t2) {
        sql(s"CREATE TABLE $t2 (col1 string, col2 string) USING $v2Format")
        val query = sql(s"select * from $t1 where not exists" +
            s"(select * from $t2 where t1.col1=t2.col1)").queryExecution
        // scalastyle:off println
        println(query.executedPlan)
      }
    }
  }

AdaptiveSparkPlan isFinalPlan=false
+- BroadcastHashJoin [col1#17], [col1#19], LeftSemi, BuildRight, false
   :- Project [col1#17, col2#18]
   :  +- BatchScan[col1#17, col2#18] class 
org.apache.spark.sql.connector.catalog.InMemoryTable$InMemoryBatchScan 
RuntimeFilters: []
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, 
true]),false), [id=#28]
      +- Project [col1#19]
         +- BatchScan[col1#19, col2#20] class 
org.apache.spark.sql.connector.catalog.InMemoryTable$InMemoryBatchScan 
RuntimeFilters: []

Expectation is `BatchScan[col1#19] class 
org.apache.spark.sql.connector.catalog.InMemoryTable$InMemoryBatchScan 
RuntimeFilters: []`
```

Reason `Batch("Early Filter and Projection Push-Down" V2ScanRelationPushDown` 
is executed before `Batch("RewriteSubquery"`, parallel datasourceV2 does not 
support `FileSourceStrategy`


> DatasourceV2 `exists ... select *` column push down
> ---
>
> Key: SPARK-37595
> URL: https://issues.apache.org/jira/browse/SPARK-37595
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: wang-zhun
>Priority: Major
>
> datasourcev2表在执行TPCDS时很慢，因为`exists ... select *`不会将裁剪后的列下推到数据源
> 在 `org.apache.spark.sql.connector.DataSourceV2SQLSuite` 中添加测试
> {code:java}
> {code}
> 测试（“datasourcev2 存在”）{
> {code:java}
> {code}
>     val t1 = s"${catalogAndNamespace}t1"
> {code:java}
> {code}
>     withTable(t1) {
> {code:java}
> {code}
>       sql(s"CREATE TABLE $t1 (col1 string, col2 string) USING $v2Format")
> {code:java}
> {code}
>       val t2 = s"${catalogAndNamespace}t2"
> {code:java}
> {code}
>       withTable(t2) {
> {code:java}
> {code}
>         sql(s"CREATE TABLE $t2 (col1 string, col2 string) USING $v2Format")
> {code:java}
> {code}
>         val query = sql(s"select * from $t1 where not exist" +
> {code:java}
> {code}
>             s"(select * from $t2 where t1.col1=t2.col1)").queryExecution
> {code:java}
> {code}
>         // scalastyle:off println
> {code:java}
> {code}
>         println（查询。执行计划）
> {code:java}
>

[jira] [Updated] (SPARK-37595) DatasourceV2 `exists ... select *` column push down

2021-12-09 Thread wang-zhun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wang-zhun updated SPARK-37595:
--
Description: 
The datasourcev2 table is very slow when executing TPCDS, because `exists ... 
select *` will not push down the cropped columns to the data source

 

Add test in `org.apache.spark.sql.connector.DataSourceV2SQLSuite`
{code:java}
test("datasourcev2 exists") {
    val t1 = s"${catalogAndNamespace}t1"
    withTable(t1) {
      sql(s"CREATE TABLE $t1 (col1 string, col2 string) USING $v2Format")
      val t2 = s"${catalogAndNamespace}t2"
      withTable(t2) {
        sql(s"CREATE TABLE $t2 (col1 string, col2 string) USING $v2Format")
        val query = sql(s"select * from $t1 where not exists" +
            s"(select * from $t2 where t1.col1=t2.col1)").queryExecution
        // scalastyle:off println
        println(query.executedPlan)
      }
    }
  }AdaptiveSparkPlan isFinalPlan=false
+- BroadcastHashJoin [col1#17], [col1#19], LeftSemi, BuildRight, false
   :- Project [col1#17, col2#18]
   :  +- BatchScan[col1#17, col2#18] class 
org.apache.spark.sql.connector.catalog.InMemoryTable$InMemoryBatchScan 
RuntimeFilters: []
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, 
true]),false), [id=#28]
      +- Project [col1#19]
         +- BatchScan[col1#19, col2#20] class 
org.apache.spark.sql.connector.catalog.InMemoryTable$InMemoryBatchScan 
RuntimeFilters: []


Expectation is `BatchScan[col1#19] class 
org.apache.spark.sql.connector.catalog.InMemoryTable$InMemoryBatchScan 
RuntimeFilters: []` {code}
Reason `Batch("Early Filter and Projection Push-Down" V2ScanRelationPushDown` 
is executed before `Batch("RewriteSubquery"`, parallel datasourceV2 does not 
support `FileSourceStrategy`

  was:
datasourcev2表在执行TPCDS时很慢，因为`exists ... select *`不会将裁剪后的列下推到数据源

在 `org.apache.spark.sql.connector.DataSourceV2SQLSuite` 中添加测试
{code:java}

{code}
测试（“datasourcev2 存在”）{
{code:java}

{code}
    val t1 = s"${catalogAndNamespace}t1"
{code:java}

{code}
    withTable(t1) {
{code:java}

{code}
      sql(s"CREATE TABLE $t1 (col1 string, col2 string) USING $v2Format")
{code:java}

{code}
      val t2 = s"${catalogAndNamespace}t2"
{code:java}

{code}
      withTable(t2) {
{code:java}

{code}
        sql(s"CREATE TABLE $t2 (col1 string, col2 string) USING $v2Format")
{code:java}

{code}
        val query = sql(s"select * from $t1 where not exist" +
{code:java}

{code}
            s"(select * from $t2 where t1.col1=t2.col1)").queryExecution
{code:java}

{code}
        // scalastyle:off println
{code:java}

{code}
        println（查询。执行计划）
{code:java}

{code}
      }
{code:java}

{code}
    }
{code:java}

{code}
  }AdaptiveSparkPlan isFinalPlan=false
{code:java}

{code}
+- BroadcastHashJoin [col1#17], [col1#19], LeftSemi, BuildRight, false
{code:java}

{code}
   :- 项目 [col1#17, col2#18]
{code:java}

{code}
   : +- BatchScan[col1#17, col2#18] 类 
org.apache.spark.sql.connector.catalog.InMemoryTable$InMemoryBatchScan 
RuntimeFilters: []
{code:java}

{code}
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, 
true]),false), [id=#28]
{code:java}

{code}
      +- 项目 [col1#19]
{code:java}

{code}
         +- BatchScan[col1#19, col2#20] class 
org.apache.spark.sql.connector.catalog.InMemoryTable$InMemoryBatchScan 
RuntimeFilters: []
{code:java}



{code}
期望是 `BatchScan[col1#19] class org.apache.spark.sql。 
connector.catalog.InMemoryTable$InMemoryBatchScan RuntimeFilters: []`

原因`Batch("Early Filter and Projection Push-Down" 
V2ScanRelationPushDown`在`Batch("RewriteSubquery"`之前执行，并行datasourceV2不支持`FileSourceStrategy`


> DatasourceV2 `exists ... select *` column push down
> ---
>
> Key: SPARK-37595
> URL: https://issues.apache.org/jira/browse/SPARK-37595
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: wang-zhun
>Priority: Major
>
> The datasourcev2 table is very slow when executing TPCDS, because `exists ... 
> select *` will not push down the cropped columns to the data source
>  
> Add test in `org.apache.spark.sql.connector.DataSourceV2SQLSuite`
> {code:java}
> test("datasourcev2 exists") {
>     val t1 = s"${catalogAndNamespace}t1"
>     withTable(t1) {
>       sql(s"CREATE TABLE $t1 (col1 string, col2 string) USING $v2Format")
>       val t2 = s"${catalogAndNamespace}t2"
>       withTable(t2) {
>         sql(s"CREATE TABLE $t2 (col1 string, col2 string) USING $v2Format")
>         val query = sql(s"select * from $t1 where not exists" +
>             s"(select * from $t2 where t1.col1=t2.col1)").queryExecution
>         // scalastyle:off println
>         println(query.executedPlan)
>       }
>     }
>   }AdaptiveSparkPlan isFinalPlan=false
> +- BroadcastHashJoin [col1#17], [col1#19], LeftSemi, BuildRight, fal

[jira] [Updated] (SPARK-37595) DatasourceV2 `exists ... select *` column push down

2021-12-09 Thread wang-zhun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wang-zhun updated SPARK-37595:
--
Description: 
The datasourcev2 table is very slow when executing TPCDS, because `exists ... 
select *` will not push down the cropped columns to the data source

 

Add test in `org.apache.spark.sql.connector.DataSourceV2SQLSuite`
{code:java}
test("datasourcev2 exists") {
    val t1 = s"${catalogAndNamespace}t1"
    withTable(t1) {
      sql(s"CREATE TABLE $t1 (col1 string, col2 string) USING $v2Format")
      val t2 = s"${catalogAndNamespace}t2"
      withTable(t2) {
        sql(s"CREATE TABLE $t2 (col1 string, col2 string) USING $v2Format")
        val query = sql(s"select * from $t1 where not exists" +
            s"(select * from $t2 where t1.col1=t2.col1)").queryExecution
        // scalastyle:off println
        println(query.executedPlan)
      }
    }
  }


AdaptiveSparkPlan isFinalPlan=false
+- BroadcastHashJoin [col1#17], [col1#19], LeftSemi, BuildRight, false
   :- Project [col1#17, col2#18]
   :  +- BatchScan[col1#17, col2#18] class 
org.apache.spark.sql.connector.catalog.InMemoryTable$InMemoryBatchScan 
RuntimeFilters: []
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, 
true]),false), [id=#28]
      +- Project [col1#19]
         +- BatchScan[col1#19, col2#20] class 
org.apache.spark.sql.connector.catalog.InMemoryTable$InMemoryBatchScan 
RuntimeFilters: []


Expectation is `BatchScan[col1#19] class 
org.apache.spark.sql.connector.catalog.InMemoryTable$InMemoryBatchScan 
RuntimeFilters: []` {code}
Reason `Batch("Early Filter and Projection Push-Down" V2ScanRelationPushDown` 
is executed before `Batch("RewriteSubquery"`, parallel datasourceV2 does not 
support `FileSourceStrategy`

  was:
The datasourcev2 table is very slow when executing TPCDS, because `exists ... 
select *` will not push down the cropped columns to the data source

 

Add test in `org.apache.spark.sql.connector.DataSourceV2SQLSuite`
{code:java}
test("datasourcev2 exists") {
    val t1 = s"${catalogAndNamespace}t1"
    withTable(t1) {
      sql(s"CREATE TABLE $t1 (col1 string, col2 string) USING $v2Format")
      val t2 = s"${catalogAndNamespace}t2"
      withTable(t2) {
        sql(s"CREATE TABLE $t2 (col1 string, col2 string) USING $v2Format")
        val query = sql(s"select * from $t1 where not exists" +
            s"(select * from $t2 where t1.col1=t2.col1)").queryExecution
        // scalastyle:off println
        println(query.executedPlan)
      }
    }
  }AdaptiveSparkPlan isFinalPlan=false
+- BroadcastHashJoin [col1#17], [col1#19], LeftSemi, BuildRight, false
   :- Project [col1#17, col2#18]
   :  +- BatchScan[col1#17, col2#18] class 
org.apache.spark.sql.connector.catalog.InMemoryTable$InMemoryBatchScan 
RuntimeFilters: []
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, 
true]),false), [id=#28]
      +- Project [col1#19]
         +- BatchScan[col1#19, col2#20] class 
org.apache.spark.sql.connector.catalog.InMemoryTable$InMemoryBatchScan 
RuntimeFilters: []


Expectation is `BatchScan[col1#19] class 
org.apache.spark.sql.connector.catalog.InMemoryTable$InMemoryBatchScan 
RuntimeFilters: []` {code}
Reason `Batch("Early Filter and Projection Push-Down" V2ScanRelationPushDown` 
is executed before `Batch("RewriteSubquery"`, parallel datasourceV2 does not 
support `FileSourceStrategy`


> DatasourceV2 `exists ... select *` column push down
> ---
>
> Key: SPARK-37595
> URL: https://issues.apache.org/jira/browse/SPARK-37595
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: wang-zhun
>Priority: Major
>
> The datasourcev2 table is very slow when executing TPCDS, because `exists ... 
> select *` will not push down the cropped columns to the data source
>  
> Add test in `org.apache.spark.sql.connector.DataSourceV2SQLSuite`
> {code:java}
> test("datasourcev2 exists") {
>     val t1 = s"${catalogAndNamespace}t1"
>     withTable(t1) {
>       sql(s"CREATE TABLE $t1 (col1 string, col2 string) USING $v2Format")
>       val t2 = s"${catalogAndNamespace}t2"
>       withTable(t2) {
>         sql(s"CREATE TABLE $t2 (col1 string, col2 string) USING $v2Format")
>         val query = sql(s"select * from $t1 where not exists" +
>             s"(select * from $t2 where t1.col1=t2.col1)").queryExecution
>         // scalastyle:off println
>         println(query.executedPlan)
>       }
>     }
>   }
> AdaptiveSparkPlan isFinalPlan=false
> +- BroadcastHashJoin [col1#17], [col1#19], LeftSemi, BuildRight, false
>    :- Project [col1#17, col2#18]
>    :  +- BatchScan[col1#17, col2#18] class 
> org.apache.spark.sql.connector.catalog.InMemoryTable$InMemoryBatchScan 
> RuntimeFilters: []
>    +- BroadcastExchange HashedRelat

[jira] [Updated] (SPARK-37595) DatasourceV2 `exists ... select *` column push down

2021-12-09 Thread wang-zhun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wang-zhun updated SPARK-37595:
--
Issue Type: Improvement  (was: Wish)

> DatasourceV2 `exists ... select *` column push down
> ---
>
> Key: SPARK-37595
> URL: https://issues.apache.org/jira/browse/SPARK-37595
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: wang-zhun
>Priority: Major
>
> The datasourcev2 table is very slow when executing TPCDS, because `exists ... 
> select *` will not push down the cropped columns to the data source
>  
> Add test in `org.apache.spark.sql.connector.DataSourceV2SQLSuite`
> {code:java}
> test("datasourcev2 exists") {
>     val t1 = s"${catalogAndNamespace}t1"
>     withTable(t1) {
>       sql(s"CREATE TABLE $t1 (col1 string, col2 string) USING $v2Format")
>       val t2 = s"${catalogAndNamespace}t2"
>       withTable(t2) {
>         sql(s"CREATE TABLE $t2 (col1 string, col2 string) USING $v2Format")
>         val query = sql(s"select * from $t1 where not exists" +
>             s"(select * from $t2 where t1.col1=t2.col1)").queryExecution
>         // scalastyle:off println
>         println(query.executedPlan)
>       }
>     }
>   }
> AdaptiveSparkPlan isFinalPlan=false
> +- BroadcastHashJoin [col1#17], [col1#19], LeftSemi, BuildRight, false
>    :- Project [col1#17, col2#18]
>    :  +- BatchScan[col1#17, col2#18] class 
> org.apache.spark.sql.connector.catalog.InMemoryTable$InMemoryBatchScan 
> RuntimeFilters: []
>    +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, 
> true]),false), [id=#28]
>       +- Project [col1#19]
>          +- BatchScan[col1#19, col2#20] class 
> org.apache.spark.sql.connector.catalog.InMemoryTable$InMemoryBatchScan 
> RuntimeFilters: []
> Expectation is `BatchScan[col1#19] class 
> org.apache.spark.sql.connector.catalog.InMemoryTable$InMemoryBatchScan 
> RuntimeFilters: []` {code}
> Reason `Batch("Early Filter and Projection Push-Down" V2ScanRelationPushDown` 
> is executed before `Batch("RewriteSubquery"`, parallel datasourceV2 does not 
> support `FileSourceStrategy`



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37582) Support the binary type by contains()

2021-12-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456339#comment-17456339
 ] 

Apache Spark commented on SPARK-37582:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34848

> Support the binary type by contains()
> -
>
> Key: SPARK-37582
> URL: https://issues.apache.org/jira/browse/SPARK-37582
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> New function which was exposed by https://github.com/apache/spark/pull/34761 
> accepts the string type only. The ticket aims to support the binary type as 
> well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37582) Support the binary type by contains()

2021-12-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37582:


Assignee: (was: Apache Spark)

> Support the binary type by contains()
> -
>
> Key: SPARK-37582
> URL: https://issues.apache.org/jira/browse/SPARK-37582
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> New function which was exposed by https://github.com/apache/spark/pull/34761 
> accepts the string type only. The ticket aims to support the binary type as 
> well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37582) Support the binary type by contains()

2021-12-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37582:


Assignee: Apache Spark

> Support the binary type by contains()
> -
>
> Key: SPARK-37582
> URL: https://issues.apache.org/jira/browse/SPARK-37582
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> New function which was exposed by https://github.com/apache/spark/pull/34761 
> accepts the string type only. The ticket aims to support the binary type as 
> well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37582) Support the binary type by contains()

2021-12-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456340#comment-17456340
 ] 

Apache Spark commented on SPARK-37582:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34848

> Support the binary type by contains()
> -
>
> Key: SPARK-37582
> URL: https://issues.apache.org/jira/browse/SPARK-37582
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> New function which was exposed by https://github.com/apache/spark/pull/34761 
> accepts the string type only. The ticket aims to support the binary type as 
> well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37583) Support the binary type by startswith() and endswith()

2021-12-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37583:


Assignee: Apache Spark

> Support the binary type by startswith() and endswith()
> --
>
> Key: SPARK-37583
> URL: https://issues.apache.org/jira/browse/SPARK-37583
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> New function were exposed by https://github.com/apache/spark/pull/34782 but 
> they accept the string type only. To make migration from other systems 
> easier, it would be nice to support the binary type too.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37583) Support the binary type by startswith() and endswith()

2021-12-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456342#comment-17456342
 ] 

Apache Spark commented on SPARK-37583:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34848

> Support the binary type by startswith() and endswith()
> --
>
> Key: SPARK-37583
> URL: https://issues.apache.org/jira/browse/SPARK-37583
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> New function were exposed by https://github.com/apache/spark/pull/34782 but 
> they accept the string type only. To make migration from other systems 
> easier, it would be nice to support the binary type too.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37583) Support the binary type by startswith() and endswith()

2021-12-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456343#comment-17456343
 ] 

Apache Spark commented on SPARK-37583:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34848

> Support the binary type by startswith() and endswith()
> --
>
> Key: SPARK-37583
> URL: https://issues.apache.org/jira/browse/SPARK-37583
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> New function were exposed by https://github.com/apache/spark/pull/34782 but 
> they accept the string type only. To make migration from other systems 
> easier, it would be nice to support the binary type too.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37583) Support the binary type by startswith() and endswith()

2021-12-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456344#comment-17456344
 ] 

Apache Spark commented on SPARK-37583:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34848

> Support the binary type by startswith() and endswith()
> --
>
> Key: SPARK-37583
> URL: https://issues.apache.org/jira/browse/SPARK-37583
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> New function were exposed by https://github.com/apache/spark/pull/34782 but 
> they accept the string type only. To make migration from other systems 
> easier, it would be nice to support the binary type too.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37581) sql hang at planning stage

2021-12-09 Thread ocean (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ocean updated SPARK-37581:
--
Description: 
when exec a sql, this sql hang at planning stage.

when disable DPP, sql can finish very quickly.

we can reproduce this  problem through example below:

create table test.test_a (
day string,
week int,
weekday int)
partitioned by (
dt varchar(8))
stored as orc;

insert into test.test_a partition (dt=20211126) values('1',1,2);

create table test.test_b (
session_id string,
device_id string,
brand string,
model string,
wx_version string,
os string,
net_work_type string,
app_id string,
app_name string,
col_z string,
page_url string,
page_title string,
olabel string,
otitle string,
source string,
send_dt string,
recv_dt string,
request_time string,
write_time string,
client_ip string,
col_a string,
dt_hour varchar(12),
product string,
channelfrom string,
customer_um string,
kb_code string,
col_b string,
rectype string,
errcode string,
col_c string,
pageid_merge string)
partitioned by (
dt varchar(8))
stored as orc;

insert into test.test_b partition(dt=20211126)
values('2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2');

 

  was:
when exec a sql, this sql hang at planning stage.

when disable DPP, sql can finish very quickly.

we can reproduce this  problem through example below:

create table test.test_a (
day string,
week int,
weekday int)
partitioned by (
dt varchar(8))
stored as orc;

insert into test.test_a partition (dt=20211126) values('1',1,2);

create table test.test_b (
session_id string,
device_id string,
brand string,
model string,
wx_version string,
os string,
net_work_type string,
app_id string,
app_name string,
col_z string,
page_url string,
page_title string,
olabel string,
otitle string,
source string,
send_dt string,
recv_dt string,
request_time string,
write_time string,
client_ip string,
col_a string,
dt_hour varchar(12),
product string,
channelfrom string,
customer_um string,
kb_code string,
col_b string,
rectype string,
errcode string,
col_c string,
pageid_merge string)
partitioned by (
dt varchar(8))
stored as orc;

insert into test.test_b partition(dt=20211126)
values('2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2');

create table if not exists test.test_c stored as ORCFILE as
select calendar.day,calendar.week,calendar.weekday, a_kbs,
b_kbs, c_kbs,d_kbs,e_kbs,f_kbs,g_kbs,h_kbs,i_kbs,
j_kbs,k_kbs,l_kbs,m_kbs,n_kbs,o_kbs,p_kbs,q_kbs,r_kbs,s_kbs
from (select * from test.test_a where dt = '20211126') calendar
left join
(select dt,count(distinct kb_code) as a_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t1
on calendar.dt = t1.dt

left join
(select dt,count(distinct kb_code) as b_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t2
on calendar.dt = t2.dt

left join
(select dt,count(distinct kb_code) as c_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t3
on calendar.dt = t3.dt

left join
(select dt,count(distinct kb_code) as d_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t4
on calendar.dt = t4.dt
left join
(select dt,count(distinct kb_code) as e_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t5
on calendar.dt = t5.dt
left join
(select dt,count(distinct kb_code) as f_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t6
on calendar.dt = t6.dt

left join
(select dt,count(distinct kb_code) as g_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t7
on calendar.dt = t7.dt

left join
(select dt,count(distinct kb_code) as h_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t8
on calendar.dt = t8.dt

left join
(select dt,count(distinct kb_code) as i_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t9
on calendar.dt = t9.dt

left join
(select dt,count(distinct kb_code) as j_kbs
from test.test_b
where dt = '20211126'
and app_id in ('1','2')
and substr(kb_code,1,6) = '66'
and pageid_merge = 'a'
group by dt) t10
on calendar.dt = t10.dt

left join
(select dt,count(distinct kb_code) as k_kbs
from test.test_b
where dt = '20211

[jira] [Updated] (SPARK-37581) sql hang at planning stage

2021-12-09 Thread ocean (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ocean updated SPARK-37581:
--
Description: 
when exec a sql, this sql hang at planning stage.

when disable DPP, sql can finish very quickly.

we can reproduce this  problem through example below:

create table test.test_a (
day string,
week int,
weekday int)
partitioned by (
dt varchar(8))
stored as orc;

insert into test.test_a partition (dt=20211126) values('1',1,2);

create table test.test_b (
session_id string,
device_id string,
brand string,
model string,
wx_version string,
os string,
net_work_type string,
app_id string,
app_name string,
col_z string,
page_url string,
page_title string,
olabel string,
otitle string,
source string,
send_dt string,
recv_dt string,
request_time string,
write_time string,
client_ip string,
col_a string,
dt_hour varchar(12),
product string,
channelfrom string,
customer_um string,
kb_code string,
col_b string,
rectype string,
errcode string,
col_c string,
pageid_merge string)
partitioned by (
dt varchar(8))
stored as orc;

insert into test.test_b partition(dt=20211126)
values('2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2');

 

 

drop table if exists test.test_c;create table if not exists test.test_c stored 
as ORCFILE as
select calendar.day,calendar.week,calendar.weekday, a_kbs,
b_kbs, c_kbs,d_kbs,e_kbs,f_kbs,g_kbs,h_kbs,i_kbs,j_kbs,k_kbs
from (select * from test.test_a where dt = '20211126') calendar
left join
(select dt,count(distinct kb_code) as a_kbs
from test.test_b
where dt = '20211126'
group by dt) t1
on calendar.dt = t1.dt

left join
(select dt,count(distinct kb_code) as b_kbs
from test.test_b
where dt = '20211126'
group by dt) t2
on calendar.dt = t2.dt

left join
(select dt,count(distinct kb_code) as c_kbs
from test.test_b
where dt = '20211126'
group by dt) t3
on calendar.dt = t3.dt

left join
(select dt,count(distinct kb_code) as d_kbs
from test.test_b
where dt = '20211126'
group by dt) t4
on calendar.dt = t4.dt

left join
(select dt,count(distinct kb_code) as e_kbs
from test.test_b
where dt = '20211126'
group by dt) t5
on calendar.dt = t5.dt

left join
(select dt,count(distinct kb_code) as f_kbs
from test.test_b
where dt = '20211126'
group by dt) t6
on calendar.dt = t6.dt

left join
(select dt,count(distinct kb_code) as g_kbs
from test.test_b
where dt = '20211126'
group by dt) t7
on calendar.dt = t7.dt

left join
(select dt,count(distinct kb_code) as h_kbs
from test.test_b
where dt = '20211126'
group by dt) t8
on calendar.dt = t8.dt

left join
(select dt,count(distinct kb_code) as i_kbs
from test.test_b
where dt = '20211126'
group by dt) t9
on calendar.dt = t9.dt

left join
(select dt,count(distinct kb_code) as j_kbs
from test.test_b
where dt = '20211126'
group by dt) t10
on calendar.dt = t10.dt

left join
(select dt,count(distinct kb_code) as k_kbs
from test.test_b
where dt = '20211126'
group by dt) t11
on calendar.dt = t11.dt

 

  was:
when exec a sql, this sql hang at planning stage.

when disable DPP, sql can finish very quickly.

we can reproduce this  problem through example below:

create table test.test_a (
day string,
week int,
weekday int)
partitioned by (
dt varchar(8))
stored as orc;

insert into test.test_a partition (dt=20211126) values('1',1,2);

create table test.test_b (
session_id string,
device_id string,
brand string,
model string,
wx_version string,
os string,
net_work_type string,
app_id string,
app_name string,
col_z string,
page_url string,
page_title string,
olabel string,
otitle string,
source string,
send_dt string,
recv_dt string,
request_time string,
write_time string,
client_ip string,
col_a string,
dt_hour varchar(12),
product string,
channelfrom string,
customer_um string,
kb_code string,
col_b string,
rectype string,
errcode string,
col_c string,
pageid_merge string)
partitioned by (
dt varchar(8))
stored as orc;

insert into test.test_b partition(dt=20211126)
values('2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2','2');

 


> sql hang at planning stage
> --
>
> Key: SPARK-37581
> URL: https://issues.apache.org/jira/browse/SPARK-37581
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.0
>Reporter: ocean
>Priority: Major
>
> when exec a sql, this sql hang at planning stage.
> when disable DPP, sql can finish very quickly.
> we can reproduce this  problem through example below:
> create table test.test_a (
> day string,
> week int,
> weekday int)
> partitioned by (
> dt varchar(8))
> stored as orc;
> insert into test.test_a partition (dt=20211126) values('1',1,2);
> create table test.test_b (
> session_id string,
> device_id string,
> brand string,
> model string,
> wx_version string,

[jira] [Assigned] (SPARK-37583) Support the binary type by startswith() and endswith()

2021-12-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37583:


Assignee: (was: Apache Spark)

> Support the binary type by startswith() and endswith()
> --
>
> Key: SPARK-37583
> URL: https://issues.apache.org/jira/browse/SPARK-37583
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> New function were exposed by https://github.com/apache/spark/pull/34782 but 
> they accept the string type only. To make migration from other systems 
> easier, it would be nice to support the binary type too.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37596) Add the support for struct type column in the DropDuplicate in spark

2021-12-09 Thread Saurabh Chawla (Jira)

Saurabh Chawla created SPARK-37596:
--

 Summary: Add the support for struct type column in the 
DropDuplicate in spark
 Key: SPARK-37596
 URL: https://issues.apache.org/jira/browse/SPARK-37596
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Saurabh Chawla


Add the support for struct type coulmn in the DropDuplicate in spark 


Currently on using the struct col in the DropDuplicate we will get the below 
exception

case class StructDropDup(c1: Int, c2: Int)

val df = Seq(("d1", StructDropDup(1, 2)),
      ("d1", StructDropDup(1, 2))).toDF("a", "b")

df.dropDuplicates("b.c1")
{code:java}
org.apache.spark.sql.AnalysisException: Cannot resolve column name "b.c1" among 
(a, b)
  at org.apache.spark.sql.Dataset.$anonfun$dropDuplicates$1(Dataset.scala:2576)
  at 
scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
  at scala.collection.Iterator.foreach(Iterator.scala:941)
  at scala.collection.Iterator.foreach$(Iterator.scala:941)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1429){code}
As workAround inorder to find the the duplicate using the struct column

df1.withColumn("b.c1", col("b.c1")).dropDuplicates("b.c1").drop("b.c1").collect
{code:java}
df.withColumn("b.c1", col("b.c1")).dropDuplicates("b.c1").drop("b.c1").collect
res25: Array[org.apache.spark.sql.Row] = Array([d1,[1,2]]){code}
There is need to add the support for the dropDuplicates



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37596) Add the support for struct type column in the DropDuplicate in spark

2021-12-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37596:


Assignee: Apache Spark

> Add the support for struct type column in the DropDuplicate in spark
> 
>
> Key: SPARK-37596
> URL: https://issues.apache.org/jira/browse/SPARK-37596
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Saurabh Chawla
>Assignee: Apache Spark
>Priority: Major
>
> Add the support for struct type coulmn in the DropDuplicate in spark 
> Currently on using the struct col in the DropDuplicate we will get the below 
> exception
> case class StructDropDup(c1: Int, c2: Int)
> val df = Seq(("d1", StructDropDup(1, 2)),
>       ("d1", StructDropDup(1, 2))).toDF("a", "b")
> df.dropDuplicates("b.c1")
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "b.c1" 
> among (a, b)
>   at 
> org.apache.spark.sql.Dataset.$anonfun$dropDuplicates$1(Dataset.scala:2576)
>   at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429){code}
> As workAround inorder to find the the duplicate using the struct column
> df1.withColumn("b.c1", 
> col("b.c1")).dropDuplicates("b.c1").drop("b.c1").collect
> {code:java}
> df.withColumn("b.c1", col("b.c1")).dropDuplicates("b.c1").drop("b.c1").collect
> res25: Array[org.apache.spark.sql.Row] = Array([d1,[1,2]]){code}
> There is need to add the support for the dropDuplicates



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37596) Add the support for struct type column in the DropDuplicate in spark

2021-12-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456351#comment-17456351
 ] 

Apache Spark commented on SPARK-37596:
--

User 'SaurabhChawla100' has created a pull request for this issue:
https://github.com/apache/spark/pull/34849

> Add the support for struct type column in the DropDuplicate in spark
> 
>
> Key: SPARK-37596
> URL: https://issues.apache.org/jira/browse/SPARK-37596
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Saurabh Chawla
>Priority: Major
>
> Add the support for struct type coulmn in the DropDuplicate in spark 
> Currently on using the struct col in the DropDuplicate we will get the below 
> exception
> case class StructDropDup(c1: Int, c2: Int)
> val df = Seq(("d1", StructDropDup(1, 2)),
>       ("d1", StructDropDup(1, 2))).toDF("a", "b")
> df.dropDuplicates("b.c1")
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "b.c1" 
> among (a, b)
>   at 
> org.apache.spark.sql.Dataset.$anonfun$dropDuplicates$1(Dataset.scala:2576)
>   at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429){code}
> As workAround inorder to find the the duplicate using the struct column
> df1.withColumn("b.c1", 
> col("b.c1")).dropDuplicates("b.c1").drop("b.c1").collect
> {code:java}
> df.withColumn("b.c1", col("b.c1")).dropDuplicates("b.c1").drop("b.c1").collect
> res25: Array[org.apache.spark.sql.Row] = Array([d1,[1,2]]){code}
> There is need to add the support for the dropDuplicates



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37596) Add the support for struct type column in the DropDuplicate in spark

2021-12-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37596:


Assignee: (was: Apache Spark)

> Add the support for struct type column in the DropDuplicate in spark
> 
>
> Key: SPARK-37596
> URL: https://issues.apache.org/jira/browse/SPARK-37596
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Saurabh Chawla
>Priority: Major
>
> Add the support for struct type coulmn in the DropDuplicate in spark 
> Currently on using the struct col in the DropDuplicate we will get the below 
> exception
> case class StructDropDup(c1: Int, c2: Int)
> val df = Seq(("d1", StructDropDup(1, 2)),
>       ("d1", StructDropDup(1, 2))).toDF("a", "b")
> df.dropDuplicates("b.c1")
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "b.c1" 
> among (a, b)
>   at 
> org.apache.spark.sql.Dataset.$anonfun$dropDuplicates$1(Dataset.scala:2576)
>   at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429){code}
> As workAround inorder to find the the duplicate using the struct column
> df1.withColumn("b.c1", 
> col("b.c1")).dropDuplicates("b.c1").drop("b.c1").collect
> {code:java}
> df.withColumn("b.c1", col("b.c1")).dropDuplicates("b.c1").drop("b.c1").collect
> res25: Array[org.apache.spark.sql.Row] = Array([d1,[1,2]]){code}
> There is need to add the support for the dropDuplicates



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37597) Deduplicate the right side of left-semi join and left-anti join

2021-12-09 Thread zhengruifeng (Jira)

zhengruifeng created SPARK-37597:


 Summary: Deduplicate the right side of left-semi join and 
left-anti join
 Key: SPARK-37597
 URL: https://issues.apache.org/jira/browse/SPARK-37597
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: zhengruifeng






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37597) Deduplicate the right side of left-semi join and left-anti join

2021-12-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37597:


Assignee: (was: Apache Spark)

> Deduplicate the right side of left-semi join and left-anti join
> ---
>
> Key: SPARK-37597
> URL: https://issues.apache.org/jira/browse/SPARK-37597
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: zhengruifeng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37597) Deduplicate the right side of left-semi join and left-anti join

2021-12-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37597:


Assignee: Apache Spark

> Deduplicate the right side of left-semi join and left-anti join
> ---
>
> Key: SPARK-37597
> URL: https://issues.apache.org/jira/browse/SPARK-37597
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37597) Deduplicate the right side of left-semi join and left-anti join

2021-12-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37597:


Assignee: Apache Spark

> Deduplicate the right side of left-semi join and left-anti join
> ---
>
> Key: SPARK-37597
> URL: https://issues.apache.org/jira/browse/SPARK-37597
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37597) Deduplicate the right side of left-semi join and left-anti join

2021-12-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456423#comment-17456423
 ] 

Apache Spark commented on SPARK-37597:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/34850

> Deduplicate the right side of left-semi join and left-anti join
> ---
>
> Key: SPARK-37597
> URL: https://issues.apache.org/jira/browse/SPARK-37597
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: zhengruifeng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37451) Performance improvement regressed String to Decimal cast

2021-12-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456431#comment-17456431
 ] 

Apache Spark commented on SPARK-37451:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/34851

> Performance improvement regressed String to Decimal cast
> 
>
> Key: SPARK-37451
> URL: https://issues.apache.org/jira/browse/SPARK-37451
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.1.3, 3.2.1
>Reporter: Raza Jafri
>Priority: Blocker
>  Labels: correctness
>
> A performance improvement to how Spark casts Strings to Decimal in this [PR 
> title|https://issues.apache.org/jira/browse/SPARK-32706], has introduced a 
> regression
> {noformat}
> scala> :paste 
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", true)
> spark.conf.set("spark.rapids.sql.castStringToDecimal.enabled", true)
> spark.conf.set("spark.rapids.sql.castDecimalToString.enabled", true)
> val data = Seq(Row("7.836725755512218E38"))
> val schema=StructType(Array(StructField("a", StringType, false)))
> val df =spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
> df.select(col("a").cast(DecimalType(37,-17))).show
> // Exiting paste mode, now interpreting.
> ++
> |                   a|
> ++
> |7.836725755512218...|
> ++
> scala> spark.version
> res2: String = 3.0.1
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", true)
> spark.conf.set("spark.rapids.sql.castStringToDecimal.enabled", true)
> spark.conf.set("spark.rapids.sql.castDecimalToString.enabled", true)
> val data = Seq(Row("7.836725755512218E38"))
> val schema=StructType(Array(StructField("a", StringType, false)))
> val df =spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
> df.select(col("a").cast(DecimalType(37,-17))).show
> // Exiting paste mode, now interpreting.
> ++
> |   a|
> ++
> |null|
> ++
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> data: Seq[org.apache.spark.sql.Row] = List([7.836725755512218E38])
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(a,StringType,false))
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> spark.version
> res1: String = 3.1.1
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37588) lot of strings get accumulated in the heap dump of spark thrift server

2021-12-09 Thread ramakrishna chilaka (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ramakrishna chilaka updated SPARK-37588:

Attachment: screenshot-1.png

> lot of strings get accumulated in the heap dump of spark thrift server
> --
>
> Key: SPARK-37588
> URL: https://issues.apache.org/jira/browse/SPARK-37588
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
> Environment: Open JDK (8 build 1.8.0_312-b07) and scala 2.12
> OS: Red Hat Enterprise Linux 8.4 (Ootpa), platform:el8
>Reporter: ramakrishna chilaka
>Priority: Major
> Attachments: screenshot-1.png
>
>
> I am starting spark thrift server using the following options
> ```
> /data/spark/sbin/start-thriftserver.sh --master spark://*:7077 --conf 
> "spark.cores.max=320" --conf "spark.executor.cores=3" --conf 
> "spark.driver.cores=15" --executor-memory=10G --driver-memory=50G --conf 
> spark.sql.adaptive.coalescePartitions.enabled=true --conf 
> spark.sql.adaptive.skewJoin.enabled=true --conf spark.sql.cbo.enabled=true 
> --conf spark.sql.adaptive.enabled=true --conf spark.rpc.io.serverThreads=64 
> --conf "spark.driver.maxResultSize=4G" --conf 
> "spark.max.fetch.failures.per.stage=10" --conf 
> "spark.sql.thriftServer.incrementalCollect=false" --conf 
> "spark.ui.reverseProxy=true" --conf "spark.ui.reverseProxyUrl=/spark_ui" 
> --conf "spark.sql.autoBroadcastJoinThreshold=1073741824" --conf 
> spark.sql.thriftServer.interruptOnCancel=true --conf 
> spark.sql.thriftServer.queryTimeout=0 --hiveconf 
> hive.server2.transport.mode=http --hiveconf 
> hive.server2.thrift.http.path=spark_sql --hiveconf 
> hive.server2.thrift.min.worker.threads=500 --hiveconf 
> hive.server2.thrift.max.worker.threads=2147483647 --hiveconf 
> hive.server2.thrift.http.cookie.is.secure=false --hiveconf 
> hive.server2.thrift.http.cookie.auth.enabled=false --hiveconf 
> hive.server2.authentication=NONE --hiveconf hive.server2.enable.doAs=false 
> --hiveconf spark.sql.hive.thriftServer.singleSession=true --hiveconf 
> hive.server2.thrift.bind.host=0.0.0.0 --conf 
> "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf 
> "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
>  --conf "spark.sql.cbo.joinReorder.enabled=true" --conf 
> "spark.sql.optimizer.dynamicPartitionPruning.enabled=true" --conf 
> "spark.worker.cleanup.enabled=true" --conf 
> "spark.worker.cleanup.appDataTtl=3600" --hiveconf 
> hive.exec.scratchdir=/data/spark_scratch/hive --hiveconf 
> hive.exec.local.scratchdir=/data/spark_scratch/local_scratch_dir --hiveconf 
> hive.download.resources.dir=/data/spark_scratch/hive.downloaded.resources.dir 
> --hiveconf hive.querylog.location=/data/spark_scratch/hive.querylog.location 
> --conf spark.executor.extraJavaOptions="-XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps" --conf spark.driver.extraJavaOptions="-verbose:gc 
> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps 
> -Xloggc:/data/thrift_driver_gc.log -XX:+ExplicitGCInvokesConcurrent 
> -XX:MinHeapFreeRatio=20 -XX:MaxHeapFreeRatio=40 -XX:GCTimeRatio=4 
> -XX:AdaptiveSizePolicyWeight=90 -XX:MaxRAM=55g" --hiveconf 
> "hive.server2.session.check.interval=6" --hiveconf 
> "hive.server2.idle.session.timeout=90" --hiveconf 
> "hive.server2.idle.session.check.operation=true" --conf 
> "spark.eventLog.enabled=false" --conf 
> "spark.cleaner.periodicGC.interval=5min" --conf 
> "spark.appStateStore.asyncTracking.enable=false" --conf 
> "spark.ui.retainedJobs=30" --conf "spark.ui.retainedStages=100" --conf 
> "spark.ui.retainedTasks=500" --conf "spark.sql.ui.retainedExecutions=10" 
> --conf "spark.ui.retainedDeadExecutors=10" --conf 
> "spark.worker.ui.retainedExecutors=10" --conf 
> "spark.worker.ui.retainedDrivers=10" --conf spark.ui.enabled=false --conf 
> spark.stage.maxConsecutiveAttempts=10 --conf spark.executor.memoryOverhead=1G 
> --conf "spark.io.compression.codec=snappy" --conf 
> "spark.default.parallelism=640" --conf spark.memory.offHeap.enabled=true 
> --conf "spark.memory.offHeap.size=3g" --conf "spark.memory.fraction=0.75" 
> --conf "spark.memory.storageFraction=0.75"
> ```
> the java heap dump after heavy usage is as follows
> ```
>  1:  50465861 9745837152  [C
>2:  23337896 1924089944  [Ljava.lang.Object;
>3:  72524905 1740597720  java.lang.Long
>4:  50463694 1614838208  java.lang.String
>5:  22718029  726976928  
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
>6:   2259416  343483328  [Lscala.collection.mutable.HashEntry;
>7:16  141744616  [Lorg.apache.spark.sql.Row;
>8:532529  123546728  
> org.apache.spark.sql.catalyst

[jira] [Updated] (SPARK-37588) lot of strings get accumulated in the heap dump of spark thrift server

2021-12-09 Thread ramakrishna chilaka (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ramakrishna chilaka updated SPARK-37588:

Description: 
I am starting spark thrift server using the following options

```

/data/spark/sbin/start-thriftserver.sh --master spark://*:7077 --conf 
"spark.cores.max=320" --conf "spark.executor.cores=3" --conf 
"spark.driver.cores=15" --executor-memory=10G --driver-memory=50G --conf 
spark.sql.adaptive.coalescePartitions.enabled=true --conf 
spark.sql.adaptive.skewJoin.enabled=true --conf spark.sql.cbo.enabled=true 
--conf spark.sql.adaptive.enabled=true --conf spark.rpc.io.serverThreads=64 
--conf "spark.driver.maxResultSize=4G" --conf 
"spark.max.fetch.failures.per.stage=10" --conf 
"spark.sql.thriftServer.incrementalCollect=false" --conf 
"spark.ui.reverseProxy=true" --conf "spark.ui.reverseProxyUrl=/spark_ui" --conf 
"spark.sql.autoBroadcastJoinThreshold=1073741824" --conf 
spark.sql.thriftServer.interruptOnCancel=true --conf 
spark.sql.thriftServer.queryTimeout=0 --hiveconf 
hive.server2.transport.mode=http --hiveconf 
hive.server2.thrift.http.path=spark_sql --hiveconf 
hive.server2.thrift.min.worker.threads=500 --hiveconf 
hive.server2.thrift.max.worker.threads=2147483647 --hiveconf 
hive.server2.thrift.http.cookie.is.secure=false --hiveconf 
hive.server2.thrift.http.cookie.auth.enabled=false --hiveconf 
hive.server2.authentication=NONE --hiveconf hive.server2.enable.doAs=false 
--hiveconf spark.sql.hive.thriftServer.singleSession=true --hiveconf 
hive.server2.thrift.bind.host=0.0.0.0 --conf 
"spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf 
"spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
 --conf "spark.sql.cbo.joinReorder.enabled=true" --conf 
"spark.sql.optimizer.dynamicPartitionPruning.enabled=true" --conf 
"spark.worker.cleanup.enabled=true" --conf 
"spark.worker.cleanup.appDataTtl=3600" --hiveconf 
hive.exec.scratchdir=/data/spark_scratch/hive --hiveconf 
hive.exec.local.scratchdir=/data/spark_scratch/local_scratch_dir --hiveconf 
hive.download.resources.dir=/data/spark_scratch/hive.downloaded.resources.dir 
--hiveconf hive.querylog.location=/data/spark_scratch/hive.querylog.location 
--conf spark.executor.extraJavaOptions="-XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps" --conf spark.driver.extraJavaOptions="-verbose:gc 
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps 
-Xloggc:/data/thrift_driver_gc.log -XX:+ExplicitGCInvokesConcurrent 
-XX:MinHeapFreeRatio=20 -XX:MaxHeapFreeRatio=40 -XX:GCTimeRatio=4 
-XX:AdaptiveSizePolicyWeight=90 -XX:MaxRAM=55g" --hiveconf 
"hive.server2.session.check.interval=6" --hiveconf 
"hive.server2.idle.session.timeout=90" --hiveconf 
"hive.server2.idle.session.check.operation=true" --conf 
"spark.eventLog.enabled=false" --conf "spark.cleaner.periodicGC.interval=5min" 
--conf "spark.appStateStore.asyncTracking.enable=false" --conf 
"spark.ui.retainedJobs=30" --conf "spark.ui.retainedStages=100" --conf 
"spark.ui.retainedTasks=500" --conf "spark.sql.ui.retainedExecutions=10" --conf 
"spark.ui.retainedDeadExecutors=10" --conf 
"spark.worker.ui.retainedExecutors=10" --conf 
"spark.worker.ui.retainedDrivers=10" --conf spark.ui.enabled=false --conf 
spark.stage.maxConsecutiveAttempts=10 --conf spark.executor.memoryOverhead=1G 
--conf "spark.io.compression.codec=snappy" --conf 
"spark.default.parallelism=640" --conf spark.memory.offHeap.enabled=true --conf 
"spark.memory.offHeap.size=3g" --conf "spark.memory.fraction=0.75" --conf 
"spark.memory.storageFraction=0.75"

```

the java heap dump after heavy usage is as follows
```
 1:  50465861 9745837152  [C
   2:  23337896 1924089944  [Ljava.lang.Object;
   3:  72524905 1740597720  java.lang.Long
   4:  50463694 1614838208  java.lang.String
   5:  22718029  726976928  
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
   6:   2259416  343483328  [Lscala.collection.mutable.HashEntry;
   7:16  141744616  [Lorg.apache.spark.sql.Row;
   8:532529  123546728  
org.apache.spark.sql.catalyst.expressions.Cast
   9:535418   72816848  
org.apache.spark.sql.catalyst.expressions.Literal
  10:   1105284   70738176  scala.collection.mutable.LinkedHashSet
  11:   1725833   70655016  [J
  12:   1154128   55398144  scala.collection.mutable.HashMap
  13:   1720740   55063680  org.apache.spark.util.collection.BitSet
  14:57   50355536  scala.collection.immutable.Vector
  15:   1602297   38455128  scala.Some
  16:   1154303   36937696  scala.collection.immutable.$colon$colon
  17:   1105284   26526816  
org.apache.spark.sql.catalyst.expressions.AttributeSet
  18:   1066442   25594608  java.lang.Integer
  19:735502   23536064  scala.collection.immutable.HashSet$H

[jira] [Updated] (SPARK-37588) lot of strings get accumulated in the heap dump of spark thrift server

2021-12-09 Thread ramakrishna chilaka (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ramakrishna chilaka updated SPARK-37588:

Description: 
I am starting spark thrift server using the following options

```

/data/spark/sbin/start-thriftserver.sh --master spark://*:7077 --conf 
"spark.cores.max=320" --conf "spark.executor.cores=3" --conf 
"spark.driver.cores=15" --executor-memory=10G --driver-memory=50G --conf 
spark.sql.adaptive.coalescePartitions.enabled=true --conf 
spark.sql.adaptive.skewJoin.enabled=true --conf spark.sql.cbo.enabled=true 
--conf spark.sql.adaptive.enabled=true --conf spark.rpc.io.serverThreads=64 
--conf "spark.driver.maxResultSize=4G" --conf 
"spark.max.fetch.failures.per.stage=10" --conf 
"spark.sql.thriftServer.incrementalCollect=false" --conf 
"spark.ui.reverseProxy=true" --conf "spark.ui.reverseProxyUrl=/spark_ui" --conf 
"spark.sql.autoBroadcastJoinThreshold=1073741824" --conf 
spark.sql.thriftServer.interruptOnCancel=true --conf 
spark.sql.thriftServer.queryTimeout=0 --hiveconf 
hive.server2.transport.mode=http --hiveconf 
hive.server2.thrift.http.path=spark_sql --hiveconf 
hive.server2.thrift.min.worker.threads=500 --hiveconf 
hive.server2.thrift.max.worker.threads=2147483647 --hiveconf 
hive.server2.thrift.http.cookie.is.secure=false --hiveconf 
hive.server2.thrift.http.cookie.auth.enabled=false --hiveconf 
hive.server2.authentication=NONE --hiveconf hive.server2.enable.doAs=false 
--hiveconf spark.sql.hive.thriftServer.singleSession=true --hiveconf 
hive.server2.thrift.bind.host=0.0.0.0 --conf 
"spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf 
"spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
 --conf "spark.sql.cbo.joinReorder.enabled=true" --conf 
"spark.sql.optimizer.dynamicPartitionPruning.enabled=true" --conf 
"spark.worker.cleanup.enabled=true" --conf 
"spark.worker.cleanup.appDataTtl=3600" --hiveconf 
hive.exec.scratchdir=/data/spark_scratch/hive --hiveconf 
hive.exec.local.scratchdir=/data/spark_scratch/local_scratch_dir --hiveconf 
hive.download.resources.dir=/data/spark_scratch/hive.downloaded.resources.dir 
--hiveconf hive.querylog.location=/data/spark_scratch/hive.querylog.location 
--conf spark.executor.extraJavaOptions="-XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps" --conf spark.driver.extraJavaOptions="-verbose:gc 
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps 
-Xloggc:/data/thrift_driver_gc.log -XX:+ExplicitGCInvokesConcurrent 
-XX:MinHeapFreeRatio=20 -XX:MaxHeapFreeRatio=40 -XX:GCTimeRatio=4 
-XX:AdaptiveSizePolicyWeight=90 -XX:MaxRAM=55g" --hiveconf 
"hive.server2.session.check.interval=6" --hiveconf 
"hive.server2.idle.session.timeout=90" --hiveconf 
"hive.server2.idle.session.check.operation=true" --conf 
"spark.eventLog.enabled=false" --conf "spark.cleaner.periodicGC.interval=5min" 
--conf "spark.appStateStore.asyncTracking.enable=false" --conf 
"spark.ui.retainedJobs=30" --conf "spark.ui.retainedStages=100" --conf 
"spark.ui.retainedTasks=500" --conf "spark.sql.ui.retainedExecutions=10" --conf 
"spark.ui.retainedDeadExecutors=10" --conf 
"spark.worker.ui.retainedExecutors=10" --conf 
"spark.worker.ui.retainedDrivers=10" --conf spark.ui.enabled=false --conf 
spark.stage.maxConsecutiveAttempts=10 --conf spark.executor.memoryOverhead=1G 
--conf "spark.io.compression.codec=snappy" --conf 
"spark.default.parallelism=640" --conf spark.memory.offHeap.enabled=true --conf 
"spark.memory.offHeap.size=3g" --conf "spark.memory.fraction=0.75" --conf 
"spark.memory.storageFraction=0.75"

```

the java heap dump after heavy usage is as follows
```
 1:  50465861 9745837152  [C
   2:  23337896 1924089944  [Ljava.lang.Object;
   3:  72524905 1740597720  java.lang.Long
   4:  50463694 1614838208  java.lang.String
   5:  22718029  726976928  
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
   6:   2259416  343483328  [Lscala.collection.mutable.HashEntry;
   7:16  141744616  [Lorg.apache.spark.sql.Row;
   8:532529  123546728  
org.apache.spark.sql.catalyst.expressions.Cast
   9:535418   72816848  
org.apache.spark.sql.catalyst.expressions.Literal
  10:   1105284   70738176  scala.collection.mutable.LinkedHashSet
  11:   1725833   70655016  [J
  12:   1154128   55398144  scala.collection.mutable.HashMap
  13:   1720740   55063680  org.apache.spark.util.collection.BitSet
  14:57   50355536  scala.collection.immutable.Vector
  15:   1602297   38455128  scala.Some
  16:   1154303   36937696  scala.collection.immutable.$colon$colon
  17:   1105284   26526816  
org.apache.spark.sql.catalyst.expressions.AttributeSet
  18:   1066442   25594608  java.lang.Integer
  19:735502   23536064  scala.collection.immutable.HashSet$H

[jira] [Resolved] (SPARK-37584) New SQL function: map_contains_key

2021-12-09 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-37584.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34836
[https://github.com/apache/spark/pull/34836]

> New SQL function: map_contains_key
> --
>
> Key: SPARK-37584
> URL: https://issues.apache.org/jira/browse/SPARK-37584
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.3.0
>
>
> Add a new function map_contains_key, which returns true if the map contains 
> the key
> Examples:
> > SELECT map_contains_key(map(1, 'a', 2, 'b'), 1);
> true
> > SELECT map_contains_key(map(1, 'a', 2, 'b'), 3);
> false



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28955) Support for LocalDateTime semantics

2021-12-09 Thread Bill Schneider (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456539#comment-17456539
 ] 

Bill Schneider commented on SPARK-28955:


I see a bunch of related tickets with fix version 3.3.0

Is this effectively going to be addressed by SPARK-35720

The goal being that if you see "2022-01-01 00:00" in a text file it can get 
parsed as "2022-01-01 00:00" regardless of system timezone, and doesn't 
accidentally end up "2021-12-31 17:00" for downstream consumers (e.g., Athena) 
if your Spark job happened to run in EDT

> Support for LocalDateTime semantics
> ---
>
> Key: SPARK-28955
> URL: https://issues.apache.org/jira/browse/SPARK-28955
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bill Schneider
>Priority: Major
>
> It would be great if Spark supported local times in DataFrames, rather than 
> only instants. 
> The specific use case I have in mind is something like
>  * parse "2019-01-01 17:00" (no timezone) from CSV -> LocalDateTime in 
> dataframe
>  * save to Parquet: LocalDateTime is stored with same integer value as 
> 2019-01-01 17:00 UTC, but with isAdjustedToUTC=false.  (Currently Spark saves 
> either INT96 or TIME_MILLIS/TIME_MICROS which has isAdjustedToUTC=true)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-37575) Empty strings and null values are both saved as quoted empty Strings "" rather than "" (for empty strings) and nothing(for null values)

2021-12-09 Thread Guo Wei (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456167#comment-17456167
 ] 

Guo Wei edited comment on SPARK-37575 at 12/9/21, 4:37 PM:
---

I also found that if emptyValueInRead set to "\"\"", reading csv data as show 
below:
{noformat}
name,brand,comment
tesla,,""{noformat}
The final result shows as follows:
||name||brand||comment||
|tesla|null|""|

But, the expected result should be:
||name||brand||comment||
|tesla|null| |


was (Author: wayne guo):
I also found that if emptyValueInRead set to "\"\"", reading csv data as show 
below:
{noformat}
name,brand,comment
tesla,,""{noformat}
The final result shows as follows:
||name||brand||comment||
|tesla|null|null|

But, the expected result should be:
||name||brand||comment||
|tesla|null| |

> Empty strings and null values are both saved as quoted empty Strings "" 
> rather than "" (for empty strings) and nothing(for null values)
> ---
>
> Key: SPARK-37575
> URL: https://issues.apache.org/jira/browse/SPARK-37575
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Guo Wei
>Priority: Major
>
> As mentioned in sql migration 
> guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]),
> {noformat}
> Since Spark 2.4, empty strings are saved as quoted empty strings "". In 
> version 2.3 and earlier, empty strings are equal to null values and do not 
> reflect to any characters in saved CSV files. For example, the row of "a", 
> null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as 
> a,,"",1. To restore the previous behavior, set the CSV option emptyValue to 
> empty (not quoted) string.{noformat}
>  
> But actually, both empty strings and null values are saved as quoted empty 
> Strings "" rather than "" (for empty strings) and nothing(for null values)。
> code:
> {code:java}
> val data = List("spark", null, "").toDF("name")
> data.coalesce(1).write.csv("spark_csv_test")
> {code}
>  actual result:
> {noformat}
> line1: spark
> line2: ""
> line3: ""{noformat}
> expected result:
> {noformat}
> line1: spark
> line2: 
> line3: ""
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37451) Performance improvement regressed String to Decimal cast

2021-12-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37451.
---
Fix Version/s: 3.1.3
   3.2.1
 Assignee: Yuming Wang
   Resolution: Fixed

> Performance improvement regressed String to Decimal cast
> 
>
> Key: SPARK-37451
> URL: https://issues.apache.org/jira/browse/SPARK-37451
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.1.3, 3.2.1
>Reporter: Raza Jafri
>Assignee: Yuming Wang
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.1.3, 3.2.1
>
>
> A performance improvement to how Spark casts Strings to Decimal in this [PR 
> title|https://issues.apache.org/jira/browse/SPARK-32706], has introduced a 
> regression
> {noformat}
> scala> :paste 
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", true)
> spark.conf.set("spark.rapids.sql.castStringToDecimal.enabled", true)
> spark.conf.set("spark.rapids.sql.castDecimalToString.enabled", true)
> val data = Seq(Row("7.836725755512218E38"))
> val schema=StructType(Array(StructField("a", StringType, false)))
> val df =spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
> df.select(col("a").cast(DecimalType(37,-17))).show
> // Exiting paste mode, now interpreting.
> ++
> |                   a|
> ++
> |7.836725755512218...|
> ++
> scala> spark.version
> res2: String = 3.0.1
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", true)
> spark.conf.set("spark.rapids.sql.castStringToDecimal.enabled", true)
> spark.conf.set("spark.rapids.sql.castDecimalToString.enabled", true)
> val data = Seq(Row("7.836725755512218E38"))
> val schema=StructType(Array(StructField("a", StringType, false)))
> val df =spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
> df.select(col("a").cast(DecimalType(37,-17))).show
> // Exiting paste mode, now interpreting.
> ++
> |   a|
> ++
> |null|
> ++
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> data: Seq[org.apache.spark.sql.Row] = List([7.836725755512218E38])
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(a,StringType,false))
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> spark.version
> res1: String = 3.1.1
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37591) Support the GCM mode by aes_encrypt()/aes_decrypt()

2021-12-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37591:


Assignee: Max Gekk  (was: Apache Spark)

> Support the GCM mode by aes_encrypt()/aes_decrypt()
> ---
>
> Key: SPARK-37591
> URL: https://issues.apache.org/jira/browse/SPARK-37591
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Implement the GCM mode in AES - aes_encrypt() and aes_decrypt()



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37591) Support the GCM mode by aes_encrypt()/aes_decrypt()

2021-12-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456587#comment-17456587
 ] 

Apache Spark commented on SPARK-37591:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/34852

> Support the GCM mode by aes_encrypt()/aes_decrypt()
> ---
>
> Key: SPARK-37591
> URL: https://issues.apache.org/jira/browse/SPARK-37591
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Implement the GCM mode in AES - aes_encrypt() and aes_decrypt()



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37591) Support the GCM mode by aes_encrypt()/aes_decrypt()

2021-12-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37591:


Assignee: Apache Spark  (was: Max Gekk)

> Support the GCM mode by aes_encrypt()/aes_decrypt()
> ---
>
> Key: SPARK-37591
> URL: https://issues.apache.org/jira/browse/SPARK-37591
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Implement the GCM mode in AES - aes_encrypt() and aes_decrypt()



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37578) DSV2 is not updating Output Metrics

2021-12-09 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456589#comment-17456589
 ] 

L. C. Hsieh commented on SPARK-37578:
-

Thanks [~cloud_fan]. I will take a look.

> DSV2 is not updating Output Metrics
> ---
>
> Key: SPARK-37578
> URL: https://issues.apache.org/jira/browse/SPARK-37578
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.3, 3.1.2
>Reporter: Sandeep Katta
>Priority: Major
>
> Repro code
> ./bin/spark-shell --master local  --jars 
> /Users/jars/iceberg-spark3-runtime-0.12.1.jar
>  
> {code:java}
> import scala.collection.mutable
> import org.apache.spark.scheduler._val bytesWritten = new 
> mutable.ArrayBuffer[Long]()
> val recordsWritten = new mutable.ArrayBuffer[Long]()
> val bytesWrittenListener = new SparkListener() {
>   override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
>     bytesWritten += taskEnd.taskMetrics.outputMetrics.bytesWritten
>     recordsWritten += taskEnd.taskMetrics.outputMetrics.recordsWritten
>   }
> }
> spark.sparkContext.addSparkListener(bytesWrittenListener)
> try {
> val df = spark.range(1000).toDF("id")
>   df.write.format("iceberg").save("Users/data/dsv2_test")
>   
> assert(bytesWritten.sum > 0)
> assert(recordsWritten.sum > 0)
> } finally {
>   spark.sparkContext.removeSparkListener(bytesWrittenListener)
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37589) Improve the performance of PlanParserSuite

2021-12-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37589.
---
  Assignee: (was: Apache Spark)
Resolution: Won't Do

This is closed according to the PR.

> Improve the performance of PlanParserSuite
> --
>
> Key: SPARK-37589
> URL: https://issues.apache.org/jira/browse/SPARK-37589
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> Reduce the time spent on executing PlanParserSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-37589) Improve the performance of PlanParserSuite

2021-12-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-37589.
-

> Improve the performance of PlanParserSuite
> --
>
> Key: SPARK-37589
> URL: https://issues.apache.org/jira/browse/SPARK-37589
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> Reduce the time spent on executing PlanParserSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37598) Pyspark's newAPIHadoopRDD() method fails with ShortWritables

2021-12-09 Thread Keith Massey (Jira)

Keith Massey created SPARK-37598:


 Summary: Pyspark's newAPIHadoopRDD() method fails with 
ShortWritables
 Key: SPARK-37598
 URL: https://issues.apache.org/jira/browse/SPARK-37598
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.2.0, 3.1.2, 3.0.3, 2.4.8
Reporter: Keith Massey


If sc. newAPIHadoopRDD() is called from Pyspark using an InputFormat that has a 
ShortWritable as a field, then the call to newAPIHadoopRDD() fails. The reason 
is that shortWritable is not explicitly handled by PythonHadoopUtil the way 
that other numeric writables are (like LongWritable). The result is that the 
ShortWritable is not converted to an object that can be serialized by spark, 
and a serialization error occurs. Below is an example stack trace from within 
the pyspark shell:

{code:java}
>>> rdd = 
>>> sc.newAPIHadoopRDD(inputFormatClass="[org.elasticsearch.hadoop.mr|http://org.elasticsearch.hadoop.mr/].EsInputFormat";,
... 
keyClass="[org.apache.hadoop.io|http://org.apache.hadoop.io/].NullWritable";,
... 
valueClass="[org.elasticsearch.hadoop.mr|http://org.elasticsearch.hadoop.mr/].LinkedMapWritable";,
... conf=conf)
2021-12-08 14:38:40,439 ERROR scheduler.TaskSetManager: task 0.0 in stage 15.0 
(TID 31) had a not serializable result: org.apache.hadoop.io.ShortWritable
Serialization stack:
- object not serializable (class: 
[org.apache.hadoop.io|http://org.apache.hadoop.io/].ShortWritable, value: 1)
- writeObject data (class: java.util.HashMap)
- object (class java.util.HashMap, \{price=1})
- field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
- object (class scala.Tuple2, (1,\{price=1}))
- element of array (index: 0)
- array (class [Lscala.Tuple2;, size 1); not retrying
Traceback (most recent call last):
 File "", line 4, in 
 File "/home/hduser/spark-3.1.2-bin-hadoop3.2/python/pyspark/context.py", line 
853, in newAPIHadoopRDD
  jconf, batchSize)
 File 
"/home/hduser/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
 line 1305, in __call__
 File "/home/hduser/spark-3.1.2-bin-hadoop3.2/python/pyspark/sql/utils.py", 
line 111, in deco
  return f(*a, **kw)
 File 
"/home/hduser/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",
 line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: org.apache.spark.SparkException: Job aborted due to stage failure: task 0.0 
in stage 15.0 (TID 31) had a not serializable result: 
org.apache.hadoop.io.ShortWritable
Serialization stack:
- object not serializable (class: 
[org.apache.hadoop.io|http://org.apache.hadoop.io/].ShortWritable, value: 1)
- writeObject data (class: java.util.HashMap)
- object (class java.util.HashMap, \{price=1})
- field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
- object (class scala.Tuple2, (1,\{price=1}))
- element of array (index: 0)
- array (class [Lscala.Tuple2;, size 1)
at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)
at scala.Option.foreach(Option.scala:407)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
at org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1449)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
at org.apache.spark.rdd.RDD.

[jira] [Assigned] (SPARK-37598) Pyspark's newAPIHadoopRDD() method fails with ShortWritables

2021-12-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37598:


Assignee: (was: Apache Spark)

> Pyspark's newAPIHadoopRDD() method fails with ShortWritables
> 
>
> Key: SPARK-37598
> URL: https://issues.apache.org/jira/browse/SPARK-37598
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.8, 3.0.3, 3.1.2, 3.2.0
>Reporter: Keith Massey
>Priority: Minor
>
> If sc. newAPIHadoopRDD() is called from Pyspark using an InputFormat that has 
> a ShortWritable as a field, then the call to newAPIHadoopRDD() fails. The 
> reason is that shortWritable is not explicitly handled by PythonHadoopUtil 
> the way that other numeric writables are (like LongWritable). The result is 
> that the ShortWritable is not converted to an object that can be serialized 
> by spark, and a serialization error occurs. Below is an example stack trace 
> from within the pyspark shell:
> {code:java}
> >>> rdd = 
> >>> sc.newAPIHadoopRDD(inputFormatClass="[org.elasticsearch.hadoop.mr|http://org.elasticsearch.hadoop.mr/].EsInputFormat";,
> ... 
> keyClass="[org.apache.hadoop.io|http://org.apache.hadoop.io/].NullWritable";,
> ... 
> valueClass="[org.elasticsearch.hadoop.mr|http://org.elasticsearch.hadoop.mr/].LinkedMapWritable";,
> ... conf=conf)
> 2021-12-08 14:38:40,439 ERROR scheduler.TaskSetManager: task 0.0 in stage 
> 15.0 (TID 31) had a not serializable result: 
> org.apache.hadoop.io.ShortWritable
> Serialization stack:
> - object not serializable (class: 
> [org.apache.hadoop.io|http://org.apache.hadoop.io/].ShortWritable, value: 1)
> - writeObject data (class: java.util.HashMap)
> - object (class java.util.HashMap, \{price=1})
> - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
> - object (class scala.Tuple2, (1,\{price=1}))
> - element of array (index: 0)
> - array (class [Lscala.Tuple2;, size 1); not retrying
> Traceback (most recent call last):
>  File "", line 4, in 
>  File "/home/hduser/spark-3.1.2-bin-hadoop3.2/python/pyspark/context.py", 
> line 853, in newAPIHadoopRDD
>   jconf, batchSize)
>  File 
> "/home/hduser/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
>  line 1305, in __call__
>  File "/home/hduser/spark-3.1.2-bin-hadoop3.2/python/pyspark/sql/utils.py", 
> line 111, in deco
>   return f(*a, **kw)
>  File 
> "/home/hduser/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
> : org.apache.spark.SparkException: Job aborted due to stage failure: task 0.0 
> in stage 15.0 (TID 31) had a not serializable result: 
> org.apache.hadoop.io.ShortWritable
> Serialization stack:
> - object not serializable (class: 
> [org.apache.hadoop.io|http://org.apache.hadoop.io/].ShortWritable, value: 1)
> - writeObject data (class: java.util.HashMap)
> - object (class java.util.HashMap, \{price=1})
> - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
> - object (class scala.Tuple2, (1,\{price=1}))
> - element of array (index: 0)
> - array (class [Lscala.Tuple2;, size 1)
> at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
> at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
> at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
> at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)
> at scala.Option.foreach(Option.scala:407)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
> at org.apache.spark.S

[jira] [Commented] (SPARK-37598) Pyspark's newAPIHadoopRDD() method fails with ShortWritables

2021-12-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456622#comment-17456622
 ] 

Apache Spark commented on SPARK-37598:
--

User 'masseyke' has created a pull request for this issue:
https://github.com/apache/spark/pull/34838

> Pyspark's newAPIHadoopRDD() method fails with ShortWritables
> 
>
> Key: SPARK-37598
> URL: https://issues.apache.org/jira/browse/SPARK-37598
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.8, 3.0.3, 3.1.2, 3.2.0
>Reporter: Keith Massey
>Priority: Minor
>
> If sc. newAPIHadoopRDD() is called from Pyspark using an InputFormat that has 
> a ShortWritable as a field, then the call to newAPIHadoopRDD() fails. The 
> reason is that shortWritable is not explicitly handled by PythonHadoopUtil 
> the way that other numeric writables are (like LongWritable). The result is 
> that the ShortWritable is not converted to an object that can be serialized 
> by spark, and a serialization error occurs. Below is an example stack trace 
> from within the pyspark shell:
> {code:java}
> >>> rdd = 
> >>> sc.newAPIHadoopRDD(inputFormatClass="[org.elasticsearch.hadoop.mr|http://org.elasticsearch.hadoop.mr/].EsInputFormat";,
> ... 
> keyClass="[org.apache.hadoop.io|http://org.apache.hadoop.io/].NullWritable";,
> ... 
> valueClass="[org.elasticsearch.hadoop.mr|http://org.elasticsearch.hadoop.mr/].LinkedMapWritable";,
> ... conf=conf)
> 2021-12-08 14:38:40,439 ERROR scheduler.TaskSetManager: task 0.0 in stage 
> 15.0 (TID 31) had a not serializable result: 
> org.apache.hadoop.io.ShortWritable
> Serialization stack:
> - object not serializable (class: 
> [org.apache.hadoop.io|http://org.apache.hadoop.io/].ShortWritable, value: 1)
> - writeObject data (class: java.util.HashMap)
> - object (class java.util.HashMap, \{price=1})
> - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
> - object (class scala.Tuple2, (1,\{price=1}))
> - element of array (index: 0)
> - array (class [Lscala.Tuple2;, size 1); not retrying
> Traceback (most recent call last):
>  File "", line 4, in 
>  File "/home/hduser/spark-3.1.2-bin-hadoop3.2/python/pyspark/context.py", 
> line 853, in newAPIHadoopRDD
>   jconf, batchSize)
>  File 
> "/home/hduser/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
>  line 1305, in __call__
>  File "/home/hduser/spark-3.1.2-bin-hadoop3.2/python/pyspark/sql/utils.py", 
> line 111, in deco
>   return f(*a, **kw)
>  File 
> "/home/hduser/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
> : org.apache.spark.SparkException: Job aborted due to stage failure: task 0.0 
> in stage 15.0 (TID 31) had a not serializable result: 
> org.apache.hadoop.io.ShortWritable
> Serialization stack:
> - object not serializable (class: 
> [org.apache.hadoop.io|http://org.apache.hadoop.io/].ShortWritable, value: 1)
> - writeObject data (class: java.util.HashMap)
> - object (class java.util.HashMap, \{price=1})
> - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
> - object (class scala.Tuple2, (1,\{price=1}))
> - element of array (index: 0)
> - array (class [Lscala.Tuple2;, size 1)
> at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
> at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
> at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
> at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)
> at scala.Option.foreach(Option.scala:407)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
> at org.apache.spark.scheduler.DAGScheduler.runJob

[jira] [Assigned] (SPARK-37598) Pyspark's newAPIHadoopRDD() method fails with ShortWritables

2021-12-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37598:


Assignee: Apache Spark

> Pyspark's newAPIHadoopRDD() method fails with ShortWritables
> 
>
> Key: SPARK-37598
> URL: https://issues.apache.org/jira/browse/SPARK-37598
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.8, 3.0.3, 3.1.2, 3.2.0
>Reporter: Keith Massey
>Assignee: Apache Spark
>Priority: Minor
>
> If sc. newAPIHadoopRDD() is called from Pyspark using an InputFormat that has 
> a ShortWritable as a field, then the call to newAPIHadoopRDD() fails. The 
> reason is that shortWritable is not explicitly handled by PythonHadoopUtil 
> the way that other numeric writables are (like LongWritable). The result is 
> that the ShortWritable is not converted to an object that can be serialized 
> by spark, and a serialization error occurs. Below is an example stack trace 
> from within the pyspark shell:
> {code:java}
> >>> rdd = 
> >>> sc.newAPIHadoopRDD(inputFormatClass="[org.elasticsearch.hadoop.mr|http://org.elasticsearch.hadoop.mr/].EsInputFormat";,
> ... 
> keyClass="[org.apache.hadoop.io|http://org.apache.hadoop.io/].NullWritable";,
> ... 
> valueClass="[org.elasticsearch.hadoop.mr|http://org.elasticsearch.hadoop.mr/].LinkedMapWritable";,
> ... conf=conf)
> 2021-12-08 14:38:40,439 ERROR scheduler.TaskSetManager: task 0.0 in stage 
> 15.0 (TID 31) had a not serializable result: 
> org.apache.hadoop.io.ShortWritable
> Serialization stack:
> - object not serializable (class: 
> [org.apache.hadoop.io|http://org.apache.hadoop.io/].ShortWritable, value: 1)
> - writeObject data (class: java.util.HashMap)
> - object (class java.util.HashMap, \{price=1})
> - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
> - object (class scala.Tuple2, (1,\{price=1}))
> - element of array (index: 0)
> - array (class [Lscala.Tuple2;, size 1); not retrying
> Traceback (most recent call last):
>  File "", line 4, in 
>  File "/home/hduser/spark-3.1.2-bin-hadoop3.2/python/pyspark/context.py", 
> line 853, in newAPIHadoopRDD
>   jconf, batchSize)
>  File 
> "/home/hduser/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
>  line 1305, in __call__
>  File "/home/hduser/spark-3.1.2-bin-hadoop3.2/python/pyspark/sql/utils.py", 
> line 111, in deco
>   return f(*a, **kw)
>  File 
> "/home/hduser/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
> : org.apache.spark.SparkException: Job aborted due to stage failure: task 0.0 
> in stage 15.0 (TID 31) had a not serializable result: 
> org.apache.hadoop.io.ShortWritable
> Serialization stack:
> - object not serializable (class: 
> [org.apache.hadoop.io|http://org.apache.hadoop.io/].ShortWritable, value: 1)
> - writeObject data (class: java.util.HashMap)
> - object (class java.util.HashMap, \{price=1})
> - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
> - object (class scala.Tuple2, (1,\{price=1}))
> - element of array (index: 0)
> - array (class [Lscala.Tuple2;, size 1)
> at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
> at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
> at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
> at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)
> at scala.Option.foreach(Option.scala:407)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196

[jira] [Commented] (SPARK-37598) Pyspark's newAPIHadoopRDD() method fails with ShortWritables

2021-12-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456623#comment-17456623
 ] 

Apache Spark commented on SPARK-37598:
--

User 'masseyke' has created a pull request for this issue:
https://github.com/apache/spark/pull/34838

> Pyspark's newAPIHadoopRDD() method fails with ShortWritables
> 
>
> Key: SPARK-37598
> URL: https://issues.apache.org/jira/browse/SPARK-37598
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.8, 3.0.3, 3.1.2, 3.2.0
>Reporter: Keith Massey
>Priority: Minor
>
> If sc. newAPIHadoopRDD() is called from Pyspark using an InputFormat that has 
> a ShortWritable as a field, then the call to newAPIHadoopRDD() fails. The 
> reason is that shortWritable is not explicitly handled by PythonHadoopUtil 
> the way that other numeric writables are (like LongWritable). The result is 
> that the ShortWritable is not converted to an object that can be serialized 
> by spark, and a serialization error occurs. Below is an example stack trace 
> from within the pyspark shell:
> {code:java}
> >>> rdd = 
> >>> sc.newAPIHadoopRDD(inputFormatClass="[org.elasticsearch.hadoop.mr|http://org.elasticsearch.hadoop.mr/].EsInputFormat";,
> ... 
> keyClass="[org.apache.hadoop.io|http://org.apache.hadoop.io/].NullWritable";,
> ... 
> valueClass="[org.elasticsearch.hadoop.mr|http://org.elasticsearch.hadoop.mr/].LinkedMapWritable";,
> ... conf=conf)
> 2021-12-08 14:38:40,439 ERROR scheduler.TaskSetManager: task 0.0 in stage 
> 15.0 (TID 31) had a not serializable result: 
> org.apache.hadoop.io.ShortWritable
> Serialization stack:
> - object not serializable (class: 
> [org.apache.hadoop.io|http://org.apache.hadoop.io/].ShortWritable, value: 1)
> - writeObject data (class: java.util.HashMap)
> - object (class java.util.HashMap, \{price=1})
> - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
> - object (class scala.Tuple2, (1,\{price=1}))
> - element of array (index: 0)
> - array (class [Lscala.Tuple2;, size 1); not retrying
> Traceback (most recent call last):
>  File "", line 4, in 
>  File "/home/hduser/spark-3.1.2-bin-hadoop3.2/python/pyspark/context.py", 
> line 853, in newAPIHadoopRDD
>   jconf, batchSize)
>  File 
> "/home/hduser/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
>  line 1305, in __call__
>  File "/home/hduser/spark-3.1.2-bin-hadoop3.2/python/pyspark/sql/utils.py", 
> line 111, in deco
>   return f(*a, **kw)
>  File 
> "/home/hduser/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
> : org.apache.spark.SparkException: Job aborted due to stage failure: task 0.0 
> in stage 15.0 (TID 31) had a not serializable result: 
> org.apache.hadoop.io.ShortWritable
> Serialization stack:
> - object not serializable (class: 
> [org.apache.hadoop.io|http://org.apache.hadoop.io/].ShortWritable, value: 1)
> - writeObject data (class: java.util.HashMap)
> - object (class java.util.HashMap, \{price=1})
> - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
> - object (class scala.Tuple2, (1,\{price=1}))
> - element of array (index: 0)
> - array (class [Lscala.Tuple2;, size 1)
> at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
> at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
> at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
> at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)
> at scala.Option.foreach(Option.scala:407)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
> at org.apache.spark.scheduler.DAGScheduler.runJob

[jira] [Commented] (SPARK-37575) Empty strings and null values are both saved as quoted empty Strings "" rather than "" (for empty strings) and nothing(for null values)

2021-12-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456662#comment-17456662
 ] 

Apache Spark commented on SPARK-37575:
--

User 'wayneguow' has created a pull request for this issue:
https://github.com/apache/spark/pull/34853

> Empty strings and null values are both saved as quoted empty Strings "" 
> rather than "" (for empty strings) and nothing(for null values)
> ---
>
> Key: SPARK-37575
> URL: https://issues.apache.org/jira/browse/SPARK-37575
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Guo Wei
>Priority: Major
>
> As mentioned in sql migration 
> guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]),
> {noformat}
> Since Spark 2.4, empty strings are saved as quoted empty strings "". In 
> version 2.3 and earlier, empty strings are equal to null values and do not 
> reflect to any characters in saved CSV files. For example, the row of "a", 
> null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as 
> a,,"",1. To restore the previous behavior, set the CSV option emptyValue to 
> empty (not quoted) string.{noformat}
>  
> But actually, both empty strings and null values are saved as quoted empty 
> Strings "" rather than "" (for empty strings) and nothing(for null values)。
> code:
> {code:java}
> val data = List("spark", null, "").toDF("name")
> data.coalesce(1).write.csv("spark_csv_test")
> {code}
>  actual result:
> {noformat}
> line1: spark
> line2: ""
> line3: ""{noformat}
> expected result:
> {noformat}
> line1: spark
> line2: 
> line3: ""
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37575) Empty strings and null values are both saved as quoted empty Strings "" rather than "" (for empty strings) and nothing(for null values)

2021-12-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37575:


Assignee: (was: Apache Spark)

> Empty strings and null values are both saved as quoted empty Strings "" 
> rather than "" (for empty strings) and nothing(for null values)
> ---
>
> Key: SPARK-37575
> URL: https://issues.apache.org/jira/browse/SPARK-37575
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Guo Wei
>Priority: Major
>
> As mentioned in sql migration 
> guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]),
> {noformat}
> Since Spark 2.4, empty strings are saved as quoted empty strings "". In 
> version 2.3 and earlier, empty strings are equal to null values and do not 
> reflect to any characters in saved CSV files. For example, the row of "a", 
> null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as 
> a,,"",1. To restore the previous behavior, set the CSV option emptyValue to 
> empty (not quoted) string.{noformat}
>  
> But actually, both empty strings and null values are saved as quoted empty 
> Strings "" rather than "" (for empty strings) and nothing(for null values)。
> code:
> {code:java}
> val data = List("spark", null, "").toDF("name")
> data.coalesce(1).write.csv("spark_csv_test")
> {code}
>  actual result:
> {noformat}
> line1: spark
> line2: ""
> line3: ""{noformat}
> expected result:
> {noformat}
> line1: spark
> line2: 
> line3: ""
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37575) Empty strings and null values are both saved as quoted empty Strings "" rather than "" (for empty strings) and nothing(for null values)

2021-12-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37575:


Assignee: Apache Spark

> Empty strings and null values are both saved as quoted empty Strings "" 
> rather than "" (for empty strings) and nothing(for null values)
> ---
>
> Key: SPARK-37575
> URL: https://issues.apache.org/jira/browse/SPARK-37575
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Guo Wei
>Assignee: Apache Spark
>Priority: Major
>
> As mentioned in sql migration 
> guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]),
> {noformat}
> Since Spark 2.4, empty strings are saved as quoted empty strings "". In 
> version 2.3 and earlier, empty strings are equal to null values and do not 
> reflect to any characters in saved CSV files. For example, the row of "a", 
> null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as 
> a,,"",1. To restore the previous behavior, set the CSV option emptyValue to 
> empty (not quoted) string.{noformat}
>  
> But actually, both empty strings and null values are saved as quoted empty 
> Strings "" rather than "" (for empty strings) and nothing(for null values)。
> code:
> {code:java}
> val data = List("spark", null, "").toDF("name")
> data.coalesce(1).write.csv("spark_csv_test")
> {code}
>  actual result:
> {noformat}
> line1: spark
> line2: ""
> line3: ""{noformat}
> expected result:
> {noformat}
> line1: spark
> line2: 
> line3: ""
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34332) Unify v1 and v2 ALTER NAMESPACE .. SET LOCATION tests

2021-12-09 Thread Terry Kim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Terry Kim updated SPARK-34332:
--
Summary: Unify v1 and v2 ALTER NAMESPACE .. SET LOCATION tests  (was: Unify 
v1 and v2 ALTER TABLE .. SET LOCATION tests)

> Unify v1 and v2 ALTER NAMESPACE .. SET LOCATION tests
> -
>
> Key: SPARK-34332
> URL: https://issues.apache.org/jira/browse/SPARK-34332
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.3.0
>
>
> Extract ALTER TABLE .. SET LOCATION tests to the common place to run them for 
> V1 and v2 datasources. Some tests can be places to V1 and V2 specific test 
> suites.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34332) Unify v1 and v2 ALTER NAMESPACE .. SET LOCATION tests

2021-12-09 Thread Terry Kim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Terry Kim updated SPARK-34332:
--
Description: Extract ALTER NAMESPACE .. SET LOCATION tests to the common 
place to run them for V1 and v2 datasources. Some tests can be places to V1 and 
V2 specific test suites.  (was: Extract ALTER TABLE .. SET LOCATION tests to 
the common place to run them for V1 and v2 datasources. Some tests can be 
places to V1 and V2 specific test suites.)

> Unify v1 and v2 ALTER NAMESPACE .. SET LOCATION tests
> -
>
> Key: SPARK-34332
> URL: https://issues.apache.org/jira/browse/SPARK-34332
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.3.0
>
>
> Extract ALTER NAMESPACE .. SET LOCATION tests to the common place to run them 
> for V1 and v2 datasources. Some tests can be places to V1 and V2 specific 
> test suites.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37599) Unify v1 and v2 ALTER TABLE .. SET LOCATION tests

2021-12-09 Thread Terry Kim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Terry Kim updated SPARK-37599:
--
Description: Extract ALTER TABLE .. SET LOCATION tests to the common place 
to run them for V1 and v2 datasources. Some tests can be places to V1 and V2 
specific test suites.  (was: Extract ALTER NAMESPACE .. SET LOCATION tests to 
the common place to run them for V1 and v2 datasources. Some tests can be 
places to V1 and V2 specific test suites.)

> Unify v1 and v2 ALTER TABLE .. SET LOCATION tests
> -
>
> Key: SPARK-37599
> URL: https://issues.apache.org/jira/browse/SPARK-37599
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.3.0
>
>
> Extract ALTER TABLE .. SET LOCATION tests to the common place to run them for 
> V1 and v2 datasources. Some tests can be places to V1 and V2 specific test 
> suites.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37599) Unify v1 and v2 ALTER TABLE .. SET LOCATION tests

2021-12-09 Thread Terry Kim (Jira)

Terry Kim created SPARK-37599:
-

 Summary: Unify v1 and v2 ALTER TABLE .. SET LOCATION tests
 Key: SPARK-37599
 URL: https://issues.apache.org/jira/browse/SPARK-37599
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Terry Kim
Assignee: Terry Kim
 Fix For: 3.3.0


Extract ALTER NAMESPACE .. SET LOCATION tests to the common place to run them 
for V1 and v2 datasources. Some tests can be places to V1 and V2 specific test 
suites.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34332) Unify v1 and v2 ALTER NAMESPACE .. SET LOCATION tests

2021-12-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456674#comment-17456674
 ] 

Apache Spark commented on SPARK-34332:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/34854

> Unify v1 and v2 ALTER NAMESPACE .. SET LOCATION tests
> -
>
> Key: SPARK-34332
> URL: https://issues.apache.org/jira/browse/SPARK-34332
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.3.0
>
>
> Extract ALTER NAMESPACE .. SET LOCATION tests to the common place to run them 
> for V1 and v2 datasources. Some tests can be places to V1 and V2 specific 
> test suites.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34332) Unify v1 and v2 ALTER NAMESPACE .. SET LOCATION tests

2021-12-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456676#comment-17456676
 ] 

Apache Spark commented on SPARK-34332:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/34854

> Unify v1 and v2 ALTER NAMESPACE .. SET LOCATION tests
> -
>
> Key: SPARK-34332
> URL: https://issues.apache.org/jira/browse/SPARK-34332
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.3.0
>
>
> Extract ALTER NAMESPACE .. SET LOCATION tests to the common place to run them 
> for V1 and v2 datasources. Some tests can be places to V1 and V2 specific 
> test suites.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37600) Upgrade to Hadoop 3.3.2

2021-12-09 Thread Chao Sun (Jira)

Chao Sun created SPARK-37600:


 Summary: Upgrade to Hadoop 3.3.2
 Key: SPARK-37600
 URL: https://issues.apache.org/jira/browse/SPARK-37600
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.3.0
Reporter: Chao Sun


Upgrade Spark to use Hadoop 3.3.2 once it's released.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37575) Null values are saved as quoted empty Strings "" rather than nothing

2021-12-09 Thread Guo Wei (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guo Wei updated SPARK-37575:

Summary: Null values are saved as quoted empty Strings "" rather than 
nothing  (was: Empty strings and null values are both saved as quoted empty 
Strings "" rather than "" (for empty strings) and nothing(for null values))

> Null values are saved as quoted empty Strings "" rather than nothing
> 
>
> Key: SPARK-37575
> URL: https://issues.apache.org/jira/browse/SPARK-37575
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Guo Wei
>Priority: Major
>
> As mentioned in sql migration 
> guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]),
> {noformat}
> Since Spark 2.4, empty strings are saved as quoted empty strings "". In 
> version 2.3 and earlier, empty strings are equal to null values and do not 
> reflect to any characters in saved CSV files. For example, the row of "a", 
> null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as 
> a,,"",1. To restore the previous behavior, set the CSV option emptyValue to 
> empty (not quoted) string.{noformat}
>  
> But actually, both empty strings and null values are saved as quoted empty 
> Strings "" rather than "" (for empty strings) and nothing(for null values)。
> code:
> {code:java}
> val data = List("spark", null, "").toDF("name")
> data.coalesce(1).write.csv("spark_csv_test")
> {code}
>  actual result:
> {noformat}
> line1: spark
> line2: ""
> line3: ""{noformat}
> expected result:
> {noformat}
> line1: spark
> line2: 
> line3: ""
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] (SPARK-37575) Null values are saved as quoted empty Strings "" rather than nothing

2021-12-09 Thread Guo Wei (Jira)



[ https://issues.apache.org/jira/browse/SPARK-37575 ]


Guo Wei deleted comment on SPARK-37575:
-

was (Author: wayne guo):
I also found that if emptyValueInRead set to "\"\"", reading csv data as show 
below:
{noformat}
name,brand,comment
tesla,,""{noformat}
The final result shows as follows:
||name||brand||comment||
|tesla|null|""|

But, the expected result should be:
||name||brand||comment||
|tesla|null| |

> Null values are saved as quoted empty Strings "" rather than nothing
> 
>
> Key: SPARK-37575
> URL: https://issues.apache.org/jira/browse/SPARK-37575
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Guo Wei
>Priority: Major
>
> As mentioned in sql migration 
> guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]),
> {noformat}
> Since Spark 2.4, empty strings are saved as quoted empty strings "". In 
> version 2.3 and earlier, empty strings are equal to null values and do not 
> reflect to any characters in saved CSV files. For example, the row of "a", 
> null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as 
> a,,"",1. To restore the previous behavior, set the CSV option emptyValue to 
> empty (not quoted) string.{noformat}
>  
> But actually, both empty strings and null values are saved as quoted empty 
> Strings "" rather than "" (for empty strings) and nothing(for null values)。
> code:
> {code:java}
> val data = List("spark", null, "").toDF("name")
> data.coalesce(1).write.csv("spark_csv_test")
> {code}
>  actual result:
> {noformat}
> line1: spark
> line2: ""
> line3: ""{noformat}
> expected result:
> {noformat}
> line1: spark
> line2: 
> line3: ""
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37600) Upgrade to Hadoop 3.3.2

2021-12-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37600:


Assignee: Apache Spark

> Upgrade to Hadoop 3.3.2
> ---
>
> Key: SPARK-37600
> URL: https://issues.apache.org/jira/browse/SPARK-37600
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Apache Spark
>Priority: Major
>
> Upgrade Spark to use Hadoop 3.3.2 once it's released.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37600) Upgrade to Hadoop 3.3.2

2021-12-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456694#comment-17456694
 ] 

Apache Spark commented on SPARK-37600:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/34855

> Upgrade to Hadoop 3.3.2
> ---
>
> Key: SPARK-37600
> URL: https://issues.apache.org/jira/browse/SPARK-37600
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> Upgrade Spark to use Hadoop 3.3.2 once it's released.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37600) Upgrade to Hadoop 3.3.2

2021-12-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37600:


Assignee: (was: Apache Spark)

> Upgrade to Hadoop 3.3.2
> ---
>
> Key: SPARK-37600
> URL: https://issues.apache.org/jira/browse/SPARK-37600
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> Upgrade Spark to use Hadoop 3.3.2 once it's released.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37601) Could/should sql.DataFrame.transform accept function parameters?

2021-12-09 Thread Rafal Wojdyla (Jira)

Rafal Wojdyla created SPARK-37601:
-

 Summary: Could/should sql.DataFrame.transform accept function 
parameters?
 Key: SPARK-37601
 URL: https://issues.apache.org/jira/browse/SPARK-37601
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.2.0, 3.1.2
Reporter: Rafal Wojdyla


{noformat}

def foo(df: DataFrame, p: int) -> DataFrame
  ...

df.tranfrom(foo, p=3)

# vs

from functools import partial
df.tranform(partial(foo, p=3))
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19256) Hive bucketing write support

2021-12-09 Thread Cheng Su (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-19256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Su updated SPARK-19256:
-
Affects Version/s: 3.3.0

> Hive bucketing write support
> 
>
> Key: SPARK-19256
> URL: https://issues.apache.org/jira/browse/SPARK-19256
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0, 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Tejas Patil
>Priority: Minor
>
> Update (2020 by Cheng Su):
> We use this JIRA to track progress for Hive bucketing write support in Spark. 
> The goal is for Spark to write Hive bucketed table, to be compatible with 
> other compute engines (Hive and Presto).
>  
> Current status for Hive bucketed table in Spark:
> Not support for reading Hive bucketed table: read bucketed table as 
> non-bucketed table.
> Wrong behavior for writing Hive ORC and Parquet bucketed table: write 
> orc/parquet bucketed table as non-bucketed table (code path: 
> InsertIntoHadoopFsRelationCommand -> FileFormatWriter).
> Do not allow for writing Hive non-ORC/Parquet bucketed table: throw exception 
> by default if writing non-orc/parquet bucketed table (code path: 
> InsertIntoHiveTable), and exception can be disabled by setting config 
> `hive.enforce.bucketing`=false and `hive.enforce.sorting`=false, which will 
> write as non-bucketed table.
>  
> Current status for Hive bucketed table in Hive:
> Hive 3.0.0 and after: support writing bucketed table with Hive murmur3hash 
> (https://issues.apache.org/jira/browse/HIVE-18910).
> Hive 1.x.y and 2.x.y: support writing bucketed table with Hive hivehash.
> Hive on Tez: support zero and multiple files per bucket 
> (https://issues.apache.org/jira/browse/HIVE-14014). And more code pointer on 
> read path - 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/metainfo/annotation/OpTraitsRulesProcFactory.java#L183-L212]
>  .
>  
> Current status for Hive bucketed table in Presto (take presto-sql here):
> Support writing bucketed table with Hive murmur3hash and hivehash 
> ([https://github.com/prestosql/presto/pull/1697]).
> Support zero and multiple files per bucket 
> ([https://github.com/prestosql/presto/pull/822]).
>  
> TLDR is to achieve Hive bucketed table compatibility across Spark, Presto and 
> Hive. Here with this JIRA, we need to add support writing Hive bucketed table 
> with Hive murmur3hash (for Hive 3.x.y) and hivehash (for Hive 1.x.y and 
> 2.x.y).
>  
> To allow Spark efficiently read Hive bucketed table, this needs more radical 
> change and we decide to wait until data source v2 supports bucketing, and do 
> the read path on data source v2. Read path will not covered by this JIRA.
>  
> Original description (2017 by Tejas Patil):
> JIRA to track design discussions and tasks related to Hive bucketing support 
> in Spark.
> Proposal : 
> [https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37601) Could/should sql.DataFrame.transform accept function parameters?

2021-12-09 Thread Rafal Wojdyla (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rafal Wojdyla updated SPARK-37601:
--
Description: 
{code:python}

def foo(df: DataFrame, p: int) -> DataFrame
  ...

# current:

from functools import partial
df.tranform(partial(foo, p=3))

# vs suggested:

df.tranfrom(foo, p=3)
{code}

  was:
{noformat}

def foo(df: DataFrame, p: int) -> DataFrame
  ...

# current:

from functools import partial
df.tranform(partial(foo, p=3))

# vs suggested:

df.tranfrom(foo, p=3)
{noformat}


> Could/should sql.DataFrame.transform accept function parameters?
> 
>
> Key: SPARK-37601
> URL: https://issues.apache.org/jira/browse/SPARK-37601
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Rafal Wojdyla
>Priority: Major
>
> {code:python}
> def foo(df: DataFrame, p: int) -> DataFrame
>   ...
> # current:
> from functools import partial
> df.tranform(partial(foo, p=3))
> # vs suggested:
> df.tranfrom(foo, p=3)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37601) Could/should sql.DataFrame.transform accept function parameters?

2021-12-09 Thread Rafal Wojdyla (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rafal Wojdyla updated SPARK-37601:
--
Description: 
{noformat}

def foo(df: DataFrame, p: int) -> DataFrame
  ...

# current:

from functools import partial
df.tranform(partial(foo, p=3))

# vs suggested:

df.tranfrom(foo, p=3)
{noformat}

  was:
{noformat}

def foo(df: DataFrame, p: int) -> DataFrame
  ...

df.tranfrom(foo, p=3)

# vs

from functools import partial
df.tranform(partial(foo, p=3))
{noformat}


> Could/should sql.DataFrame.transform accept function parameters?
> 
>
> Key: SPARK-37601
> URL: https://issues.apache.org/jira/browse/SPARK-37601
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Rafal Wojdyla
>Priority: Major
>
> {noformat}
> def foo(df: DataFrame, p: int) -> DataFrame
>   ...
> # current:
> from functools import partial
> df.tranform(partial(foo, p=3))
> # vs suggested:
> df.tranfrom(foo, p=3)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37601) Could/should sql.DataFrame.transform accept function parameters?

2021-12-09 Thread Rafal Wojdyla (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rafal Wojdyla updated SPARK-37601:
--
Description: 
{code:python}

def foo(df: DataFrame, p: int) -> DataFrame
  ...

# current:

from functools import partial
df.transform(partial(foo, p=3))

# vs suggested:

df.transform(foo, p=3)
{code}

  was:
{code:python}

def foo(df: DataFrame, p: int) -> DataFrame
  ...

# current:

from functools import partial
df.tranform(partial(foo, p=3))

# vs suggested:

df.tranfrom(foo, p=3)
{code}


> Could/should sql.DataFrame.transform accept function parameters?
> 
>
> Key: SPARK-37601
> URL: https://issues.apache.org/jira/browse/SPARK-37601
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Rafal Wojdyla
>Priority: Major
>
> {code:python}
> def foo(df: DataFrame, p: int) -> DataFrame
>   ...
> # current:
> from functools import partial
> df.transform(partial(foo, p=3))
> # vs suggested:
> df.transform(foo, p=3)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37602) Add config property to set default Spark listeners

2021-12-09 Thread Shardul Mahadik (Jira)

Shardul Mahadik created SPARK-37602:
---

 Summary: Add config property to set default Spark listeners
 Key: SPARK-37602
 URL: https://issues.apache.org/jira/browse/SPARK-37602
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: Shardul Mahadik


{{spark.extraListeners}} allows users to custom Spark Listeners. As Spark 
platform administrators, we want to set our own set of "default" listeners for 
all jobs on our platforms, however using {{spark.extraListeners}} makes it easy 
for our clients to unknowingly override our default listeners.

I would like to propose adding {{spark.defaultListeners}} which is intended to 
be set by administrators and will be combined with {{spark.extraListeners}} at 
runtime. This is similar in spirit to {{spark.driver.defaultJavaOptions}} and 
{{spark.plugins.defaultList}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37601) Could/should sql.DataFrame.transform accept function parameters?

2021-12-09 Thread Rafal Wojdyla (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rafal Wojdyla updated SPARK-37601:
--
Description: 
{code:python}

def foo(df: DataFrame, p: int) -> DataFrame
  ...

# current:

from functools import partial
df.transform(partial(foo, p=3))

# or

df.transform(lambda df: foo(df, p=3))

# vs suggested:

df.transform(foo, p=3)
{code}

  was:
{code:python}

def foo(df: DataFrame, p: int) -> DataFrame
  ...

# current:

from functools import partial
df.transform(partial(foo, p=3))

# vs suggested:

df.transform(foo, p=3)
{code}


> Could/should sql.DataFrame.transform accept function parameters?
> 
>
> Key: SPARK-37601
> URL: https://issues.apache.org/jira/browse/SPARK-37601
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Rafal Wojdyla
>Priority: Major
>
> {code:python}
> def foo(df: DataFrame, p: int) -> DataFrame
>   ...
> # current:
> from functools import partial
> df.transform(partial(foo, p=3))
> # or
> df.transform(lambda df: foo(df, p=3))
> # vs suggested:
> df.transform(foo, p=3)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37601) Could/should sql.DataFrame.transform accept function parameters?

2021-12-09 Thread Rafal Wojdyla (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rafal Wojdyla updated SPARK-37601:
--
Description: 
{code:python}

def foo(df: DataFrame, p: int) -> DataFrame
  ...

# current:

from functools import partial
df.transform(partial(foo, p=3))

# or:

df.transform(lambda df: foo(df, p=3))

# vs suggested:

df.transform(foo, p=3)
{code}

  was:
{code:python}

def foo(df: DataFrame, p: int) -> DataFrame
  ...

# current:

from functools import partial
df.transform(partial(foo, p=3))

# or

df.transform(lambda df: foo(df, p=3))

# vs suggested:

df.transform(foo, p=3)
{code}


> Could/should sql.DataFrame.transform accept function parameters?
> 
>
> Key: SPARK-37601
> URL: https://issues.apache.org/jira/browse/SPARK-37601
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Rafal Wojdyla
>Priority: Major
>
> {code:python}
> def foo(df: DataFrame, p: int) -> DataFrame
>   ...
> # current:
> from functools import partial
> df.transform(partial(foo, p=3))
> # or:
> df.transform(lambda df: foo(df, p=3))
> # vs suggested:
> df.transform(foo, p=3)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37602) Add config property to set default Spark listeners

2021-12-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456835#comment-17456835
 ] 

Apache Spark commented on SPARK-37602:
--

User 'shardulm94' has created a pull request for this issue:
https://github.com/apache/spark/pull/34856

> Add config property to set default Spark listeners
> --
>
> Key: SPARK-37602
> URL: https://issues.apache.org/jira/browse/SPARK-37602
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Shardul Mahadik
>Priority: Trivial
>
> {{spark.extraListeners}} allows users to custom Spark Listeners. As Spark 
> platform administrators, we want to set our own set of "default" listeners 
> for all jobs on our platforms, however using {{spark.extraListeners}} makes 
> it easy for our clients to unknowingly override our default listeners.
> I would like to propose adding {{spark.defaultListeners}} which is intended 
> to be set by administrators and will be combined with 
> {{spark.extraListeners}} at runtime. This is similar in spirit to 
> {{spark.driver.defaultJavaOptions}} and {{spark.plugins.defaultList}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37602) Add config property to set default Spark listeners

2021-12-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37602:


Assignee: Apache Spark

> Add config property to set default Spark listeners
> --
>
> Key: SPARK-37602
> URL: https://issues.apache.org/jira/browse/SPARK-37602
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Shardul Mahadik
>Assignee: Apache Spark
>Priority: Trivial
>
> {{spark.extraListeners}} allows users to custom Spark Listeners. As Spark 
> platform administrators, we want to set our own set of "default" listeners 
> for all jobs on our platforms, however using {{spark.extraListeners}} makes 
> it easy for our clients to unknowingly override our default listeners.
> I would like to propose adding {{spark.defaultListeners}} which is intended 
> to be set by administrators and will be combined with 
> {{spark.extraListeners}} at runtime. This is similar in spirit to 
> {{spark.driver.defaultJavaOptions}} and {{spark.plugins.defaultList}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37602) Add config property to set default Spark listeners

2021-12-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37602:


Assignee: (was: Apache Spark)

> Add config property to set default Spark listeners
> --
>
> Key: SPARK-37602
> URL: https://issues.apache.org/jira/browse/SPARK-37602
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Shardul Mahadik
>Priority: Trivial
>
> {{spark.extraListeners}} allows users to custom Spark Listeners. As Spark 
> platform administrators, we want to set our own set of "default" listeners 
> for all jobs on our platforms, however using {{spark.extraListeners}} makes 
> it easy for our clients to unknowingly override our default listeners.
> I would like to propose adding {{spark.defaultListeners}} which is intended 
> to be set by administrators and will be combined with 
> {{spark.extraListeners}} at runtime. This is similar in spirit to 
> {{spark.driver.defaultJavaOptions}} and {{spark.plugins.defaultList}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37602) Add config property to set default Spark listeners

2021-12-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456833#comment-17456833
 ] 

Apache Spark commented on SPARK-37602:
--

User 'shardulm94' has created a pull request for this issue:
https://github.com/apache/spark/pull/34856

> Add config property to set default Spark listeners
> --
>
> Key: SPARK-37602
> URL: https://issues.apache.org/jira/browse/SPARK-37602
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Shardul Mahadik
>Priority: Trivial
>
> {{spark.extraListeners}} allows users to custom Spark Listeners. As Spark 
> platform administrators, we want to set our own set of "default" listeners 
> for all jobs on our platforms, however using {{spark.extraListeners}} makes 
> it easy for our clients to unknowingly override our default listeners.
> I would like to propose adding {{spark.defaultListeners}} which is intended 
> to be set by administrators and will be combined with 
> {{spark.extraListeners}} at runtime. This is similar in spirit to 
> {{spark.driver.defaultJavaOptions}} and {{spark.plugins.defaultList}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36850) Migrate CreateTableStatement to v2 command framework

2021-12-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456856#comment-17456856
 ] 

Apache Spark commented on SPARK-36850:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/34857

> Migrate CreateTableStatement to v2 command framework
> 
>
> Key: SPARK-36850
> URL: https://issues.apache.org/jira/browse/SPARK-36850
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36850) Migrate CreateTableStatement to v2 command framework

2021-12-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456857#comment-17456857
 ] 

Apache Spark commented on SPARK-36850:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/34857

> Migrate CreateTableStatement to v2 command framework
> 
>
> Key: SPARK-36850
> URL: https://issues.apache.org/jira/browse/SPARK-36850
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26589) proper `median` method for spark dataframe

2021-12-09 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456860#comment-17456860
 ] 

Nicholas Chammas commented on SPARK-26589:
--

That makes sense to me. I've been struggling with how to approach the 
implementation, so I've posted to the dev list asking for a little more help.

> proper `median` method for spark dataframe
> --
>
> Key: SPARK-26589
> URL: https://issues.apache.org/jira/browse/SPARK-26589
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jan Gorecki
>Priority: Minor
>
> I found multiple tickets asking for median function to be implemented in 
> Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as 
> duplicate of it. The thing is that approximate quantile is a workaround for 
> lack of median function. Thus I am filling this Feature Request for proper, 
> exact, not approximation of, median function. I am aware about difficulties 
> that are caused by distributed environment when trying to compute median, 
> nevertheless I don't think those difficulties is reason good enough to drop 
> out `median` function from scope of Spark. I am not asking about efficient 
> median but exact median.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37597) Deduplicate the right side of left-semi join and left-anti join

2021-12-09 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-37597.
--
Resolution: Duplicate

> Deduplicate the right side of left-semi join and left-anti join
> ---
>
> Key: SPARK-37597
> URL: https://issues.apache.org/jira/browse/SPARK-37597
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37567) reuse Exchange failed

2021-12-09 Thread junbiao chen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

junbiao chen updated SPARK-37567:
-
Description: 
PR :https://github.com/chenjunbiao001/spark/pull/1

use case：query2 in TPC-DS.There are three exchange subquery will scan the same 
table "store_sales" in logical plan,these subqueries meet exchange reuse rule.I 
confirm that the exchange use rule  work in physical plan.But when spark 
execute the physical plan,I find out 

exchange reuse failed,reused exchange has been executed twice.

physical plan：

!physical plan-query2.png!

 

execution stages：

!execution stage-query2.png!

 

!execution stage(1)-query2.png!

  was:
use case：query2 in TPC-DS.There are three exchange subquery will scan the same 
table "store_sales" in logical plan,these subqueries meet exchange reuse rule.I 
confirm that the exchange use rule  work in physical plan.But when spark 
execute the physical plan,I find out 

exchange reuse failed,reused exchange has been executed twice.

physical plan：

!physical plan-query2.png!

 

execution stages：

!execution stage-query2.png!

 

!execution stage(1)-query2.png!


> reuse Exchange failed 
> --
>
> Key: SPARK-37567
> URL: https://issues.apache.org/jira/browse/SPARK-37567
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: junbiao chen
>Priority: Major
>  Labels: performance
> Attachments: execution stage(1)-query2.png, execution 
> stage-query2.png, physical plan-query2.png
>
>
> PR :https://github.com/chenjunbiao001/spark/pull/1
> use case：query2 in TPC-DS.There are three exchange subquery will scan the 
> same table "store_sales" in logical plan,these subqueries meet exchange reuse 
> rule.I confirm that the exchange use rule  work in physical plan.But when 
> spark execute the physical plan,I find out 
> exchange reuse failed,reused exchange has been executed twice.
> physical plan：
> !physical plan-query2.png!
>  
> execution stages：
> !execution stage-query2.png!
>  
> !execution stage(1)-query2.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37567) reuse Exchange failed

2021-12-09 Thread junbiao chen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

junbiao chen updated SPARK-37567:
-
Description: 
 

use case：query2 in TPC-DS.There are three exchange subquery will scan the same 
table "store_sales" in logical plan,these subqueries meet exchange reuse rule.I 
confirm that the exchange use rule  work in physical plan.But when spark 
execute the physical plan,I find out 

exchange reuse failed,reused exchange has been executed twice.

physical plan：

!physical plan-query2.png!

 

execution stages：

!execution stage-query2.png!

 

!execution stage(1)-query2.png!

  was:
PR :https://github.com/chenjunbiao001/spark/pull/1

use case：query2 in TPC-DS.There are three exchange subquery will scan the same 
table "store_sales" in logical plan,these subqueries meet exchange reuse rule.I 
confirm that the exchange use rule  work in physical plan.But when spark 
execute the physical plan,I find out 

exchange reuse failed,reused exchange has been executed twice.

physical plan：

!physical plan-query2.png!

 

execution stages：

!execution stage-query2.png!

 

!execution stage(1)-query2.png!


> reuse Exchange failed 
> --
>
> Key: SPARK-37567
> URL: https://issues.apache.org/jira/browse/SPARK-37567
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: junbiao chen
>Priority: Major
>  Labels: performance
> Attachments: execution stage(1)-query2.png, execution 
> stage-query2.png, physical plan-query2.png
>
>
>  
> use case：query2 in TPC-DS.There are three exchange subquery will scan the 
> same table "store_sales" in logical plan,these subqueries meet exchange reuse 
> rule.I confirm that the exchange use rule  work in physical plan.But when 
> spark execute the physical plan,I find out 
> exchange reuse failed,reused exchange has been executed twice.
> physical plan：
> !physical plan-query2.png!
>  
> execution stages：
> !execution stage-query2.png!
>  
> !execution stage(1)-query2.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37567) reuse Exchange failed

2021-12-09 Thread junbiao chen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

junbiao chen updated SPARK-37567:
-
Description: 
PR available：https://github.com/apache/spark/pull/34858

use case：query2 in TPC-DS.There are three exchange subquery will scan the same 
table "store_sales" in logical plan,these subqueries meet exchange reuse rule.I 
confirm that the exchange use rule  work in physical plan.But when spark 
execute the physical plan,I find out 

exchange reuse failed,reused exchange has been executed twice.

physical plan：

!physical plan-query2.png!

 

execution stages：

!execution stage-query2.png!

 

!execution stage(1)-query2.png!

  was:
 

use case：query2 in TPC-DS.There are three exchange subquery will scan the same 
table "store_sales" in logical plan,these subqueries meet exchange reuse rule.I 
confirm that the exchange use rule  work in physical plan.But when spark 
execute the physical plan,I find out 

exchange reuse failed,reused exchange has been executed twice.

physical plan：

!physical plan-query2.png!

 

execution stages：

!execution stage-query2.png!

 

!execution stage(1)-query2.png!


> reuse Exchange failed 
> --
>
> Key: SPARK-37567
> URL: https://issues.apache.org/jira/browse/SPARK-37567
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: junbiao chen
>Priority: Major
>  Labels: performance
> Attachments: execution stage(1)-query2.png, execution 
> stage-query2.png, physical plan-query2.png
>
>
> PR available：https://github.com/apache/spark/pull/34858
> use case：query2 in TPC-DS.There are three exchange subquery will scan the 
> same table "store_sales" in logical plan,these subqueries meet exchange reuse 
> rule.I confirm that the exchange use rule  work in physical plan.But when 
> spark execute the physical plan,I find out 
> exchange reuse failed,reused exchange has been executed twice.
> physical plan：
> !physical plan-query2.png!
>  
> execution stages：
> !execution stage-query2.png!
>  
> !execution stage(1)-query2.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37603) org.apache.spark.sql.catalyst.expressions.ListQuery; no valid constructor

2021-12-09 Thread huangweiguo (Jira)

huangweiguo created SPARK-37603:
---

 Summary: org.apache.spark.sql.catalyst.expressions.ListQuery; no 
valid constructor
 Key: SPARK-37603
 URL: https://issues.apache.org/jira/browse/SPARK-37603
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.2
Reporter: huangweiguo


I don't use the class org.apache.spark.sql.catalyst.expressions.ListQuery, but 
it 

serialize exception (maybe):

 
{code:java}
//stack
Caused by: java.io.InvalidClassException: 
org.apache.spark.sql.catalyst.expressions.ListQuery; no valid constructor
at 
java.io.ObjectStreamClass$ExceptionInfo.newInvalidClassException(ObjectStreamClass.java:150)
at 
java.io.ObjectStreamClass.checkDeserialize(ObjectStreamClass.java:790)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2001)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at 
java

[jira] [Updated] (SPARK-37567) reuse Exchange failed

2021-12-09 Thread junbiao chen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

junbiao chen updated SPARK-37567:
-
Labels: bugfix performance pull-request-available  (was: performance)

> reuse Exchange failed 
> --
>
> Key: SPARK-37567
> URL: https://issues.apache.org/jira/browse/SPARK-37567
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: junbiao chen
>Priority: Major
>  Labels: bugfix, performance, pull-request-available
> Attachments: execution stage(1)-query2.png, execution 
> stage-query2.png, physical plan-query2.png
>
>
> PR available：https://github.com/apache/spark/pull/34858
> use case：query2 in TPC-DS.There are three exchange subquery will scan the 
> same table "store_sales" in logical plan,these subqueries meet exchange reuse 
> rule.I confirm that the exchange use rule  work in physical plan.But when 
> spark execute the physical plan,I find out 
> exchange reuse failed,reused exchange has been executed twice.
> physical plan：
> !physical plan-query2.png!
>  
> execution stages：
> !execution stage-query2.png!
>  
> !execution stage(1)-query2.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 125 matches

Mail list logo