[jira] [Updated] (SPARK-45954) Avoid generating redundant ShuffleExchangeExec node
[ https://issues.apache.org/jira/browse/SPARK-45954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45954: Summary: Avoid generating redundant ShuffleExchangeExec node (was: Remove redundant shuffles) > Avoid generating redundant ShuffleExchangeExec node > --- > > Key: SPARK-45954 > URL: https://issues.apache.org/jira/browse/SPARK-45954 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45954) Remove redundant shuffles
Yuming Wang created SPARK-45954: --- Summary: Remove redundant shuffles Key: SPARK-45954 URL: https://issues.apache.org/jira/browse/SPARK-45954 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45947) Set a human readable description for Dataset api
[ https://issues.apache.org/jira/browse/SPARK-45947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45947: Description: We should set the view name to sparkSession.sparkContext.setJobDescription("xxx") !screenshot-1.png! was: Need to sparkSession.sparkContext.setJobDescription("xxx") !screenshot-1.png! > Set a human readable description for Dataset api > > > Key: SPARK-45947 > URL: https://issues.apache.org/jira/browse/SPARK-45947 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: screenshot-1.png > > > We should set the view name to > sparkSession.sparkContext.setJobDescription("xxx") > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45947) Set a human readable description for Dataset api
[ https://issues.apache.org/jira/browse/SPARK-45947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45947: Description: Need to sparkSession.sparkContext.setJobDescription("xxx") !screenshot-1.png! > Set a human readable description for Dataset api > > > Key: SPARK-45947 > URL: https://issues.apache.org/jira/browse/SPARK-45947 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: screenshot-1.png > > > Need to sparkSession.sparkContext.setJobDescription("xxx") > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45947) Set a human readable description for Dataset api
Yuming Wang created SPARK-45947: --- Summary: Set a human readable description for Dataset api Key: SPARK-45947 URL: https://issues.apache.org/jira/browse/SPARK-45947 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Yuming Wang Attachments: screenshot-1.png -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45947) Set a human readable description for Dataset api
[ https://issues.apache.org/jira/browse/SPARK-45947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45947: Attachment: screenshot-1.png > Set a human readable description for Dataset api > > > Key: SPARK-45947 > URL: https://issues.apache.org/jira/browse/SPARK-45947 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: screenshot-1.png > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45915) Treat decimal(x, 0) the same as IntegralType in PromoteStrings
[ https://issues.apache.org/jira/browse/SPARK-45915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45915: Summary: Treat decimal(x, 0) the same as IntegralType in PromoteStrings (was: Unwrap cast in predicate) > Treat decimal(x, 0) the same as IntegralType in PromoteStrings > -- > > Key: SPARK-45915 > URL: https://issues.apache.org/jira/browse/SPARK-45915 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45915) Unwrap cast in predicate
Yuming Wang created SPARK-45915: --- Summary: Unwrap cast in predicate Key: SPARK-45915 URL: https://issues.apache.org/jira/browse/SPARK-45915 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45909) Remove the cast if it can safely up-cast in IsNotNull
Yuming Wang created SPARK-45909: --- Summary: Remove the cast if it can safely up-cast in IsNotNull Key: SPARK-45909 URL: https://issues.apache.org/jira/browse/SPARK-45909 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45894) hive table level setting hadoop.mapred.max.split.size
[ https://issues.apache.org/jira/browse/SPARK-45894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45894: Target Version/s: (was: 3.5.0) > hive table level setting hadoop.mapred.max.split.size > - > > Key: SPARK-45894 > URL: https://issues.apache.org/jira/browse/SPARK-45894 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: guihuawen >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > In the scenario of hive table scan, by configuring the > hadoop.mapred.max.split.size parameter, you can increase the parallelism of > the scan hive table stage, thereby reducing the running time. > However, if a large table and a small table are in the same query, if only a > separate hadoop.mapred.max.split.size parameter is configured, some stages > will run a very large number of tasks, and some stages will The number of > tasks running is very small. For runtime tasks, the > hadoop.mapred.max.split.size parameter can be set separately for each hive > table to ensure this balance. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45895) Combine multiple like to like all
Yuming Wang created SPARK-45895: --- Summary: Combine multiple like to like all Key: SPARK-45895 URL: https://issues.apache.org/jira/browse/SPARK-45895 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Yuming Wang {code:scala} spark.sql("create table t(a string, b string, c string) using parquet") spark.sql( """ |select * from t where |substr(a, 1, 5) like '%a%' and |substr(a, 1, 5) like '%b%' |""".stripMargin).explain(true) {code} We can optimize the query to: {code:scala} spark.sql( """ |select * from t where |substr(a, 1, 5) like all('%a%', '%b%') |""".stripMargin).explain(true) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45853) Add Iceberg and Hudi to third party projects
Yuming Wang created SPARK-45853: --- Summary: Add Iceberg and Hudi to third party projects Key: SPARK-45853 URL: https://issues.apache.org/jira/browse/SPARK-45853 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 4.0.0 Reporter: Yuming Wang {noformat} Error: org.apache.hive.service.cli.HiveSQLException: Error running query: java.util.concurrent.ExecutionException: org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find the data source: iceberg. Please find packages at `https://spark.apache.org/third-party-projects.html`. at org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:46) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:262) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:166) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:79) at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:63) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:41) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:166) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:161) at java.base/java.security.AccessController.doPrivileged(AccessController.java:712) at java.base/javax.security.auth.Subject.doAs(Subject.java:439) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:175) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:833) {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45848) spark-build-info.ps1 missing the docroot property
[ https://issues.apache.org/jira/browse/SPARK-45848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45848: Description: https://github.com/apache/spark/blob/master/build/spark-build-info.ps1#L38-L44 https://github.com/apache/spark/blob/master/build/spark-build-info#L30-L36 was:https://github.com/apache/spark/blob/master/build/spark-build-info.ps1#L38-L44 > spark-build-info.ps1 missing the docroot property > - > > Key: SPARK-45848 > URL: https://issues.apache.org/jira/browse/SPARK-45848 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > > https://github.com/apache/spark/blob/master/build/spark-build-info.ps1#L38-L44 > https://github.com/apache/spark/blob/master/build/spark-build-info#L30-L36 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45848) spark-build-info.ps1 missing the docroot property
Yuming Wang created SPARK-45848: --- Summary: spark-build-info.ps1 missing the docroot property Key: SPARK-45848 URL: https://issues.apache.org/jira/browse/SPARK-45848 Project: Spark Issue Type: Bug Components: Build Affects Versions: 4.0.0 Reporter: Yuming Wang https://github.com/apache/spark/blob/master/build/spark-build-info.ps1#L38-L44 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45755) Push down limit through Dataset.isEmpty()
[ https://issues.apache.org/jira/browse/SPARK-45755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45755: Description: Push down LocalLimit can not optimize the case of distinct. {code:scala} def isEmpty: Boolean = withAction("isEmpty", withTypedPlan { LocalLimit(Literal(1), select().logicalPlan) }.queryExecution) { plan => plan.executeTake(1).isEmpty } {code} > Push down limit through Dataset.isEmpty() > - > > Key: SPARK-45755 > URL: https://issues.apache.org/jira/browse/SPARK-45755 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > > Push down LocalLimit can not optimize the case of distinct. > {code:scala} > def isEmpty: Boolean = withAction("isEmpty", > withTypedPlan { LocalLimit(Literal(1), select().logicalPlan) > }.queryExecution) { plan => > plan.executeTake(1).isEmpty > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45755) Push down limit through Dataset.isEmpty()
Yuming Wang created SPARK-45755: --- Summary: Push down limit through Dataset.isEmpty() Key: SPARK-45755 URL: https://issues.apache.org/jira/browse/SPARK-45755 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45658) Canonicalization of DynamicPruningSubquery is broken
[ https://issues.apache.org/jira/browse/SPARK-45658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45658: Target Version/s: (was: 3.5.1) > Canonicalization of DynamicPruningSubquery is broken > > > Key: SPARK-45658 > URL: https://issues.apache.org/jira/browse/SPARK-45658 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0, 3.5.1 >Reporter: Asif >Priority: Major > > The canonicalization of (buildKeys: Seq[Expression]) in the class > DynamicPruningSubquery is broken, as the buildKeys are canonicalized just by > calling > buildKeys.map(_.canonicalized) > The above would result in incorrect canonicalization as it would not be > normalizing the exprIds relative to buildQuery output > The fix is to use the buildQuery : LogicalPlan's output to normalize the > buildKeys expression > as given below, using the standard approach. > buildKeys.map(QueryPlan.normalizeExpressions(_, buildQuery.output)), > Will be filing a PR and bug test for the same. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45658) Canonicalization of DynamicPruningSubquery is broken
[ https://issues.apache.org/jira/browse/SPARK-45658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45658: Affects Version/s: (was: 3.5.1) > Canonicalization of DynamicPruningSubquery is broken > > > Key: SPARK-45658 > URL: https://issues.apache.org/jira/browse/SPARK-45658 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Asif >Priority: Major > > The canonicalization of (buildKeys: Seq[Expression]) in the class > DynamicPruningSubquery is broken, as the buildKeys are canonicalized just by > calling > buildKeys.map(_.canonicalized) > The above would result in incorrect canonicalization as it would not be > normalizing the exprIds relative to buildQuery output > The fix is to use the buildQuery : LogicalPlan's output to normalize the > buildKeys expression > as given below, using the standard approach. > buildKeys.map(QueryPlan.normalizeExpressions(_, buildQuery.output)), > Will be filing a PR and bug test for the same. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43851) Support LCA in grouping expressions
[ https://issues.apache.org/jira/browse/SPARK-43851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1591#comment-1591 ] Yuming Wang commented on SPARK-43851: - The resolution should be unresolved. > Support LCA in grouping expressions > --- > > Key: SPARK-43851 > URL: https://issues.apache.org/jira/browse/SPARK-43851 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > > Teradata supports it: > {code:sql} > create table t1(a int) using parquet; > select a + 1 as a1, a1 + 1 as a2 from t1 group by a1, a2; > {code} > {noformat} > [UNSUPPORTED_FEATURE.LATERAL_COLUMN_ALIAS_IN_GROUP_BY] The feature is not > supported: Referencing a lateral column alias via GROUP BY alias/ALL is not > supported yet. > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-43851) Support LCA in grouping expressions
[ https://issues.apache.org/jira/browse/SPARK-43851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reopened SPARK-43851: - Assignee: (was: Jia Fan) > Support LCA in grouping expressions > --- > > Key: SPARK-43851 > URL: https://issues.apache.org/jira/browse/SPARK-43851 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > Fix For: 3.5.0 > > > Teradata supports it: > {code:sql} > create table t1(a int) using parquet; > select a + 1 as a1, a1 + 1 as a2 from t1 group by a1, a2; > {code} > {noformat} > [UNSUPPORTED_FEATURE.LATERAL_COLUMN_ALIAS_IN_GROUP_BY] The feature is not > supported: Referencing a lateral column alias via GROUP BY alias/ALL is not > supported yet. > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43851) Support LCA in grouping expressions
[ https://issues.apache.org/jira/browse/SPARK-43851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-43851: Fix Version/s: (was: 3.5.0) > Support LCA in grouping expressions > --- > > Key: SPARK-43851 > URL: https://issues.apache.org/jira/browse/SPARK-43851 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > > Teradata supports it: > {code:sql} > create table t1(a int) using parquet; > select a + 1 as a1, a1 + 1 as a2 from t1 group by a1, a2; > {code} > {noformat} > [UNSUPPORTED_FEATURE.LATERAL_COLUMN_ALIAS_IN_GROUP_BY] The feature is not > supported: Referencing a lateral column alias via GROUP BY alias/ALL is not > supported yet. > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45454) Set the table's default owner to current_user
[ https://issues.apache.org/jira/browse/SPARK-45454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45454: Parent: (was: SPARK-30016) Issue Type: Improvement (was: Sub-task) > Set the table's default owner to current_user > - > > Key: SPARK-45454 > URL: https://issues.apache.org/jira/browse/SPARK-45454 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45454) Set the table's default owner to current_user
[ https://issues.apache.org/jira/browse/SPARK-45454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45454: Summary: Set the table's default owner to current_user (was: Set owner of DS v2 table to CURRENT_USER if it is already set) > Set the table's default owner to current_user > - > > Key: SPARK-45454 > URL: https://issues.apache.org/jira/browse/SPARK-45454 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45454) Set owner of DS v2 table to CURRENT_USER if it is already set
Yuming Wang created SPARK-45454: --- Summary: Set owner of DS v2 table to CURRENT_USER if it is already set Key: SPARK-45454 URL: https://issues.apache.org/jira/browse/SPARK-45454 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45387) Partition key filter cannot be pushed down when using cast
[ https://issues.apache.org/jira/browse/SPARK-45387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45387: Target Version/s: (was: 3.1.1, 3.3.0) > Partition key filter cannot be pushed down when using cast > -- > > Key: SPARK-45387 > URL: https://issues.apache.org/jira/browse/SPARK-45387 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1, 3.1.2, 3.3.0, 3.4.0 >Reporter: TianyiMa >Priority: Critical > > Suppose we have a partitioned table `table_pt` with partition colum `dt` > which is StringType and the table metadata is managed by Hive Metastore, if > we filter partition by dt = '123', this filter can be pushed down to data > source, but if the filter condition is number, e.g. dt = 123, that cannot be > pushed down to data source, causing spark to pull all of that table's > partition meta data to client, which is poor of performance if the table has > thousands of partitions and increasing the risk of hive metastore oom. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45369) Push down limit through generate
Yuming Wang created SPARK-45369: --- Summary: Push down limit through generate Key: SPARK-45369 URL: https://issues.apache.org/jira/browse/SPARK-45369 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45282) Join loses records for cached datasets
[ https://issues.apache.org/jira/browse/SPARK-45282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17768399#comment-17768399 ] Yuming Wang commented on SPARK-45282: - cc [~ulysses] [~cloud_fan] > Join loses records for cached datasets > -- > > Key: SPARK-45282 > URL: https://issues.apache.org/jira/browse/SPARK-45282 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0 > Environment: spark 3.4.1 on apache hadoop 3.3.6 or kubernetes 1.26 or > databricks 13.3 >Reporter: koert kuipers >Priority: Major > Labels: CorrectnessBug, correctness > > we observed this issue on spark 3.4.1 but it is also present on 3.5.0. it is > not present on spark 3.3.1. > it only shows up in distributed environment. i cannot replicate in unit test. > however i did get it to show up on hadoop cluster, kubernetes, and on > databricks 13.3 > the issue is that records are dropped when two cached dataframes are joined. > it seems in spark 3.4.1 in queryplan some Exchanges are dropped as an > optimization while in spark 3.3.1 these Exhanges are still present. it seems > to be an issue with AQE with canChangeCachedPlanOutputPartitioning=true. > to reproduce on distributed cluster these settings needed are: > {code:java} > spark.sql.adaptive.advisoryPartitionSizeInBytes 33554432 > spark.sql.adaptive.coalescePartitions.parallelismFirst false > spark.sql.adaptive.enabled true > spark.sql.optimizer.canChangeCachedPlanOutputPartitioning true {code} > code using scala to reproduce is: > {code:java} > import java.util.UUID > import org.apache.spark.sql.functions.col > import spark.implicits._ > val data = (1 to 100).toDS().map(i => > UUID.randomUUID().toString).persist() > val left = data.map(k => (k, 1)) > val right = data.map(k => (k, k)) // if i change this to k => (k, 1) it works! > println("number of left " + left.count()) > println("number of right " + right.count()) > println("number of (left join right) " + > left.toDF("key", "value1").join(right.toDF("key", "value2"), "key").count() > ) > val left1 = left > .toDF("key", "value1") > .repartition(col("key")) // comment out this line to make it work > .persist() > println("number of left1 " + left1.count()) > val right1 = right > .toDF("key", "value2") > .repartition(col("key")) // comment out this line to make it work > .persist() > println("number of right1 " + right1.count()) > println("number of (left1 join right1) " + left1.join(right1, > "key").count()) // this gives incorrect result{code} > this produces the following output: > {code:java} > number of left 100 > number of right 100 > number of (left join right) 100 > number of left1 100 > number of right1 100 > number of (left1 join right1) 859531 {code} > note that the last number (the incorrect one) actually varies depending on > settings and cluster size etc. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43406) enable spark sql to drop multiple partitions in one call
[ https://issues.apache.org/jira/browse/SPARK-43406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-43406. - Resolution: Duplicate > enable spark sql to drop multiple partitions in one call > > > Key: SPARK-43406 > URL: https://issues.apache.org/jira/browse/SPARK-43406 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1, 3.3.2, 3.4.0 >Reporter: chenruotao >Priority: Major > > Now spark sql cannot drop multiple partitions in one call, so I fix it > With this patch we can drop multiple partitions like this : > alter table test.table_partition drop partition(dt<='2023-04-02', > dt>='2023-03-31') -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43406) enable spark sql to drop multiple partitions in one call
[ https://issues.apache.org/jira/browse/SPARK-43406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-43406: Target Version/s: (was: 4.0.0) > enable spark sql to drop multiple partitions in one call > > > Key: SPARK-43406 > URL: https://issues.apache.org/jira/browse/SPARK-43406 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1, 3.3.2, 3.4.0 >Reporter: chenruotao >Priority: Major > > Now spark sql cannot drop multiple partitions in one call, so I fix it > With this patch we can drop multiple partitions like this : > alter table test.table_partition drop partition(dt<='2023-04-02', > dt>='2023-03-31') -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43406) enable spark sql to drop multiple partitions in one call
[ https://issues.apache.org/jira/browse/SPARK-43406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-43406: Fix Version/s: (was: 3.5.0) > enable spark sql to drop multiple partitions in one call > > > Key: SPARK-43406 > URL: https://issues.apache.org/jira/browse/SPARK-43406 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1, 3.3.2, 3.4.0 >Reporter: chenruotao >Priority: Major > > Now spark sql cannot drop multiple partitions in one call, so I fix it > With this patch we can drop multiple partitions like this : > alter table test.table_partition drop partition(dt<='2023-04-02', > dt>='2023-03-31') -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43406) enable spark sql to drop multiple partitions in one call
[ https://issues.apache.org/jira/browse/SPARK-43406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-43406: Target Version/s: 4.0.0 > enable spark sql to drop multiple partitions in one call > > > Key: SPARK-43406 > URL: https://issues.apache.org/jira/browse/SPARK-43406 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1, 3.3.2, 3.4.0 >Reporter: chenruotao >Priority: Major > > Now spark sql cannot drop multiple partitions in one call, so I fix it > With this patch we can drop multiple partitions like this : > alter table test.table_partition drop partition(dt<='2023-04-02', > dt>='2023-03-31') -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45089) Remove obsolete repo of DB2 JDBC driver
[ https://issues.apache.org/jira/browse/SPARK-45089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-45089. - Fix Version/s: 4.0.0 Assignee: Cheng Pan Resolution: Fixed Issue resolved by pull request 42820 https://github.com/apache/spark/pull/42820 > Remove obsolete repo of DB2 JDBC driver > --- > > Key: SPARK-45089 > URL: https://issues.apache.org/jira/browse/SPARK-45089 > Project: Spark > Issue Type: Test > Components: Build, Tests >Affects Versions: 4.0.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45071) Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data
[ https://issues.apache.org/jira/browse/SPARK-45071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45071: Fix Version/s: 3.5.1 (was: 3.5.0) > Optimize the processing speed of `BinaryArithmetic#dataType` when processing > multi-column data > -- > > Key: SPARK-45071 > URL: https://issues.apache.org/jira/browse/SPARK-45071 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0, 3.5.0 >Reporter: ming95 >Assignee: ming95 >Priority: Major > Fix For: 3.4.2, 4.0.0, 3.5.1 > > > Since `BinaryArithmetic#dataType` will recursively process the datatype of > each node, the driver will be very slow when multiple columns are processed. > For example, the following code: > {code:java} > ``` > import spark.implicits._ > import scala.util.Random > import org.apache.spark.sql.functions.sum > import org.apache.spark.sql.types.{StructType, StructField, IntegerType} > val N = 30 > val M = 100 > val columns = Seq.fill(N)(Random.alphanumeric.take(8).mkString) > val data = Seq.fill(M)(Seq.fill(N)(Random.nextInt(16) - 5)) > val schema = StructType(columns.map(StructField(_, IntegerType))) > val rdd = spark.sparkContext.parallelize(data.map(Row.fromSeq(_))) > val df = spark.createDataFrame(rdd, schema) > val colExprs = columns.map(sum(_)) > // gen a new column , and add the other 30 column > df.withColumn("new_col_sum", expr(columns.mkString(" + "))) > ``` > {code} > > This code will take a few minutes for the driver to execute in the spark3.4 > version, but only takes a few seconds to execute in the spark3.2 version. > Related issue: SPARK-39316 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45071) Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data
[ https://issues.apache.org/jira/browse/SPARK-45071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-45071. - Fix Version/s: 3.5.0 4.0.0 3.4.2 Resolution: Fixed Issue resolved by pull request 42804 [https://github.com/apache/spark/pull/42804] > Optimize the processing speed of `BinaryArithmetic#dataType` when processing > multi-column data > -- > > Key: SPARK-45071 > URL: https://issues.apache.org/jira/browse/SPARK-45071 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0, 3.5.0 >Reporter: ming95 >Assignee: ming95 >Priority: Major > Fix For: 3.5.0, 4.0.0, 3.4.2 > > > Since `BinaryArithmetic#dataType` will recursively process the datatype of > each node, the driver will be very slow when multiple columns are processed. > For example, the following code: > {code:java} > ``` > import spark.implicits._ > import scala.util.Random > import org.apache.spark.sql.functions.sum > import org.apache.spark.sql.types.{StructType, StructField, IntegerType} > val N = 30 > val M = 100 > val columns = Seq.fill(N)(Random.alphanumeric.take(8).mkString) > val data = Seq.fill(M)(Seq.fill(N)(Random.nextInt(16) - 5)) > val schema = StructType(columns.map(StructField(_, IntegerType))) > val rdd = spark.sparkContext.parallelize(data.map(Row.fromSeq(_))) > val df = spark.createDataFrame(rdd, schema) > val colExprs = columns.map(sum(_)) > // gen a new column , and add the other 30 column > df.withColumn("new_col_sum", expr(columns.mkString(" + "))) > ``` > {code} > > This code will take a few minutes for the driver to execute in the spark3.4 > version, but only takes a few seconds to execute in the spark3.2 version. > Related issue: SPARK-39316 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45071) Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data
[ https://issues.apache.org/jira/browse/SPARK-45071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-45071: --- Assignee: ming95 > Optimize the processing speed of `BinaryArithmetic#dataType` when processing > multi-column data > -- > > Key: SPARK-45071 > URL: https://issues.apache.org/jira/browse/SPARK-45071 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0, 3.5.0 >Reporter: ming95 >Assignee: ming95 >Priority: Major > > Since `BinaryArithmetic#dataType` will recursively process the datatype of > each node, the driver will be very slow when multiple columns are processed. > For example, the following code: > {code:java} > ``` > import spark.implicits._ > import scala.util.Random > import org.apache.spark.sql.functions.sum > import org.apache.spark.sql.types.{StructType, StructField, IntegerType} > val N = 30 > val M = 100 > val columns = Seq.fill(N)(Random.alphanumeric.take(8).mkString) > val data = Seq.fill(M)(Seq.fill(N)(Random.nextInt(16) - 5)) > val schema = StructType(columns.map(StructField(_, IntegerType))) > val rdd = spark.sparkContext.parallelize(data.map(Row.fromSeq(_))) > val df = spark.createDataFrame(rdd, schema) > val colExprs = columns.map(sum(_)) > // gen a new column , and add the other 30 column > df.withColumn("new_col_sum", expr(columns.mkString(" + "))) > ``` > {code} > > This code will take a few minutes for the driver to execute in the spark3.4 > version, but only takes a few seconds to execute in the spark3.2 version. > Related issue: SPARK-39316 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45020) org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'default' not found (state=08S01,code=0)
[ https://issues.apache.org/jira/browse/SPARK-45020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-45020: Fix Version/s: (was: 3.1.0) > org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database > 'default' not found (state=08S01,code=0) > - > > Key: SPARK-45020 > URL: https://issues.apache.org/jira/browse/SPARK-45020 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Sruthi Mooriyathvariam >Priority: Minor > > There is an alert that fires up when a Spark 3.1 cluster is created using > shared metastore with Spark 2.4. The alert says DefaultDatabase does not > exist. This is misleading and thus we need to suppress this alert. > In the class SessionCatalog.scala, the method requireDbExists() is not > handling the case when the db = defaultDB. This needs to be added to suppress > this misleading alert. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44846) PushFoldableIntoBranches in complex grouping expressions may cause bindReference error
[ https://issues.apache.org/jira/browse/SPARK-44846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-44846: --- Assignee: zhuml > PushFoldableIntoBranches in complex grouping expressions may cause > bindReference error > -- > > Key: SPARK-44846 > URL: https://issues.apache.org/jira/browse/SPARK-44846 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: zhuml >Assignee: zhuml >Priority: Major > > SQL: > {code:java} > select c*2 as d from > (select if(b > 1, 1, b) as c from > (select if(a < 0, 0 ,a) as b from t group by b) t1 > group by c) t2 {code} > ERROR: > {code:java} > Couldn't find _groupingexpression#15 in [if ((_groupingexpression#15 > 1)) 1 > else _groupingexpression#15#16] > java.lang.IllegalStateException: Couldn't find _groupingexpression#15 in [if > ((_groupingexpression#15 > 1)) 1 else _groupingexpression#15#16] > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:461) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:466) > at > org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren(TreeNode.scala:1241) > at > org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren$(TreeNode.scala:1240) > at > org.apache.spark.sql.catalyst.expressions.BinaryExpression.mapChildren(Expression.scala:653) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:466) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:466) > at > org.apache.spark.sql.catalyst.trees.TernaryLike.mapChildren(TreeNode.scala:1272) > at > org.apache.spark.sql.catalyst.trees.TernaryLike.mapChildren$(TreeNode.scala:1271) > at > org.apache.spark.sql.catalyst.expressions.If.mapChildren(conditionalExpressions.scala:41) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:466) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:466) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1215) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1214) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:533) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:466) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:437) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:405) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:94) > at scala.collection.immutable.List.map(List.scala:293) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReferences(BoundAttribute.scala:94) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.generateResultFunction(HashAggregateExec.scala:360) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduceWithKeys(HashAggregateExec.scala:538) > at > org.apache.spark.sql.execution.aggregate.AggregateCodegenSupport.doProduce(AggregateCodegenSupport.scala:69) > at > org.apache.spark.sql.execution.aggregate.AggregateCodegenSupport.doProduce$(AggregateCodegenSupport.scala:65) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduce(HashAggregateExec.scala:49) > at > org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:97) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:243) > at > org.apache.spark.sql.execution.CodegenSupport.produce(WholeStageCodegenExec.scala:92) > at > org.apache.spark.sql.execution.CodegenSupport.produce$(WholeStageCodegenExec.scala:92) > at
[jira] [Resolved] (SPARK-44846) PushFoldableIntoBranches in complex grouping expressions may cause bindReference error
[ https://issues.apache.org/jira/browse/SPARK-44846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-44846. - Fix Version/s: 3.5.0 4.0.0 3.4.2 Resolution: Fixed Issue resolved by pull request 42633 [https://github.com/apache/spark/pull/42633] > PushFoldableIntoBranches in complex grouping expressions may cause > bindReference error > -- > > Key: SPARK-44846 > URL: https://issues.apache.org/jira/browse/SPARK-44846 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: zhuml >Assignee: zhuml >Priority: Major > Fix For: 3.5.0, 4.0.0, 3.4.2 > > > SQL: > {code:java} > select c*2 as d from > (select if(b > 1, 1, b) as c from > (select if(a < 0, 0 ,a) as b from t group by b) t1 > group by c) t2 {code} > ERROR: > {code:java} > Couldn't find _groupingexpression#15 in [if ((_groupingexpression#15 > 1)) 1 > else _groupingexpression#15#16] > java.lang.IllegalStateException: Couldn't find _groupingexpression#15 in [if > ((_groupingexpression#15 > 1)) 1 else _groupingexpression#15#16] > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:461) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:466) > at > org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren(TreeNode.scala:1241) > at > org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren$(TreeNode.scala:1240) > at > org.apache.spark.sql.catalyst.expressions.BinaryExpression.mapChildren(Expression.scala:653) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:466) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:466) > at > org.apache.spark.sql.catalyst.trees.TernaryLike.mapChildren(TreeNode.scala:1272) > at > org.apache.spark.sql.catalyst.trees.TernaryLike.mapChildren$(TreeNode.scala:1271) > at > org.apache.spark.sql.catalyst.expressions.If.mapChildren(conditionalExpressions.scala:41) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:466) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:466) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1215) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1214) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:533) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:466) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:437) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:405) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:94) > at scala.collection.immutable.List.map(List.scala:293) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReferences(BoundAttribute.scala:94) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.generateResultFunction(HashAggregateExec.scala:360) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduceWithKeys(HashAggregateExec.scala:538) > at > org.apache.spark.sql.execution.aggregate.AggregateCodegenSupport.doProduce(AggregateCodegenSupport.scala:69) > at > org.apache.spark.sql.execution.aggregate.AggregateCodegenSupport.doProduce$(AggregateCodegenSupport.scala:65) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduce(HashAggregateExec.scala:49) > at > org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:97) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:243) >
[jira] [Updated] (SPARK-44892) Add official image Dockerfile for Spark 3.3.3
[ https://issues.apache.org/jira/browse/SPARK-44892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44892: Fix Version/s: (was: 4.0.0) > Add official image Dockerfile for Spark 3.3.3 > - > > Key: SPARK-44892 > URL: https://issues.apache.org/jira/browse/SPARK-44892 > Project: Spark > Issue Type: Sub-task > Components: Spark Docker >Affects Versions: 3.3.3 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44892) Add official image Dockerfile for Spark 3.3.3
[ https://issues.apache.org/jira/browse/SPARK-44892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-44892. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 54 [https://github.com/apache/spark-docker/pull/54] > Add official image Dockerfile for Spark 3.3.3 > - > > Key: SPARK-44892 > URL: https://issues.apache.org/jira/browse/SPARK-44892 > Project: Spark > Issue Type: Sub-task > Components: Spark Docker >Affects Versions: 3.3.3 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44892) Add official image Dockerfile for Spark 3.3.3
[ https://issues.apache.org/jira/browse/SPARK-44892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-44892: --- Assignee: Yuming Wang > Add official image Dockerfile for Spark 3.3.3 > - > > Key: SPARK-44892 > URL: https://issues.apache.org/jira/browse/SPARK-44892 > Project: Spark > Issue Type: Sub-task > Components: Spark Docker >Affects Versions: 3.3.3 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44892) Add official image Dockerfile for Spark 3.3.3
Yuming Wang created SPARK-44892: --- Summary: Add official image Dockerfile for Spark 3.3.3 Key: SPARK-44892 URL: https://issues.apache.org/jira/browse/SPARK-44892 Project: Spark Issue Type: Sub-task Components: Spark Docker Affects Versions: 3.3.3 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44813) The JIRA Python misses our assignee when it searches user again
[ https://issues.apache.org/jira/browse/SPARK-44813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44813: Fix Version/s: 3.3.4 (was: 3.3.3) > The JIRA Python misses our assignee when it searches user again > --- > > Key: SPARK-44813 > URL: https://issues.apache.org/jira/browse/SPARK-44813 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.4.2, 3.5.0, 4.0.0, 3.3.4 > > > {code:java} > >>> assignee = asf_jira.user("yao") > >>> "SPARK-44801"'SPARK-44801' > >>> asf_jira.assign_issue(issue.key, assignee.name) > response text = {"errorMessages":[],"errors":{"assignee":"User 'airhot' > cannot be assigned issues."}} {code} > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44857) Fix getBaseURI error in Spark Worker LogPage UI buttons
[ https://issues.apache.org/jira/browse/SPARK-44857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44857: Fix Version/s: 3.3.4 (was: 3.3.3) > Fix getBaseURI error in Spark Worker LogPage UI buttons > --- > > Key: SPARK-44857 > URL: https://issues.apache.org/jira/browse/SPARK-44857 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 3.2.0, 3.2.4, 3.3.2, 3.4.1, 3.5.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.4.2, 3.5.0, 4.0.0, 3.3.4 > > Attachments: Screenshot 2023-08-17 at 2.38.45 PM.png > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44880) Remove unnecessary curly braces at the end of the thread locks info
[ https://issues.apache.org/jira/browse/SPARK-44880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-44880: --- Assignee: Kent Yao > Remove unnecessary curly braces at the end of the thread locks info > --- > > Key: SPARK-44880 > URL: https://issues.apache.org/jira/browse/SPARK-44880 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.3.2, 3.4.1, 3.5.0, 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > Remove unnecessary curly braces at the end of the thread locks info -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44880) Remove unnecessary curly braces at the end of the thread locks info
[ https://issues.apache.org/jira/browse/SPARK-44880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-44880. - Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42571 [https://github.com/apache/spark/pull/42571] > Remove unnecessary curly braces at the end of the thread locks info > --- > > Key: SPARK-44880 > URL: https://issues.apache.org/jira/browse/SPARK-44880 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.3.2, 3.4.1, 3.5.0, 4.0.0 >Reporter: Kent Yao >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > Remove unnecessary curly braces at the end of the thread locks info -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44880) Remove unnecessary curly braces at the end of the thread locks info
[ https://issues.apache.org/jira/browse/SPARK-44880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44880: Fix Version/s: 3.5.1 (was: 3.5.0) > Remove unnecessary curly braces at the end of the thread locks info > --- > > Key: SPARK-44880 > URL: https://issues.apache.org/jira/browse/SPARK-44880 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.3.2, 3.4.1, 3.5.0, 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 4.0.0, 3.5.1 > > > Remove unnecessary curly braces at the end of the thread locks info -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44792) Upgrade curator to 5.2.0
[ https://issues.apache.org/jira/browse/SPARK-44792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-44792. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42474 [https://github.com/apache/spark/pull/42474] > Upgrade curator to 5.2.0 > > > Key: SPARK-44792 > URL: https://issues.apache.org/jira/browse/SPARK-44792 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 4.0.0 > > > https://issues.apache.org/jira/browse/HADOOP-17612 > https://issues.apache.org/jira/browse/HADOOP-18515 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44792) Upgrade curator to 5.2.0
[ https://issues.apache.org/jira/browse/SPARK-44792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-44792: --- Assignee: Yuming Wang > Upgrade curator to 5.2.0 > > > Key: SPARK-44792 > URL: https://issues.apache.org/jira/browse/SPARK-44792 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > https://issues.apache.org/jira/browse/HADOOP-17612 > https://issues.apache.org/jira/browse/HADOOP-18515 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44792) Upgrade curator to 5.2.0
[ https://issues.apache.org/jira/browse/SPARK-44792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44792: Description: https://issues.apache.org/jira/browse/HADOOP-17612 https://issues.apache.org/jira/browse/HADOOP-18515 was:https://issues.apache.org/jira/browse/HADOOP-17612 > Upgrade curator to 5.2.0 > > > Key: SPARK-44792 > URL: https://issues.apache.org/jira/browse/SPARK-44792 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > > https://issues.apache.org/jira/browse/HADOOP-17612 > https://issues.apache.org/jira/browse/HADOOP-18515 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44792) Upgrade curator to 5.2.0
[ https://issues.apache.org/jira/browse/SPARK-44792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44792: Description: https://issues.apache.org/jira/browse/HADOOP-17612 > Upgrade curator to 5.2.0 > > > Key: SPARK-44792 > URL: https://issues.apache.org/jira/browse/SPARK-44792 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > > https://issues.apache.org/jira/browse/HADOOP-17612 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44792) Upgrade curator to 5.2.0
Yuming Wang created SPARK-44792: --- Summary: Upgrade curator to 5.2.0 Key: SPARK-44792 URL: https://issues.apache.org/jira/browse/SPARK-44792 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 4.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44700) Rule OptimizeCsvJsonExprs should not be applied to expression like from_json(regexp_replace)
[ https://issues.apache.org/jira/browse/SPARK-44700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44700: Fix Version/s: 3.3.0 > Rule OptimizeCsvJsonExprs should not be applied to expression like > from_json(regexp_replace) > > > Key: SPARK-44700 > URL: https://issues.apache.org/jira/browse/SPARK-44700 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: jiahong.li >Priority: Minor > Fix For: 3.3.0 > > > _SQL_ like below: > select tmp.* > from > (select > device_id, ads_id, > from_json(regexp_replace(device_personas, '(?<=(\\{|,))"device_', > '"user_device_'), ${device_schema}) as tmp > from input ) > ${device_schema} includes more than 100 fields. > if Rule: OptimizeCsvJsonExprs been applied, the expression, regexp_replace, > will be invoked many times, that costs so much time. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44700) Rule OptimizeCsvJsonExprs should not be applied to expression like from_json(regexp_replace)
[ https://issues.apache.org/jira/browse/SPARK-44700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-44700. - Resolution: Fixed Please upgrade Spark to the latest version to fix this issue. > Rule OptimizeCsvJsonExprs should not be applied to expression like > from_json(regexp_replace) > > > Key: SPARK-44700 > URL: https://issues.apache.org/jira/browse/SPARK-44700 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: jiahong.li >Priority: Minor > > _SQL_ like below: > select tmp.* > from > (select > device_id, ads_id, > from_json(regexp_replace(device_personas, '(?<=(\\{|,))"device_', > '"user_device_'), ${device_schema}) as tmp > from input ) > ${device_schema} includes more than 100 fields. > if Rule: OptimizeCsvJsonExprs been applied, the expression, regexp_replace, > will be invoked many times, that costs so much time. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44700) Rule OptimizeCsvJsonExprs should not be applied to expression like from_json(regexp_replace)
[ https://issues.apache.org/jira/browse/SPARK-44700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44700: Affects Version/s: 3.1.1 (was: 3.4.0) (was: 3.4.1) > Rule OptimizeCsvJsonExprs should not be applied to expression like > from_json(regexp_replace) > > > Key: SPARK-44700 > URL: https://issues.apache.org/jira/browse/SPARK-44700 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: jiahong.li >Priority: Minor > > _SQL_ like below: > select tmp.* > from > (select > device_id, ads_id, > from_json(regexp_replace(device_personas, '(?<=(\\{|,))"device_', > '"user_device_'), ${device_schema}) as tmp > from input ) > ${device_schema} includes more than 100 fields. > if Rule: OptimizeCsvJsonExprs been applied, the expression, regexp_replace, > will be invoked many times, that costs so much time. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24087) Avoid shuffle when join keys are a super-set of bucket keys
[ https://issues.apache.org/jira/browse/SPARK-24087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752239#comment-17752239 ] Yuming Wang commented on SPARK-24087: - Fixed by SPARK-35703. > Avoid shuffle when join keys are a super-set of bucket keys > --- > > Key: SPARK-24087 > URL: https://issues.apache.org/jira/browse/SPARK-24087 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: yucai >Priority: Major > Labels: bulk-closed > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44719) NoClassDefFoundError when using Hive UDF
[ https://issues.apache.org/jira/browse/SPARK-44719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752023#comment-17752023 ] Yuming Wang commented on SPARK-44719: - There are two ways to fix it: 1. Upgrade the built-in hive to 2.3.10 with the following patch. 2. Revert SPARK-43225. https://github.com/apache/hive/pull/4562 https://github.com/apache/hive/pull/4563 https://github.com/apache/hive/pull/4564 > NoClassDefFoundError when using Hive UDF > > > Key: SPARK-44719 > URL: https://issues.apache.org/jira/browse/SPARK-44719 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > Attachments: HiveUDFs-1.0-SNAPSHOT.jar > > > How to reproduce: > {noformat} > spark-sql (default)> add jar > /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar; > Time taken: 0.413 seconds > spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as > 'net.petrabarus.hiveudfs.LongToIP'; > Time taken: 0.038 seconds > spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10); > 23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT > long_to_ip(2130706433L) FROM range(10)] > java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory > at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > ... > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44719) NoClassDefFoundError when using Hive UDF
[ https://issues.apache.org/jira/browse/SPARK-44719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44719: Description: How to reproduce: {noformat} spark-sql (default)> add jar /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar; Time taken: 0.413 seconds spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as 'net.petrabarus.hiveudfs.LongToIP'; Time taken: 0.038 seconds spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10); 23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT long_to_ip(2130706433L) FROM range(10)] java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) ... {noformat} was: How to reproduce: ``` spark-sql (default)> add jar /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar; Time taken: 0.413 seconds spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as 'net.petrabarus.hiveudfs.LongToIP'; Time taken: 0.038 seconds spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10); 23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT long_to_ip(2130706433L) FROM range(10)] java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) ... ``` > NoClassDefFoundError when using Hive UDF > > > Key: SPARK-44719 > URL: https://issues.apache.org/jira/browse/SPARK-44719 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > Attachments: HiveUDFs-1.0-SNAPSHOT.jar > > > How to reproduce: > {noformat} > spark-sql (default)> add jar > /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar; > Time taken: 0.413 seconds > spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as > 'net.petrabarus.hiveudfs.LongToIP'; > Time taken: 0.038 seconds > spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10); > 23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT > long_to_ip(2130706433L) FROM range(10)] > java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory > at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > ... > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44719) NoClassDefFoundError when using Hive UDF
[ https://issues.apache.org/jira/browse/SPARK-44719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44719: Attachment: HiveUDFs-1.0-SNAPSHOT.jar > NoClassDefFoundError when using Hive UDF > > > Key: SPARK-44719 > URL: https://issues.apache.org/jira/browse/SPARK-44719 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > Attachments: HiveUDFs-1.0-SNAPSHOT.jar > > > How to reproduce: > ``` > spark-sql (default)> add jar > /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar; > Time taken: 0.413 seconds > spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as > 'net.petrabarus.hiveudfs.LongToIP'; > Time taken: 0.038 seconds > spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10); > 23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT > long_to_ip(2130706433L) FROM range(10)] > java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory > at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > ... > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44719) NoClassDefFoundError when using Hive UDF
Yuming Wang created SPARK-44719: --- Summary: NoClassDefFoundError when using Hive UDF Key: SPARK-44719 URL: https://issues.apache.org/jira/browse/SPARK-44719 Project: Spark Issue Type: Bug Components: Build, SQL Affects Versions: 3.5.0 Reporter: Yuming Wang Attachments: HiveUDFs-1.0-SNAPSHOT.jar How to reproduce: ``` spark-sql (default)> add jar /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar; Time taken: 0.413 seconds spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as 'net.petrabarus.hiveudfs.LongToIP'; Time taken: 0.038 seconds spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10); 23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT long_to_ip(2130706433L) FROM range(10)] java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) ... ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42500) ConstantPropagation support more cases
[ https://issues.apache.org/jira/browse/SPARK-42500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-42500. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42038 [https://github.com/apache/spark/pull/42038] > ConstantPropagation support more cases > -- > > Key: SPARK-42500 > URL: https://issues.apache.org/jira/browse/SPARK-42500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Assignee: Tongwei >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42500) ConstantPropagation support more cases
[ https://issues.apache.org/jira/browse/SPARK-42500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-42500: --- Assignee: Tongwei > ConstantPropagation support more cases > -- > > Key: SPARK-42500 > URL: https://issues.apache.org/jira/browse/SPARK-42500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Assignee: Tongwei >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44662) SPIP: Improving performance of BroadcastHashJoin queries with stream side join key on non partition columns
[ https://issues.apache.org/jira/browse/SPARK-44662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44662: Target Version/s: (was: 3.3.3) > SPIP: Improving performance of BroadcastHashJoin queries with stream side > join key on non partition columns > --- > > Key: SPARK-44662 > URL: https://issues.apache.org/jira/browse/SPARK-44662 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.3 >Reporter: Asif >Priority: Major > > h2. *Q1. What are you trying to do? Articulate your objectives using > absolutely no jargon.* > On the lines of DPP which helps DataSourceV2 relations when the joining key > is a partition column, the same concept can be extended over to the case > where joining key is not a partition column. > The Keys of BroadcastHashJoin are already available before actual evaluation > of the stream iterator. These keys can be pushed down to the DataSource as a > SortedSet. > For non partition columns, the DataSources like iceberg have max/min stats on > column available at manifest level, and for formats like parquet , they have > max/min stats at various storage level. The passed SortedSet can be used to > prune using ranges at both driver level ( manifests files) as well as > executor level ( while actually going through chunks , row groups etc at > parquet level) > If the data is stored as Columnar Batch format , then it would not be > possible to filter out individual row at DataSource level, even though we > have keys. > But at the scan level, ( ColumnToRowExec) it is still possible to filter out > as many rows as possible , if the query involves nested joins. Thus reducing > the number of rows to join at the higher join levels. > Will be adding more details.. > h2. *Q2. What problem is this proposal NOT designed to solve?* > This can only help in BroadcastHashJoin's performance if the join is Inner or > Left Semi. > This will also not work if there are nodes like Expand, Generator , Aggregate > (without group by on keys not part of joining column etc) below the > BroadcastHashJoin node being targeted. > h2. *Q3. How is it done today, and what are the limits of current practice?* > Currently this sort of pruning at DataSource level is being done using DPP > (Dynamic Partition Pruning ) and IFF one of the join key column is a > Partitioning column ( so that cost of DPP query is justified and way less > than amount of data it will be filtering by skipping partitions). > The limitation is that DPP type approach is not implemented ( intentionally I > believe), if the join column is a non partition column ( because of cost of > "DPP type" query would most likely be way high as compared to any possible > pruning ( especially if the column is not stored in a sorted manner). > h2. *Q4. What is new in your approach and why do you think it will be > successful?* > 1) This allows pruning on non partition column based joins. > 2) Because it piggy backs on Broadcasted Keys, there is no extra cost of "DPP > type" query. > 3) The Data can be used by DataSource to prune at driver (possibly) and also > at executor level ( as in case of parquet which has max/min at various > structure levels) > 4) The big benefit should be seen in multilevel nested join queries. In the > current code base, if I am correct, only one join's pruning filter would get > pushed at scan level. Since it is on partition key may be that is sufficient. > But if it is a nested Join query , and may be involving different columns on > streaming side for join, each such filter push could do significant pruning. > This requires some handling in case of AQE, as the stream side iterator ( & > hence stage evaluation needs to be delayed, till all the available join > filters in the nested tree are pushed at their respective target > BatchScanExec). > h4. *Single Row Filteration* > 5) In case of nested broadcasted joins, if the datasource is column vector > oriented , then what spark would get is a ColumnarBatch. But because scans > have Filters from multiple joins, they can be retrieved and can be applied in > code generated at ColumnToRowExec level, using a new "containsKey" method on > HashedRelation. Thus only those rows which satisfy all the > BroadcastedHashJoins ( whose keys have been pushed) , will be used for join > evaluation. > The code is already there , will be opening a PR. For non partition table > TPCDS run on laptop with TPCDS data size of ( scale factor 4), I am seeing > 15% gain. > For partition table TPCDS, there is improvement in 4 - 5 queries to the tune > of 10% to 37%. > h2. *Q5. Who cares? If you are successful, what difference will it make?* > If use cases involve
[jira] [Updated] (SPARK-44662) SPIP: Improving performance of BroadcastHashJoin queries with stream side join key on non partition columns
[ https://issues.apache.org/jira/browse/SPARK-44662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44662: Fix Version/s: (was: 3.3.3) > SPIP: Improving performance of BroadcastHashJoin queries with stream side > join key on non partition columns > --- > > Key: SPARK-44662 > URL: https://issues.apache.org/jira/browse/SPARK-44662 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.3 >Reporter: Asif >Priority: Major > > h2. *Q1. What are you trying to do? Articulate your objectives using > absolutely no jargon.* > On the lines of DPP which helps DataSourceV2 relations when the joining key > is a partition column, the same concept can be extended over to the case > where joining key is not a partition column. > The Keys of BroadcastHashJoin are already available before actual evaluation > of the stream iterator. These keys can be pushed down to the DataSource as a > SortedSet. > For non partition columns, the DataSources like iceberg have max/min stats on > column available at manifest level, and for formats like parquet , they have > max/min stats at various storage level. The passed SortedSet can be used to > prune using ranges at both driver level ( manifests files) as well as > executor level ( while actually going through chunks , row groups etc at > parquet level) > If the data is stored as Columnar Batch format , then it would not be > possible to filter out individual row at DataSource level, even though we > have keys. > But at the scan level, ( ColumnToRowExec) it is still possible to filter out > as many rows as possible , if the query involves nested joins. Thus reducing > the number of rows to join at the higher join levels. > Will be adding more details.. > h2. *Q2. What problem is this proposal NOT designed to solve?* > This can only help in BroadcastHashJoin's performance if the join is Inner or > Left Semi. > This will also not work if there are nodes like Expand, Generator , Aggregate > (without group by on keys not part of joining column etc) below the > BroadcastHashJoin node being targeted. > h2. *Q3. How is it done today, and what are the limits of current practice?* > Currently this sort of pruning at DataSource level is being done using DPP > (Dynamic Partition Pruning ) and IFF one of the join key column is a > Partitioning column ( so that cost of DPP query is justified and way less > than amount of data it will be filtering by skipping partitions). > The limitation is that DPP type approach is not implemented ( intentionally I > believe), if the join column is a non partition column ( because of cost of > "DPP type" query would most likely be way high as compared to any possible > pruning ( especially if the column is not stored in a sorted manner). > h2. *Q4. What is new in your approach and why do you think it will be > successful?* > 1) This allows pruning on non partition column based joins. > 2) Because it piggy backs on Broadcasted Keys, there is no extra cost of "DPP > type" query. > 3) The Data can be used by DataSource to prune at driver (possibly) and also > at executor level ( as in case of parquet which has max/min at various > structure levels) > 4) The big benefit should be seen in multilevel nested join queries. In the > current code base, if I am correct, only one join's pruning filter would get > pushed at scan level. Since it is on partition key may be that is sufficient. > But if it is a nested Join query , and may be involving different columns on > streaming side for join, each such filter push could do significant pruning. > This requires some handling in case of AQE, as the stream side iterator ( & > hence stage evaluation needs to be delayed, till all the available join > filters in the nested tree are pushed at their respective target > BatchScanExec). > h4. *Single Row Filteration* > 5) In case of nested broadcasted joins, if the datasource is column vector > oriented , then what spark would get is a ColumnarBatch. But because scans > have Filters from multiple joins, they can be retrieved and can be applied in > code generated at ColumnToRowExec level, using a new "containsKey" method on > HashedRelation. Thus only those rows which satisfy all the > BroadcastedHashJoins ( whose keys have been pushed) , will be used for join > evaluation. > The code is already there , will be opening a PR. For non partition table > TPCDS run on laptop with TPCDS data size of ( scale factor 4), I am seeing > 15% gain. > For partition table TPCDS, there is improvement in 4 - 5 queries to the tune > of 10% to 37%. > h2. *Q5. Who cares? If you are successful, what difference will it make?* > If use cases involve m
[jira] [Assigned] (SPARK-44675) Increase ReservedCodeCacheSize for release build
[ https://issues.apache.org/jira/browse/SPARK-44675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-44675: --- Assignee: Yuming Wang > Increase ReservedCodeCacheSize for release build > > > Key: SPARK-44675 > URL: https://issues.apache.org/jira/browse/SPARK-44675 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44675) Increase ReservedCodeCacheSize for release build
[ https://issues.apache.org/jira/browse/SPARK-44675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-44675. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42344 [https://github.com/apache/spark/pull/42344] > Increase ReservedCodeCacheSize for release build > > > Key: SPARK-44675 > URL: https://issues.apache.org/jira/browse/SPARK-44675 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44675) Increase ReservedCodeCacheSize for release build
Yuming Wang created SPARK-44675: --- Summary: Increase ReservedCodeCacheSize for release build Key: SPARK-44675 URL: https://issues.apache.org/jira/browse/SPARK-44675 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 4.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44654) In subquery cannot perform partition pruning
[ https://issues.apache.org/jira/browse/SPARK-44654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17750739#comment-17750739 ] Yuming Wang commented on SPARK-44654: - Another way is convert join to filter if maximum number of rows on one side is 1: https://github.com/apache/spark/pull/42114 > In subquery cannot perform partition pruning > > > Key: SPARK-44654 > URL: https://issues.apache.org/jira/browse/SPARK-44654 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: 7mming7 >Priority: Minor > Labels: performance > Attachments: image-2023-08-03-17-22-53-981.png > > > The following SQL cannot perform partition pruning > {code:java} > SELECT * FROM parquet_part WHERE id_type in (SELECT max(id_type) from > parquet_part){code} > As can be seen from the execution plan below, the partition pruning of left > cannot be performed after the subquery of in is converted into join > !image-2023-08-03-17-22-53-981.png! > The current issue proposes to optimize insubquery. Only when the value of in > is greater than a threshold, insubquery will be converted into Join -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44651) Make do-release-docker.sh compatible with Mac m2
Yuming Wang created SPARK-44651: --- Summary: Make do-release-docker.sh compatible with Mac m2 Key: SPARK-44651 URL: https://issues.apache.org/jira/browse/SPARK-44651 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 4.0.0 Reporter: Yuming Wang How to test: {code:sh} dev/create-release/do-release-docker.sh -d /Users/yumwang/release-spark/output -s docs -n {code} Install python3-dev and build-essential: {code:sh} $APT_INSTALL python-is-python3 python3-pip python3-setuptools python3-dev build-essential {code} {noformat} Collecting grpcio==1.56.0 Downloading grpcio-1.56.0.tar.gz (24.3 MB) || 24.3 MB 6.7 MB/s ERROR: Command errored out with exit status 1: command: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-qmfpon02/grpcio/setup.py'"'"'; __file__='"'"'/tmp/pip-install-qmfpon02/grpcio/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-qmfpon02/grpcio/pip-egg-info cwd: /tmp/pip-install-qmfpon02/grpcio/ Complete output (11 lines): Traceback (most recent call last): File "", line 1, in File "/tmp/pip-install-qmfpon02/grpcio/setup.py", line 263, in if check_linker_need_libatomic(): File "/tmp/pip-install-qmfpon02/grpcio/setup.py", line 210, in check_linker_need_libatomic cpp_test = subprocess.Popen(cxx + ['-x', 'c++', '-std=c++14', '-'], File "/usr/lib/python3.8/subprocess.py", line 858, in __init__ self._execute_child(args, executable, preexec_fn, close_fds, File "/usr/lib/python3.8/subprocess.py", line 1704, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) FileNotFoundError: [Errno 2] No such file or directory: 'c++' ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output. ... Could not find . This could mean the following: * You're on Ubuntu and haven't run `apt-get install python3-dev`. * You're on RHEL/Fedora and haven't run `yum install python3-devel` or `dnf install python3-devel` (make sure you also have redhat-rpm-config installed) * You're on Mac OS X and the usual Python framework was somehow corrupted (check your environment variables or try re-installing?) * You're on Windows and your Python installation was somehow corrupted (check your environment variables or try re-installing?) {noformat} {noformat} #5 848.0 Successfully built grpcio future #5 848.0 Failed to build pyarrow #5 848.7 ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be installed directly {noformat} {noformat} root@c57ec74c8d32:/# $APT_INSTALL r-base r-base-dev Reading package lists... Done Building dependency tree Reading state information... Done Some packages could not be installed. This may mean that you have requested an impossible situation or if you are using the unstable distribution that some required packages have not yet been created or been moved out of Incoming. The following information may help to resolve the situation: The following packages have unmet dependencies: r-base : Depends: r-base-core (>= 4.3.1-3.2004.0) but it is not going to be installed Depends: r-recommended (= 4.3.1-3.2004.0) but it is not going to be installed r-base-dev : Depends: r-base-core (>= 4.3.1-3.2004.0) but it is not going to be installed E: Unable to correct problems, you have held broken packages. {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38506) Push partial aggregation through join
[ https://issues.apache.org/jira/browse/SPARK-38506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-38506: Description: Please see https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/SQL-Request-and-Transaction-Processing/Join-Planning-and-Optimization/Partial-GROUP-BY-Block-Optimization for more details. (was: Please see https://docs.teradata.com/r/Teradata-VantageTM-SQL-Request-and-Transaction-Processing/March-2019/Join-Planning-and-Optimization/Partial-GROUP-BY-Block-Optimization for more details.) > Push partial aggregation through join > - > > Key: SPARK-38506 > URL: https://issues.apache.org/jira/browse/SPARK-38506 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > > Please see > https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/SQL-Request-and-Transaction-Processing/Join-Planning-and-Optimization/Partial-GROUP-BY-Block-Optimization > for more details. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44562) Add OptimizeOneRowRelationSubquery in batch of Subquery
[ https://issues.apache.org/jira/browse/SPARK-44562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-44562: --- Assignee: Yuming Wang > Add OptimizeOneRowRelationSubquery in batch of Subquery > --- > > Key: SPARK-44562 > URL: https://issues.apache.org/jira/browse/SPARK-44562 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44562) Add OptimizeOneRowRelationSubquery in batch of Subquery
[ https://issues.apache.org/jira/browse/SPARK-44562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-44562. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42180 [https://github.com/apache/spark/pull/42180] > Add OptimizeOneRowRelationSubquery in batch of Subquery > --- > > Key: SPARK-44562 > URL: https://issues.apache.org/jira/browse/SPARK-44562 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44598) spark 3.2+ can not read hive table with hbase serde when hbase StorefileSize is 0
[ https://issues.apache.org/jira/browse/SPARK-44598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-44598. - Resolution: Not A Problem > spark 3.2+ can not read hive table with hbase serde when hbase StorefileSize > is 0 > -- > > Key: SPARK-44598 > URL: https://issues.apache.org/jira/browse/SPARK-44598 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.3 >Reporter: ming95 >Priority: Major > > We using spark to read a hive table with hbase serde . We found that when the > hbase table data is relatively small (hbase StorefileSize is 0), the data > read by spark 3.2 or 3.5 is empty, and there is no error message. > But when using spark2.4 or hive to read, the data can be read normally. Other > information shows that spark3.1 can also read data normally, can anyone > provide some ideas? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-44598) spark 3.2+ can not read hive table with hbase serde when hbase StorefileSize is 0
[ https://issues.apache.org/jira/browse/SPARK-44598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reopened SPARK-44598: - > spark 3.2+ can not read hive table with hbase serde when hbase StorefileSize > is 0 > -- > > Key: SPARK-44598 > URL: https://issues.apache.org/jira/browse/SPARK-44598 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.3 >Reporter: ming95 >Priority: Major > > We using spark to read a hive table with hbase serde . We found that when the > hbase table data is relatively small (hbase StorefileSize is 0), the data > read by spark 3.2 or 3.5 is empty, and there is no error message. > But when using spark2.4 or hive to read, the data can be read normally. Other > information shows that spark3.1 can also read data normally, can anyone > provide some ideas? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44598) spark 3.2+ can not read hive table with hbase serde when hbase StorefileSize is 0
[ https://issues.apache.org/jira/browse/SPARK-44598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17749418#comment-17749418 ] Yuming Wang commented on SPARK-44598: - How to reproduce this issue? > spark 3.2+ can not read hive table with hbase serde when hbase StorefileSize > is 0 > -- > > Key: SPARK-44598 > URL: https://issues.apache.org/jira/browse/SPARK-44598 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.3 >Reporter: ming95 >Priority: Major > > We using spark to read a hive table with hbase serde . We found that when the > hbase table data is relatively small (hbase StorefileSize is 0), the data > read by spark 3.2 or 3.5 is empty, and there is no error message. > But when using spark2.4 or hive to read, the data can be read normally. Other > information shows that spark3.1 can also read data normally, can anyone > provide some ideas? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44454) HiveShim getTablesByType support fallback
[ https://issues.apache.org/jira/browse/SPARK-44454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-44454: --- Assignee: dzcxzl > HiveShim getTablesByType support fallback > - > > Key: SPARK-44454 > URL: https://issues.apache.org/jira/browse/SPARK-44454 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.1 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Minor > > When we use a high version of Hive Client to communicate with a low version > of Hive meta store, we may encounter Invalid method name: > 'get_tables_by_type'. > > {code:java} > 23/07/17 12:45:24,391 [main] DEBUG SparkSqlParser: Parsing command: show views > 23/07/17 12:45:24,489 [main] ERROR log: Got exception: > org.apache.thrift.TApplicationException Invalid method name: > 'get_tables_by_type' > org.apache.thrift.TApplicationException: Invalid method name: > 'get_tables_by_type' > at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_tables_by_type(ThriftHiveMetastore.java:1433) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_tables_by_type(ThriftHiveMetastore.java:1418) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTables(HiveMetaStoreClient.java:1411) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173) > at com.sun.proxy.$Proxy23.getTables(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2344) > at com.sun.proxy.$Proxy23.getTables(Unknown Source) > at org.apache.hadoop.hive.ql.metadata.Hive.getTablesByType(Hive.java:1427) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.sql.hive.client.Shim_v2_3.getTablesByType(HiveShim.scala:1408) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$listTablesByType$1(HiveClientImpl.scala:789) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:225) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:224) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:274) > at > org.apache.spark.sql.hive.client.HiveClientImpl.listTablesByType(HiveClientImpl.scala:785) > at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listViews$1(HiveExternalCatalog.scala:895) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:108) > at > org.apache.spark.sql.hive.HiveExternalCatalog.listViews(HiveExternalCatalog.scala:893) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listViews(ExternalCatalogWithListener.scala:158) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listViews(SessionCatalog.scala:1040) > at > org.apache.spark.sql.execution.command.ShowViewsCommand.$anonfun$run$5(views.scala:407) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.command.ShowViewsCommand.run(views.scala:407) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44454) HiveShim getTablesByType support fallback
[ https://issues.apache.org/jira/browse/SPARK-44454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-44454. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42033 [https://github.com/apache/spark/pull/42033] > HiveShim getTablesByType support fallback > - > > Key: SPARK-44454 > URL: https://issues.apache.org/jira/browse/SPARK-44454 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.1 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Minor > Fix For: 4.0.0 > > > When we use a high version of Hive Client to communicate with a low version > of Hive meta store, we may encounter Invalid method name: > 'get_tables_by_type'. > > {code:java} > 23/07/17 12:45:24,391 [main] DEBUG SparkSqlParser: Parsing command: show views > 23/07/17 12:45:24,489 [main] ERROR log: Got exception: > org.apache.thrift.TApplicationException Invalid method name: > 'get_tables_by_type' > org.apache.thrift.TApplicationException: Invalid method name: > 'get_tables_by_type' > at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_tables_by_type(ThriftHiveMetastore.java:1433) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_tables_by_type(ThriftHiveMetastore.java:1418) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTables(HiveMetaStoreClient.java:1411) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173) > at com.sun.proxy.$Proxy23.getTables(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2344) > at com.sun.proxy.$Proxy23.getTables(Unknown Source) > at org.apache.hadoop.hive.ql.metadata.Hive.getTablesByType(Hive.java:1427) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.sql.hive.client.Shim_v2_3.getTablesByType(HiveShim.scala:1408) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$listTablesByType$1(HiveClientImpl.scala:789) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:225) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:224) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:274) > at > org.apache.spark.sql.hive.client.HiveClientImpl.listTablesByType(HiveClientImpl.scala:785) > at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listViews$1(HiveExternalCatalog.scala:895) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:108) > at > org.apache.spark.sql.hive.HiveExternalCatalog.listViews(HiveExternalCatalog.scala:893) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listViews(ExternalCatalogWithListener.scala:158) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listViews(SessionCatalog.scala:1040) > at > org.apache.spark.sql.execution.command.ShowViewsCommand.$anonfun$run$5(views.scala:407) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.command.ShowViewsCommand.run(views.scala:407) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44513) Upgrade snappy-java to 1.1.10.3
[ https://issues.apache.org/jira/browse/SPARK-44513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44513: Fix Version/s: 3.4.2 > Upgrade snappy-java to 1.1.10.3 > --- > > Key: SPARK-44513 > URL: https://issues.apache.org/jira/browse/SPARK-44513 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.1 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Trivial > Fix For: 3.4.2, 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44562) Add OptimizeOneRowRelationSubquery in batch of Subquery
Yuming Wang created SPARK-44562: --- Summary: Add OptimizeOneRowRelationSubquery in batch of Subquery Key: SPARK-44562 URL: https://issues.apache.org/jira/browse/SPARK-44562 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44466) Exclude configs starting with SPARK_DRIVER_PREFIX and SPARK_EXECUTOR_PREFIX from modifiedConfigs
[ https://issues.apache.org/jira/browse/SPARK-44466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-44466. - Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42049 [https://github.com/apache/spark/pull/42049] > Exclude configs starting with SPARK_DRIVER_PREFIX and SPARK_EXECUTOR_PREFIX > from modifiedConfigs > > > Key: SPARK-44466 > URL: https://issues.apache.org/jira/browse/SPARK-44466 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.5.0, 4.0.0 > > Attachments: screenshot-1.png > > > Should not include this value: > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44466) Exclude configs starting with SPARK_DRIVER_PREFIX and SPARK_EXECUTOR_PREFIX from modifiedConfigs
[ https://issues.apache.org/jira/browse/SPARK-44466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-44466: --- Assignee: Yuming Wang > Exclude configs starting with SPARK_DRIVER_PREFIX and SPARK_EXECUTOR_PREFIX > from modifiedConfigs > > > Key: SPARK-44466 > URL: https://issues.apache.org/jira/browse/SPARK-44466 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Attachments: screenshot-1.png > > > Should not include this value: > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44527) Simplify BinaryComparison if its children contain ScalarSubquery with empty output
[ https://issues.apache.org/jira/browse/SPARK-44527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746488#comment-17746488 ] Yuming Wang commented on SPARK-44527: - https://github.com/apache/spark/pull/42129 > Simplify BinaryComparison if its children contain ScalarSubquery with empty > output > -- > > Key: SPARK-44527 > URL: https://issues.apache.org/jira/browse/SPARK-44527 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44527) Simplify BinaryComparison if its children contain ScalarSubquery with empty output
Yuming Wang created SPARK-44527: --- Summary: Simplify BinaryComparison if its children contain ScalarSubquery with empty output Key: SPARK-44527 URL: https://issues.apache.org/jira/browse/SPARK-44527 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44523) Filter's maxRows/maxRowsPerPartition is 0 if condition is FalseLiteral
[ https://issues.apache.org/jira/browse/SPARK-44523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44523: Summary: Filter's maxRows/maxRowsPerPartition is 0 if condition is FalseLiteral (was: Filter's maxRows should be 0 if condition is FalseLiteral) > Filter's maxRows/maxRowsPerPartition is 0 if condition is FalseLiteral > -- > > Key: SPARK-44523 > URL: https://issues.apache.org/jira/browse/SPARK-44523 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44523) Filter's maxRows should be 0 if condition is FalseLiteral
[ https://issues.apache.org/jira/browse/SPARK-44523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746290#comment-17746290 ] Yuming Wang commented on SPARK-44523: - https://github.com/apache/spark/pull/42126 > Filter's maxRows should be 0 if condition is FalseLiteral > - > > Key: SPARK-44523 > URL: https://issues.apache.org/jira/browse/SPARK-44523 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44523) Filter's maxRows should be 0 if condition is FalseLiteral
Yuming Wang created SPARK-44523: --- Summary: Filter's maxRows should be 0 if condition is FalseLiteral Key: SPARK-44523 URL: https://issues.apache.org/jira/browse/SPARK-44523 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44514) Optimize join if maximum number of rows on one side is 1
[ https://issues.apache.org/jira/browse/SPARK-44514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44514: Summary: Optimize join if maximum number of rows on one side is 1 (was: Rewrite the join to filter if one side maximum number of rows is 1) > Optimize join if maximum number of rows on one side is 1 > > > Key: SPARK-44514 > URL: https://issues.apache.org/jira/browse/SPARK-44514 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44514) Rewrite the join to filter if one side maximum number of rows is 1
[ https://issues.apache.org/jira/browse/SPARK-44514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746143#comment-17746143 ] Yuming Wang commented on SPARK-44514: - https://github.com/apache/spark/pull/42114 > Rewrite the join to filter if one side maximum number of rows is 1 > -- > > Key: SPARK-44514 > URL: https://issues.apache.org/jira/browse/SPARK-44514 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44514) Rewrite the join to filter if one side maximum number of rows is 1
Yuming Wang created SPARK-44514: --- Summary: Rewrite the join to filter if one side maximum number of rows is 1 Key: SPARK-44514 URL: https://issues.apache.org/jira/browse/SPARK-44514 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44466) Exclude configs starting with SPARK_DRIVER_PREFIX and SPARK_EXECUTOR_PREFIX from modifiedConfigs
[ https://issues.apache.org/jira/browse/SPARK-44466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44466: Summary: Exclude configs starting with SPARK_DRIVER_PREFIX and SPARK_EXECUTOR_PREFIX from modifiedConfigs (was: Update initialSessionOptions to the value after supplementation) > Exclude configs starting with SPARK_DRIVER_PREFIX and SPARK_EXECUTOR_PREFIX > from modifiedConfigs > > > Key: SPARK-44466 > URL: https://issues.apache.org/jira/browse/SPARK-44466 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Yuming Wang >Priority: Major > Attachments: screenshot-1.png > > > Should not include this value: > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44493) Extract pushable predicates from disjunctive predicates
[ https://issues.apache.org/jira/browse/SPARK-44493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44493: Attachment: before.png > Extract pushable predicates from disjunctive predicates > --- > > Key: SPARK-44493 > URL: https://issues.apache.org/jira/browse/SPARK-44493 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: after.png, before.png > > > Example: > {code:sql} > select count(*) > from > db.very_large_table > where > session_start_dt between date_sub('2023-07-15', 1) and > date_add('2023-07-16', 1) > and type = 'event' > and date(event_timestamp) between '2023-07-15' and '2023-07-16' > and ( > ( > page_id in (2627, 2835, 2402999) > and -- other predicates > and rdt = 0 > ) or ( > page_id in (2616, 3411350) > and rdt = 0 > ) or ( > page_id = 2403006 > ) or ( > page_id in (2208336, 2356359) > and -- other predicates > and rdt = 0 > ) > ) > {code} > We can push down {{page_id in(2627, 2835, 2402999, 2616, 3411350, 2403006, > 2208336, 2356359)}} to datasource. > Before: > After: -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44493) Extract pushable predicates from disjunctive predicates
Yuming Wang created SPARK-44493: --- Summary: Extract pushable predicates from disjunctive predicates Key: SPARK-44493 URL: https://issues.apache.org/jira/browse/SPARK-44493 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Yuming Wang Attachments: after.png, before.png Example: {code:sql} select count(*) from db.very_large_table where session_start_dt between date_sub('2023-07-15', 1) and date_add('2023-07-16', 1) and type = 'event' and date(event_timestamp) between '2023-07-15' and '2023-07-16' and ( ( page_id in (2627, 2835, 2402999) and -- other predicates and rdt = 0 ) or ( page_id in (2616, 3411350) and rdt = 0 ) or ( page_id = 2403006 ) or ( page_id in (2208336, 2356359) and -- other predicates and rdt = 0 ) ) {code} We can push down {{page_id in(2627, 2835, 2402999, 2616, 3411350, 2403006, 2208336, 2356359)}} to datasource. Before: After: -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44493) Extract pushable predicates from disjunctive predicates
[ https://issues.apache.org/jira/browse/SPARK-44493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44493: Description: Example: {code:sql} select count(*) from db.very_large_table where session_start_dt between date_sub('2023-07-15', 1) and date_add('2023-07-16', 1) and type = 'event' and date(event_timestamp) between '2023-07-15' and '2023-07-16' and ( ( page_id in (2627, 2835, 2402999) and -- other predicates and rdt = 0 ) or ( page_id in (2616, 3411350) and rdt = 0 ) or ( page_id = 2403006 ) or ( page_id in (2208336, 2356359) and -- other predicates and rdt = 0 ) ) {code} We can push down {{page_id in(2627, 2835, 2402999, 2616, 3411350, 2403006, 2208336, 2356359)}} to datasource. Before: !before.png! After: !after.png! was: Example: {code:sql} select count(*) from db.very_large_table where session_start_dt between date_sub('2023-07-15', 1) and date_add('2023-07-16', 1) and type = 'event' and date(event_timestamp) between '2023-07-15' and '2023-07-16' and ( ( page_id in (2627, 2835, 2402999) and -- other predicates and rdt = 0 ) or ( page_id in (2616, 3411350) and rdt = 0 ) or ( page_id = 2403006 ) or ( page_id in (2208336, 2356359) and -- other predicates and rdt = 0 ) ) {code} We can push down {{page_id in(2627, 2835, 2402999, 2616, 3411350, 2403006, 2208336, 2356359)}} to datasource. Before: After: > Extract pushable predicates from disjunctive predicates > --- > > Key: SPARK-44493 > URL: https://issues.apache.org/jira/browse/SPARK-44493 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: after.png, before.png > > > Example: > {code:sql} > select count(*) > from > db.very_large_table > where > session_start_dt between date_sub('2023-07-15', 1) and > date_add('2023-07-16', 1) > and type = 'event' > and date(event_timestamp) between '2023-07-15' and '2023-07-16' > and ( > ( > page_id in (2627, 2835, 2402999) > and -- other predicates > and rdt = 0 > ) or ( > page_id in (2616, 3411350) > and rdt = 0 > ) or ( > page_id = 2403006 > ) or ( > page_id in (2208336, 2356359) > and -- other predicates > and rdt = 0 > ) > ) > {code} > We can push down {{page_id in(2627, 2835, 2402999, 2616, 3411350, 2403006, > 2208336, 2356359)}} to datasource. > Before: > !before.png! > After: > !after.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44493) Extract pushable predicates from disjunctive predicates
[ https://issues.apache.org/jira/browse/SPARK-44493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44493: Attachment: after.png > Extract pushable predicates from disjunctive predicates > --- > > Key: SPARK-44493 > URL: https://issues.apache.org/jira/browse/SPARK-44493 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: after.png, before.png > > > Example: > {code:sql} > select count(*) > from > db.very_large_table > where > session_start_dt between date_sub('2023-07-15', 1) and > date_add('2023-07-16', 1) > and type = 'event' > and date(event_timestamp) between '2023-07-15' and '2023-07-16' > and ( > ( > page_id in (2627, 2835, 2402999) > and -- other predicates > and rdt = 0 > ) or ( > page_id in (2616, 3411350) > and rdt = 0 > ) or ( > page_id = 2403006 > ) or ( > page_id in (2208336, 2356359) > and -- other predicates > and rdt = 0 > ) > ) > {code} > We can push down {{page_id in(2627, 2835, 2402999, 2616, 3411350, 2403006, > 2208336, 2356359)}} to datasource. > Before: > After: -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44466) Update initialSessionOptions to the value after supplementation
[ https://issues.apache.org/jira/browse/SPARK-44466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17744040#comment-17744040 ] Yuming Wang commented on SPARK-44466: - https://github.com/apache/spark/pull/42049 > Update initialSessionOptions to the value after supplementation > --- > > Key: SPARK-44466 > URL: https://issues.apache.org/jira/browse/SPARK-44466 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Yuming Wang >Priority: Major > Attachments: screenshot-1.png > > > Should not include this value: > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44466) Update initialSessionOptions to the value after supplementation
[ https://issues.apache.org/jira/browse/SPARK-44466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44466: Description: Should not include this value: !screenshot-1.png! > Update initialSessionOptions to the value after supplementation > --- > > Key: SPARK-44466 > URL: https://issues.apache.org/jira/browse/SPARK-44466 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Yuming Wang >Priority: Major > Attachments: screenshot-1.png > > > Should not include this value: > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44466) Update initialSessionOptions to the value after supplementation
[ https://issues.apache.org/jira/browse/SPARK-44466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44466: Attachment: screenshot-1.png > Update initialSessionOptions to the value after supplementation > --- > > Key: SPARK-44466 > URL: https://issues.apache.org/jira/browse/SPARK-44466 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Yuming Wang >Priority: Major > Attachments: screenshot-1.png > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44466) Update initialSessionOptions to the value after supplementation
Yuming Wang created SPARK-44466: --- Summary: Update initialSessionOptions to the value after supplementation Key: SPARK-44466 URL: https://issues.apache.org/jira/browse/SPARK-44466 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.1 Reporter: Yuming Wang Attachments: screenshot-1.png -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44448) Wrong results for dense_rank() <= k from InferWindowGroupLimit and DenseRankLimitIterator
[ https://issues.apache.org/jira/browse/SPARK-8?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17743592#comment-17743592 ] Yuming Wang commented on SPARK-8: - cc [~beliefer] > Wrong results for dense_rank() <= k from InferWindowGroupLimit and > DenseRankLimitIterator > - > > Key: SPARK-8 > URL: https://issues.apache.org/jira/browse/SPARK-8 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Priority: Major > > Top-k filters on a dense_rank() window function return wrong results, due to > a bug in optimization InferWindowGroupLimit, specifically in the code for > DenseRankLimitIterator, introduced in > https://issues.apache.org/jira/browse/SPARK-37099. > Repro: > {code:java} > create or replace temp view t1 (p, o) as values (1, 1), (1, 1), (1, 2), (2, > 1), (2, 1), (2, 2); > select * from (select *, dense_rank() over (partition by p order by o) as rnk > from t1) where rnk = 1;{code} > Spark result: > {code:java} > [1,1,1] > [1,1,1] > [2,1,1]{code} > Correct result: > {code:java} > [1,1,1] > [1,1,1] > [2,1,1] > [2,1,1]{code} > > The bug is in {{{}DenseRankLimitIterator{}}}, it fails to reset state > properly when transitioning from one window partition to the next. {{reset}} > only resets {{{}rank = 0{}}}, what it is missing is to reset > {{{}currentRankRow = null{}}}. This means that when processing the second and > later window partitions, the rank incorrectly gets incremented based on > comparing the ordering of the last row of the previous partition to the first > row of the new partition. > This means that a dense_rank window func that has more than one window > partition and more than one row with dense_rank = 1 in the second or later > partitions can give wrong results when optimized. > ({{{}RankLimitIterator{}}} narrowly avoids this bug by happenstance, the > first row in the new partition will try to increment rank, but increment it > by the value of count which is 0, so it happens to work by accident). > Unfortunately, tests for the optimization only had a single row per rank, so > did not catch the bug as the bug requires multiple rows per rank. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44448) Wrong results for dense_rank() <= k from InferWindowGroupLimit and DenseRankLimitIterator
[ https://issues.apache.org/jira/browse/SPARK-8?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-8: Target Version/s: 3.5.0 > Wrong results for dense_rank() <= k from InferWindowGroupLimit and > DenseRankLimitIterator > - > > Key: SPARK-8 > URL: https://issues.apache.org/jira/browse/SPARK-8 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Priority: Major > > Top-k filters on a dense_rank() window function return wrong results, due to > a bug in optimization InferWindowGroupLimit, specifically in the code for > DenseRankLimitIterator, introduced in > https://issues.apache.org/jira/browse/SPARK-37099. > Repro: > {code:java} > create or replace temp view t1 (p, o) as values (1, 1), (1, 1), (1, 2), (2, > 1), (2, 1), (2, 2); > select * from (select *, dense_rank() over (partition by p order by o) as rnk > from t1) where rnk = 1;{code} > Spark result: > {code:java} > [1,1,1] > [1,1,1] > [2,1,1]{code} > Correct result: > {code:java} > [1,1,1] > [1,1,1] > [2,1,1] > [2,1,1]{code} > > The bug is in {{{}DenseRankLimitIterator{}}}, it fails to reset state > properly when transitioning from one window partition to the next. {{reset}} > only resets {{{}rank = 0{}}}, what it is missing is to reset > {{{}currentRankRow = null{}}}. This means that when processing the second and > later window partitions, the rank incorrectly gets incremented based on > comparing the ordering of the last row of the previous partition to the first > row of the new partition. > This means that a dense_rank window func that has more than one window > partition and more than one row with dense_rank = 1 in the second or later > partitions can give wrong results when optimized. > ({{{}RankLimitIterator{}}} narrowly avoids this bug by happenstance, the > first row in the new partition will try to increment rank, but increment it > by the value of count which is 0, so it happens to work by accident). > Unfortunately, tests for the optimization only had a single row per rank, so > did not catch the bug as the bug requires multiple rows per rank. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org