[GitHub] [spark] SparkQA commented on pull request #32451: [SPARK-35144][SQL] Migrate to transformWithPruning for object rules

2021-05-07 Thread GitBox


SparkQA commented on pull request #32451:
URL: https://github.com/apache/spark/pull/32451#issuecomment-834119257






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32454: [SPARK-35327][SQL][TESTS] Filters out the TPC-DS queries that can cause flaky test results

2021-05-07 Thread GitBox


AmplabJenkins commented on pull request #32454:
URL: https://github.com/apache/spark/pull/32454#issuecomment-834119543


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138226/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32407: [SPARK-35261][SQL] Support static magic method for stateless ScalarFunction

2021-05-07 Thread GitBox


AmplabJenkins commented on pull request #32407:
URL: https://github.com/apache/spark/pull/32407#issuecomment-834119534






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32451: [SPARK-35144][SQL] Migrate to transformWithPruning for object rules

2021-05-07 Thread GitBox


AmplabJenkins commented on pull request #32451:
URL: https://github.com/apache/spark/pull/32451#issuecomment-834119537


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42760/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32463: [SPARK-35147][SQL] Migrate to resolveWithPruning for two command rules

2021-05-07 Thread GitBox


AmplabJenkins commented on pull request #32463:
URL: https://github.com/apache/spark/pull/32463#issuecomment-834119541


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138224/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32430: [SPARK-35133][SQL] Explain codegen works with AQE

2021-05-07 Thread GitBox


AmplabJenkins commented on pull request #32430:
URL: https://github.com/apache/spark/pull/32430#issuecomment-834119536


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138227/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32430: [SPARK-35133][SQL] Explain codegen works with AQE

2021-05-07 Thread GitBox


AmplabJenkins removed a comment on pull request #32430:
URL: https://github.com/apache/spark/pull/32430#issuecomment-834119536


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138227/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32454: [SPARK-35327][SQL][TESTS] Filters out the TPC-DS queries that can cause flaky test results

2021-05-07 Thread GitBox


AmplabJenkins removed a comment on pull request #32454:
URL: https://github.com/apache/spark/pull/32454#issuecomment-834119543


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138226/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32463: [SPARK-35147][SQL] Migrate to resolveWithPruning for two command rules

2021-05-07 Thread GitBox


AmplabJenkins removed a comment on pull request #32463:
URL: https://github.com/apache/spark/pull/32463#issuecomment-834119541


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138224/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32407: [SPARK-35261][SQL] Support static magic method for stateless ScalarFunction

2021-05-07 Thread GitBox


AmplabJenkins removed a comment on pull request #32407:
URL: https://github.com/apache/spark/pull/32407#issuecomment-834119534






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32451: [SPARK-35144][SQL] Migrate to transformWithPruning for object rules

2021-05-07 Thread GitBox


AmplabJenkins removed a comment on pull request #32451:
URL: https://github.com/apache/spark/pull/32451#issuecomment-834119537


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42760/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #32301: [SPARK-35194][SQL] Refactor nested column aliasing for readability

2021-05-07 Thread GitBox


viirya commented on a change in pull request #32301:
URL: https://github.com/apache/spark/pull/32301#discussion_r627975154



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##
@@ -17,71 +17,148 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
+import scala.collection.mutable
+
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types._
 
 /**
- * This aims to handle a nested column aliasing pattern inside the 
`ColumnPruning` optimizer rule.
- * If a project or its child references to nested fields, and not all the 
fields
- * in a nested attribute are used, we can substitute them by alias attributes; 
then a project
- * of the nested fields as aliases on the children of the child will be 
created.
+ * This aims to handle a nested column aliasing pattern inside the 
[[ColumnPruning]] optimizer rule.
+ * If:
+ * - A [[Project]] or its child references nested fields
+ * - Not all of the fields in a nested attribute are used
+ * Then:
+ * - Substitute the nested field references with alias attributes
+ * - Add grandchild [[Project]]s transforming the nested fields to aliases
+ *
+ * Example 1: Project
+ * --
+ * Before:
+ * +- Project [concat_ws(s#0.a, s#0.b) AS concat_ws(s.a, s.b)#1]
+ *   +- GlobalLimit 5
+ * +- LocalLimit 5
+ *   +- LocalRelation , [s#0]
+ * After:
+ * +- Project [concat_ws(_gen_alias_2#2, _gen_alias_3#3) AS concat_ws(s.a, 
s.b)#1]
+ *   +- GlobalLimit 5
+ * +- LocalLimit 5
+ *   +- Project [s#0.a AS _gen_alias_2#2, s#0.b AS _gen_alias_3#3]
+ * +- LocalRelation , [s#0]
+ *
+ * Example 2: Project above Filter
+ * ---
+ * Before:
+ * +- Project [s#0.a AS s.a#1]
+ *   +- Filter (length(s#0.b) > 2)
+ * +- GlobalLimit 5
+ *   +- LocalLimit 5
+ * +- LocalRelation , [s#0]
+ * After:
+ * +- Project [_gen_alias_2#2 AS s.a#1]
+ *   +- Filter (length(_gen_alias_3#3) > 2)
+ * +- GlobalLimit 5
+ *   +- LocalLimit 5
+ * +- Project [s#0.a AS _gen_alias_2#2, s#0.b AS _gen_alias_3#3]
+ *   +- LocalRelation , [s#0]
+ *
+ * Example 3: Nested columns in nested columns
+ * ---
+ * Before:
+ * +- Project [s#0.a AS s.a#1, s#0.a.a1 AS s.a.a1#2]
+ *   +- GlobalLimit 5
+ * +- LocalLimit 5
+ *   +- LocalRelation , [s#0]
+ * After:
+ * +- Project [_gen_alias_3#3 AS s.a#1, _gen_alias_3#3.name AS s.a.a1#2]
+ *   +- GlobalLimit 5
+ * +- LocalLimit 5
+ *   +- Project [s#0.a AS _gen_alias_3#3]
+ * +- LocalRelation , [s#0]
+ *
+ * The schema of the datasource relation will be pruned in the 
[[SchemaPruning]] optimizer rule.
  */
 object NestedColumnAliasing {
 
   def unapply(plan: LogicalPlan): Option[LogicalPlan] = plan match {
 /**
  * This pattern is needed to support [[Filter]] plan cases like
- * [[Project]]->[[Filter]]->listed plan in `canProjectPushThrough` (e.g., 
[[Window]]).
- * The reason why we don't simply add [[Filter]] in 
`canProjectPushThrough` is that
+ * [[Project]]->[[Filter]]->listed plan in [[canProjectPushThrough]] 
(e.g., [[Window]]).
+ * The reason why we don't simply add [[Filter]] in 
[[canProjectPushThrough]] is that
  * the optimizer can hit an infinite loop during the 
[[PushDownPredicates]] rule.
  */
-case Project(projectList, Filter(condition, child))
-if SQLConf.get.nestedSchemaPruningEnabled && 
canProjectPushThrough(child) =>
-  val exprCandidatesToPrune = projectList ++ Seq(condition) ++ 
child.expressions
-  getAliasSubMap(exprCandidatesToPrune, 
child.producedAttributes.toSeq).map {
-case (nestedFieldToAlias, attrToAliases) =>
-  NestedColumnAliasing.replaceToAliases(plan, nestedFieldToAlias, 
attrToAliases)
-  }
+case Project(projectList, Filter(condition, child)) if
+SQLConf.get.nestedSchemaPruningEnabled && canProjectPushThrough(child) 
=>
+  rewritePlanIfSubsetFieldsUsed(
+plan, projectList ++ Seq(condition) ++ child.expressions, 
child.producedAttributes.toSeq)
 
-case Project(projectList, child)
-if SQLConf.get.nestedSchemaPruningEnabled && 
canProjectPushThrough(child) =>
-  val exprCandidatesToPrune = projectList ++ child.expressions
-  getAliasSubMap(exprCandidatesToPrune, 
child.producedAttributes.toSeq).map {
-case (nestedFieldToAlias, attrToAliases) =>
-  NestedColumnAliasing.replaceToAliases(plan, nestedFieldToAlias, 
attrToAliases)
-  }
+case Project(projectList, child) if
+SQLConf.get.nestedSchemaPruningEnabled && canProjectPushThrough(child) 
=>
+  rewritePlanIfSubsetFieldsUsed(
+plan, projectList ++ child.expressions, child.producedAttributes.toSeq)
 
 case p if SQLConf.get.nestedSchemaPruningEnabled && canPruneOn(p) =>
-   

[GitHub] [spark] AmplabJenkins commented on pull request #32287: [SPARK-27991][CORE] Defer the fetch request on Netty OOM

2021-05-07 Thread GitBox


AmplabJenkins commented on pull request #32287:
URL: https://github.com/apache/spark/pull/32287#issuecomment-834120459


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42761/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32287: [SPARK-27991][CORE] Defer the fetch request on Netty OOM

2021-05-07 Thread GitBox


SparkQA commented on pull request #32287:
URL: https://github.com/apache/spark/pull/32287#issuecomment-834120433


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42761/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32287: [SPARK-27991][CORE] Defer the fetch request on Netty OOM

2021-05-07 Thread GitBox


AmplabJenkins removed a comment on pull request #32287:
URL: https://github.com/apache/spark/pull/32287#issuecomment-834120459


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42761/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

2021-05-07 Thread GitBox


cloud-fan commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r627978970



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
##
@@ -188,6 +188,15 @@ abstract class ParquetFilterSuite extends QueryTest with 
ParquetTest with Shared
   checkFilterPredicate(!(tsAttr < ts4.ts), classOf[GtEq[_]], 
resultFun(ts4))
   checkFilterPredicate(tsAttr < ts2.ts || tsAttr > ts3.ts, 
classOf[Operators.Or],
 Seq(Row(resultFun(ts1)), Row(resultFun(ts4
+
+  Seq(3, 20).foreach { threshold =>
+withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_INFILTERTHRESHOLD.key -> 
s"$threshold") {

Review comment:
   shall we update the conf doc of 
`PARQUET_FILTER_PUSHDOWN_INFILTERTHRESHOLD`? We have a new feature now.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on pull request #32462: [SPARK-34795][SPARK-35192][SPARK-35293][SQL][TESTS][3.1] Adds a new job in GitHub Actions to check the output of TPC-DS queries

2021-05-07 Thread GitBox


maropu commented on pull request #32462:
URL: https://github.com/apache/spark/pull/32462#issuecomment-834125353


   > We are not going to bring SPARK-35327, right? If you want SPARK-35327 too, 
let's hold on this PR.
   
   Yea, right. That should be included in this PR, too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sharkdtu commented on pull request #32456: [SPARK-35328][Core] Use 'SPARK_DRIVER_LOG_URL_' as env prefix for getting driver log urls by default

2021-05-07 Thread GitBox


sharkdtu commented on pull request #32456:
URL: https://github.com/apache/spark/pull/32456#issuecomment-834125785


   @HyukjinKwon Thanks, PR description has been updated and the test failure 
has been fixed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #32446: [SPARK-35321][SQL] Don't register Hive permanent functions when creating Hive client

2021-05-07 Thread GitBox


viirya commented on a change in pull request #32446:
URL: https://github.com/apache/spark/pull/32446#discussion_r627980973



##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala
##
@@ -1316,6 +1320,13 @@ private[client] class Shim_v2_1 extends Shim_v2_0 {
   override def alterPartitions(hive: Hive, tableName: String, newParts: 
JList[Partition]): Unit = {
 alterPartitionsMethod.invoke(hive, tableName, newParts, 
environmentContextInAlterTable)
   }
+
+  // HIVE-10319 introduced a new HMS thrift API `get_all_functions` which is 
used by
+  // `Hive.get` since version 2.1.0, when it loads all Hive permanent 
functions during
+  // initialization. This breaks compatibility with HMS server of lower 
versions.
+  // To mitigate here we use `Hive.getWithFastCheck` instead which skips 
loading the permanent
+  // functions and therefore avoids calling `get_all_functions`.
+  override def getHive(hiveConf: HiveConf): Hive = 
Hive.getWithFastCheck(hiveConf, false)

Review comment:
   Oh, I thought it may be easily to ignore we have different `getHive` 
here, when overriding `getHive` on other Shim. If we want to override it again, 
we at least notice it by compiler error.
   
   Not strong option, anyway. Okay for me as it is.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #32448: [SPARK-35290][SQL] Use StructType merging for unionByName with null filling

2021-05-07 Thread GitBox


viirya commented on pull request #32448:
URL: https://github.com/apache/spark/pull/32448#issuecomment-834126745


   ok to test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32448: [SPARK-35290][SQL] Use StructType merging for unionByName with null filling

2021-05-07 Thread GitBox


SparkQA commented on pull request #32448:
URL: https://github.com/apache/spark/pull/32448#issuecomment-834128258


   **[Test build #138240 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138240/testReport)**
 for PR 32448 at commit 
[`3c0d3d0`](https://github.com/apache/spark/commit/3c0d3d0af7d89d28dbe1db7b0c1b8e9dc8c090a5).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32442: [SPARK-35283][SQL] Support query some DDL with CTES

2021-05-07 Thread GitBox


SparkQA commented on pull request #32442:
URL: https://github.com/apache/spark/pull/32442#issuecomment-834131039


   **[Test build #138229 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138229/testReport)**
 for PR 32442 at commit 
[`4f8b782`](https://github.com/apache/spark/commit/4f8b7828a3448120e0d1fd2daeb9e8d3ab1a67eb).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32442: [SPARK-35283][SQL] Support query some DDL with CTES

2021-05-07 Thread GitBox


SparkQA removed a comment on pull request #32442:
URL: https://github.com/apache/spark/pull/32442#issuecomment-834026504


   **[Test build #138229 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138229/testReport)**
 for PR 32442 at commit 
[`4f8b782`](https://github.com/apache/spark/commit/4f8b7828a3448120e0d1fd2daeb9e8d3ab1a67eb).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32446: [SPARK-35321][SQL] Don't register Hive permanent functions when creating Hive client

2021-05-07 Thread GitBox


SparkQA commented on pull request #32446:
URL: https://github.com/apache/spark/pull/32446#issuecomment-834139845


   **[Test build #138236 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138236/testReport)**
 for PR 32446 at commit 
[`88697a4`](https://github.com/apache/spark/commit/88697a43ba63963a1951f8d99a697fab4ca5692f).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32446: [SPARK-35321][SQL] Don't register Hive permanent functions when creating Hive client

2021-05-07 Thread GitBox


SparkQA removed a comment on pull request #32446:
URL: https://github.com/apache/spark/pull/32446#issuecomment-834071285


   **[Test build #138236 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138236/testReport)**
 for PR 32446 at commit 
[`88697a4`](https://github.com/apache/spark/commit/88697a43ba63963a1951f8d99a697fab4ca5692f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

2021-05-07 Thread GitBox


sunchao commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-834142995


   Thanks. They don't seem related. I tested them locally and all passed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hddong commented on a change in pull request #32364: [SPARK-35242][SQL] Support change catalog default database for spark

2021-05-07 Thread GitBox


hddong commented on a change in pull request #32364:
URL: https://github.com/apache/spark/pull/32364#discussion_r628003003



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/StaticSQLConf.scala
##
@@ -37,6 +37,12 @@ object StaticSQLConf {
 .stringConf
 .createWithDefault(Utils.resolveURI("spark-warehouse").toString)
 
+  val CATALOG_DEFAULT_DATABASE = 
buildStaticConf("spark.sql.catalog.default.database")
+.doc("The default database for session catalog.")
+.version("3.2.0")
+.stringConf
+.createWithDefault("default")

Review comment:
   IMO, the database need exits when not connect to `default`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hddong commented on pull request #32364: [SPARK-35242][SQL] Support change catalog default database for spark

2021-05-07 Thread GitBox


hddong commented on pull request #32364:
URL: https://github.com/apache/spark/pull/32364#issuecomment-834147167


   @cloud-fan @yaooqinn : thanks for your review. 
   In my case hive permission managed by ranger, and all users have not read 
access to `default`.
   And please review again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

2021-05-07 Thread GitBox


viirya commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-834148230


   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hddong commented on a change in pull request #32364: [SPARK-35242][SQL] Support change catalog default database for spark

2021-05-07 Thread GitBox


hddong commented on a change in pull request #32364:
URL: https://github.com/apache/spark/pull/32364#discussion_r628003003



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/StaticSQLConf.scala
##
@@ -37,6 +37,12 @@ object StaticSQLConf {
 .stringConf
 .createWithDefault(Utils.resolveURI("spark-warehouse").toString)
 
+  val CATALOG_DEFAULT_DATABASE = 
buildStaticConf("spark.sql.catalog.default.database")
+.doc("The default database for session catalog.")
+.version("3.2.0")
+.stringConf
+.createWithDefault("default")

Review comment:
   IMO, the database need exits when not connect to `default`. Now, spark 
shell(submit) always need a read permision of `default` when init.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32413: [SPARK-35288][SQL] StaticInvoke should find the method without exact argument classes match

2021-05-07 Thread GitBox


SparkQA commented on pull request #32413:
URL: https://github.com/apache/spark/pull/32413#issuecomment-834152033


   **[Test build #138230 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138230/testReport)**
 for PR 32413 at commit 
[`0ec8117`](https://github.com/apache/spark/commit/0ec8117aaae0708b19e817c61c780eff6af37cce).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32413: [SPARK-35288][SQL] StaticInvoke should find the method without exact argument classes match

2021-05-07 Thread GitBox


SparkQA removed a comment on pull request #32413:
URL: https://github.com/apache/spark/pull/32413#issuecomment-834032204


   **[Test build #138230 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138230/testReport)**
 for PR 32413 at commit 
[`0ec8117`](https://github.com/apache/spark/commit/0ec8117aaae0708b19e817c61c780eff6af37cce).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32442: [SPARK-35283][SQL] Support query some DDL with CTES

2021-05-07 Thread GitBox


AmplabJenkins commented on pull request #32442:
URL: https://github.com/apache/spark/pull/32442#issuecomment-834152962


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138229/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32446: [SPARK-35321][SQL] Don't register Hive permanent functions when creating Hive client

2021-05-07 Thread GitBox


AmplabJenkins commented on pull request #32446:
URL: https://github.com/apache/spark/pull/32446#issuecomment-834152965


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138236/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32448: [SPARK-35290][SQL] Use StructType merging for unionByName with null filling

2021-05-07 Thread GitBox


SparkQA commented on pull request #32448:
URL: https://github.com/apache/spark/pull/32448#issuecomment-834153023






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32448: [SPARK-35290][SQL] Use StructType merging for unionByName with null filling

2021-05-07 Thread GitBox


AmplabJenkins commented on pull request #32448:
URL: https://github.com/apache/spark/pull/32448#issuecomment-834153063


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42762/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32446: [SPARK-35321][SQL] Don't register Hive permanent functions when creating Hive client

2021-05-07 Thread GitBox


AmplabJenkins removed a comment on pull request #32446:
URL: https://github.com/apache/spark/pull/32446#issuecomment-834152965


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138236/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32442: [SPARK-35283][SQL] Support query some DDL with CTES

2021-05-07 Thread GitBox


AmplabJenkins removed a comment on pull request #32442:
URL: https://github.com/apache/spark/pull/32442#issuecomment-834152962


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138229/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32448: [SPARK-35290][SQL] Use StructType merging for unionByName with null filling

2021-05-07 Thread GitBox


AmplabJenkins removed a comment on pull request #32448:
URL: https://github.com/apache/spark/pull/32448#issuecomment-833167741






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32413: [SPARK-35288][SQL] StaticInvoke should find the method without exact argument classes match

2021-05-07 Thread GitBox


AmplabJenkins commented on pull request #32413:
URL: https://github.com/apache/spark/pull/32413#issuecomment-834153566


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138230/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

2021-05-07 Thread GitBox


SparkQA commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-834153699


   **[Test build #138241 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138241/testReport)**
 for PR 32354 at commit 
[`50dc32d`](https://github.com/apache/spark/commit/50dc32d89d3129a8e3d8e5019c4d7888ede30b4f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32413: [SPARK-35288][SQL] StaticInvoke should find the method without exact argument classes match

2021-05-07 Thread GitBox


AmplabJenkins removed a comment on pull request #32413:
URL: https://github.com/apache/spark/pull/32413#issuecomment-834153566


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138230/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #32442: [SPARK-35283][SQL] Support query some DDL with CTES

2021-05-07 Thread GitBox


beliefer commented on pull request #32442:
URL: https://github.com/apache/spark/pull/32442#issuecomment-834154950


   ping @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32459: [SPARK-26164][SQL][FOLLOWUP] WriteTaskStatsTracker should know which file the row is written to

2021-05-07 Thread GitBox


SparkQA commented on pull request #32459:
URL: https://github.com/apache/spark/pull/32459#issuecomment-834163796


   **[Test build #138233 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138233/testReport)**
 for PR 32459 at commit 
[`8e9f6cb`](https://github.com/apache/spark/commit/8e9f6cb8d5b19792fc408c7b9fe9bcc77a4a56d7).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
 * `class CustomWriteTaskStatsTrackerSuite extends SparkFunSuite `
 * `class CustomWriteTaskStatsTracker extends WriteTaskStatsTracker `
 * `case class CustomWriteTaskStats(numRowsPerFile: Map[String, Int]) 
extends WriteTaskStats`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan opened a new pull request #32466: [SPARK-35333][SQL] Skip object null check in Invoke if possible

2021-05-07 Thread GitBox


cloud-fan opened a new pull request #32466:
URL: https://github.com/apache/spark/pull/32466


   
   
   ### What changes were proposed in this pull request?
   
   If `targetObject` is not nullable, we don't need the object null check in 
`Invoke`.
   
   ### Why are the changes needed?
   
   small perf improvement
   
   ### Does this PR introduce _any_ user-facing change?
   
   no
   
   ### How was this patch tested?
   
   existing tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #32466: [SPARK-35333][SQL] Skip object null check in Invoke if possible

2021-05-07 Thread GitBox


cloud-fan commented on pull request #32466:
URL: https://github.com/apache/spark/pull/32466#issuecomment-834165155


   cc @maropu @sunchao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32459: [SPARK-26164][SQL][FOLLOWUP] WriteTaskStatsTracker should know which file the row is written to

2021-05-07 Thread GitBox


SparkQA removed a comment on pull request #32459:
URL: https://github.com/apache/spark/pull/32459#issuecomment-834046652


   **[Test build #138233 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138233/testReport)**
 for PR 32459 at commit 
[`8e9f6cb`](https://github.com/apache/spark/commit/8e9f6cb8d5b19792fc408c7b9fe9bcc77a4a56d7).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #32407: [SPARK-35261][SQL] Support static magic method for stateless ScalarFunction

2021-05-07 Thread GitBox


cloud-fan commented on a change in pull request #32407:
URL: https://github.com/apache/spark/pull/32407#discussion_r628022822



##
File path: 
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/ScalarFunction.java
##
@@ -29,33 +29,62 @@
  * 
  * The JVM type of result values produced by this function must be the type 
used by Spark's
  * InternalRow API for the {@link DataType SQL data type} returned by {@link 
#resultType()}.
+ * The mapping between {@link DataType} and the corresponding JVM type is 
defined below.
  * 
  * IMPORTANT: the default implementation of {@link #produceResult} 
throws
- * {@link UnsupportedOperationException}. Users can choose to override this 
method, or implement
- * a "magic method" with name {@link #MAGIC_METHOD_NAME} which takes 
individual parameters
- * instead of a {@link InternalRow}. The magic method will be loaded by Spark 
through Java
- * reflection and will also provide better performance in general, due to 
optimizations such as
- * codegen, removal of Java boxing, etc.
- *
+ * {@link UnsupportedOperationException}. Users must choose to either override 
this method, or
+ * implement a magic method with name {@link #MAGIC_METHOD_NAME}, which takes 
individual parameters
+ * instead of a {@link InternalRow}. The magic method approach is generally 
recommended because it
+ * provides better performance over the default {@link #produceResult}, due to 
optimizations such
+ * as whole-stage codegen, elimination of Java boxing, etc.
+ * 
+ * In addition, for stateless Java functions, users can optionally define the
+ * {@link #MAGIC_METHOD_NAME} as a static method, which further avoids certain 
runtime costs such
+ * as nullness check on the method receiver, potential Java dynamic dispatch, 
etc.

Review comment:
   I'm not sure if it can make a big difference. I'm trying to remove the 
null check in https://github.com/apache/spark/pull/32466 , can you run 
benchmark again and see if it makes some difference?
   
   My hunch is that, eliminating dynamic dispatch is the main advantage of 
static magic method.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #32459: [SPARK-26164][SQL][FOLLOWUP] WriteTaskStatsTracker should know which file the row is written to

2021-05-07 Thread GitBox


cloud-fan commented on pull request #32459:
URL: https://github.com/apache/spark/pull/32459#issuecomment-834169943


   thanks for review, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #32459: [SPARK-26164][SQL][FOLLOWUP] WriteTaskStatsTracker should know which file the row is written to

2021-05-07 Thread GitBox


cloud-fan closed pull request #32459:
URL: https://github.com/apache/spark/pull/32459


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32465: [SPARK-35331][SQL] Support resolving missing attrs for distribute/cluster by/repartition hint

2021-05-07 Thread GitBox


SparkQA commented on pull request #32465:
URL: https://github.com/apache/spark/pull/32465#issuecomment-834170936


   **[Test build #138231 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138231/testReport)**
 for PR 32465 at commit 
[`0c711e3`](https://github.com/apache/spark/commit/0c711e3a081dc644c3a2d3c47207046eb4457ee1).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #32367: [SPARK-35020][SQL] Group exception messages in catalyst/util

2021-05-07 Thread GitBox


cloud-fan commented on pull request #32367:
URL: https://github.com/apache/spark/pull/32367#issuecomment-834171176


   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #32367: [SPARK-35020][SQL] Group exception messages in catalyst/util

2021-05-07 Thread GitBox


cloud-fan closed pull request #32367:
URL: https://github.com/apache/spark/pull/32367


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32465: [SPARK-35331][SQL] Support resolving missing attrs for distribute/cluster by/repartition hint

2021-05-07 Thread GitBox


SparkQA removed a comment on pull request #32465:
URL: https://github.com/apache/spark/pull/32465#issuecomment-834046574


   **[Test build #138231 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138231/testReport)**
 for PR 32465 at commit 
[`0c711e3`](https://github.com/apache/spark/commit/0c711e3a081dc644c3a2d3c47207046eb4457ee1).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #32377: [SPARK-35021][SQL] Group exception messages in connector/catalog

2021-05-07 Thread GitBox


cloud-fan commented on pull request #32377:
URL: https://github.com/apache/spark/pull/32377#issuecomment-834171929


   @beliefer can you fix the conflicts? thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #27432: [SPARK-28325][SQL]Support ANSI SQL: SIMILAR TO ... ESCAPE syntax

2021-05-07 Thread GitBox


cloud-fan commented on pull request #27432:
URL: https://github.com/apache/spark/pull/27432#issuecomment-834172751


   can you fix the conflicts? thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikun commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

2021-05-07 Thread GitBox


Yikun commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r628004164



##
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##
@@ -2395,6 +2395,36 @@ class Dataset[T] private[sql](
*/
   def withColumn(colName: String, col: Column): DataFrame = 
withColumns(Seq(colName), Seq(col))
 
+  /**
+   * (Scala-specific) Returns a new Dataset by adding columns or replacing the 
existing columns
+   * that has the same names.
+   *
+   * `colsMap` is a map of column name and column, the column must only refer 
to attributes
+   * supplied by this Dataset. It is an error to add columns that refers to 
some other Dataset.
+   *
+   * @group untypedrel
+   * @since 3.2.0
+   */
+  def withColumns(colsMap: Map[String, Column]): DataFrame = {
+val colNames = colsMap.flatMap{ case (colName, _) => Seq(colName) }.toSeq

Review comment:
   done, thanks for your suggestion!

##
File path: python/pyspark/sql/dataframe.pyi
##
@@ -250,6 +250,7 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
 self, cols: Union[List[str], Tuple[str]], support: Optional[float] = 
...
 ) -> DataFrame: ...
 def withColumn(self, colName: str, col: Column) -> DataFrame: ...
+def withColumns(self, colsMap: Dict[str, Column] ) -> DataFrame: ...

Review comment:
   done

##
File path: python/pyspark/sql/dataframe.py
##
@@ -2423,6 +2423,38 @@ def freqItems(self, cols, support=None):
 support = 0.01
 return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), 
support), self.sql_ctx)
 
+def withColumns(self, colsMap):
+"""
+Returns a new :class:`DataFrame` by adding multiple columns or 
replacing the
+existing columns that has the same name.
+
+The colsMap is a map of column name and column, the column must only 
refer to attribute
+supplied by this Dataset. It is an error to add columns that refers to 
some other Dataset.

Review comment:
   done

##
File path: python/pyspark/sql/dataframe.py
##
@@ -2423,6 +2423,38 @@ def freqItems(self, cols, support=None):
 support = 0.01
 return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), 
support), self.sql_ctx)
 
+def withColumns(self, colsMap):
+"""
+Returns a new :class:`DataFrame` by adding multiple columns or 
replacing the
+existing columns that has the same name.

Review comment:
   done

##
File path: python/pyspark/sql/dataframe.py
##
@@ -2423,6 +2423,38 @@ def freqItems(self, cols, support=None):
 support = 0.01
 return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), 
support), self.sql_ctx)
 
+def withColumns(self, colsMap):
+"""
+Returns a new :class:`DataFrame` by adding multiple columns or 
replacing the
+existing columns that has the same name.
+
+The colsMap is a map of column name and column, the column must only 
refer to attribute

Review comment:
   done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #32434: [SPARK-35312][SS] Introduce new Option in Kafka source to specify minimum number of records to read per trigger

2021-05-07 Thread GitBox


HeartSaVioR commented on pull request #32434:
URL: https://github.com/apache/spark/pull/32434#issuecomment-834175330


   ok to test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #32367: [SPARK-35020][SQL] Group exception messages in catalyst/util

2021-05-07 Thread GitBox


beliefer commented on pull request #32367:
URL: https://github.com/apache/spark/pull/32367#issuecomment-834175496


   @allisonwang-db Thank you for review. @cloud-fan Thank you too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #32442: [SPARK-35283][SQL] Support query some DDL with CTES

2021-05-07 Thread GitBox


cloud-fan commented on a change in pull request #32442:
URL: https://github.com/apache/spark/pull/32442#discussion_r628032240



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
##
@@ -193,13 +193,32 @@ class AstBuilder extends SqlBaseBaseVisitor[AnyRef] with 
SQLConfHelper with Logg
* This is only used for Common Table Expressions.
*/
   override def visitNamedQuery(ctx: NamedQueryContext): SubqueryAlias = 
withOrigin(ctx) {
-val subQuery: LogicalPlan = plan(ctx.query).optionalMap(ctx.columnAliases)(
+val logicalPlan = Option(ctx.query).map(plan).orElse(
+  Option(ctx.ddlStatementForQuery).map(visitDdlStatementForQuery)).get

Review comment:
   nit: we can call `visitDdlQuery` and don't need to create 
`visitDdlStatementForQuery`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #32442: [SPARK-35283][SQL] Support query some DDL with CTES

2021-05-07 Thread GitBox


cloud-fan commented on a change in pull request #32442:
URL: https://github.com/apache/spark/pull/32442#discussion_r628032519



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
##
@@ -193,13 +193,32 @@ class AstBuilder extends SqlBaseBaseVisitor[AnyRef] with 
SQLConfHelper with Logg
* This is only used for Common Table Expressions.
*/
   override def visitNamedQuery(ctx: NamedQueryContext): SubqueryAlias = 
withOrigin(ctx) {
-val subQuery: LogicalPlan = plan(ctx.query).optionalMap(ctx.columnAliases)(
+val logicalPlan = Option(ctx.query).map(plan).orElse(
+  Option(ctx.ddlStatementForQuery).map(visitDdlStatementForQuery)).get
+val subQuery: LogicalPlan = logicalPlan.optionalMap(ctx.columnAliases)(
   (columnAliases, plan) =>
 UnresolvedSubqueryColumnAliases(visitIdentifierList(columnAliases), 
plan)
 )
 SubqueryAlias(ctx.name.getText, subQuery)
   }
 
+  override def visitDdlQuery(ctx: DdlQueryContext): LogicalPlan = 
withOrigin(ctx) {
+visitDdlStatementForQuery(ctx.ddlStatementForQuery())
+  }
+
+  def visitDdlStatementForQuery(ctx: DdlStatementForQueryContext): LogicalPlan 
= withOrigin(ctx) {
+ctx match {
+  case namespaces: ShowNamespacesContext => visitShowNamespaces(namespaces)
+  case tables: ShowTablesContext => visitShowTables(tables)
+  case tblProperties: ShowTblPropertiesContext => 
visitShowTblProperties(tblProperties)
+  case partitions: ShowPartitionsContext => visitShowPartitions(partitions)
+  case columns: ShowColumnsContext => visitShowColumns(columns)
+  case views: ShowViewsContext => visitShowViews(views)
+  case functions: ShowFunctionsContext => visitShowFunctions(functions)
+  case _ => throw 
QueryParsingErrors.unsupportedDdlStatementForQueryError(ctx)

Review comment:
   This can't happen, and is an assert like error




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #32377: [SPARK-35021][SQL] Group exception messages in connector/catalog

2021-05-07 Thread GitBox


beliefer commented on pull request #32377:
URL: https://github.com/apache/spark/pull/32377#issuecomment-834176700


   > @beliefer can you fix the conflicts? thanks!
   
   OK


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #32442: [SPARK-35283][SQL] Support query some DDL with CTES

2021-05-07 Thread GitBox


cloud-fan commented on a change in pull request #32442:
URL: https://github.com/apache/spark/pull/32442#discussion_r628032974



##
File path: 
sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
##
@@ -375,8 +363,18 @@ ctes
 : WITH namedQuery (',' namedQuery)*
 ;
 
+ddlStatementForQuery

Review comment:
   how about `informationQueries`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #32442: [SPARK-35283][SQL] Support query some DDL with CTES

2021-05-07 Thread GitBox


cloud-fan commented on a change in pull request #32442:
URL: https://github.com/apache/spark/pull/32442#discussion_r628033374



##
File path: 
sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/ThriftServerQueryTestSuite.scala
##
@@ -84,6 +84,8 @@ class ThriftServerQueryTestSuite extends SQLQueryTestSuite 
with SharedThriftServ
 "date.sql",
 // SPARK-28620
 "postgreSQL/float4.sql",
+// SPARK-35283
+"cte-ddl.sql",

Review comment:
   why it doesn't work in thriftserver?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #32442: [SPARK-35283][SQL] Support query some DDL with CTES

2021-05-07 Thread GitBox


cloud-fan commented on pull request #32442:
URL: https://github.com/apache/spark/pull/32442#issuecomment-834177349


   The syntax looks good, cc @yaooqinn @wangyum @viirya @maropu 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yijiacui-db commented on pull request #31944: [SPARK-34854][SQL][SS] Expose source metrics via progress report and add Kafka use-case to report delay.

2021-05-07 Thread GitBox


yijiacui-db commented on pull request #31944:
URL: https://github.com/apache/spark/pull/31944#issuecomment-834179633


   > Thanks all for thoughtful reviewing and thanks @yijiacui-db for the 
contribution! Merged to master.
   
   @HeartSaVioR @xuanyuanking @gaborgsomogyi @viirya Thank you so much for 
reviewing this PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sowantharajan commented on pull request #20692: [SPARK-23531][SQL] Show attribute type in explain

2021-05-07 Thread GitBox


sowantharajan commented on pull request #20692:
URL: https://github.com/apache/spark/pull/20692#issuecomment-834181214


   This is known hierarchical query problem which is required for recursive 
query. This feature is available in the query language like SQL. Kindly open 
this ticket and give some support for this options


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32465: [SPARK-35331][SQL] Support resolving missing attrs for distribute/cluster by/repartition hint

2021-05-07 Thread GitBox


SparkQA commented on pull request #32465:
URL: https://github.com/apache/spark/pull/32465#issuecomment-834183373


   **[Test build #138234 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138234/testReport)**
 for PR 32465 at commit 
[`03ed3a5`](https://github.com/apache/spark/commit/03ed3a5a665adecd7a49d22242506ed1df96aa0f).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32287: [SPARK-27991][CORE] Defer the fetch request on Netty OOM

2021-05-07 Thread GitBox


SparkQA commented on pull request #32287:
URL: https://github.com/apache/spark/pull/32287#issuecomment-834183769


   **[Test build #138239 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138239/testReport)**
 for PR 32287 at commit 
[`af4381c`](https://github.com/apache/spark/commit/af4381c1205958e86e50c74c0a3b0eb2f4a445d9).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a change in pull request #32442: [SPARK-35283][SQL] Support query some DDL with CTES

2021-05-07 Thread GitBox


beliefer commented on a change in pull request #32442:
URL: https://github.com/apache/spark/pull/32442#discussion_r628040787



##
File path: 
sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/ThriftServerQueryTestSuite.scala
##
@@ -84,6 +84,8 @@ class ThriftServerQueryTestSuite extends SQLQueryTestSuite 
with SharedThriftServ
 "date.sql",
 // SPARK-28620
 "postgreSQL/float4.sql",
+// SPARK-35283
+"cte-ddl.sql",

Review comment:
   Because the output schema of hive is different from spark




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32287: [SPARK-27991][CORE] Defer the fetch request on Netty OOM

2021-05-07 Thread GitBox


SparkQA removed a comment on pull request #32287:
URL: https://github.com/apache/spark/pull/32287#issuecomment-834094979


   **[Test build #138239 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138239/testReport)**
 for PR 32287 at commit 
[`af4381c`](https://github.com/apache/spark/commit/af4381c1205958e86e50c74c0a3b0eb2f4a445d9).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32465: [SPARK-35331][SQL] Support resolving missing attrs for distribute/cluster by/repartition hint

2021-05-07 Thread GitBox


SparkQA removed a comment on pull request #32465:
URL: https://github.com/apache/spark/pull/32465#issuecomment-834048943


   **[Test build #138234 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138234/testReport)**
 for PR 32465 at commit 
[`03ed3a5`](https://github.com/apache/spark/commit/03ed3a5a665adecd7a49d22242506ed1df96aa0f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

2021-05-07 Thread GitBox


SparkQA commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-834188152






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on pull request #32442: [SPARK-35283][SQL] Support query some DDL with CTES

2021-05-07 Thread GitBox


wangyum commented on pull request #32442:
URL: https://github.com/apache/spark/pull/32442#issuecomment-834189005


   +1. This syntax looks good.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] peter-toth commented on a change in pull request #32298: [SPARK-34079][SQL] Merge non-correlated scalar subqueries to multi-column scalar subqueries for better reuse

2021-05-07 Thread GitBox


peter-toth commented on a change in pull request #32298:
URL: https://github.com/apache/spark/pull/32298#discussion_r628046114



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/MergeScalarSubqueries.scala
##
@@ -0,0 +1,184 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LeafNode, 
LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.{MULTI_SCALAR_SUBQUERY, 
SCALAR_SUBQUERY}
+
+/**
+ * This rule tries to merge multiple non-correlated [[ScalarSubquery]]s into a
+ * [[MultiScalarSubquery]] to compute multiple scalar values once.
+ *
+ * The process is the following:
+ * - While traversing through the plan each [[ScalarSubquery]] plan is tried 
to merge into the cache
+ *   of already seen subquery plans. If merge is possible then cache is 
updated with the merged
+ *   subquery plan, if not then the new subquery plan is added to the cache.
+ * - The original [[ScalarSubquery]] expression is replaced to a reference 
pointing to its cached
+ *   version in this form: 
`GetStructField(MultiScalarSubquery(SubqueryReference(...)))`.
+ * - A second traversal checks if a [[SubqueryReference]] is pointing to a 
subquery plan that
+ *   returns multiple values and either replaces only [[SubqueryReference]] to 
the cached plan or
+ *   restores the whole expression to its original [[ScalarSubquery]] form.
+ * - [[ReuseSubquery]] rule makes sure that merged subqueries are computed 
once.
+ *
+ * Eg. the following query:
+ *
+ * SELECT
+ *   (SELECT avg(a) FROM t GROUP BY b),
+ *   (SELECT sum(b) FROM t GROUP BY b)
+ *
+ * is optimized from:
+ *
+ * Project [scalar-subquery#231 [] AS scalarsubquery()#241,
+ *   scalar-subquery#232 [] AS scalarsubquery()#242L]
+ * :  :- Aggregate [b#234], [avg(a#233) AS avg(a)#236]
+ * :  :  +- Relation default.t[a#233,b#234] parquet
+ * :  +- Aggregate [b#240], [sum(b#240) AS sum(b)#238L]
+ * : +- Project [b#240]
+ * :+- Relation default.t[a#239,b#240] parquet

Review comment:
   > The proposed rule augments two subqueries, makes them look identical, 
and hopes (a) column-pruning doesn't prune too aggressively and (b) physical 
de-dup could dedup them. In case (a) changes later and the two aggregate trees 
are not deduped in the physical plan, there could potentially be regressions -- 
each aggregation then becomes more expensive.
   
   In this PR the new `MergeScalarSubqueries` rule runs in a separate batch 
after column pruning, close to the end of optimization. This is by design to 
make sure no subsequent rule changes the structure of different instances of a 
merged subquery plan at different places in the logical plan differently. So 
the physical planing creates the same physical plan for these instances and 
there shouldn't be any dedup issues.
   
   I think probably the downside of my current PR is that the physical planning 
of merged subqueries happen multiple times (as many times as they they appear 
in the logical plan) and physical dedup comes only after that. This could be 
improved if we had subquery references in logical plan as well (something like 
`ReuseSubqueryExec`). But I think that's what your (1) is about. Move the 
merged subqueries to a special top logical plan node and add subquery 
references at places where they are actually used.
   
   > SELECT y
   FROM LATERAL VIEW explode(ARRAY(ARRAY(1), ARRAY(1, 2), ARRAY(1, 2, 3))) AS y
   WHERE
   ( SELECT COUNT(*) FROM LATERAL VIEW explode(y) AS element ) > 1
   AND
   ( SELECT SUM(element) FROM LATERAL VIEW explode(y) AS element ) > 3
   I noticed that such subqueries do not work for now. But they align with the 
language spec and has well defined semantics. Once we support them, we want 
your proposed rule to be able to speedup them as well.
   
   Ah ok, but what should be the optimized plan of that query? This looks like 
we have 2 correlated subqueries and (2) makes perfect sense to merge them. But 

[GitHub] [spark] HeartSaVioR commented on pull request #32434: [SPARK-35312][SS] Introduce new Option in Kafka source to specify minimum number of records to read per trigger

2021-05-07 Thread GitBox


HeartSaVioR commented on pull request #32434:
URL: https://github.com/apache/spark/pull/32434#issuecomment-834190258


   Thanks for the contribution! The use case is interesting, especially the 
sink is sensitive to the number of batches and is sub-optimal for lots of small 
batches.
   
   One thing we need to deal with is AdmissionControl - from Spark 3.0, Spark 
community generalized the requirement of max offsets/max files per trigger into 
`SupportsAdmissionControl`. This is actually to make sure once trigger is not 
affected by max offsets/max files per trigger, but given the max offsets per 
trigger is generalized to `ReadMaxRows`, this is the another thing we may want 
to generalize.
   
   cc. @brkyvz as I see some opportunity to improve ReadLimit on this use case.
   
   Also cc. @tdas @zsxwing @viirya @gaborgsomogyi @xuanyuanking 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

2021-05-07 Thread GitBox


AmplabJenkins commented on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-834191115


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42763/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32465: [SPARK-35331][SQL] Support resolving missing attrs for distribute/cluster by/repartition hint

2021-05-07 Thread GitBox


AmplabJenkins commented on pull request #32465:
URL: https://github.com/apache/spark/pull/32465#issuecomment-834191119






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32459: [SPARK-26164][SQL][FOLLOWUP] WriteTaskStatsTracker should know which file the row is written to

2021-05-07 Thread GitBox


AmplabJenkins commented on pull request #32459:
URL: https://github.com/apache/spark/pull/32459#issuecomment-834191113


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138233/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32287: [SPARK-27991][CORE] Defer the fetch request on Netty OOM

2021-05-07 Thread GitBox


AmplabJenkins commented on pull request #32287:
URL: https://github.com/apache/spark/pull/32287#issuecomment-834191117


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138239/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32354: [SPARK-35232][SQL] Nested column pruning should retain column metadata

2021-05-07 Thread GitBox


AmplabJenkins removed a comment on pull request #32354:
URL: https://github.com/apache/spark/pull/32354#issuecomment-834191115


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42763/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32459: [SPARK-26164][SQL][FOLLOWUP] WriteTaskStatsTracker should know which file the row is written to

2021-05-07 Thread GitBox


AmplabJenkins removed a comment on pull request #32459:
URL: https://github.com/apache/spark/pull/32459#issuecomment-834191113


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138233/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32287: [SPARK-27991][CORE] Defer the fetch request on Netty OOM

2021-05-07 Thread GitBox


AmplabJenkins removed a comment on pull request #32287:
URL: https://github.com/apache/spark/pull/32287#issuecomment-834191117


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138239/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32465: [SPARK-35331][SQL] Support resolving missing attrs for distribute/cluster by/repartition hint

2021-05-07 Thread GitBox


AmplabJenkins removed a comment on pull request #32465:
URL: https://github.com/apache/spark/pull/32465#issuecomment-834191116






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32466: [SPARK-35333][SQL] Skip object null check in Invoke if possible

2021-05-07 Thread GitBox


SparkQA commented on pull request #32466:
URL: https://github.com/apache/spark/pull/32466#issuecomment-834191972


   **[Test build #138242 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138242/testReport)**
 for PR 32466 at commit 
[`77f6576`](https://github.com/apache/spark/commit/77f6576226822a3f36aa52fcff0889541c991b81).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

2021-05-07 Thread GitBox


SparkQA commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-834192100


   **[Test build #138244 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138244/testReport)**
 for PR 32431 at commit 
[`3f5102d`](https://github.com/apache/spark/commit/3f5102d5be8240053b7092b329ba71f67220770c).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32434: [SPARK-35312][SS] Introduce new Option in Kafka source to specify minimum number of records to read per trigger

2021-05-07 Thread GitBox


SparkQA commented on pull request #32434:
URL: https://github.com/apache/spark/pull/32434#issuecomment-834192073


   **[Test build #138243 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138243/testReport)**
 for PR 32434 at commit 
[`be9805d`](https://github.com/apache/spark/commit/be9805d1ef5852254e415a30ef600db807264ca1).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32377: [SPARK-35021][SQL] Group exception messages in connector/catalog

2021-05-07 Thread GitBox


SparkQA commented on pull request #32377:
URL: https://github.com/apache/spark/pull/32377#issuecomment-834192177


   **[Test build #138245 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138245/testReport)**
 for PR 32377 at commit 
[`4db4250`](https://github.com/apache/spark/commit/4db425003f8d0a75e0238672ae50082ceb8cd751).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32434: [SPARK-35312][SS] Introduce new Option in Kafka source to specify minimum number of records to read per trigger

2021-05-07 Thread GitBox


AmplabJenkins removed a comment on pull request #32434:
URL: https://github.com/apache/spark/pull/32434#issuecomment-831878519


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] peter-toth commented on a change in pull request #32298: [SPARK-34079][SQL] Merge non-correlated scalar subqueries to multi-column scalar subqueries for better reuse

2021-05-07 Thread GitBox


peter-toth commented on a change in pull request #32298:
URL: https://github.com/apache/spark/pull/32298#discussion_r628046114



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/MergeScalarSubqueries.scala
##
@@ -0,0 +1,184 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LeafNode, 
LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.{MULTI_SCALAR_SUBQUERY, 
SCALAR_SUBQUERY}
+
+/**
+ * This rule tries to merge multiple non-correlated [[ScalarSubquery]]s into a
+ * [[MultiScalarSubquery]] to compute multiple scalar values once.
+ *
+ * The process is the following:
+ * - While traversing through the plan each [[ScalarSubquery]] plan is tried 
to merge into the cache
+ *   of already seen subquery plans. If merge is possible then cache is 
updated with the merged
+ *   subquery plan, if not then the new subquery plan is added to the cache.
+ * - The original [[ScalarSubquery]] expression is replaced to a reference 
pointing to its cached
+ *   version in this form: 
`GetStructField(MultiScalarSubquery(SubqueryReference(...)))`.
+ * - A second traversal checks if a [[SubqueryReference]] is pointing to a 
subquery plan that
+ *   returns multiple values and either replaces only [[SubqueryReference]] to 
the cached plan or
+ *   restores the whole expression to its original [[ScalarSubquery]] form.
+ * - [[ReuseSubquery]] rule makes sure that merged subqueries are computed 
once.
+ *
+ * Eg. the following query:
+ *
+ * SELECT
+ *   (SELECT avg(a) FROM t GROUP BY b),
+ *   (SELECT sum(b) FROM t GROUP BY b)
+ *
+ * is optimized from:
+ *
+ * Project [scalar-subquery#231 [] AS scalarsubquery()#241,
+ *   scalar-subquery#232 [] AS scalarsubquery()#242L]
+ * :  :- Aggregate [b#234], [avg(a#233) AS avg(a)#236]
+ * :  :  +- Relation default.t[a#233,b#234] parquet
+ * :  +- Aggregate [b#240], [sum(b#240) AS sum(b)#238L]
+ * : +- Project [b#240]
+ * :+- Relation default.t[a#239,b#240] parquet

Review comment:
   > The proposed rule augments two subqueries, makes them look identical, 
and hopes (a) column-pruning doesn't prune too aggressively and (b) physical 
de-dup could dedup them. In case (a) changes later and the two aggregate trees 
are not deduped in the physical plan, there could potentially be regressions -- 
each aggregation then becomes more expensive.
   
   In this PR the new `MergeScalarSubqueries` rule runs in a separate batch 
after column pruning, close to the end of optimization. This is by design to 
make sure no subsequent rule changes the structure of different instances of a 
merged subquery plan at different places in the logical plan differently. So 
the physical planing creates the same physical plan for these instances and 
there shouldn't be any dedup issues.
   
   I think probably the downside of my current PR is that the physical planning 
of merged subqueries happen multiple times (as many times as they they appear 
in the logical plan) and physical dedup comes only after that. This could be 
improved if we had subquery references in logical plan as well (something like 
`ReuseSubqueryExec`). But I think that's what your (1) is about. Move the 
merged subqueries to a special top logical plan node and add subquery 
references at places where they are actually used.
   
   > SELECT y
   FROM LATERAL VIEW explode(ARRAY(ARRAY(1), ARRAY(1, 2), ARRAY(1, 2, 3))) AS y
   WHERE
   ( SELECT COUNT(*) FROM LATERAL VIEW explode(y) AS element ) > 1
   AND
   ( SELECT SUM(element) FROM LATERAL VIEW explode(y) AS element ) > 3
   I noticed that such subqueries do not work for now. But they align with the 
language spec and has well defined semantics. Once we support them, we want 
your proposed rule to be able to speedup them as well.
   
   Ah ok, but what should be the optimized plan of that query? This looks like 
we have 2 correlated subqueries and (2) makes perfect sense to merge them. But 

[GitHub] [spark] peter-toth commented on a change in pull request #32298: [SPARK-34079][SQL] Merge non-correlated scalar subqueries to multi-column scalar subqueries for better reuse

2021-05-07 Thread GitBox


peter-toth commented on a change in pull request #32298:
URL: https://github.com/apache/spark/pull/32298#discussion_r628046114



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/MergeScalarSubqueries.scala
##
@@ -0,0 +1,184 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LeafNode, 
LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.{MULTI_SCALAR_SUBQUERY, 
SCALAR_SUBQUERY}
+
+/**
+ * This rule tries to merge multiple non-correlated [[ScalarSubquery]]s into a
+ * [[MultiScalarSubquery]] to compute multiple scalar values once.
+ *
+ * The process is the following:
+ * - While traversing through the plan each [[ScalarSubquery]] plan is tried 
to merge into the cache
+ *   of already seen subquery plans. If merge is possible then cache is 
updated with the merged
+ *   subquery plan, if not then the new subquery plan is added to the cache.
+ * - The original [[ScalarSubquery]] expression is replaced to a reference 
pointing to its cached
+ *   version in this form: 
`GetStructField(MultiScalarSubquery(SubqueryReference(...)))`.
+ * - A second traversal checks if a [[SubqueryReference]] is pointing to a 
subquery plan that
+ *   returns multiple values and either replaces only [[SubqueryReference]] to 
the cached plan or
+ *   restores the whole expression to its original [[ScalarSubquery]] form.
+ * - [[ReuseSubquery]] rule makes sure that merged subqueries are computed 
once.
+ *
+ * Eg. the following query:
+ *
+ * SELECT
+ *   (SELECT avg(a) FROM t GROUP BY b),
+ *   (SELECT sum(b) FROM t GROUP BY b)
+ *
+ * is optimized from:
+ *
+ * Project [scalar-subquery#231 [] AS scalarsubquery()#241,
+ *   scalar-subquery#232 [] AS scalarsubquery()#242L]
+ * :  :- Aggregate [b#234], [avg(a#233) AS avg(a)#236]
+ * :  :  +- Relation default.t[a#233,b#234] parquet
+ * :  +- Aggregate [b#240], [sum(b#240) AS sum(b)#238L]
+ * : +- Project [b#240]
+ * :+- Relation default.t[a#239,b#240] parquet

Review comment:
   > The proposed rule augments two subqueries, makes them look identical, 
and hopes (a) column-pruning doesn't prune too aggressively and (b) physical 
de-dup could dedup them. In case (a) changes later and the two aggregate trees 
are not deduped in the physical plan, there could potentially be regressions -- 
each aggregation then becomes more expensive.
   
   In this PR the new `MergeScalarSubqueries` rule runs in a separate batch 
after column pruning, close to the end of optimization. This is by design to 
make sure no subsequent rule changes the structure of different instances of a 
merged subquery plan at different places in the logical plan differently. So 
the physical planing creates the same physical plan for these instances and 
there shouldn't be any dedup issues.
   
   I think probably the downside of my current PR is that the physical planning 
of merged subqueries happen multiple times (as many times as they they appear 
in the logical plan) and physical dedup comes only after that. This could be 
improved if we had subquery references in logical plan as well (something like 
`ReuseSubqueryExec`). But I think that's what your (1) is about. Move the 
merged subqueries to a special top logical plan node and add subquery 
references at places where they are actually used.
   
   > SELECT y
   FROM LATERAL VIEW explode(ARRAY(ARRAY(1), ARRAY(1, 2), ARRAY(1, 2, 3))) AS y
   WHERE
   ( SELECT COUNT(*) FROM LATERAL VIEW explode(y) AS element ) > 1
   AND
   ( SELECT SUM(element) FROM LATERAL VIEW explode(y) AS element ) > 3
   I noticed that such subqueries do not work for now. But they align with the 
language spec and has well defined semantics. Once we support them, we want 
your proposed rule to be able to speedup them as well.
   
   Ah ok, but what should be the optimized plan of that query? This looks like 
we have 2 correlated subqueries and (2) makes perfect sense to merge them. But 

[GitHub] [spark] wangyum commented on a change in pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

2021-05-07 Thread GitBox


wangyum commented on a change in pull request #29642:
URL: https://github.com/apache/spark/pull/29642#discussion_r628066242



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
##
@@ -188,6 +188,15 @@ abstract class ParquetFilterSuite extends QueryTest with 
ParquetTest with Shared
   checkFilterPredicate(!(tsAttr < ts4.ts), classOf[GtEq[_]], 
resultFun(ts4))
   checkFilterPredicate(tsAttr < ts2.ts || tsAttr > ts3.ts, 
classOf[Operators.Or],
 Seq(Row(resultFun(ts1)), Row(resultFun(ts4
+
+  Seq(3, 20).foreach { threshold =>
+withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_INFILTERTHRESHOLD.key -> 
s"$threshold") {

Review comment:
   Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32464: [SPARK-35062][SQL] Group exception messages in sql/streaming

2021-05-07 Thread GitBox


SparkQA commented on pull request #32464:
URL: https://github.com/apache/spark/pull/32464#issuecomment-834210377


   **[Test build #138235 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138235/testReport)**
 for PR 32464 at commit 
[`ce9d446`](https://github.com/apache/spark/commit/ce9d4469ac2d05b5c02cfe2940220ef14088bb37).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32464: [SPARK-35062][SQL] Group exception messages in sql/streaming

2021-05-07 Thread GitBox


SparkQA removed a comment on pull request #32464:
URL: https://github.com/apache/spark/pull/32464#issuecomment-834068621


   **[Test build #138235 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138235/testReport)**
 for PR 32464 at commit 
[`ce9d446`](https://github.com/apache/spark/commit/ce9d4469ac2d05b5c02cfe2940220ef14088bb37).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan opened a new pull request #32467: [WIP] simplify correlated subquery resolution

2021-05-07 Thread GitBox


cloud-fan opened a new pull request #32467:
URL: https://github.com/apache/spark/pull/32467


   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32434: [SPARK-35312][SS] Introduce new Option in Kafka source to specify minimum number of records to read per trigger

2021-05-07 Thread GitBox


SparkQA commented on pull request #32434:
URL: https://github.com/apache/spark/pull/32434#issuecomment-834218387


   **[Test build #138243 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138243/testReport)**
 for PR 32434 at commit 
[`be9805d`](https://github.com/apache/spark/commit/be9805d1ef5852254e415a30ef600db807264ca1).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32434: [SPARK-35312][SS] Introduce new Option in Kafka source to specify minimum number of records to read per trigger

2021-05-07 Thread GitBox


SparkQA removed a comment on pull request #32434:
URL: https://github.com/apache/spark/pull/32434#issuecomment-834192073


   **[Test build #138243 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138243/testReport)**
 for PR 32434 at commit 
[`be9805d`](https://github.com/apache/spark/commit/be9805d1ef5852254e415a30ef600db807264ca1).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #32455: [WIP][SPARK-35253][SQL][BUILD] Bump up the janino version to v3.1.4(latest)

2021-05-07 Thread GitBox


HyukjinKwon commented on pull request #32455:
URL: https://github.com/apache/spark/pull/32455#issuecomment-834219899


   cc @rednaxelafx too FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32377: [SPARK-35021][SQL] Group exception messages in connector/catalog

2021-05-07 Thread GitBox


SparkQA commented on pull request #32377:
URL: https://github.com/apache/spark/pull/32377#issuecomment-834221675


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42767/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32466: [SPARK-35333][SQL] Skip object null check in Invoke if possible

2021-05-07 Thread GitBox


SparkQA commented on pull request #32466:
URL: https://github.com/apache/spark/pull/32466#issuecomment-834223084






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   >