[GitHub] spark issue #15218: [SPARK-17637][Scheduler]Packed scheduling for Spark task...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15218 The test case design is pretty good. It covers all the scenarios. - Could you add a check for the negative case? That means, when users do not provide the right TaskAssigner name, we fall back to the default round robin one - For the existing unchanged test cases in `TaskSchedulerImplSuite.scala`, please add a check to verify whether it picks the default one. - If possible, please change one of the existing test case in `TaskSchedulerImplSuite.scala`, ensure that users are allowed to input the round robin as the task assigner. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15398: [SPARK-17647][SQL][WIP] Fix backslash escaping in 'LIKE'...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15398 That is a design decision we need to make. Personally, MySQL and PostgreSQL are not good examples we should follow. Let us ask ourselves a simple question: > When users input `'\a'`, what is their expected results? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15398: [SPARK-17647][SQL][WIP] Fix backslash escaping in 'LIKE'...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15398 Also checked SQL Server. It behaves differently from both `Postgre` and `DB2`. It simply ignores it. > If there is no character after an escape character in the LIKE pattern, the pattern is not valid and the LIKE returns FALSE. If the character after an escape character is not a wildcard character, the escape character is discarded and the character following the escape is treated as a regular character in the pattern. This includes the percent sign (%), underscore (_), and left bracket ([) wildcard characters when they are enclosed in double brackets ([ ]). Also, within the double bracket characters ([ ]), escape characters can be used and the caret (^), hyphen (-), and right bracket (]) can be escaped. Let me check Oracle now --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15398: [SPARK-17647][SQL][WIP] Fix backslash escaping in 'LIKE'...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15398 Checked Oracle. Oracle has the same behavior like DB2: ``` SQL> SELECT USERNAME FROM ALL_USERS WHERE USERNAME LIKE '%P%\' ESCAPE '\'; SELECT USERNAME FROM ALL_USERS WHERE USERNAME LIKE '%P%\' ESCAPE '\' * ERROR at line 1: ORA-01424: missing or illegal character following the escape character SQL> SELECT USERNAME FROM ALL_USERS WHERE USERNAME LIKE '%P%\a' ESCAPE '\'; SELECT USERNAME FROM ALL_USERS WHERE USERNAME LIKE '%P%\a' ESCAPE '\' * ERROR at line 1: ORA-01424: missing or illegal character following the escape character ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15398: [SPARK-17647][SQL][WIP] Fix backslash escaping in 'LIKE'...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15398 This is different. > If there is no character after an escape character in the LIKE pattern, the pattern is not valid and the LIKE returns FALSE. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15398: [SPARK-17647][SQL][WIP] Fix backslash escaping in 'LIKE'...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15398 ``` ORA-01424: missing or illegal character following the escape character Cause: The character following the escape character in LIKE pattern is missing or not one of the escape character, '%', or '_'. Action: Remove the escape character or specify the missing character. ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15398: [SPARK-17647][SQL][WIP] Fix backslash escaping in 'LIKE'...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15398 We need to answer all the three questions listed by @jodersky --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15398: [SPARK-17647][SQL][WIP] Fix backslash escaping in 'LIKE'...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15398 If you check what we did in the other PRs, we do not need to strictly follow Hive. I do not have a strong opinion in all the above options. Let @rxin @yhuai make a decision. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15432: [SPARK-17854][SQL] rand/randn allows null/long as input ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15432 @srowen Previously, I also assumed Spark SQL is following the behaviors defined by Hive. However, we already merged a few PRs in Spark SQL that do not follow Hive. Especially, Hive has so many [configuration flags](https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties), but we do not want to add them for better usability. In the long term, when enterprise customers start using Spark more and more, I expect they might not even care whether we are Hive compliant or not. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15502: [SPARK-17892] [SQL] [2.0] Do Not Optimize Query in CTAS ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15502 cc @yhuai @hvanhovell @cloud-fan I guess this needs to be merged to 2.0.2 ASAP? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15502: [SPARK-17892] [SQL] [2.0] Do Not Optimize Query i...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15502#discussion_r83587470 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -510,7 +510,7 @@ private[hive] case class InsertIntoHiveTable( child: LogicalPlan, overwrite: Boolean, ifNotExists: Boolean) - extends LogicalPlan with Command { --- End diff -- In the Command, this PR requires [the child must be empty](https://github.com/gatorsmile/spark/blob/9cfebc523e4b88c3df3ffae8ca5ea92e98a0a616/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Command.scala#L28) . Should we convert `InsertIntoHiveTable` to a non-child `Command`? Just FYI, in Spark 2.1, `InsertIntoTable` is still a `LogicalPlan` instead of a `Command`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15316: [SPARK-17751] [SQL] Remove spark.sql.eagerAnalysis and O...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15316 Also cc @cloud-fan and @liancheng --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15316: [SPARK-17751] [SQL] Remove spark.sql.eagerAnalysis and O...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15316 cc @hvanhovell @rxin Any more comment about this PR? I assume Spark 2.0.2 needs it. Recently, when we analyzing the JIRA https://issues.apache.org/jira/browse/SPARK-17709, we are unable to see the plan due the analyzer failure. The users have to manually rebuild it with this fix. Then, we can see the failed analyzed plan. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15502: [SPARK-17892] [SQL] [2.0] Do Not Optimize Query i...
Github user gatorsmile closed the pull request at: https://github.com/apache/spark/pull/15502 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15502: [SPARK-17892] [SQL] [2.0] Do Not Optimize Query in CTAS ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15502 Thanks! Close it now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15316: [SPARK-17751] [SQL] Remove spark.sql.eagerAnalysis and O...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15316 Sure, will do it soon. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15495: [SPARK-17620][SQL] Determine Serde by hive.default.filef...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15495 Thank you! Will do it soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15523: [SPARK-17981] [SPARK-17957] [SQL] Incorrectly Set...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/15523 [SPARK-17981] [SPARK-17957] [SQL] Incorrectly Set Nullability to False in FilterExec ### What changes were proposed in this pull request? When `FilterExec` contains `isNotNull`, which could be inferred and pushed down or users specified, we convert the nullability of the involved columns if the top-layer expression is null-intolerant. However, this is not correct, if the top-layer expression is not a leaf expression, it could still tolerate the null when it has null-tolerant child expressions. For example, `cast(coalesce(a#5, a#15) as double)`. Although `cast` is a null-intolerant expression, but obviously` coalesce` is null-tolerant. Thus, it could eat null. When the nullability is wrong, we could generate incorrect results in different cases. For example, ```Scala val df1 = Seq((1, 2), (2, 3)).toDF("a", "b") val df2 = Seq((2, 5), (3, 4)).toDF("a", "c") val joinedDf = df1.join(df2, Seq("a"), "outer").na.fill(0) val df3 = Seq((3, 1)).toDF("a", "d") joinedDf.join(df3, "a").show ``` The optimized plan is like ``` Project [a#29, b#30, c#31, d#42] +- Join Inner, (a#29 = a#41) :- Project [cast(coalesce(cast(coalesce(a#5, a#15) as double), 0.0) as int) AS a#29, cast(coalesce(cast(b#6 as double), 0.0) as int) AS b#30, cast(coalesce(cast(c#16 as double), 0.0) as int) AS c#31] : +- Filter isnotnull(cast(coalesce(cast(coalesce(a#5, a#15) as double), 0.0) as int)) : +- Join FullOuter, (a#5 = a#15) ::- LocalRelation [a#5, b#6] :+- LocalRelation [a#15, c#16] +- LocalRelation [a#41, d#42] ``` Without the fix, it returns an empty result. With the fix, it can return a correct answer: ``` +---+---+---+---+ | a| b| c| d| +---+---+---+---+ | 3| 0| 4| 1| +---+---+---+---+ ``` ### How was this patch tested? Added test cases to verify the nullability changes in FilterExec. Also added a test case for verifying the reported incorrect result. You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark nullabilityFilterExec Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15523.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15523 commit 54c3cc849e91b9bedc6379e04edc9e23245be760 Author: gatorsmile Date: 2016-10-17T23:29:43Z fix --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15495: [SPARK-17620][SQL] Determine Serde by hive.default.filef...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15495 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15266: [SPARK-17693] [SQL] Fixed Insert Failure To Data Source ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15266 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15495: [SPARK-17620][SQL] Determine Serde by hive.default.filef...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15495 Merging to master! Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15523: [SPARK-17981] [SPARK-17957] [SQL] Fix Incorrect Nullabil...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15523 cc @cloud-fan @davies @sameeragarwal --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15459: [SPARK-17409] [SQL] [FOLLOW-UP] Do Not Optimize Query in...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15459 @yhuai Any further comment about it? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15417: [SPARK-17851][SQL][TESTS] Make sure all test sqls in cat...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15417 @jiangxb1987 Can you leave a comment on the PR changes to explain why you made these changes? You know, reviewing these changes is not easy. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15266: [SPARK-17693] [SQL] Fixed Insert Failure To Data Source ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15266 Any other comment? @cloud-fan Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15529: [SPARK-17751] [SQL] [Backport-2.0] Remove spark.s...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/15529 [SPARK-17751] [SQL] [Backport-2.0] Remove spark.sql.eagerAnalysis and Output the Plan if Existed in AnalysisException ### What changes were proposed in this pull request? This PR is to backport the fix https://github.com/apache/spark/pull/15316 to 2.0. Dataset always does eager analysis now. Thus, `spark.sql.eagerAnalysis` is not used any more. Thus, we need to remove it. This PR also outputs the plan. Without the fix, the analysis error is like ``` cannot resolve '`k1`' given input columns: [k, v]; line 1 pos 12 ``` After the fix, the analysis error becomes: ``` org.apache.spark.sql.AnalysisException: cannot resolve '`k1`' given input columns: [k, v]; line 1 pos 12; 'Project [unresolvedalias(CASE WHEN ('k1 = 2) THEN 22 WHEN ('k1 = 4) THEN 44 ELSE 0 END, None), v#6] +- SubqueryAlias t +- Project [_1#2 AS k#5, _2#3 AS v#6] +- LocalRelation [_1#2, _2#3] ``` ### How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark eagerAnalysis20 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15529.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15529 commit ebc6f73f225dd008503b0df3b0e575f7f4696d08 Author: gatorsmile Date: 2016-10-18T06:53:34Z fix --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15529: [SPARK-17751] [SQL] [Backport-2.0] Remove spark.sql.eage...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15529 cc @hvanhovell @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15529: [SPARK-17751] [SQL] [Backport-2.0] Remove spark.sql.eage...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15529 Thanks! Close it now --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15529: [SPARK-17751] [SQL] [Backport-2.0] Remove spark.s...
Github user gatorsmile closed the pull request at: https://github.com/apache/spark/pull/15529 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15546: [SPARK-17892][SQL] SQLBuilder should wrap the generated ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15546 Wrong JIRA number. : ) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15541: [SPARK-17637][Scheduler]Packed scheduling for Spa...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15541#discussion_r83995272 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskAssigner.scala --- @@ -0,0 +1,233 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import scala.collection.mutable.ArrayBuffer +import scala.collection.mutable.PriorityQueue +import scala.util.Random + +import org.apache.spark.internal.{config, Logging} +import org.apache.spark.SparkConf +import org.apache.spark.util.Utils + +/** Tracking the current state of the workers with available cores and assigned task list. */ +class OfferState(val workOffer: WorkerOffer) { + // The current remaining cores that can be allocated to tasks. + var coresAvailable: Int = workOffer.cores + // The list of tasks that are assigned to this worker. + val tasks = new ArrayBuffer[TaskDescription](coresAvailable) +} + +/** + * TaskAssigner is the base class for all task assigner implementations, and can be + * extended to implement different task scheduling algorithms. + * Together with [[org.apache.spark.scheduler.TaskScheduler TaskScheduler]], TaskAssigner + * is used to assign tasks to workers with available cores. Internally, TaskScheduler, requested + * to perform task assignment given available workers, first sorts the candidate tasksets, + * and then for each taskset, it takes a number of rounds to request TaskAssigner for task + * assignment with different the locality restrictions until there is either no qualified + * workers or no valid tasks to be assigned. + * + * TaskAssigner is responsible to maintain the worker availability state and task assignment + * information. The contract between [[org.apache.spark.scheduler.TaskScheduler TaskScheduler]] + * and TaskAssigner is as follows. First, TaskScheduler invokes construct() of TaskAssigner to + * initialize the its internal worker states at the beginning of resource offering. Before each + * round of task assignment for a taskset, TaskScheduler invoke the init() of TaskAssigner to + * initialize the data structure for the round. When performing real task assignment, + * hasNext()/getNext() is used by TaskScheduler to check the worker availability and retrieve + * current offering from TaskAssigner. Then offerAccepted is used by TaskScheduler to notify + * the TaskAssigner so that TaskAssigner can decide whether the current offer is valid or not for + * the next request. After task assignment is done, TaskScheduler invokes the tasks() to + * retrieve all the task assignment information, and eventually, invokes reset() method so that + * TaskAssigner can cleanup its internal maintained resources. + */ + +private[scheduler] abstract class TaskAssigner { + var offer: Seq[OfferState] = _ + var CPUS_PER_TASK = 1 + + def withCpuPerTask(CPUS_PER_TASK: Int): Unit = { +this.CPUS_PER_TASK = CPUS_PER_TASK + } + + // The final assigned offer returned to TaskScheduler. + final def tasks: Seq[ArrayBuffer[TaskDescription]] = offer.map(_.tasks) + + // Invoked at the beginning of resource offering to construct the offer with the workoffers. + def construct(workOffer: Seq[WorkerOffer]): Unit = { +offer = workOffer.map(o => new OfferState(o)) + } + + // Invoked at each round of Taskset assignment to initialize the internal structure. + def init(): Unit + + // Whether there is offer available to be used inside of one round of Taskset assignment. + def hasNext: Boolean + + // Returned the next assigned offer based on the task assignment strategy. + def getNext(): OfferState + + // Invoked by the TaskScheduler to indicate whether the current offer is accepted or not so that + // the assigner can decide whether the current worker is valid for the next offering. + def offerAcc
[GitHub] spark pull request #15541: [SPARK-17637][Scheduler]Packed scheduling for Spa...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15541#discussion_r83995415 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskAssigner.scala --- @@ -0,0 +1,233 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import scala.collection.mutable.ArrayBuffer +import scala.collection.mutable.PriorityQueue +import scala.util.Random + +import org.apache.spark.internal.{config, Logging} +import org.apache.spark.SparkConf +import org.apache.spark.util.Utils + +/** Tracking the current state of the workers with available cores and assigned task list. */ +class OfferState(val workOffer: WorkerOffer) { + // The current remaining cores that can be allocated to tasks. + var coresAvailable: Int = workOffer.cores + // The list of tasks that are assigned to this worker. + val tasks = new ArrayBuffer[TaskDescription](coresAvailable) +} + +/** + * TaskAssigner is the base class for all task assigner implementations, and can be + * extended to implement different task scheduling algorithms. + * Together with [[org.apache.spark.scheduler.TaskScheduler TaskScheduler]], TaskAssigner + * is used to assign tasks to workers with available cores. Internally, TaskScheduler, requested + * to perform task assignment given available workers, first sorts the candidate tasksets, + * and then for each taskset, it takes a number of rounds to request TaskAssigner for task + * assignment with different the locality restrictions until there is either no qualified + * workers or no valid tasks to be assigned. + * + * TaskAssigner is responsible to maintain the worker availability state and task assignment + * information. The contract between [[org.apache.spark.scheduler.TaskScheduler TaskScheduler]] + * and TaskAssigner is as follows. First, TaskScheduler invokes construct() of TaskAssigner to + * initialize the its internal worker states at the beginning of resource offering. Before each + * round of task assignment for a taskset, TaskScheduler invoke the init() of TaskAssigner to + * initialize the data structure for the round. When performing real task assignment, + * hasNext()/getNext() is used by TaskScheduler to check the worker availability and retrieve + * current offering from TaskAssigner. Then offerAccepted is used by TaskScheduler to notify + * the TaskAssigner so that TaskAssigner can decide whether the current offer is valid or not for + * the next request. After task assignment is done, TaskScheduler invokes the tasks() to + * retrieve all the task assignment information, and eventually, invokes reset() method so that + * TaskAssigner can cleanup its internal maintained resources. + */ + +private[scheduler] abstract class TaskAssigner { + var offer: Seq[OfferState] = _ + var CPUS_PER_TASK = 1 + + def withCpuPerTask(CPUS_PER_TASK: Int): Unit = { +this.CPUS_PER_TASK = CPUS_PER_TASK + } + + // The final assigned offer returned to TaskScheduler. + final def tasks: Seq[ArrayBuffer[TaskDescription]] = offer.map(_.tasks) + + // Invoked at the beginning of resource offering to construct the offer with the workoffers. + def construct(workOffer: Seq[WorkerOffer]): Unit = { +offer = workOffer.map(o => new OfferState(o)) + } + + // Invoked at each round of Taskset assignment to initialize the internal structure. + def init(): Unit + + // Whether there is offer available to be used inside of one round of Taskset assignment. + def hasNext: Boolean + + // Returned the next assigned offer based on the task assignment strategy. + def getNext(): OfferState + + // Invoked by the TaskScheduler to indicate whether the current offer is accepted or not so that + // the assigner can decide whether the current worker is valid for the next offering. + def offerAcc
[GitHub] spark pull request #15541: [SPARK-17637][Scheduler]Packed scheduling for Spa...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15541#discussion_r83996232 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskAssigner.scala --- @@ -0,0 +1,233 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import scala.collection.mutable.ArrayBuffer +import scala.collection.mutable.PriorityQueue +import scala.util.Random + +import org.apache.spark.internal.{config, Logging} +import org.apache.spark.SparkConf +import org.apache.spark.util.Utils + +/** Tracking the current state of the workers with available cores and assigned task list. */ +class OfferState(val workOffer: WorkerOffer) { + // The current remaining cores that can be allocated to tasks. + var coresAvailable: Int = workOffer.cores + // The list of tasks that are assigned to this worker. + val tasks = new ArrayBuffer[TaskDescription](coresAvailable) +} + +/** + * TaskAssigner is the base class for all task assigner implementations, and can be + * extended to implement different task scheduling algorithms. + * Together with [[org.apache.spark.scheduler.TaskScheduler TaskScheduler]], TaskAssigner + * is used to assign tasks to workers with available cores. Internally, TaskScheduler, requested + * to perform task assignment given available workers, first sorts the candidate tasksets, + * and then for each taskset, it takes a number of rounds to request TaskAssigner for task + * assignment with different the locality restrictions until there is either no qualified + * workers or no valid tasks to be assigned. + * + * TaskAssigner is responsible to maintain the worker availability state and task assignment + * information. The contract between [[org.apache.spark.scheduler.TaskScheduler TaskScheduler]] + * and TaskAssigner is as follows. First, TaskScheduler invokes construct() of TaskAssigner to + * initialize the its internal worker states at the beginning of resource offering. Before each + * round of task assignment for a taskset, TaskScheduler invoke the init() of TaskAssigner to + * initialize the data structure for the round. When performing real task assignment, + * hasNext()/getNext() is used by TaskScheduler to check the worker availability and retrieve + * current offering from TaskAssigner. Then offerAccepted is used by TaskScheduler to notify + * the TaskAssigner so that TaskAssigner can decide whether the current offer is valid or not for + * the next request. After task assignment is done, TaskScheduler invokes the tasks() to + * retrieve all the task assignment information, and eventually, invokes reset() method so that + * TaskAssigner can cleanup its internal maintained resources. + */ + +private[scheduler] abstract class TaskAssigner { + var offer: Seq[OfferState] = _ + var CPUS_PER_TASK = 1 + + def withCpuPerTask(CPUS_PER_TASK: Int): Unit = { +this.CPUS_PER_TASK = CPUS_PER_TASK + } + + // The final assigned offer returned to TaskScheduler. + final def tasks: Seq[ArrayBuffer[TaskDescription]] = offer.map(_.tasks) + + // Invoked at the beginning of resource offering to construct the offer with the workoffers. + def construct(workOffer: Seq[WorkerOffer]): Unit = { +offer = workOffer.map(o => new OfferState(o)) + } + + // Invoked at each round of Taskset assignment to initialize the internal structure. + def init(): Unit + + // Whether there is offer available to be used inside of one round of Taskset assignment. + def hasNext: Boolean + + // Returned the next assigned offer based on the task assignment strategy. + def getNext(): OfferState + + // Invoked by the TaskScheduler to indicate whether the current offer is accepted or not so that + // the assigner can decide whether the current worker is valid for the next offering. + def offerAcc
[GitHub] spark pull request #15523: [SPARK-17981] [SPARK-17957] [SQL] Fix Incorrect N...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15523#discussion_r84008907 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala --- @@ -87,7 +87,14 @@ case class FilterExec(condition: Expression, child: SparkPlan) // Split out all the IsNotNulls from condition. private val (notNullPreds, otherPreds) = splitConjunctivePredicates(condition).partition { -case IsNotNull(a: NullIntolerant) if a.references.subsetOf(child.outputSet) => true +case IsNotNull(a) => isNullIntolerant(a) && a.references.subsetOf(child.outputSet) +case _ => false + } + + // One expression is null intolerant iff it and its children are null intolerant + private def isNullIntolerant(expr: Expression): Boolean = expr match { +case e: NullIntolerant => + if (e.isInstanceOf[LeafExpression]) true else e.children.forall(isNullIntolerant) --- End diff -- Just realized the original code was from your PR. Then, in your above code, why you still need to keep `a.references.subsetOf(child.outputSet)`? It looks confusing to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15523: [SPARK-17981] [SPARK-17957] [SQL] Fix Incorrect N...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15523#discussion_r84010709 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala --- @@ -87,7 +87,14 @@ case class FilterExec(condition: Expression, child: SparkPlan) // Split out all the IsNotNulls from condition. private val (notNullPreds, otherPreds) = splitConjunctivePredicates(condition).partition { -case IsNotNull(a: NullIntolerant) if a.references.subsetOf(child.outputSet) => true +case IsNotNull(a) => isNullIntolerant(a) && a.references.subsetOf(child.outputSet) +case _ => false + } + + // One expression is null intolerant iff it and its children are null intolerant + private def isNullIntolerant(expr: Expression): Boolean = expr match { +case e: NullIntolerant => + if (e.isInstanceOf[LeafExpression]) true else e.children.forall(isNullIntolerant) --- End diff -- Could you show me an example? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15523: [SPARK-17981] [SPARK-17957] [SQL] Fix Incorrect N...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15523#discussion_r84014760 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala --- @@ -87,7 +87,14 @@ case class FilterExec(condition: Expression, child: SparkPlan) // Split out all the IsNotNulls from condition. private val (notNullPreds, otherPreds) = splitConjunctivePredicates(condition).partition { -case IsNotNull(a: NullIntolerant) if a.references.subsetOf(child.outputSet) => true +case IsNotNull(a) => isNullIntolerant(a) && a.references.subsetOf(child.outputSet) +case _ => false + } + + // One expression is null intolerant iff it and its children are null intolerant + private def isNullIntolerant(expr: Expression): Boolean = expr match { +case e: NullIntolerant => + if (e.isInstanceOf[LeafExpression]) true else e.children.forall(isNullIntolerant) --- End diff -- uh, I see. First, we definitely need test cases to cover each positive and negative scenario. Previously, we did not have any test case to check the validity of nullability changes. Second, the code needs more comments when the variable/function names are not able to explain the codes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15523: [SPARK-17981] [SPARK-17957] [SQL] Fix Incorrect Nullabil...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15523 @viirya I have not changed the algorithm. I just tried to improve the test case coverage. Thanks to `constructIsNotNullConstraints`, the existing solution already covers all the cases, right? Can you help me check anything scenario is missing? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15382: [SPARK-17810] [SQL] Default spark.sql.warehouse.dir is r...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15382 cc @yhuai --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15523: [SPARK-17981] [SPARK-17957] [SQL] Fix Incorrect Nullabil...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15523 The parm name of the verification function is wrong. It should be `expectedNonNullableColumns`. Please check the test case again. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15363: [SPARK-17791][SQL] Join reordering using star schema det...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15363 See [the list of paper that cited K. Ono](https://scholar.google.com/scholar?start=0&hl=en&as_sdt=0,5&sciodt=0,5&cites=5144610119819043766&scipsc=) of `Measuring the Complexity of Join Enumeration in Query Optimization`. There are a lot of good references we can investigate. : ) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15363#discussion_r106298859 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala --- @@ -51,6 +51,11 @@ case class CostBasedJoinReorder(conf: CatalystConf) extends Rule[LogicalPlan] wi def reorder(plan: LogicalPlan, output: AttributeSet): LogicalPlan = { val (items, conditions) = extractInnerJoins(plan) +// Find the star schema joins. Currently, it returns the star join with the largest +// fact table. In the future, it can return more than one star join (e.g. F1-D1-D2 +// and F2-D3-D4). +val starJoinPlans = StarSchemaDetection(conf).findStarJoins(items, conditions.toSeq) --- End diff -- See [the list of paper that cited K. Ono](https://scholar.google.com/scholar?start=0&hl=en&as_sdt=0,5&sciodt=0,5&cites=5144610119819043766&scipsc=) of `Measuring the Complexity of Join Enumeration in Query Optimization`. There are a lot of good references we can investigate. : ) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17270: [SPARK-19929] [SQL] Showing Hive Managed table's ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17270#discussion_r106313999 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -910,7 +910,7 @@ case class ShowCreateTableCommand(table: TableIdentifier) extends RunnableComman } } -if (metadata.tableType == EXTERNAL) { +if (metadata.tableType == EXTERNAL || metadata.tableType == MANAGED) { --- End diff -- > SHOW CREATE TABLE shows the CREATE TABLE statement that creates a given table, or the CREATE VIEW statement that creates a given view. If we add the location clause for managed tables, CREATE TABLE will create an external table instead of a managed table. That is not what we want. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17270: [SPARK-19929] [SQL] Showing Hive Managed table's LOATION...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17270 If you want to know hive table's location, you can use the command `DESCRIBE EXTENDED` , right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17191: [SPARK-14471][SQL] Aliases in SELECT could be used in GR...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17191 Could you please do a check which systems support it and which systems disallow it? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17289: [SPARK-19948] Document that saveAsTable uses catalog as ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17289 We introduced a behavior change in Spark 2.2. In Spark 2.1, we reported an error if the underlying JDBC table exists. We changed [the mode to `SaveMode.Overwrite`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala#L159) if the table does not exist in the catalog. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17289: [SPARK-19948] Document that saveAsTable uses catalog as ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17289 ```Scala test("saveAsTable API with SaveMode.Overwrite") { val df = spark.createDataFrame(sparkContext.parallelize(arr1x2), schema2) spark.read.jdbc(url1, "test.people", properties).show() df.write.format("jdbc").mode(SaveMode.ErrorIfExists) .option("url", url1) .option("dbtable", "test.people") .options(properties.asScala) .saveAsTable("j1") spark.read.jdbc(url1, "test.people", properties).show() } ``` This is a test case I used. Previously, we respected the user-specified mode `SaveMode.ErrorIfExists`. Now, we are not sending the[ mode to the _createRelation_ API ](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L181). It might be an unexpected behavior change to the external data source connector. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17287: [SPARK-19945][SQL]add test suite for SessionCatal...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17287#discussion_r106333835 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalSessionCatalogSuite.scala --- @@ -0,0 +1,40 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import org.apache.spark.sql.catalyst.catalog.{CatalogTestUtils, ExternalCatalog, SessionCatalogSuite} +import org.apache.spark.sql.hive.test.TestHiveSingleton + +class HiveExternalSessionCatalogSuite extends SessionCatalogSuite with TestHiveSingleton { + + protected override val isHiveExternalCatalog = true + + private val externalCatalog = { +val catalog = spark.sharedState.externalCatalog +catalog.asInstanceOf[HiveExternalCatalog].client.reset() +catalog + } + + protected val utils = new CatalogTestUtils { +override val tableInputFormat: String = "org.apache.hadoop.mapred.SequenceFileInputFormat" +override val tableOutputFormat: String = + "org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat" +override val defaultProvider: String = "parquet" --- End diff -- The above input and output formats does not match what you specified here. Let us change it to `hive` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17287: [SPARK-19945][SQL]add test suite for SessionCatal...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17287#discussion_r106334485 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala --- @@ -1270,6 +1376,7 @@ class SessionCatalogSuite extends PlanTest { } assert(cause.getMessage.contains("Undefined function: 'undefined_fn'")) + catalog.reset() --- End diff -- Instead of adding `reset`, why not using your new function `withBasicCatalog`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17287: [SPARK-19945][SQL]add test suite for SessionCatal...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17287#discussion_r106334626 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala --- @@ -999,257 +1083,279 @@ class SessionCatalogSuite extends PlanTest { expectedParts: CatalogTablePartition*): Boolean = { // ExternalCatalog may set a default location for partitions, here we ignore the partition // location when comparing them. -actualParts.map(p => p.copy(storage = p.storage.copy(locationUri = None))).toSet == - expectedParts.map(p => p.copy(storage = p.storage.copy(locationUri = None))).toSet +val actualPartsNormalize = actualParts.map(p => + p.copy(parameters = Map.empty, storage = p.storage.copy( +properties = Map.empty, locationUri = None, serde = None))).toSet + +val expectedPartsNormalize = expectedParts.map(p => +p.copy(parameters = Map.empty, storage = p.storage.copy( + properties = Map.empty, locationUri = None, serde = None))).toSet + +actualPartsNormalize == expectedPartsNormalize +//actualParts.map(p => +// p.copy(storage = p.storage.copy( +//properties = Map.empty, locationUri = None))).toSet == +// expectedParts.map(p => +//p.copy(storage = p.storage.copy(properties = Map.empty, locationUri = None))).toSet --- End diff -- ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17287: [SPARK-19945][SQL]add test suite for SessionCatal...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17287#discussion_r106334827 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala --- @@ -999,257 +1094,279 @@ class SessionCatalogSuite extends PlanTest { expectedParts: CatalogTablePartition*): Boolean = { // ExternalCatalog may set a default location for partitions, here we ignore the partition // location when comparing them. -actualParts.map(p => p.copy(storage = p.storage.copy(locationUri = None))).toSet == - expectedParts.map(p => p.copy(storage = p.storage.copy(locationUri = None))).toSet +val actualPartsNormalize = actualParts.map(p => --- End diff -- Because Hive metastore fills the values after we call the Hive APIs? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17287: [SPARK-19945][SQL]add test suite for SessionCatal...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17287#discussion_r106334939 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala --- @@ -999,257 +1083,279 @@ class SessionCatalogSuite extends PlanTest { expectedParts: CatalogTablePartition*): Boolean = { // ExternalCatalog may set a default location for partitions, here we ignore the partition // location when comparing them. -actualParts.map(p => p.copy(storage = p.storage.copy(locationUri = None))).toSet == - expectedParts.map(p => p.copy(storage = p.storage.copy(locationUri = None))).toSet +val actualPartsNormalize = actualParts.map(p => --- End diff -- Because Hive metastore fills the values after we calling the Hive APIs? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16971: [SPARK-19573][SQL] Make NaN/null handling consistent in ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16971 ping @zhengruifeng --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16971: [SPARK-19573][SQL] Make NaN/null handling consist...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16971#discussion_r106335040 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala --- @@ -245,7 +245,7 @@ object ApproximatePercentile { val result = new Array[Double](percentages.length) var i = 0 while (i < percentages.length) { - result(i) = summaries.query(percentages(i)) + result(i) = summaries.query(percentages(i)).get --- End diff -- Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16981: [SPARK-19637][SQL] Add to_json in FunctionRegistry
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16981 cc @maropu https://github.com/apache/spark/pull/17171 is merged. Are you interested in working on `from_json`? JIRA: https://issues.apache.org/jira/browse/SPARK-19967 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17289: [SPARK-19948] Document that saveAsTable uses catalog as ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17289 This is the design decision we need to make here. Spark SQL is kind of a federation system. Two write APIs behave differently. The `saveAsTable` API expects users to register it in the **global catalog** before usage. The `save` API skips the global catalog and relies on the connectors to communicate with the **local catalog**. The users might not realize the difference. ```Scala df.write.format("xyz").mode(SaveMode.ErrorIfExists) .saveAsTable("j1") ``` ```Scala df.write.format("xyz").mode(SaveMode.ErrorIfExists) .save() ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16209: [WIP][SPARK-10849][SQL] Adds option to the JDBC data sou...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16209 @sureshthalamati https://github.com/apache/spark/pull/17171 has been resolved. Can you update your PR by allowing users to specify the schema in DDL format? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17287: [SPARK-19945][SQL]add test suite for SessionCatal...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17287#discussion_r106340219 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala --- @@ -999,257 +1083,279 @@ class SessionCatalogSuite extends PlanTest { expectedParts: CatalogTablePartition*): Boolean = { // ExternalCatalog may set a default location for partitions, here we ignore the partition // location when comparing them. -actualParts.map(p => p.copy(storage = p.storage.copy(locationUri = None))).toSet == - expectedParts.map(p => p.copy(storage = p.storage.copy(locationUri = None))).toSet +val actualPartsNormalize = actualParts.map(p => --- End diff -- You need to leave a comment to explain it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17287: [SPARK-19945][SQL]add test suite for SessionCatal...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17287#discussion_r106340900 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala --- @@ -1270,6 +1376,7 @@ class SessionCatalogSuite extends PlanTest { } assert(cause.getMessage.contains("Undefined function: 'undefined_fn'")) + catalog.reset() --- End diff -- Then, this `reset()` could be skipped if hitting an exception. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17286: [SPARK-19915][SQL] Exclude cartesian product cand...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17286#discussion_r106341181 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -696,6 +696,13 @@ object SQLConf { .intConf .createWithDefault(12) + val JOIN_REORDER_CARD_WEIGHT = +buildConf("spark.sql.cbo.joinReorder.card.weight") + .doc("The weight of cardinality (number of rows) for plan cost comparison in join reorder: " + +"rows * weight + size * (1 - weight).") + .doubleConf + .createWithDefault(0.7) --- End diff -- What is boundary of this? adding `check`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16626: [SPARK-19261][SQL] Alter add columns for Hive ser...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16626#discussion_r106341499 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -175,6 +178,78 @@ case class AlterTableRenameCommand( } /** + * A command that add columns to a table + * The syntax of using this command in SQL is: + * {{{ + * ALTER TABLE table_identifier + * ADD COLUMNS (col_name data_type [COMMENT col_comment], ...); + * }}} +*/ +case class AlterTableAddColumnsCommand( +table: TableIdentifier, +columns: Seq[StructField]) extends RunnableCommand { + override def run(sparkSession: SparkSession): Seq[Row] = { +val catalog = sparkSession.sessionState.catalog +val catalogTable = verifyAlterTableAddColumn(catalog, table) + +// If an exception is thrown here we can just assume the table is uncached; +// this can happen with Hive tables when the underlying catalog is in-memory. +val wasCached = Try(sparkSession.catalog.isCached(table.unquotedString)).getOrElse(false) +if (wasCached) { + try { +sparkSession.catalog.uncacheTable(table.unquotedString) + } catch { +case NonFatal(e) => log.warn(e.toString, e) + } +} --- End diff -- No need to check if it is cached or not. Just uncache it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16626: [SPARK-19261][SQL] Alter add columns for Hive ser...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16626#discussion_r106341966 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala --- @@ -1860,4 +1860,72 @@ class HiveDDLSuite } } } + + Seq("PARQUET", "ORC", "TEXTFILE", "SEQUENCEFILE", "RCFILE", "AVRO").foreach { tableType => --- End diff -- If the list is complete, we can create a variable and reuse it in the future test cases in `HiveCatalogedDDLSuite `. Let us create it now? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16626: [SPARK-19261][SQL] Alter add columns for Hive ser...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16626#discussion_r106342123 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala --- @@ -1860,4 +1860,72 @@ class HiveDDLSuite } } } + + Seq("PARQUET", "ORC", "TEXTFILE", "SEQUENCEFILE", "RCFILE", "AVRO").foreach { tableType => +test(s"alter hive serde table add columns -- partitioned - $tableType") { + withTable("alter_add_partitioned") { --- End diff -- The name is confusing. Let us just simplify it to `tab`. We already can know the scenario by the test case name. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16626: [SPARK-19261][SQL] Alter add columns for Hive ser...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16626#discussion_r106342233 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala --- @@ -71,7 +71,6 @@ class JDBCSuite extends SparkFunSuite conn.prepareStatement("insert into test.people values ('mary', 2)").executeUpdate() conn.prepareStatement( "insert into test.people values ('joe ''foo'' \"bar\"', 3)").executeUpdate() -conn.commit() --- End diff -- Why? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16626: [SPARK-19261][SQL] Alter add columns for Hive serde and ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16626 Could you add a scenario when users add a column name that already exists in the table schema? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17270: [SPARK-19929] [SQL] Showing Hive Managed table's LOATION...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17270 Yeah. you need to close the PR by yourself. We are unable to close it. thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16672: [SPARK-19329][SQL]Reading from or writing to a datasourc...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16672 To backport https://github.com/apache/spark/pull/17097, we need to backport multiple PRs. This is one of it. @windpiger Could you please submit a PR to backport it to Spark 2.1? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16626: [SPARK-19261][SQL] Alter add columns for Hive ser...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16626#discussion_r106465419 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -175,6 +178,87 @@ case class AlterTableRenameCommand( } /** + * A command that add columns to a table + * The syntax of using this command in SQL is: + * {{{ + * ALTER TABLE table_identifier + * ADD COLUMNS (col_name data_type [COMMENT col_comment], ...); + * }}} +*/ +case class AlterTableAddColumnsCommand( +table: TableIdentifier, +columns: Seq[StructField]) extends RunnableCommand { + override def run(sparkSession: SparkSession): Seq[Row] = { +val catalog = sparkSession.sessionState.catalog +val catalogTable = verifyAlterTableAddColumn(catalog, table) + +try { + sparkSession.catalog.uncacheTable(table.unquotedString) --- End diff -- `table.quotedString` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16626: [SPARK-19261][SQL] Alter add columns for Hive ser...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16626#discussion_r106467818 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -175,6 +178,87 @@ case class AlterTableRenameCommand( } /** + * A command that add columns to a table + * The syntax of using this command in SQL is: + * {{{ + * ALTER TABLE table_identifier + * ADD COLUMNS (col_name data_type [COMMENT col_comment], ...); + * }}} +*/ +case class AlterTableAddColumnsCommand( +table: TableIdentifier, +columns: Seq[StructField]) extends RunnableCommand { + override def run(sparkSession: SparkSession): Seq[Row] = { +val catalog = sparkSession.sessionState.catalog +val catalogTable = verifyAlterTableAddColumn(catalog, table) + +try { + sparkSession.catalog.uncacheTable(table.unquotedString) +} catch { + case NonFatal(e) => +log.warn(s"Exception when attempting to uncache table ${table.unquotedString}", e) +} + +// Invalidate the table last, otherwise uncaching the table would load the logical plan +// back into the hive metastore cache +catalog.refreshTable(table) +val partitionFields = catalogTable.schema.takeRight(catalogTable.partitionColumnNames.length) +val newSchemaFields = catalogTable.schema + .take(catalogTable.schema.length - catalogTable.partitionColumnNames.length) ++ + columns ++ partitionFields +checkDuplication(newSchemaFields.map(_.name)) +catalog.alterTableSchema(table, newSchema = + catalogTable.schema.copy(fields = newSchemaFields.toArray)) + +Seq.empty[Row] + } + + private def checkDuplication(colNames: Seq[String]): Unit = { +if (colNames.distinct.length != colNames.length) { + val duplicateColumns = colNames.groupBy(identity).collect { --- End diff -- This does not consider the case sensitivity --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16626: [SPARK-19261][SQL] Alter add columns for Hive ser...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16626#discussion_r106472274 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -175,6 +178,87 @@ case class AlterTableRenameCommand( } /** + * A command that add columns to a table + * The syntax of using this command in SQL is: + * {{{ + * ALTER TABLE table_identifier + * ADD COLUMNS (col_name data_type [COMMENT col_comment], ...); + * }}} +*/ +case class AlterTableAddColumnsCommand( +table: TableIdentifier, +columns: Seq[StructField]) extends RunnableCommand { + override def run(sparkSession: SparkSession): Seq[Row] = { +val catalog = sparkSession.sessionState.catalog +val catalogTable = verifyAlterTableAddColumn(catalog, table) + +try { + sparkSession.catalog.uncacheTable(table.unquotedString) +} catch { + case NonFatal(e) => +log.warn(s"Exception when attempting to uncache table ${table.unquotedString}", e) +} + +// Invalidate the table last, otherwise uncaching the table would load the logical plan +// back into the hive metastore cache +catalog.refreshTable(table) +val partitionFields = catalogTable.schema.takeRight(catalogTable.partitionColumnNames.length) +val newSchemaFields = catalogTable.schema + .take(catalogTable.schema.length - catalogTable.partitionColumnNames.length) ++ + columns ++ partitionFields +checkDuplication(newSchemaFields.map(_.name)) +catalog.alterTableSchema(table, newSchema = + catalogTable.schema.copy(fields = newSchemaFields.toArray)) + +Seq.empty[Row] + } + + private def checkDuplication(colNames: Seq[String]): Unit = { +if (colNames.distinct.length != colNames.length) { + val duplicateColumns = colNames.groupBy(identity).collect { +case (x, ys) if ys.length > 1 => x + } + throw new AnalysisException( +s"Found duplicate column(s): ${duplicateColumns.mkString(", ")}") +} + } + + /** + * ALTER TABLE ADD COLUMNS command does not support temporary view/table, + * view, or datasource table with text, orc formats or external provider. + * For datasource table, it currently only supports parquet, json, csv. + */ + private def verifyAlterTableAddColumn( + catalog: SessionCatalog, + table: TableIdentifier): CatalogTable = { +val catalogTable = catalog.getTempViewOrPermanentTableMetadata(table) + +if (catalogTable.tableType == CatalogTableType.VIEW) { + throw new AnalysisException( +s"${table.toString} is a VIEW, which does not support ALTER ADD COLUMNS.") --- End diff -- How about? > ALTER ADD COLUMNS does not support views. You must drop and re-create the views for adding the new columns. Views: $table. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16626: [SPARK-19261][SQL] Alter add columns for Hive ser...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16626#discussion_r106472336 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -175,6 +178,87 @@ case class AlterTableRenameCommand( } /** + * A command that add columns to a table + * The syntax of using this command in SQL is: + * {{{ + * ALTER TABLE table_identifier + * ADD COLUMNS (col_name data_type [COMMENT col_comment], ...); + * }}} +*/ +case class AlterTableAddColumnsCommand( +table: TableIdentifier, +columns: Seq[StructField]) extends RunnableCommand { + override def run(sparkSession: SparkSession): Seq[Row] = { +val catalog = sparkSession.sessionState.catalog +val catalogTable = verifyAlterTableAddColumn(catalog, table) + +try { + sparkSession.catalog.uncacheTable(table.unquotedString) +} catch { + case NonFatal(e) => +log.warn(s"Exception when attempting to uncache table ${table.unquotedString}", e) +} + +// Invalidate the table last, otherwise uncaching the table would load the logical plan +// back into the hive metastore cache +catalog.refreshTable(table) +val partitionFields = catalogTable.schema.takeRight(catalogTable.partitionColumnNames.length) +val newSchemaFields = catalogTable.schema + .take(catalogTable.schema.length - catalogTable.partitionColumnNames.length) ++ + columns ++ partitionFields +checkDuplication(newSchemaFields.map(_.name)) +catalog.alterTableSchema(table, newSchema = + catalogTable.schema.copy(fields = newSchemaFields.toArray)) + +Seq.empty[Row] + } + + private def checkDuplication(colNames: Seq[String]): Unit = { +if (colNames.distinct.length != colNames.length) { + val duplicateColumns = colNames.groupBy(identity).collect { +case (x, ys) if ys.length > 1 => x + } + throw new AnalysisException( +s"Found duplicate column(s): ${duplicateColumns.mkString(", ")}") +} + } + + /** + * ALTER TABLE ADD COLUMNS command does not support temporary view/table, + * view, or datasource table with text, orc formats or external provider. + * For datasource table, it currently only supports parquet, json, csv. + */ + private def verifyAlterTableAddColumn( + catalog: SessionCatalog, + table: TableIdentifier): CatalogTable = { +val catalogTable = catalog.getTempViewOrPermanentTableMetadata(table) + +if (catalogTable.tableType == CatalogTableType.VIEW) { + throw new AnalysisException( +s"${table.toString} is a VIEW, which does not support ALTER ADD COLUMNS.") +} + +if (DDLUtils.isDatasourceTable(catalogTable)) { + DataSource.lookupDataSource(catalogTable.provider.get).newInstance() match { +// For datasource table, this command can only support the following File format. +// TextFileFormat only default to one column "value" +// OrcFileFormat can not handle difference between user-specified schema and +// inferred schema yet. TODO, once this issue is resolved , we can add Orc back. +// Hive type is already considered as hive serde table, so the logic will not +// come in here. +case _: JsonFileFormat | _: CSVFileFormat | _: ParquetFileFormat => +case s => + throw new AnalysisException( +s"Datasource table $table with type $s, which does not support ALTER ADD COLUMNS.") --- End diff -- The same here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16626: [SPARK-19261][SQL] Alter add columns for Hive serde and ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16626 Please also add a test case: ALTER TABLE does not affect any view that references the table being altered. Also includes the views that have an "*" in their SELECT list. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17317: [SPARK-19329][SQL][BRANCH-2.1]Reading from or writing to...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17317 Thanks! Merging to 2.1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17317: [SPARK-19329][SQL][BRANCH-2.1]Reading from or writing to...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17317 @windpiger Could you please close it? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17319: [SPARK-19765][SPARK-18549][SPARK-19093][SPARK-197...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/17319 [SPARK-19765][SPARK-18549][SPARK-19093][SPARK-19736][BACKPORT-2.1][SQL] Backport Three Cache-related PRs to Spark 2.1 ### What changes were proposed in this pull request? Backport a few cache related PRs: --- [[SPARK-19093][SQL] Cached tables are not used in SubqueryExpression](https://github.com/apache/spark/pull/16493) Consider the plans inside subquery expressions while looking up cache manager to make use of cached data. Currently CacheManager.useCachedData does not consider the subquery expressions in the plan. --- [[SPARK-19736][SQL] refreshByPath should clear all cached plans with the specified path](https://github.com/apache/spark/pull/17064) Catalog.refreshByPath can refresh the cache entry and the associated metadata for all dataframes (if any), that contain the given data source path. However, CacheManager.invalidateCachedPath doesn't clear all cached plans with the specified path. It causes some strange behaviors reported in SPARK-15678. --- [[SPARK-19765][SPARK-18549][SQL] UNCACHE TABLE should un-cache all cached plans that refer to this table](https://github.com/apache/spark/pull/17097) When un-cache a table, we should not only remove the cache entry for this table, but also un-cache any other cached plans that refer to this table. The following commands trigger the table uncache: `DropTableCommand`, `TruncateTableCommand`, `AlterTableRenameCommand`, `UncacheTableCommand`, `RefreshTable` and `InsertIntoHiveTable` This PR also includes some refactors: - use java.util.LinkedList to store the cache entries, so that it's safer to remove elements while iterating - rename invalidateCache to recacheByPlan, which is more obvious about what it does. ### How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark backport-17097 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17319.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17319 commit 3f1895f315ef357ec9f9201748d760293deb4f88 Author: Xiao Li Date: 2017-03-16T07:36:35Z fix. commit 11a8f31d5954c14eb8e546d001688f93357da676 Author: Xiao Li Date: 2017-03-16T17:35:21Z Merge remote-tracking branch 'upstream/branch-2.1' into backport-17097 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17191: [SPARK-14471][SQL] Aliases in SELECT could be used in GR...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17191 We hit this issue multiple times. Although it looks not right to support it for the users of major enterprise RDBMS, two popular open source RDBMS PostgreSQL and MySQL support it. This is the design decision we need to make. Should we keep the previous decision? @rxin @marmbrus @hvanhovell @cloud-fan Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17287: [SPARK-19945][SQL]add test suite for SessionCatalog with...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17287 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17287: [SPARK-19945][SQL]add test suite for SessionCatalog with...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17287 Thanks! Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17319: [SPARK-19765][SPARK-18549][SPARK-19093][SPARK-19736][BAC...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17319 cc @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17320: [SPARK-19967][SQL] Add from_json in FunctionRegis...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17320#discussion_r106550372 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -634,7 +661,12 @@ case class StructToJson( override def inputTypes: Seq[AbstractDataType] = StructType :: Nil } -object StructToJson { +object JsonExprUtils { + + def validateSchemaLiteral(exp: Expression): StructType = exp match { +case Literal(s, StringType) => CatalystSqlParser.parseTableSchema(s.toString) +case e => throw new AnalysisException(s"Must be a string literal, but: $e") --- End diff -- How about? > Expected a string literal instead of $e --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17320: [SPARK-19967][SQL] Add from_json in FunctionRegis...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17320#discussion_r106550986 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala --- @@ -202,12 +202,12 @@ class JsonFunctionsSuite extends QueryTest with SharedSQLContext { val df1 = Seq(Tuple1(Tuple1(1))).toDF("a") checkAnswer( df1.selectExpr("to_json(a)"), - Row("""{"_1":1}""") :: Nil) + Row( """{"_1":1}""") :: Nil) val df2 = Seq(Tuple1(Tuple1(java.sql.Timestamp.valueOf("2015-08-26 18:00:00.0".toDF("a") checkAnswer( df2.selectExpr("to_json(a, map('timestampFormat', 'dd/MM/ HH:mm'))"), - Row("""{"_1":"26/08/2015 18:00"}""") :: Nil) + Row( """{"_1":"26/08/2015 18:00"}""") :: Nil) --- End diff -- revert it back --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17320: [SPARK-19967][SQL] Add from_json in FunctionRegis...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17320#discussion_r106550987 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala --- @@ -202,12 +202,12 @@ class JsonFunctionsSuite extends QueryTest with SharedSQLContext { val df1 = Seq(Tuple1(Tuple1(1))).toDF("a") checkAnswer( df1.selectExpr("to_json(a)"), - Row("""{"_1":1}""") :: Nil) + Row( """{"_1":1}""") :: Nil) --- End diff -- revert it back --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17320: [SPARK-19967][SQL] Add from_json in FunctionRegistry
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17320 All the valid examples are using a single column. Could you also add a test case to verify the schema having multiple columns? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16330: [SPARK-18817][SPARKR][SQL] change derby log output to te...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16330 The code changes are now very specific to R. Let me know if you still need me. : ) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15363#discussion_r106562261 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala --- @@ -167,8 +167,8 @@ object ExtractFiltersAndInnerJoins extends PredicateHelper { : (Seq[(LogicalPlan, InnerLike)], Seq[Expression]) = plan match { case Join(left, right, joinType: InnerLike, cond) => val (plans, conditions) = flattenJoin(left, joinType) - (plans ++ Seq((right, joinType)), conditions ++ cond.toSeq) - + (plans ++ Seq((right, joinType)), conditions ++ --- End diff -- Nit: remove this heading space. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15363: [SPARK-17791][SQL] Join reordering using star schema det...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15363 It sounds like all the comments have been addressed. LGTM except one minor comment. cc @sameeragarwal @cloud-fan @hvanhovell --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17322: [SPARK-19987][SQL] Pass all filters into FileIndex
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17322 LGTM pending Jenkins --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17315: [SPARK-19949][SQL] unify bad record handling in C...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17315#discussion_r106566183 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -391,9 +288,9 @@ class JacksonParser( case token => // We cannot parse this token based on the given data type. So, we throw a - // SparkSQLJsonProcessingException and this exception will be caught by + // SparkSQLRuntimeException and this exception will be caught by --- End diff -- `RuntimeException `? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17322: [SPARK-19987][SQL] Pass all filters into FileIndex
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17322 Thanks! Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17315: [SPARK-19949][SQL] unify bad record handling in C...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17315#discussion_r106570171 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala --- @@ -65,7 +65,7 @@ private[sql] class JSONOptions( val allowBackslashEscapingAnyCharacter = parameters.get("allowBackslashEscapingAnyCharacter").map(_.toBoolean).getOrElse(false) val compressionCodec = parameters.get("compression").map(CompressionCodecs.getCodecClassName) - private val parseMode = parameters.getOrElse("mode", "PERMISSIVE") + val parseMode = parameters.getOrElse("mode", "PERMISSIVE") --- End diff -- How about creating an enum, like what we are doing for `SaveMode`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17287: [SPARK-19945][SQL]add test suite for SessionCatal...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17287#discussion_r106570569 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala --- @@ -27,41 +27,67 @@ import org.apache.spark.sql.catalyst.parser.CatalystSqlParser import org.apache.spark.sql.catalyst.plans.PlanTest import org.apache.spark.sql.catalyst.plans.logical.{Range, SubqueryAlias, View} +class InMemorySessionCatalogSuite extends SessionCatalogSuite { + protected val utils = new CatalogTestUtils { +override val tableInputFormat: String = "com.fruit.eyephone.CameraInputFormat" +override val tableOutputFormat: String = "com.fruit.eyephone.CameraOutputFormat" +override val defaultProvider: String = "parquet" +override def newEmptyCatalog(): ExternalCatalog = new InMemoryCatalog + } +} + /** - * Tests for [[SessionCatalog]] that assume that [[InMemoryCatalog]] is correctly implemented. + * Tests for [[SessionCatalog]] * * Note: many of the methods here are very similar to the ones in [[ExternalCatalogSuite]]. * This is because [[SessionCatalog]] and [[ExternalCatalog]] share many similar method * signatures but do not extend a common parent. This is largely by design but * unfortunately leads to very similar test code in two places. */ -class SessionCatalogSuite extends PlanTest { - private val utils = new CatalogTestUtils { -override val tableInputFormat: String = "com.fruit.eyephone.CameraInputFormat" -override val tableOutputFormat: String = "com.fruit.eyephone.CameraOutputFormat" -override val defaultProvider: String = "parquet" -override def newEmptyCatalog(): ExternalCatalog = new InMemoryCatalog - } +abstract class SessionCatalogSuite extends PlanTest { + protected val utils: CatalogTestUtils + + protected val isHiveExternalCatalog = false import utils._ + private def withBasicCatalog(f: SessionCatalog => Unit): Unit = { +val catalog = new SessionCatalog(newBasicCatalog()) +catalog.createDatabase(newDb("default"), ignoreIfExists = true) +try { + f(catalog) +} finally { + catalog.reset() +} + } + + private def withEmptyCatalog(f: SessionCatalog => Unit): Unit = { +val catalog = new SessionCatalog(newEmptyCatalog()) +catalog.createDatabase(newDb("default"), ignoreIfExists = true) +try { + f(catalog) +} finally { + catalog.reset() +} + } // -- // Databases // -- test("basic create and list databases") { -val catalog = new SessionCatalog(newEmptyCatalog()) -catalog.createDatabase(newDb("default"), ignoreIfExists = true) -assert(catalog.databaseExists("default")) -assert(!catalog.databaseExists("testing")) -assert(!catalog.databaseExists("testing2")) -catalog.createDatabase(newDb("testing"), ignoreIfExists = false) -assert(catalog.databaseExists("testing")) -assert(catalog.listDatabases().toSet == Set("default", "testing")) -catalog.createDatabase(newDb("testing2"), ignoreIfExists = false) -assert(catalog.listDatabases().toSet == Set("default", "testing", "testing2")) -assert(catalog.databaseExists("testing2")) -assert(!catalog.databaseExists("does_not_exist")) +withEmptyCatalog { catalog => + catalog.createDatabase(newDb("default"), ignoreIfExists = true) --- End diff -- This is a basic test case to test `create and list databases`. Thus, it will be good to create a database without the existing one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17319: [SPARK-19765][SPARK-18549][SPARK-19093][SPARK-197...
Github user gatorsmile closed the pull request at: https://github.com/apache/spark/pull/17319 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17320: [SPARK-19967][SQL] Add from_json in FunctionRegistry
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17320 LGTM cc @ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17191: [SPARK-14471][SQL] Aliases in SELECT could be use...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17191#discussion_r106717092 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -2598,4 +2598,26 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { } assert(!jobStarted.get(), "Command should not trigger a Spark job.") } + + test("SPARK-14471 When groupByAliasesEnabled=true, aliases in SELECT could exist in GROUP BY") { +withSQLConf(SQLConf.GROUP_BY_ALIASES_ENABLED.key -> "true") { --- End diff -- Since you test the mixed case, you also need to enable `GROUP_BY_ORDINAL` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17320: [SPARK-19967][SQL] Add from_json in FunctionRegistry
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17320 Thanks! Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17191: [SPARK-14471][SQL] Aliases in SELECT could be used in GR...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17191 What is the behaviors of MySQL and Postgres when we use digits as alias? ```SQL SELECT k1 AS `2`, k2 AS a, SUM(v) FROM t GROUP BY 2, k2 ``` You might need to replace backticks by the other symbols for quoting the column names --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17240: [SPARK-19915][SQL] Improve join reorder: simplify...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17240#discussion_r106772853 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala --- @@ -36,27 +36,24 @@ case class CostBasedJoinReorder(conf: CatalystConf) extends Rule[LogicalPlan] wi if (!conf.cboEnabled || !conf.joinReorderEnabled) { plan } else { - val result = plan transform { -case p @ Project(projectList, j @ Join(_, _, _: InnerLike, _)) => - reorder(p, p.outputSet) -case j @ Join(_, _, _: InnerLike, _) => - reorder(j, j.outputSet) + val result = plan transformDown { +case j @ Join(_, _, _: InnerLike, _) => reorder(j) } // After reordering is finished, convert OrderedJoin back to Join - result transform { + result transformDown { case oj: OrderedJoin => oj.join } } } - def reorder(plan: LogicalPlan, output: AttributeSet): LogicalPlan = { + def reorder(plan: LogicalPlan): LogicalPlan = { --- End diff -- private --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17138: [SPARK-17080] [SQL] join reorder
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17138#discussion_r106773013 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala --- @@ -0,0 +1,297 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.CatalystConf +import org.apache.spark.sql.catalyst.expressions.{And, Attribute, AttributeSet, Expression, PredicateHelper} +import org.apache.spark.sql.catalyst.plans.{Inner, InnerLike} +import org.apache.spark.sql.catalyst.plans.logical.{BinaryNode, Join, LogicalPlan, Project} +import org.apache.spark.sql.catalyst.rules.Rule + + +/** + * Cost-based join reorder. + * We may have several join reorder algorithms in the future. This class is the entry of these + * algorithms, and chooses which one to use. + */ +case class CostBasedJoinReorder(conf: CatalystConf) extends Rule[LogicalPlan] with PredicateHelper { + def apply(plan: LogicalPlan): LogicalPlan = { +if (!conf.cboEnabled || !conf.joinReorderEnabled) { + plan +} else { + val result = plan transform { +case p @ Project(projectList, j @ Join(_, _, _: InnerLike, _)) => + reorder(p, p.outputSet) +case j @ Join(_, _, _: InnerLike, _) => + reorder(j, j.outputSet) + } + // After reordering is finished, convert OrderedJoin back to Join + result transform { +case oj: OrderedJoin => oj.join + } +} + } + + def reorder(plan: LogicalPlan, output: AttributeSet): LogicalPlan = { +val (items, conditions) = extractInnerJoins(plan) +val result = + // Do reordering if the number of items is appropriate and join conditions exist. + // We also need to check if costs of all items can be evaluated. + if (items.size > 2 && items.size <= conf.joinReorderDPThreshold && conditions.nonEmpty && + items.forall(_.stats(conf).rowCount.isDefined)) { +JoinReorderDP.search(conf, items, conditions, output).getOrElse(plan) + } else { +plan + } +// Set consecutive join nodes ordered. +replaceWithOrderedJoin(result) + } + + /** + * Extract consecutive inner joinable items and join conditions. --- End diff -- How about `Extracts the join conditions and sub-plans of consecutive inner joins` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17138: [SPARK-17080] [SQL] join reorder
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17138#discussion_r106773023 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala --- @@ -0,0 +1,297 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.CatalystConf +import org.apache.spark.sql.catalyst.expressions.{And, Attribute, AttributeSet, Expression, PredicateHelper} +import org.apache.spark.sql.catalyst.plans.{Inner, InnerLike} +import org.apache.spark.sql.catalyst.plans.logical.{BinaryNode, Join, LogicalPlan, Project} +import org.apache.spark.sql.catalyst.rules.Rule + + +/** + * Cost-based join reorder. + * We may have several join reorder algorithms in the future. This class is the entry of these + * algorithms, and chooses which one to use. + */ +case class CostBasedJoinReorder(conf: CatalystConf) extends Rule[LogicalPlan] with PredicateHelper { + def apply(plan: LogicalPlan): LogicalPlan = { +if (!conf.cboEnabled || !conf.joinReorderEnabled) { + plan +} else { + val result = plan transform { +case p @ Project(projectList, j @ Join(_, _, _: InnerLike, _)) => + reorder(p, p.outputSet) +case j @ Join(_, _, _: InnerLike, _) => + reorder(j, j.outputSet) + } + // After reordering is finished, convert OrderedJoin back to Join + result transform { +case oj: OrderedJoin => oj.join + } +} + } + + def reorder(plan: LogicalPlan, output: AttributeSet): LogicalPlan = { +val (items, conditions) = extractInnerJoins(plan) --- End diff -- Nit: `items` -> `subplans`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17138: [SPARK-17080] [SQL] join reorder
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17138#discussion_r106773079 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala --- @@ -0,0 +1,297 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.CatalystConf +import org.apache.spark.sql.catalyst.expressions.{And, Attribute, AttributeSet, Expression, PredicateHelper} +import org.apache.spark.sql.catalyst.plans.{Inner, InnerLike} +import org.apache.spark.sql.catalyst.plans.logical.{BinaryNode, Join, LogicalPlan, Project} +import org.apache.spark.sql.catalyst.rules.Rule + + +/** + * Cost-based join reorder. + * We may have several join reorder algorithms in the future. This class is the entry of these + * algorithms, and chooses which one to use. + */ +case class CostBasedJoinReorder(conf: CatalystConf) extends Rule[LogicalPlan] with PredicateHelper { + def apply(plan: LogicalPlan): LogicalPlan = { +if (!conf.cboEnabled || !conf.joinReorderEnabled) { + plan +} else { + val result = plan transform { +case p @ Project(projectList, j @ Join(_, _, _: InnerLike, _)) => + reorder(p, p.outputSet) +case j @ Join(_, _, _: InnerLike, _) => + reorder(j, j.outputSet) + } + // After reordering is finished, convert OrderedJoin back to Join + result transform { +case oj: OrderedJoin => oj.join + } +} + } + + def reorder(plan: LogicalPlan, output: AttributeSet): LogicalPlan = { +val (items, conditions) = extractInnerJoins(plan) +val result = + // Do reordering if the number of items is appropriate and join conditions exist. + // We also need to check if costs of all items can be evaluated. + if (items.size > 2 && items.size <= conf.joinReorderDPThreshold && conditions.nonEmpty && + items.forall(_.stats(conf).rowCount.isDefined)) { +JoinReorderDP.search(conf, items, conditions, output).getOrElse(plan) + } else { +plan + } +// Set consecutive join nodes ordered. +replaceWithOrderedJoin(result) + } + + /** + * Extract consecutive inner joinable items and join conditions. + * This method works for bushy trees and left/right deep trees. + */ + private def extractInnerJoins(plan: LogicalPlan): (Seq[LogicalPlan], Set[Expression]) = { +plan match { + case Join(left, right, _: InnerLike, cond) => +val (leftPlans, leftConditions) = extractInnerJoins(left) +val (rightPlans, rightConditions) = extractInnerJoins(right) +(leftPlans ++ rightPlans, cond.toSet.flatMap(splitConjunctivePredicates) ++ + leftConditions ++ rightConditions) + case Project(projectList, join) if projectList.forall(_.isInstanceOf[Attribute]) => +extractInnerJoins(join) + case _ => +(Seq(plan), Set()) +} + } + + private def replaceWithOrderedJoin(plan: LogicalPlan): LogicalPlan = plan match { +case j @ Join(left, right, _: InnerLike, cond) => --- End diff -- Nit: `cond` -> `_` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17138: [SPARK-17080] [SQL] join reorder
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17138#discussion_r106774778 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala --- @@ -0,0 +1,297 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.CatalystConf +import org.apache.spark.sql.catalyst.expressions.{And, Attribute, AttributeSet, Expression, PredicateHelper} +import org.apache.spark.sql.catalyst.plans.{Inner, InnerLike} +import org.apache.spark.sql.catalyst.plans.logical.{BinaryNode, Join, LogicalPlan, Project} +import org.apache.spark.sql.catalyst.rules.Rule + + +/** + * Cost-based join reorder. + * We may have several join reorder algorithms in the future. This class is the entry of these + * algorithms, and chooses which one to use. + */ +case class CostBasedJoinReorder(conf: CatalystConf) extends Rule[LogicalPlan] with PredicateHelper { + def apply(plan: LogicalPlan): LogicalPlan = { +if (!conf.cboEnabled || !conf.joinReorderEnabled) { + plan +} else { + val result = plan transform { +case p @ Project(projectList, j @ Join(_, _, _: InnerLike, _)) => + reorder(p, p.outputSet) +case j @ Join(_, _, _: InnerLike, _) => + reorder(j, j.outputSet) + } + // After reordering is finished, convert OrderedJoin back to Join + result transform { +case oj: OrderedJoin => oj.join + } +} + } + + def reorder(plan: LogicalPlan, output: AttributeSet): LogicalPlan = { +val (items, conditions) = extractInnerJoins(plan) +val result = + // Do reordering if the number of items is appropriate and join conditions exist. + // We also need to check if costs of all items can be evaluated. + if (items.size > 2 && items.size <= conf.joinReorderDPThreshold && conditions.nonEmpty && + items.forall(_.stats(conf).rowCount.isDefined)) { +JoinReorderDP.search(conf, items, conditions, output).getOrElse(plan) + } else { +plan + } +// Set consecutive join nodes ordered. +replaceWithOrderedJoin(result) + } + + /** + * Extract consecutive inner joinable items and join conditions. + * This method works for bushy trees and left/right deep trees. + */ + private def extractInnerJoins(plan: LogicalPlan): (Seq[LogicalPlan], Set[Expression]) = { +plan match { + case Join(left, right, _: InnerLike, cond) => +val (leftPlans, leftConditions) = extractInnerJoins(left) +val (rightPlans, rightConditions) = extractInnerJoins(right) +(leftPlans ++ rightPlans, cond.toSet.flatMap(splitConjunctivePredicates) ++ + leftConditions ++ rightConditions) + case Project(projectList, join) if projectList.forall(_.isInstanceOf[Attribute]) => +extractInnerJoins(join) + case _ => +(Seq(plan), Set()) +} + } + + private def replaceWithOrderedJoin(plan: LogicalPlan): LogicalPlan = plan match { +case j @ Join(left, right, _: InnerLike, cond) => + val replacedLeft = replaceWithOrderedJoin(left) + val replacedRight = replaceWithOrderedJoin(right) + OrderedJoin(j.copy(left = replacedLeft, right = replacedRight)) +case p @ Project(_, join) => + p.copy(child = replaceWithOrderedJoin(join)) +case _ => + plan + } + + /** This is a wrapper class for a join node that has been ordered. */ + private case class OrderedJoin(join: Join) extends BinaryNode { +override def left: LogicalPlan = join.left +override def right: LogicalPlan = join.right +override def output: Se