[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun closed the pull request at: https://github.com/apache/spark/pull/14132 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r72182414 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,49 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + * + * This rule substitutes `UnresolvedRelation`s in `Substitute` batch before `ResolveRelations` + * rule is applied. Here are two reasons. + * - To support `MetastoreRelation` in Hive module. + * - To reduce the effect of `Hint` on the other rules. + * + * After this rule, it is guaranteed that there exists no unknown `Hint` in the plan. + * All new `Hint`s should be transformed into concrete Hint classes `BroadcastHint` here. + */ + object SubstituteHints extends Rule[LogicalPlan] { +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case logical: LogicalPlan => logical transformDown { +case h @ Hint(name, parameters, child) +if Seq("BROADCAST", "BROADCASTJOIN", "MAPJOIN").contains(name.toUpperCase) => + var resolvedChild = child + + for (table <- parameters) { +var stop = false +resolvedChild = resolvedChild.transformDown { --- End diff -- Oh, I see. That's the reason why I can not find the clear logic. Thank you so much, @yhuai ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r72181762 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,49 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + * + * This rule substitutes `UnresolvedRelation`s in `Substitute` batch before `ResolveRelations` + * rule is applied. Here are two reasons. + * - To support `MetastoreRelation` in Hive module. + * - To reduce the effect of `Hint` on the other rules. + * + * After this rule, it is guaranteed that there exists no unknown `Hint` in the plan. + * All new `Hint`s should be transformed into concrete Hint classes `BroadcastHint` here. + */ + object SubstituteHints extends Rule[LogicalPlan] { +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case logical: LogicalPlan => logical transformDown { +case h @ Hint(name, parameters, child) +if Seq("BROADCAST", "BROADCASTJOIN", "MAPJOIN").contains(name.toUpperCase) => + var resolvedChild = child + + for (table <- parameters) { +var stop = false +resolvedChild = resolvedChild.transformDown { --- End diff -- You probably want to look at older versions of Hive. For example, 0.10. After https://issues.apache.org/jira/browse/HIVE-3784, Hive does not really use map join hint. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r72179412 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,49 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + * + * This rule substitutes `UnresolvedRelation`s in `Substitute` batch before `ResolveRelations` + * rule is applied. Here are two reasons. + * - To support `MetastoreRelation` in Hive module. + * - To reduce the effect of `Hint` on the other rules. + * + * After this rule, it is guaranteed that there exists no unknown `Hint` in the plan. + * All new `Hint`s should be transformed into concrete Hint classes `BroadcastHint` here. + */ + object SubstituteHints extends Rule[LogicalPlan] { +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case logical: LogicalPlan => logical transformDown { +case h @ Hint(name, parameters, child) +if Seq("BROADCAST", "BROADCASTJOIN", "MAPJOIN").contains(name.toUpperCase) => + var resolvedChild = child + + for (table <- parameters) { +var stop = false +resolvedChild = resolvedChild.transformDown { --- End diff -- cc @yhuai, any idea how does hive handle sql hints? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r72178111 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,49 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + * + * This rule substitutes `UnresolvedRelation`s in `Substitute` batch before `ResolveRelations` + * rule is applied. Here are two reasons. + * - To support `MetastoreRelation` in Hive module. + * - To reduce the effect of `Hint` on the other rules. + * + * After this rule, it is guaranteed that there exists no unknown `Hint` in the plan. + * All new `Hint`s should be transformed into concrete Hint classes `BroadcastHint` here. + */ + object SubstituteHints extends Rule[LogicalPlan] { +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case logical: LogicalPlan => logical transformDown { +case h @ Hint(name, parameters, child) +if Seq("BROADCAST", "BROADCASTJOIN", "MAPJOIN").contains(name.toUpperCase) => + var resolvedChild = child + + for (table <- parameters) { +var stop = false +resolvedChild = resolvedChild.transformDown { --- End diff -- For your case, what I understand is the DFS should be used. I've search Hive codebase until now, but I couldn't find the clear logic until now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r72176938 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,49 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + * + * This rule substitutes `UnresolvedRelation`s in `Substitute` batch before `ResolveRelations` + * rule is applied. Here are two reasons. + * - To support `MetastoreRelation` in Hive module. + * - To reduce the effect of `Hint` on the other rules. + * + * After this rule, it is guaranteed that there exists no unknown `Hint` in the plan. + * All new `Hint`s should be transformed into concrete Hint classes `BroadcastHint` here. + */ + object SubstituteHints extends Rule[LogicalPlan] { +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case logical: LogicalPlan => logical transformDown { +case h @ Hint(name, parameters, child) +if Seq("BROADCAST", "BROADCASTJOIN", "MAPJOIN").contains(name.toUpperCase) => + var resolvedChild = child + + for (table <- parameters) { +var stop = false +resolvedChild = resolvedChild.transformDown { --- End diff -- yea, for this example, DFS is right and BFS is wrong. But how about the case I mentioned above? I think we don't need to traverse the plan tree recursively, just find the sub-tree of relation part, flatten the join node if any, and then iterate these relations, wrap it with broadcast hint if it's a table and table name matches the hint argument. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r72114534 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,49 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + * + * This rule substitutes `UnresolvedRelation`s in `Substitute` batch before `ResolveRelations` + * rule is applied. Here are two reasons. + * - To support `MetastoreRelation` in Hive module. + * - To reduce the effect of `Hint` on the other rules. + * + * After this rule, it is guaranteed that there exists no unknown `Hint` in the plan. + * All new `Hint`s should be transformed into concrete Hint classes `BroadcastHint` here. + */ + object SubstituteHints extends Rule[LogicalPlan] { +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case logical: LogicalPlan => logical transformDown { +case h @ Hint(name, parameters, child) +if Seq("BROADCAST", "BROADCASTJOIN", "MAPJOIN").contains(name.toUpperCase) => + var resolvedChild = child + + for (table <- parameters) { +var stop = false +resolvedChild = resolvedChild.transformDown { --- End diff -- Since `JOIN` is binary operation, **closest** does not mean BFS. Please see the following example. We don't think that the hint will match the fourth `t1`. If that term is ambiguous, I will change that into the first one in DFS. ``` scala> sql("select * from t1 JOIN t1 JOIN t1 JOIN t1").explain == Physical Plan == CartesianProduct :- CartesianProduct : :- CartesianProduct : : :- *Range (0, 1, splits=8) : : +- *Range (0, 1, splits=8) : +- *Range (0, 1, splits=8) +- *Range (0, 1, splits=8) scala> sql("select /*+ MAPJOIN(t1) */ * from t1 JOIN t1 JOIN t1 JOIN t1").explain == Physical Plan == CartesianProduct :- CartesianProduct : :- BroadcastNestedLoopJoin BuildLeft, Inner, true : : :- BroadcastExchange IdentityBroadcastMode : : : +- *Range (0, 1, splits=8) : : +- *Range (0, 1, splits=8) : +- *Range (0, 1, splits=8) +- *Range (0, 1, splits=8) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r72052381 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,49 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + * + * This rule substitutes `UnresolvedRelation`s in `Substitute` batch before `ResolveRelations` + * rule is applied. Here are two reasons. + * - To support `MetastoreRelation` in Hive module. + * - To reduce the effect of `Hint` on the other rules. + * + * After this rule, it is guaranteed that there exists no unknown `Hint` in the plan. + * All new `Hint`s should be transformed into concrete Hint classes `BroadcastHint` here. + */ + object SubstituteHints extends Rule[LogicalPlan] { +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case logical: LogicalPlan => logical transformDown { +case h @ Hint(name, parameters, child) +if Seq("BROADCAST", "BROADCASTJOIN", "MAPJOIN").contains(name.toUpperCase) => + var resolvedChild = child + + for (table <- parameters) { +var stop = false +resolvedChild = resolvedChild.transformDown { --- End diff -- and I think BFS is the one that matches the semantic "closest". --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r7205 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,49 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + * + * This rule substitutes `UnresolvedRelation`s in `Substitute` batch before `ResolveRelations` + * rule is applied. Here are two reasons. + * - To support `MetastoreRelation` in Hive module. + * - To reduce the effect of `Hint` on the other rules. + * + * After this rule, it is guaranteed that there exists no unknown `Hint` in the plan. + * All new `Hint`s should be transformed into concrete Hint classes `BroadcastHint` here. + */ + object SubstituteHints extends Rule[LogicalPlan] { +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case logical: LogicalPlan => logical transformDown { +case h @ Hint(name, parameters, child) +if Seq("BROADCAST", "BROADCASTJOIN", "MAPJOIN").contains(name.toUpperCase) => + var resolvedChild = child + + for (table <- parameters) { +var stop = false +resolvedChild = resolvedChild.transformDown { --- End diff -- `SELECT /*+ MAPJOIN(a) */ FROM (SELECT * FROM a) x JOIN a` If we do DFS, we will add the hint to the table `a` insode subquery, not the table `a` after JOIN. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r72040705 --- Diff: sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 --- @@ -347,6 +347,16 @@ querySpecification windows?) ; +hint +: '/*+' hintStatement '*/' --- End diff -- Yes. Currently, it does. Should we expand this, too? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r72040281 --- Diff: sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 --- @@ -347,6 +347,16 @@ querySpecification windows?) ; +hint +: '/*+' hintStatement '*/' --- End diff -- I might have missed something, but do we only support one hint? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r72040264 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,49 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + * + * This rule substitutes `UnresolvedRelation`s in `Substitute` batch before `ResolveRelations` + * rule is applied. Here are two reasons. + * - To support `MetastoreRelation` in Hive module. + * - To reduce the effect of `Hint` on the other rules. + * + * After this rule, it is guaranteed that there exists no unknown `Hint` in the plan. + * All new `Hint`s should be transformed into concrete Hint classes `BroadcastHint` here. + */ + object SubstituteHints extends Rule[LogicalPlan] { +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case logical: LogicalPlan => logical transformDown { +case h @ Hint(name, parameters, child) +if Seq("BROADCAST", "BROADCASTJOIN", "MAPJOIN").contains(name.toUpperCase) => + var resolvedChild = child + + for (table <- parameters) { +var stop = false +resolvedChild = resolvedChild.transformDown { --- End diff -- I'll check again this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r72038785 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,49 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + * + * This rule substitutes `UnresolvedRelation`s in `Substitute` batch before `ResolveRelations` + * rule is applied. Here are two reasons. + * - To support `MetastoreRelation` in Hive module. + * - To reduce the effect of `Hint` on the other rules. + * + * After this rule, it is guaranteed that there exists no unknown `Hint` in the plan. + * All new `Hint`s should be transformed into concrete Hint classes `BroadcastHint` here. + */ + object SubstituteHints extends Rule[LogicalPlan] { +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case logical: LogicalPlan => logical transformDown { +case h @ Hint(name, parameters, child) +if Seq("BROADCAST", "BROADCASTJOIN", "MAPJOIN").contains(name.toUpperCase) => + var resolvedChild = child + + for (table <- parameters) { +var stop = false +resolvedChild = resolvedChild.transformDown { --- End diff -- I think DFS is right for this match. Do you think BFS? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r72037553 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,49 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + * + * This rule substitutes `UnresolvedRelation`s in `Substitute` batch before `ResolveRelations` + * rule is applied. Here are two reasons. + * - To support `MetastoreRelation` in Hive module. + * - To reduce the effect of `Hint` on the other rules. + * + * After this rule, it is guaranteed that there exists no unknown `Hint` in the plan. + * All new `Hint`s should be transformed into concrete Hint classes `BroadcastHint` here. + */ + object SubstituteHints extends Rule[LogicalPlan] { +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case logical: LogicalPlan => logical transformDown { +case h @ Hint(name, parameters, child) +if Seq("BROADCAST", "BROADCASTJOIN", "MAPJOIN").contains(name.toUpperCase) => + var resolvedChild = child + + for (table <- parameters) { +var stop = false +resolvedChild = resolvedChild.transformDown { --- End diff -- Does this really work? `transformDown` is DFS I think --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r72036822 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -372,6 +372,10 @@ trait CheckAnalysis extends PredicateHelper { |in operator ${operator.simpleString} """.stripMargin) + case Hint(_, _, _) => +throw new IllegalStateException( + "logical hint operator should have been removed by the optimizer") --- End diff -- Oh! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r72036883 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,49 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + * + * This rule substitutes `UnresolvedRelation`s in `Substitute` batch before `ResolveRelations` + * rule is applied. Here are two reasons. + * - To support `MetastoreRelation` in Hive module. + * - To reduce the effect of `Hint` on the other rules. + * + * After this rule, it is guaranteed that there exists no unknown `Hint` in the plan. + * All new `Hint`s should be transformed into concrete Hint classes `BroadcastHint` here. + */ + object SubstituteHints extends Rule[LogicalPlan] { +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case logical: LogicalPlan => logical transformDown { +case h @ Hint(name, parameters, child) +if Seq("BROADCAST", "BROADCASTJOIN", "MAPJOIN").contains(name.toUpperCase) => --- End diff -- Sure. I see. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r72036607 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,49 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + * + * This rule substitutes `UnresolvedRelation`s in `Substitute` batch before `ResolveRelations` + * rule is applied. Here are two reasons. + * - To support `MetastoreRelation` in Hive module. + * - To reduce the effect of `Hint` on the other rules. + * + * After this rule, it is guaranteed that there exists no unknown `Hint` in the plan. + * All new `Hint`s should be transformed into concrete Hint classes `BroadcastHint` here. + */ + object SubstituteHints extends Rule[LogicalPlan] { +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case logical: LogicalPlan => logical transformDown { +case h @ Hint(name, parameters, child) +if Seq("BROADCAST", "BROADCASTJOIN", "MAPJOIN").contains(name.toUpperCase) => --- End diff -- we can make it a constant value(use set instead of seq) in this rule. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r72036166 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -372,6 +372,10 @@ trait CheckAnalysis extends PredicateHelper { |in operator ${operator.simpleString} """.stripMargin) + case Hint(_, _, _) => +throw new IllegalStateException( + "logical hint operator should have been removed by the optimizer") --- End diff -- by analyzer --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r72012385 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,49 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + * + * This rule substitutes `UnresolvedRelation`s in `Substitute` batch before `ResolveRelations` + * rule is applied. Here are two reasons. + * - To support `MetastoreRelation` in Hive module. + * - To reduce the effect of `Hint` on the other rules. + * + * After this rule, it is guaranteed that there exists no unknown `Hint` in the plan. + * All new `Hint`s should be transformed into concrete Hint classes `BroadcastHint` here. + */ + object SubstituteHints extends Rule[LogicalPlan] { --- End diff -- Yep. I will wrap the whole query plan. `Project` is not always there. ``` SELECT /*+ MAPJOIN(t) */ * from t, u where 1=1 group by 1 order by 1 'Sort [1 ASC], true +- 'Aggregate [1], [*] +- 'Filter (1 = 1) +- 'Hint MAPJOIN, [t] +- 'Join Inner :- 'UnresolvedRelation `t` +- 'UnresolvedRelation `u` ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r72011582 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,49 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + * + * This rule substitutes `UnresolvedRelation`s in `Substitute` batch before `ResolveRelations` + * rule is applied. Here are two reasons. + * - To support `MetastoreRelation` in Hive module. + * - To reduce the effect of `Hint` on the other rules. + * + * After this rule, it is guaranteed that there exists no unknown `Hint` in the plan. + * All new `Hint`s should be transformed into concrete Hint classes `BroadcastHint` here. + */ + object SubstituteHints extends Rule[LogicalPlan] { --- End diff -- First of all, I'll take a look again. I need some boot-up time. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r72011413 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,49 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + * + * This rule substitutes `UnresolvedRelation`s in `Substitute` batch before `ResolveRelations` + * rule is applied. Here are two reasons. + * - To support `MetastoreRelation` in Hive module. + * - To reduce the effect of `Hint` on the other rules. + * + * After this rule, it is guaranteed that there exists no unknown `Hint` in the plan. + * All new `Hint`s should be transformed into concrete Hint classes `BroadcastHint` here. + */ + object SubstituteHints extends Rule[LogicalPlan] { --- End diff -- For the second one, I will revise again. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r72011394 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,49 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + * + * This rule substitutes `UnresolvedRelation`s in `Substitute` batch before `ResolveRelations` + * rule is applied. Here are two reasons. + * - To support `MetastoreRelation` in Hive module. + * - To reduce the effect of `Hint` on the other rules. + * + * After this rule, it is guaranteed that there exists no unknown `Hint` in the plan. + * All new `Hint`s should be transformed into concrete Hint classes `BroadcastHint` here. + */ + object SubstituteHints extends Rule[LogicalPlan] { --- End diff -- I see. Since this is `SELECT hint`, which one do you mean? - `Hint(Project)` - `Project(Hint)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r72011145 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,49 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + * + * This rule substitutes `UnresolvedRelation`s in `Substitute` batch before `ResolveRelations` + * rule is applied. Here are two reasons. + * - To support `MetastoreRelation` in Hive module. + * - To reduce the effect of `Hint` on the other rules. + * + * After this rule, it is guaranteed that there exists no unknown `Hint` in the plan. + * All new `Hint`s should be transformed into concrete Hint classes `BroadcastHint` here. + */ + object SubstituteHints extends Rule[LogicalPlan] { --- End diff -- then should we wrap the whole query plan with `Hint` instead of just the relation? https://github.com/apache/spark/pull/14132#discussion-diff-71820260R343 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r72010521 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,49 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + * + * This rule substitutes `UnresolvedRelation`s in `Substitute` batch before `ResolveRelations` + * rule is applied. Here are two reasons. + * - To support `MetastoreRelation` in Hive module. + * - To reduce the effect of `Hint` on the other rules. + * + * After this rule, it is guaranteed that there exists no unknown `Hint` in the plan. + * All new `Hint`s should be transformed into concrete Hint classes `BroadcastHint` here. + */ + object SubstituteHints extends Rule[LogicalPlan] { --- End diff -- Hi, @cloud-fan . Thank you again. Yes. That's true. So, at my first commit of this PR, I actually implemented to attach just the concrete `BroadcastHint` to the relations in `AstBuilder.scala`. But, there were some advices to - Move out the logic from the parser module (for preventing future updates on parser module) - Generalize the hint syntax like the current code. Since a general `Hint` is attached on a logical plan not a table relation, we can substitute any other plans (filter/project/aggregator), too. The following is my previous comment about this situation. The child of hint is `JOIN`. > Yep. I grasp your point. First, the given relation is just a logical plan, not a table. In the following case, the relation is Join. > SELECT /*+ MAPJOIN(parquet_t0) */ * FROM parquet_t0, parquet_t1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r72008134 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,49 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + * + * This rule substitutes `UnresolvedRelation`s in `Substitute` batch before `ResolveRelations` + * rule is applied. Here are two reasons. + * - To support `MetastoreRelation` in Hive module. + * - To reduce the effect of `Hint` on the other rules. + * + * After this rule, it is guaranteed that there exists no unknown `Hint` in the plan. + * All new `Hint`s should be transformed into concrete Hint classes `BroadcastHint` here. + */ + object SubstituteHints extends Rule[LogicalPlan] { --- End diff -- how does hive handle sql hints? This rule looks a little complex to me, according to the fact that hints can only be applied to table relation. It will be great if we can wrap `UnresolvedRelation` with `Hint` in the parser, and then it will be very easy to resolve hints at analyzer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71826514 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -193,6 +196,9 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { case OneRowRelation => "" +case Hint(_, _, child) => --- End diff -- I remember the reason. Originally, I designed `Hint` is removed by some rules of `Analyzer` or `Optimizer`. But, now, it's changed. Only rules of `Analyzer` handles that and removes all unknown hints. `CheckAnalysis` is the correct position. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71825927 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -193,6 +196,9 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { case OneRowRelation => "" +case Hint(_, _, child) => --- End diff -- Ur, it's added by previous advice comments. No problem. I'll move that into `CheckAnalysis`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71825810 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala --- @@ -339,8 +339,11 @@ class AstBuilder extends SqlBaseBaseVisitor[AnyRef] with Logging { case SqlBaseParser.SELECT => // Regular select +// Add hints. +val withHint = relation.optionalMap(ctx.hint)(withHints) --- End diff -- Yep. I grasp your point. First, the given `relation` is just a logical plan, not a table. In the following case, the `relation` is `Join`. ``` SELECT /*+ MAPJOIN(parquet_t0) */ * FROM parquet_t0, parquet_t1 ``` This means just a context. Second, I agree with your point. Generally, `hint` is about tables like join order or access path. We can provides some hints on filter conditions, I'm not sure about the cases so far. But, even for that cases, we can adapt that in `Analyzer` easily. In other words, we can identify `Hint("FILTERHINT")` in the same SPJ. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71823645 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -193,6 +196,9 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { case OneRowRelation => "" +case Hint(_, _, child) => --- End diff -- so why do we have this case? `SQLBuilder` will be applied on analyzed plan right? And I think it's better to push this check in `CheckAnalysis` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71823613 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala --- @@ -339,8 +339,11 @@ class AstBuilder extends SqlBaseBaseVisitor[AnyRef] with Logging { case SqlBaseParser.SELECT => // Regular select +// Add hints. +val withHint = relation.optionalMap(ctx.hint)(withHints) --- End diff -- for now the hint can only affect the relations(broadcast hint), so it's ok to just put the `Hint` on top of relation. But generally hint can affect anything in the select clause, e.g. maybe the filter condition. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71823152 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -193,6 +196,9 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { case OneRowRelation => "" +case Hint(_, _, child) => --- End diff -- Also, `SparkStrategies` ensures that here https://github.com/apache/spark/pull/14132/files#diff-7253a38df7e111ecf6b1ef71feba383bR347 , too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71823079 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -193,6 +196,9 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { case OneRowRelation => "" +case Hint(_, _, child) => --- End diff -- After applying the rules of `Analyzer`, there will be no more `Hint` case class. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71822992 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala --- @@ -339,8 +339,11 @@ class AstBuilder extends SqlBaseBaseVisitor[AnyRef] with Logging { case SqlBaseParser.SELECT => // Regular select +// Add hints. +val withHint = relation.optionalMap(ctx.hint)(withHints) --- End diff -- Ur, could you give some example what you concern? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71822903 --- Diff: sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 --- @@ -347,6 +347,16 @@ querySpecification windows?) ; +hint +: '/*+' hintStatement '*/' +; + +hintStatement +: hintName=identifier +| hintName=identifier '(' parameters+=identifier parameters+=identifier ')' --- End diff -- Thank you for review, @cloud-fan . The first goal of this PR provides a general syntax for hints, not only broadcast hints. The `(` and `)` syntax is for `INDEX(t idx_emp)` style. You can see the testcase for this in the testcase, too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71820467 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -193,6 +196,9 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { case OneRowRelation => "" +case Hint(_, _, child) => --- End diff -- will an analyzed plan contain `Hint`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71820260 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala --- @@ -339,8 +339,11 @@ class AstBuilder extends SqlBaseBaseVisitor[AnyRef] with Logging { case SqlBaseParser.SELECT => // Regular select +// Add hints. +val withHint = relation.optionalMap(ctx.hint)(withHints) --- End diff -- according to the grammar: `SELECT /*+ hint... */ ...`, it seems that hint should be applied to the whole SELECT clause, not only its relations? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71819403 --- Diff: sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 --- @@ -347,6 +347,16 @@ querySpecification windows?) ; +hint +: '/*+' hintStatement '*/' +; + +hintStatement +: hintName=identifier +| hintName=identifier '(' parameters+=identifier parameters+=identifier ')' --- End diff -- do we need this rule for broadcast hint? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71756794 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -162,6 +162,9 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { toSQL(p.right), p.condition.map(" ON " + _.sql).getOrElse("")) +case h @ Hint(_, _, s @ SubqueryAlias(alias, p @ Project(_, _: SQLTable))) => --- End diff -- Sure! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71756744 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -425,6 +443,46 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +/** + * Merge and move upward to the nearest Project. + * A broadcast hint comment is scattered into multiple nodes inside the plan, and the + * information of BroadcastHint resides its current position inside the plan. In order to + * reconstruct broadcast hint comment, we need to pack the information of BroadcastHint into + * Hint("BROADCAST", _, _) and collect them up by moving upward to the nearest Project node. + */ +object NormalizeBroadcastHint extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +// Capture the broadcasted information and store it in Hint. +case BroadcastHint(child @ SubqueryAlias(_, Project(_, SQLTable(database, table, _, _ => + Hint("BROADCAST", Seq(table), child) + +// Nearest Project is found. +case p @ Project(_, Hint(_, _, _)) => p + +// Merge BROADCAST hints up to the nearest Project. +case Hint("BROADCAST", params1, h @ Hint("BROADCAST", params2, _)) => + h.copy(parameters = params1 ++ params2) +case j @ Join(h1 @ Hint("BROADCAST", p1, left), h2 @ Hint("BROADCAST", p2, right), _, _) => + h1.copy(parameters = p1 ++ p2, child = j.copy(left = left, right = right)) + +// Bubble up BROADCAST hints to the nearest Project. +case j @ Join(h @ Hint("BROADCAST", _, hintChild), _, _, _) => + h.copy(child = j.copy(left = hintChild)) +case j @ Join(_, h @ Hint("BROADCAST", _, hintChild), _, _) => + h.copy(child = j.copy(right = hintChild)) + +// Other UnaryNodes are bypassed. +case u: UnaryNode +if u.child.isInstanceOf[Hint] && u.child.asInstanceOf[Hint].name.equals("BROADCAST") => --- End diff -- No problem, if the decision is made. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71756590 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -425,6 +449,44 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +/** + * Merge and move upward to the nearest Project. + * A broadcast hint comment is scattered into multiple nodes inside the plan, and the + * information of BroadcastHint resides its current position inside the plan. In order to + * reconstruct broadcast hint comment, we need to pack the information of BroadcastHint into + * Hint("BROADCAST", _, _) and collect them up by moving upward to the nearest Project node. + */ +object NormalizeBroadcastHint extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +// Capture the broadcasted information and store it in Hint. +case BroadcastHint(child @ SubqueryAlias(_, Project(_, SQLTable(database, table, _, _ => + Hint("BROADCAST", Seq(table), child) + +// Nearest Project is found. +case p @ Project(_, Hint(_, _, _)) => p + +// Merge BROADCAST hints up to the nearest Project. +case Hint("BROADCAST", params1, h @ Hint("BROADCAST", params2, _)) => + h.copy(parameters = params1 ++ params2) +case j @ Join(h1 @ Hint("BROADCAST", p1, left), h2 @ Hint("BROADCAST", p2, right), _, _) => + h1.copy(parameters = p1 ++ p2, child = j.copy(left = left, right = right)) + +// Bubble up BROADCAST hints to the nearest Project. +case j @ Join(h @ Hint("BROADCAST", _, hintChild), _, _, _) => + h.copy(child = j.copy(left = hintChild)) +case j @ Join(_, h @ Hint("BROADCAST", _, hintChild), _, _) => + h.copy(child = j.copy(right = hintChild)) --- End diff -- Yep. No problem. For both ways, I'm ready. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71736274 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -162,6 +162,9 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { toSQL(p.right), p.condition.map(" ON " + _.sql).getOrElse("")) +case h @ Hint(_, _, s @ SubqueryAlias(alias, p @ Project(_, _: SQLTable))) => --- End diff -- The new way looks cleaner, but this looks like a special handling. Can you write a code comment to explain when it is being used? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71735446 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -425,6 +443,46 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +/** + * Merge and move upward to the nearest Project. + * A broadcast hint comment is scattered into multiple nodes inside the plan, and the + * information of BroadcastHint resides its current position inside the plan. In order to + * reconstruct broadcast hint comment, we need to pack the information of BroadcastHint into + * Hint("BROADCAST", _, _) and collect them up by moving upward to the nearest Project node. + */ +object NormalizeBroadcastHint extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +// Capture the broadcasted information and store it in Hint. +case BroadcastHint(child @ SubqueryAlias(_, Project(_, SQLTable(database, table, _, _ => + Hint("BROADCAST", Seq(table), child) + +// Nearest Project is found. +case p @ Project(_, Hint(_, _, _)) => p + +// Merge BROADCAST hints up to the nearest Project. +case Hint("BROADCAST", params1, h @ Hint("BROADCAST", params2, _)) => + h.copy(parameters = params1 ++ params2) +case j @ Join(h1 @ Hint("BROADCAST", p1, left), h2 @ Hint("BROADCAST", p2, right), _, _) => + h1.copy(parameters = p1 ++ p2, child = j.copy(left = left, right = right)) + +// Bubble up BROADCAST hints to the nearest Project. +case j @ Join(h @ Hint("BROADCAST", _, hintChild), _, _, _) => + h.copy(child = j.copy(left = hintChild)) +case j @ Join(_, h @ Hint("BROADCAST", _, hintChild), _, _) => + h.copy(child = j.copy(right = hintChild)) + +// Other UnaryNodes are bypassed. +case u: UnaryNode +if u.child.isInstanceOf[Hint] && u.child.asInstanceOf[Hint].name.equals("BROADCAST") => --- End diff -- As mentioned before, how about using `isBroadcastHint` here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71735281 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -425,6 +449,44 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +/** + * Merge and move upward to the nearest Project. + * A broadcast hint comment is scattered into multiple nodes inside the plan, and the + * information of BroadcastHint resides its current position inside the plan. In order to + * reconstruct broadcast hint comment, we need to pack the information of BroadcastHint into + * Hint("BROADCAST", _, _) and collect them up by moving upward to the nearest Project node. + */ +object NormalizeBroadcastHint extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +// Capture the broadcasted information and store it in Hint. +case BroadcastHint(child @ SubqueryAlias(_, Project(_, SQLTable(database, table, _, _ => + Hint("BROADCAST", Seq(table), child) + +// Nearest Project is found. +case p @ Project(_, Hint(_, _, _)) => p + +// Merge BROADCAST hints up to the nearest Project. +case Hint("BROADCAST", params1, h @ Hint("BROADCAST", params2, _)) => + h.copy(parameters = params1 ++ params2) +case j @ Join(h1 @ Hint("BROADCAST", p1, left), h2 @ Hint("BROADCAST", p2, right), _, _) => + h1.copy(parameters = p1 ++ p2, child = j.copy(left = left, right = right)) + +// Bubble up BROADCAST hints to the nearest Project. +case j @ Join(h @ Hint("BROADCAST", _, hintChild), _, _, _) => + h.copy(child = j.copy(left = hintChild)) +case j @ Join(_, h @ Hint("BROADCAST", _, hintChild), _, _) => + h.copy(child = j.copy(right = hintChild)) --- End diff -- Actually, `isBroadcastHint` can be reused below for `UnaryNode` ```SQL case j: Join if j.children.exists(isBroadcastHint) => val parameter = j.children.filter(isBroadcastHint).map(_.asInstanceOf[Hint]).flatMap(_.parameters) val newChildren = j.children.map(x => if (isBroadcastHint(x)) x.asInstanceOf[Hint].child else x) Hint("BROADCAST", parameter = parameter, j.copy(left = newChildren(0), right = newChildren(1))) ``` However, I am not sure which way is better. Leave it here and let the Committers decide it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71694743 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -425,6 +449,44 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +/** + * Merge and move upward to the nearest Project. + * A broadcast hint comment is scattered into multiple nodes inside the plan, and the + * information of BroadcastHint resides its current position inside the plan. In order to + * reconstruct broadcast hint comment, we need to pack the information of BroadcastHint into + * Hint("BROADCAST", _, _) and collect them up by moving upward to the nearest Project node. + */ +object NormalizeBroadcastHint extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +// Capture the broadcasted information and store it in Hint. +case BroadcastHint(child @ SubqueryAlias(_, Project(_, SQLTable(database, table, _, _ => + Hint("BROADCAST", Seq(table), child) + +// Nearest Project is found. +case p @ Project(_, Hint(_, _, _)) => p + +// Merge BROADCAST hints up to the nearest Project. +case Hint("BROADCAST", params1, h @ Hint("BROADCAST", params2, _)) => + h.copy(parameters = params1 ++ params2) +case j @ Join(h1 @ Hint("BROADCAST", p1, left), h2 @ Hint("BROADCAST", p2, right), _, _) => + h1.copy(parameters = p1 ++ p2, child = j.copy(left = left, right = right)) + +// Bubble up BROADCAST hints to the nearest Project. +case j @ Join(h @ Hint("BROADCAST", _, hintChild), _, _, _) => + h.copy(child = j.copy(left = hintChild)) +case j @ Join(_, h @ Hint("BROADCAST", _, hintChild), _, _) => + h.copy(child = j.copy(right = hintChild)) --- End diff -- For the functional equivalence, you can see the logic. Also, the above code is passed `LogicalPlanToSQLSuite` with the existing tests, too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71694195 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -425,6 +449,44 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +/** + * Merge and move upward to the nearest Project. + * A broadcast hint comment is scattered into multiple nodes inside the plan, and the + * information of BroadcastHint resides its current position inside the plan. In order to + * reconstruct broadcast hint comment, we need to pack the information of BroadcastHint into + * Hint("BROADCAST", _, _) and collect them up by moving upward to the nearest Project node. + */ +object NormalizeBroadcastHint extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +// Capture the broadcasted information and store it in Hint. +case BroadcastHint(child @ SubqueryAlias(_, Project(_, SQLTable(database, table, _, _ => + Hint("BROADCAST", Seq(table), child) + +// Nearest Project is found. +case p @ Project(_, Hint(_, _, _)) => p + +// Merge BROADCAST hints up to the nearest Project. +case Hint("BROADCAST", params1, h @ Hint("BROADCAST", params2, _)) => + h.copy(parameters = params1 ++ params2) +case j @ Join(h1 @ Hint("BROADCAST", p1, left), h2 @ Hint("BROADCAST", p2, right), _, _) => + h1.copy(parameters = p1 ++ p2, child = j.copy(left = left, right = right)) + +// Bubble up BROADCAST hints to the nearest Project. +case j @ Join(h @ Hint("BROADCAST", _, hintChild), _, _, _) => + h.copy(child = j.copy(left = hintChild)) +case j @ Join(_, h @ Hint("BROADCAST", _, hintChild), _, _) => + h.copy(child = j.copy(right = hintChild)) --- End diff -- Here is the candidate. Since we cannot take advantage of `case` statement, I added a function for that. ``` private def isBroadcastHint(logicalPlan: LogicalPlan): Boolean = logicalPlan.isInstanceOf[Hint] && logicalPlan.asInstanceOf[Hint].name.equals("BRAODCAST") ... case j: Join if j.children.exists(isBroadcastHint) => val newChildren = j.children.map(x => if (isBroadcastHint(x)) x.asInstanceOf[Hint].child else x) Hint("BROADCAST", j.children.filter(isBroadcastHint).map(_.asInstanceOf[Hint]).flatMap(_.parameters), j.copy(left = newChildren(0), right = newChildren(1))) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71683008 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -356,8 +372,14 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +val broadcastHint = project match { + case p @ Project(projectList, Hint("BROADCAST", tables, child)) => +if (tables.nonEmpty) s"/*+ MAPJOIN(${tables.mkString(", ")}) */" else "" + case _ => "" +} build( "SELECT", + broadcastHint, --- End diff -- Hmm. For this one, I found more reasonable solution. Instead of adding new dummy `Project`, we can utilize the existing dummy `Project` on the `SQLTable`. Basically, the root cause of this problem of `groupingSetToSQL` is that it removes the project by doing this `toSQL(project.child)`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71671510 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -356,8 +372,14 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +val broadcastHint = project match { + case p @ Project(projectList, Hint("BROADCAST", tables, child)) => +if (tables.nonEmpty) s"/*+ MAPJOIN(${tables.mkString(", ")}) */" else "" + case _ => "" +} build( "SELECT", + broadcastHint, --- End diff -- To do this, we need to add the extra dummy `Project` for the following condition. ```scala case a @ Aggregate(_, _, e @ Expand(_, _, p: Project)) if isGroupingSet(a, e, p) => groupingSetToSQL(a, e, p) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71627841 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -425,6 +449,44 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +/** + * Merge and move upward to the nearest Project. + * A broadcast hint comment is scattered into multiple nodes inside the plan, and the + * information of BroadcastHint resides its current position inside the plan. In order to + * reconstruct broadcast hint comment, we need to pack the information of BroadcastHint into + * Hint("BROADCAST", _, _) and collect them up by moving upward to the nearest Project node. + */ +object NormalizeBroadcastHint extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +// Capture the broadcasted information and store it in Hint. +case BroadcastHint(child @ SubqueryAlias(_, Project(_, SQLTable(database, table, _, _ => + Hint("BROADCAST", Seq(table), child) + +// Nearest Project is found. +case p @ Project(_, Hint(_, _, _)) => p + +// Merge BROADCAST hints up to the nearest Project. +case Hint("BROADCAST", params1, h @ Hint("BROADCAST", params2, _)) => + h.copy(parameters = params1 ++ params2) +case j @ Join(h1 @ Hint("BROADCAST", p1, left), h2 @ Hint("BROADCAST", p2, right), _, _) => + h1.copy(parameters = p1 ++ p2, child = j.copy(left = left, right = right)) + +// Bubble up BROADCAST hints to the nearest Project. +case j @ Join(h @ Hint("BROADCAST", _, hintChild), _, _, _) => + h.copy(child = j.copy(left = hintChild)) +case j @ Join(_, h @ Hint("BROADCAST", _, hintChild), _, _) => + h.copy(child = j.copy(right = hintChild)) --- End diff -- You mean before committing here? No problem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71627448 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -425,6 +449,44 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +/** + * Merge and move upward to the nearest Project. + * A broadcast hint comment is scattered into multiple nodes inside the plan, and the + * information of BroadcastHint resides its current position inside the plan. In order to + * reconstruct broadcast hint comment, we need to pack the information of BroadcastHint into + * Hint("BROADCAST", _, _) and collect them up by moving upward to the nearest Project node. + */ +object NormalizeBroadcastHint extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +// Capture the broadcasted information and store it in Hint. +case BroadcastHint(child @ SubqueryAlias(_, Project(_, SQLTable(database, table, _, _ => + Hint("BROADCAST", Seq(table), child) + +// Nearest Project is found. +case p @ Project(_, Hint(_, _, _)) => p + +// Merge BROADCAST hints up to the nearest Project. +case Hint("BROADCAST", params1, h @ Hint("BROADCAST", params2, _)) => + h.copy(parameters = params1 ++ params2) +case j @ Join(h1 @ Hint("BROADCAST", p1, left), h2 @ Hint("BROADCAST", p2, right), _, _) => + h1.copy(parameters = p1 ++ p2, child = j.copy(left = left, right = right)) + +// Bubble up BROADCAST hints to the nearest Project. +case j @ Join(h @ Hint("BROADCAST", _, hintChild), _, _, _) => + h.copy(child = j.copy(left = hintChild)) +case j @ Join(_, h @ Hint("BROADCAST", _, hintChild), _, _) => + h.copy(child = j.copy(right = hintChild)) --- End diff -- Can you post your code when you merge them in a single case? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71627192 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -425,6 +449,44 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +/** + * Merge and move upward to the nearest Project. + * A broadcast hint comment is scattered into multiple nodes inside the plan, and the + * information of BroadcastHint resides its current position inside the plan. In order to + * reconstruct broadcast hint comment, we need to pack the information of BroadcastHint into + * Hint("BROADCAST", _, _) and collect them up by moving upward to the nearest Project node. + */ +object NormalizeBroadcastHint extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +// Capture the broadcasted information and store it in Hint. +case BroadcastHint(child @ SubqueryAlias(_, Project(_, SQLTable(database, table, _, _ => + Hint("BROADCAST", Seq(table), child) + +// Nearest Project is found. +case p @ Project(_, Hint(_, _, _)) => p + +// Merge BROADCAST hints up to the nearest Project. +case Hint("BROADCAST", params1, h @ Hint("BROADCAST", params2, _)) => + h.copy(parameters = params1 ++ params2) +case j @ Join(h1 @ Hint("BROADCAST", p1, left), h2 @ Hint("BROADCAST", p2, right), _, _) => + h1.copy(parameters = p1 ++ p2, child = j.copy(left = left, right = right)) + +// Bubble up BROADCAST hints to the nearest Project. +case j @ Join(h @ Hint("BROADCAST", _, hintChild), _, _, _) => + h.copy(child = j.copy(left = hintChild)) +case j @ Join(_, h @ Hint("BROADCAST", _, hintChild), _, _) => + h.copy(child = j.copy(right = hintChild)) --- End diff -- Ur, in this case, I thought the combined one looks complex, doesn't it? I'm not sure that also could reduce the number of lines. Anyway, it's your advice. I'll try to find some way. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71626761 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -425,6 +449,44 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +/** + * Merge and move upward to the nearest Project. + * A broadcast hint comment is scattered into multiple nodes inside the plan, and the + * information of BroadcastHint resides its current position inside the plan. In order to + * reconstruct broadcast hint comment, we need to pack the information of BroadcastHint into + * Hint("BROADCAST", _, _) and collect them up by moving upward to the nearest Project node. + */ +object NormalizeBroadcastHint extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +// Capture the broadcasted information and store it in Hint. +case BroadcastHint(child @ SubqueryAlias(_, Project(_, SQLTable(database, table, _, _ => + Hint("BROADCAST", Seq(table), child) + +// Nearest Project is found. +case p @ Project(_, Hint(_, _, _)) => p + +// Merge BROADCAST hints up to the nearest Project. +case Hint("BROADCAST", params1, h @ Hint("BROADCAST", params2, _)) => + h.copy(parameters = params1 ++ params2) +case j @ Join(h1 @ Hint("BROADCAST", p1, left), h2 @ Hint("BROADCAST", p2, right), _, _) => + h1.copy(parameters = p1 ++ p2, child = j.copy(left = left, right = right)) + +// Bubble up BROADCAST hints to the nearest Project. +case j @ Join(h @ Hint("BROADCAST", _, hintChild), _, _, _) => + h.copy(child = j.copy(left = hintChild)) +case j @ Join(_, h @ Hint("BROADCAST", _, hintChild), _, _) => + h.copy(child = j.copy(right = hintChild)) + +// Other UnaryNodes are bypassed. +case u: UnaryNode +if u.child.isInstanceOf[Hint] && u.child.asInstanceOf[Hint].name.equals("BROADCAST") => + val hint = u.child.asInstanceOf[Hint] + hint.copy(child = u.withNewChildren(Seq(hint.child))) + +// Other binary or higher operations are ignored. --- End diff -- We still need to explain why they can be ignored. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71626714 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -425,6 +449,44 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +/** + * Merge and move upward to the nearest Project. + * A broadcast hint comment is scattered into multiple nodes inside the plan, and the + * information of BroadcastHint resides its current position inside the plan. In order to + * reconstruct broadcast hint comment, we need to pack the information of BroadcastHint into + * Hint("BROADCAST", _, _) and collect them up by moving upward to the nearest Project node. + */ +object NormalizeBroadcastHint extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +// Capture the broadcasted information and store it in Hint. +case BroadcastHint(child @ SubqueryAlias(_, Project(_, SQLTable(database, table, _, _ => + Hint("BROADCAST", Seq(table), child) + +// Nearest Project is found. +case p @ Project(_, Hint(_, _, _)) => p + +// Merge BROADCAST hints up to the nearest Project. +case Hint("BROADCAST", params1, h @ Hint("BROADCAST", params2, _)) => + h.copy(parameters = params1 ++ params2) +case j @ Join(h1 @ Hint("BROADCAST", p1, left), h2 @ Hint("BROADCAST", p2, right), _, _) => + h1.copy(parameters = p1 ++ p2, child = j.copy(left = left, right = right)) + +// Bubble up BROADCAST hints to the nearest Project. +case j @ Join(h @ Hint("BROADCAST", _, hintChild), _, _, _) => + h.copy(child = j.copy(left = hintChild)) +case j @ Join(_, h @ Hint("BROADCAST", _, hintChild), _, _) => + h.copy(child = j.copy(right = hintChild)) --- End diff -- The above three `Join` cases can be combined into a single one. The code might looks cleaner --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71626636 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -193,6 +193,12 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { case OneRowRelation => "" +case Hint(_, _, child) => + toSQL(child) + +case BroadcastHint(child) => + toSQL(child) --- End diff -- Yep. Right. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71626212 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -193,6 +193,12 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { case OneRowRelation => "" +case Hint(_, _, child) => + toSQL(child) + +case BroadcastHint(child) => + toSQL(child) --- End diff -- This is dead code. Right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71625598 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -356,8 +372,14 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +val broadcastHint = project match { + case p @ Project(projectList, Hint("BROADCAST", tables, child)) => +if (tables.nonEmpty) s"/*+ MAPJOIN(${tables.mkString(", ")}) */" else "" + case _ => "" +} build( "SELECT", + broadcastHint, --- End diff -- Okay. I'll try that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71625297 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -356,8 +372,14 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +val broadcastHint = project match { + case p @ Project(projectList, Hint("BROADCAST", tables, child)) => +if (tables.nonEmpty) s"/*+ MAPJOIN(${tables.mkString(", ")}) */" else "" + case _ => "" +} build( "SELECT", + broadcastHint, --- End diff -- Just did a quick check. Except `groupingSetToSQL`, the other nodes are not converting the `Project` operator to SQL. That is the reason why we have to add a special handling for `BroadcastHint`. Then, maybe we can add an extra dummy `Project` in `groupingSetToSQL`. If so, we can avoid adding the special handling for `BroadcastHint` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71623500 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -356,8 +372,14 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +val broadcastHint = project match { + case p @ Project(projectList, Hint("BROADCAST", tables, child)) => +if (tables.nonEmpty) s"/*+ MAPJOIN(${tables.mkString(", ")}) */" else "" + case _ => "" +} build( "SELECT", + broadcastHint, --- End diff -- That is like a temporary patch, instead of a permanent solution. We need a clean way to resolve it. Let me find what is the root cause --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71465508 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,35 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. --- End diff -- Also, in the PR description, too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71464803 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,35 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. --- End diff -- Oh, I missed you comment here. It's too far from the bottom now. :) I'll add more `prerequisite, dependency assumptions` here now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71464346 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -356,8 +372,14 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +val broadcastHint = project match { + case p @ Project(projectList, Hint("BROADCAST", tables, child)) => +if (tables.nonEmpty) s"/*+ MAPJOIN(${tables.mkString(", ")}) */" else "" + case _ => "" +} build( "SELECT", + broadcastHint, --- End diff -- "SELECT" occurs in the followings. But, I didn't added meaning-logic based on the testcases. - aggregateToSQL - generateToSQL => This has only "SELECT 1" - groupingSetToSQL - projectToSQL - windowToSQL I think the test coverage enough due to it generates the above cases. If you suggests more testcases, I welcome. I like robustness both for this PR and for the future. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71463733 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -356,8 +372,14 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +val broadcastHint = project match { + case p @ Project(projectList, Hint("BROADCAST", tables, child)) => +if (tables.nonEmpty) s"/*+ MAPJOIN(${tables.mkString(", ")}) */" else "" + case _ => "" +} build( "SELECT", + broadcastHint, --- End diff -- It's the result of Window test query. For the Windows query, there were nested Projects. ``` test("broadcast hint with window") { checkSQL( """ |SELECT /*+ MAPJOIN(parquet_t1) */ | x.key, MAX(y.key) OVER (PARTITION BY x.key % 5 ORDER BY x.key) |FROM parquet_t1 x JOIN parquet_t1 y ON x.key = y.key """.stripMargin, "broadcast_hint_window") } ``` I had the same feeling why some "SELECT" doesn't happen. After https://issues.apache.org/jira/browse/SPARK-16576 , I think this kind of weirdness will be reduced. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71463337 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -425,6 +449,44 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +/** + * Merge and move upward to the nearest Project. + * A broadcast hint comment is scattered into multiple nodes inside the plan, and the + * information of BroadcastHint resides its current position inside the plan. In order to + * reconstruct broadcast hint comment, we need to pack the information of BroadcastHint into + * Hint("BROADCAST", _, _) and collect them up by moving upward to the nearest Project node. + */ +object NormalizeBroadcastHint extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +// Capture the broadcasted information and store it in Hint. +case BroadcastHint(child @ SubqueryAlias(_, Project(_, SQLTable(database, table, _, _ => + Hint("BROADCAST", Seq(table), child) + +// Nearest Project is found. +case p @ Project(_, Hint(_, _, _)) => p + +// Merge BROADCAST hints up to the nearest Project. +case Hint("BROADCAST", params1, h @ Hint("BROADCAST", params2, _)) => + h.copy(parameters = params1 ++ params2) +case j @ Join(h1 @ Hint("BROADCAST", p1, left), h2 @ Hint("BROADCAST", p2, right), _, _) => + h1.copy(parameters = p1 ++ p2, child = j.copy(left = left, right = right)) + +// Bubble up BROADCAST hints to the nearest Project. +case j @ Join(h @ Hint("BROADCAST", _, hintChild), _, _, _) => + h.copy(child = j.copy(left = hintChild)) +case j @ Join(_, h @ Hint("BROADCAST", _, hintChild), _, _) => + h.copy(child = j.copy(right = hintChild)) + +// Other UnaryNodes are bypassed. +case u: UnaryNode + if u.child.isInstanceOf[Hint] && u.child.asInstanceOf[Hint].name.equals("BROADCAST") => --- End diff -- Sure. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71457670 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -425,6 +449,44 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +/** + * Merge and move upward to the nearest Project. + * A broadcast hint comment is scattered into multiple nodes inside the plan, and the + * information of BroadcastHint resides its current position inside the plan. In order to + * reconstruct broadcast hint comment, we need to pack the information of BroadcastHint into + * Hint("BROADCAST", _, _) and collect them up by moving upward to the nearest Project node. + */ +object NormalizeBroadcastHint extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +// Capture the broadcasted information and store it in Hint. +case BroadcastHint(child @ SubqueryAlias(_, Project(_, SQLTable(database, table, _, _ => + Hint("BROADCAST", Seq(table), child) + +// Nearest Project is found. +case p @ Project(_, Hint(_, _, _)) => p + +// Merge BROADCAST hints up to the nearest Project. +case Hint("BROADCAST", params1, h @ Hint("BROADCAST", params2, _)) => + h.copy(parameters = params1 ++ params2) +case j @ Join(h1 @ Hint("BROADCAST", p1, left), h2 @ Hint("BROADCAST", p2, right), _, _) => + h1.copy(parameters = p1 ++ p2, child = j.copy(left = left, right = right)) + +// Bubble up BROADCAST hints to the nearest Project. +case j @ Join(h @ Hint("BROADCAST", _, hintChild), _, _, _) => + h.copy(child = j.copy(left = hintChild)) +case j @ Join(_, h @ Hint("BROADCAST", _, hintChild), _, _) => + h.copy(child = j.copy(right = hintChild)) + +// Other UnaryNodes are bypassed. +case u: UnaryNode + if u.child.isInstanceOf[Hint] && u.child.asInstanceOf[Hint].name.equals("BROADCAST") => --- End diff -- uh, yeah! please add two more spaces before `if` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71457606 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -356,8 +372,14 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +val broadcastHint = project match { + case p @ Project(projectList, Hint("BROADCAST", tables, child)) => +if (tables.nonEmpty) s"/*+ MAPJOIN(${tables.mkString(", ")}) */" else "" + case _ => "" +} build( "SELECT", + broadcastHint, --- End diff -- Could you please do more investigation on this? The current solution looks not clean to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71446840 --- Diff: sql/hive/src/test/resources/sqlgen/broadcast_hint_nested_query_1.sql --- @@ -0,0 +1,7 @@ +-- This file is automatically generated by LogicalPlanToSQLSuite. +SELECT /*+ MAPJOIN(parquet_t0) */ * +FROM (SELECT id tid FROM parquet_t0) T +JOIN (SELECT key uid FROM parquet_t1) U +ON tid=uid + +SELECT `gen_attr` AS `tid`, `gen_attr` AS `uid` FROM (SELECT `gen_attr`, `gen_attr` FROM (SELECT /*+ MAPJOIN(parquet_t0) */ `gen_attr` AS `gen_attr` FROM (SELECT `id` AS `gen_attr` FROM `default`.`parquet_t0`) AS gen_subquery_0) AS T INNER JOIN (SELECT `gen_attr` AS `gen_attr` FROM (SELECT `key` AS `gen_attr`, `value` AS `gen_attr` FROM `default`.`parquet_t1`) AS gen_subquery_1) AS U ON (`gen_attr` = `gen_attr`)) AS gen_subquery_2 --- End diff -- Yep. I'm trying to recover them in https://github.com/apache/spark/pull/14257 , too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71444098 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -425,6 +449,44 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +/** + * Merge and move upward to the nearest Project. + * A broadcast hint comment is scattered into multiple nodes inside the plan, and the + * information of BroadcastHint resides its current position inside the plan. In order to + * reconstruct broadcast hint comment, we need to pack the information of BroadcastHint into + * Hint("BROADCAST", _, _) and collect them up by moving upward to the nearest Project node. + */ +object NormalizeBroadcastHint extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +// Capture the broadcasted information and store it in Hint. +case BroadcastHint(child @ SubqueryAlias(_, Project(_, SQLTable(database, table, _, _ => + Hint("BROADCAST", Seq(table), child) + +// Nearest Project is found. +case p @ Project(_, Hint(_, _, _)) => p + +// Merge BROADCAST hints up to the nearest Project. +case Hint("BROADCAST", params1, h @ Hint("BROADCAST", params2, _)) => + h.copy(parameters = params1 ++ params2) +case j @ Join(h1 @ Hint("BROADCAST", p1, left), h2 @ Hint("BROADCAST", p2, right), _, _) => + h1.copy(parameters = p1 ++ p2, child = j.copy(left = left, right = right)) + +// Bubble up BROADCAST hints to the nearest Project. +case j @ Join(h @ Hint("BROADCAST", _, hintChild), _, _, _) => + h.copy(child = j.copy(left = hintChild)) +case j @ Join(_, h @ Hint("BROADCAST", _, hintChild), _, _) => + h.copy(child = j.copy(right = hintChild)) + +// Other UnaryNodes are bypassed. +case u: UnaryNode + if u.child.isInstanceOf[Hint] && u.child.asInstanceOf[Hint].name.equals("BROADCAST") => + val hint = u.child.asInstanceOf[Hint] + hint.copy(child = u.withNewChildren(Seq(hint.child))) + +// Other binary operations are ignored. --- End diff -- Yep. I worried you say like that. :) I hesitated to use `higher` or `multinary`. May I enumerate all of them? At first, I did but remove that comment since it will be outdated. Let me think about this more. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71443911 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -425,6 +449,44 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +/** + * Merge and move upward to the nearest Project. + * A broadcast hint comment is scattered into multiple nodes inside the plan, and the + * information of BroadcastHint resides its current position inside the plan. In order to + * reconstruct broadcast hint comment, we need to pack the information of BroadcastHint into + * Hint("BROADCAST", _, _) and collect them up by moving upward to the nearest Project node. + */ +object NormalizeBroadcastHint extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +// Capture the broadcasted information and store it in Hint. +case BroadcastHint(child @ SubqueryAlias(_, Project(_, SQLTable(database, table, _, _ => + Hint("BROADCAST", Seq(table), child) + +// Nearest Project is found. +case p @ Project(_, Hint(_, _, _)) => p + +// Merge BROADCAST hints up to the nearest Project. +case Hint("BROADCAST", params1, h @ Hint("BROADCAST", params2, _)) => + h.copy(parameters = params1 ++ params2) +case j @ Join(h1 @ Hint("BROADCAST", p1, left), h2 @ Hint("BROADCAST", p2, right), _, _) => + h1.copy(parameters = p1 ++ p2, child = j.copy(left = left, right = right)) + +// Bubble up BROADCAST hints to the nearest Project. +case j @ Join(h @ Hint("BROADCAST", _, hintChild), _, _, _) => + h.copy(child = j.copy(left = hintChild)) +case j @ Join(_, h @ Hint("BROADCAST", _, hintChild), _, _) => + h.copy(child = j.copy(right = hintChild)) + +// Other UnaryNodes are bypassed. +case u: UnaryNode + if u.child.isInstanceOf[Hint] && u.child.asInstanceOf[Hint].name.equals("BROADCAST") => --- End diff -- ? This one is switching the `UnaryNode` and its child `Hint`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71443752 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -356,8 +372,14 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +val broadcastHint = project match { + case p @ Project(projectList, Hint("BROADCAST", tables, child)) => +if (tables.nonEmpty) s"/*+ MAPJOIN(${tables.mkString(", ")}) */" else "" + case _ => "" +} build( "SELECT", + broadcastHint, --- End diff -- Yep. At first, I thought we need to change `projectToSQL`, but the current logic requires this. For `windowToSQL`, it seems not to be attached `BroadcastHint` directly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71443423 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/PlanParserSuite.scala --- @@ -447,4 +447,42 @@ class PlanParserSuite extends PlanTest { assertEqual("select a, b from db.c where x !> 1", table("db", "c").where('x <= 1).select('a, 'b)) } + + test("select hint syntax") { +// Hive compatibility: Missing parameter raises ParseException. +val m = intercept[ParseException] { + parsePlan("SELECT /*+ HINT() */ * FROM t") +}.getMessage +assert(m.contains("no viable alternative at input")) + +// Hive compatibility: No database. +val m2 = intercept[ParseException] { + parsePlan("SELECT /*+ MAPJOIN(default.t) */ * from default.t;") +}.getMessage +assert(m2.contains("no viable alternative at input")) + +comparePlans( + parsePlan("SELECT /*+ HINT */ * FROM t"), + Hint("HINT", Seq.empty, table("t")).select(star())) + +comparePlans( + parsePlan("SELECT /*+ HINT */ * FROM t"), + Hint("HINT", Seq.empty, table("t")).select(star())) --- End diff -- My bad. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71443387 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/PlanParserSuite.scala --- @@ -447,4 +447,42 @@ class PlanParserSuite extends PlanTest { assertEqual("select a, b from db.c where x !> 1", table("db", "c").where('x <= 1).select('a, 'b)) } + + test("select hint syntax") { +// Hive compatibility: Missing parameter raises ParseException. +val m = intercept[ParseException] { + parsePlan("SELECT /*+ HINT() */ * FROM t") +}.getMessage +assert(m.contains("no viable alternative at input")) + +// Hive compatibility: No database. +val m2 = intercept[ParseException] { + parsePlan("SELECT /*+ MAPJOIN(default.t) */ * from default.t;") +}.getMessage +assert(m2.contains("no viable alternative at input")) + +comparePlans( + parsePlan("SELECT /*+ HINT */ * FROM t"), + Hint("HINT", Seq.empty, table("t")).select(star())) + +comparePlans( + parsePlan("SELECT /*+ HINT */ * FROM t"), + Hint("HINT", Seq.empty, table("t")).select(star())) + +comparePlans( + parsePlan("SELECT /*+ BROADCASTJOIN(u) */ * FROM t"), + Hint("BROADCASTJOIN", Seq("u"), table("t")).select(star())) + +comparePlans( + parsePlan("SELECT /*+ MAPJOIN(u) */ * FROM t"), + Hint("MAPJOIN", Seq("u"), table("t")).select(star())) + +comparePlans( + parsePlan("SELECT /*+ STREAMTABLE(a,b,c) */ * FROM t"), + Hint("STREAMTABLE", Seq("a", "b", "c"), table("t")).select(star())) + +comparePlans( + parsePlan("SELECT /*+ INDEX(t emp_job_ix) */ * FROM t"), + Hint("INDEX", Seq("t", "emp_job_ix"), table("t")).select(star())) + } --- End diff -- Yep. I added like this. ``` comparePlans( parsePlan("SELECT /*+ MAPJOIN(`default.t`) */ * from `default.t`"), Hint("MAPJOIN", Seq("default.t"), table("default.t")).select(star())) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71441985 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,35 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. --- End diff -- It sounds like you remove the comment you put before. Second, we also need to explain the restriction and the future work. Third, we also need to document the issue you hit in the PR description. It can help us understand the issues when we enhance it later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71441464 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -425,6 +449,44 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +/** + * Merge and move upward to the nearest Project. + * A broadcast hint comment is scattered into multiple nodes inside the plan, and the + * information of BroadcastHint resides its current position inside the plan. In order to + * reconstruct broadcast hint comment, we need to pack the information of BroadcastHint into + * Hint("BROADCAST", _, _) and collect them up by moving upward to the nearest Project node. + */ +object NormalizeBroadcastHint extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +// Capture the broadcasted information and store it in Hint. +case BroadcastHint(child @ SubqueryAlias(_, Project(_, SQLTable(database, table, _, _ => + Hint("BROADCAST", Seq(table), child) + +// Nearest Project is found. +case p @ Project(_, Hint(_, _, _)) => p + +// Merge BROADCAST hints up to the nearest Project. +case Hint("BROADCAST", params1, h @ Hint("BROADCAST", params2, _)) => + h.copy(parameters = params1 ++ params2) +case j @ Join(h1 @ Hint("BROADCAST", p1, left), h2 @ Hint("BROADCAST", p2, right), _, _) => + h1.copy(parameters = p1 ++ p2, child = j.copy(left = left, right = right)) + +// Bubble up BROADCAST hints to the nearest Project. +case j @ Join(h @ Hint("BROADCAST", _, hintChild), _, _, _) => + h.copy(child = j.copy(left = hintChild)) +case j @ Join(_, h @ Hint("BROADCAST", _, hintChild), _, _) => + h.copy(child = j.copy(right = hintChild)) + +// Other UnaryNodes are bypassed. +case u: UnaryNode + if u.child.isInstanceOf[Hint] && u.child.asInstanceOf[Hint].name.equals("BROADCAST") => + val hint = u.child.asInstanceOf[Hint] + hint.copy(child = u.withNewChildren(Seq(hint.child))) + +// Other binary operations are ignored. --- End diff -- This is not clear. Which operations are ignored? and Why they are ignored? You know, `Union` is not a binary operation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71441209 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -425,6 +449,44 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +/** + * Merge and move upward to the nearest Project. + * A broadcast hint comment is scattered into multiple nodes inside the plan, and the + * information of BroadcastHint resides its current position inside the plan. In order to + * reconstruct broadcast hint comment, we need to pack the information of BroadcastHint into + * Hint("BROADCAST", _, _) and collect them up by moving upward to the nearest Project node. + */ +object NormalizeBroadcastHint extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +// Capture the broadcasted information and store it in Hint. +case BroadcastHint(child @ SubqueryAlias(_, Project(_, SQLTable(database, table, _, _ => + Hint("BROADCAST", Seq(table), child) + +// Nearest Project is found. +case p @ Project(_, Hint(_, _, _)) => p + +// Merge BROADCAST hints up to the nearest Project. +case Hint("BROADCAST", params1, h @ Hint("BROADCAST", params2, _)) => + h.copy(parameters = params1 ++ params2) +case j @ Join(h1 @ Hint("BROADCAST", p1, left), h2 @ Hint("BROADCAST", p2, right), _, _) => + h1.copy(parameters = p1 ++ p2, child = j.copy(left = left, right = right)) + +// Bubble up BROADCAST hints to the nearest Project. +case j @ Join(h @ Hint("BROADCAST", _, hintChild), _, _, _) => + h.copy(child = j.copy(left = hintChild)) +case j @ Join(_, h @ Hint("BROADCAST", _, hintChild), _, _) => + h.copy(child = j.copy(right = hintChild)) + +// Other UnaryNodes are bypassed. +case u: UnaryNode + if u.child.isInstanceOf[Hint] && u.child.asInstanceOf[Hint].name.equals("BROADCAST") => --- End diff -- Yu can put another case for Hint. Then, you do not need this `if` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71441085 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -356,8 +372,14 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +val broadcastHint = project match { + case p @ Project(projectList, Hint("BROADCAST", tables, child)) => +if (tables.nonEmpty) s"/*+ MAPJOIN(${tables.mkString(", ")}) */" else "" + case _ => "" +} build( "SELECT", + broadcastHint, --- End diff -- This is in `groupingSetToSQL`. This looks weird. How about the others? e.g., `windowToSQL`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71433270 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/PlanParserSuite.scala --- @@ -447,4 +447,42 @@ class PlanParserSuite extends PlanTest { assertEqual("select a, b from db.c where x !> 1", table("db", "c").where('x <= 1).select('a, 'b)) } + + test("select hint syntax") { +// Hive compatibility: Missing parameter raises ParseException. +val m = intercept[ParseException] { + parsePlan("SELECT /*+ HINT() */ * FROM t") +}.getMessage +assert(m.contains("no viable alternative at input")) + +// Hive compatibility: No database. +val m2 = intercept[ParseException] { + parsePlan("SELECT /*+ MAPJOIN(default.t) */ * from default.t;") +}.getMessage +assert(m2.contains("no viable alternative at input")) + +comparePlans( + parsePlan("SELECT /*+ HINT */ * FROM t"), + Hint("HINT", Seq.empty, table("t")).select(star())) + +comparePlans( + parsePlan("SELECT /*+ HINT */ * FROM t"), + Hint("HINT", Seq.empty, table("t")).select(star())) --- End diff -- A duplicate case, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71433044 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/PlanParserSuite.scala --- @@ -447,4 +447,42 @@ class PlanParserSuite extends PlanTest { assertEqual("select a, b from db.c where x !> 1", table("db", "c").where('x <= 1).select('a, 'b)) } + + test("select hint syntax") { +// Hive compatibility: Missing parameter raises ParseException. +val m = intercept[ParseException] { + parsePlan("SELECT /*+ HINT() */ * FROM t") +}.getMessage +assert(m.contains("no viable alternative at input")) + +// Hive compatibility: No database. +val m2 = intercept[ParseException] { + parsePlan("SELECT /*+ MAPJOIN(default.t) */ * from default.t;") +}.getMessage +assert(m2.contains("no viable alternative at input")) + +comparePlans( + parsePlan("SELECT /*+ HINT */ * FROM t"), + Hint("HINT", Seq.empty, table("t")).select(star())) + +comparePlans( + parsePlan("SELECT /*+ HINT */ * FROM t"), + Hint("HINT", Seq.empty, table("t")).select(star())) + +comparePlans( + parsePlan("SELECT /*+ BROADCASTJOIN(u) */ * FROM t"), + Hint("BROADCASTJOIN", Seq("u"), table("t")).select(star())) + +comparePlans( + parsePlan("SELECT /*+ MAPJOIN(u) */ * FROM t"), + Hint("MAPJOIN", Seq("u"), table("t")).select(star())) + +comparePlans( + parsePlan("SELECT /*+ STREAMTABLE(a,b,c) */ * FROM t"), + Hint("STREAMTABLE", Seq("a", "b", "c"), table("t")).select(star())) + +comparePlans( + parsePlan("SELECT /*+ INDEX(t emp_job_ix) */ * FROM t"), + Hint("INDEX", Seq("t", "emp_job_ix"), table("t")).select(star())) + } --- End diff -- This should work. Please add this test case. ```parsePlan("SELECT /*+ MAPJOIN(`default.t`) */ * from default.t")``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71432748 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/PlanParserSuite.scala --- @@ -447,4 +447,42 @@ class PlanParserSuite extends PlanTest { assertEqual("select a, b from db.c where x !> 1", table("db", "c").where('x <= 1).select('a, 'b)) } + + test("select hint syntax") { +// Hive compatibility: Missing parameter raises ParseException. +val m = intercept[ParseException] { + parsePlan("SELECT /*+ HINT() */ * FROM t") +}.getMessage +assert(m.contains("no viable alternative at input")) + +// Hive compatibility: No database. +val m2 = intercept[ParseException] { + parsePlan("SELECT /*+ MAPJOIN(default.t) */ * from default.t;") --- End diff -- remove the semicolon --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71430308 --- Diff: sql/hive/src/test/resources/sqlgen/broadcast_hint_nested_query_1.sql --- @@ -0,0 +1,7 @@ +-- This file is automatically generated by LogicalPlanToSQLSuite. +SELECT /*+ MAPJOIN(parquet_t0) */ * +FROM (SELECT id tid FROM parquet_t0) T +JOIN (SELECT key uid FROM parquet_t1) U +ON tid=uid + +SELECT `gen_attr` AS `tid`, `gen_attr` AS `uid` FROM (SELECT `gen_attr`, `gen_attr` FROM (SELECT /*+ MAPJOIN(parquet_t0) */ `gen_attr` AS `gen_attr` FROM (SELECT `id` AS `gen_attr` FROM `default`.`parquet_t0`) AS gen_subquery_0) AS T INNER JOIN (SELECT `gen_attr` AS `gen_attr` FROM (SELECT `key` AS `gen_attr`, `value` AS `gen_attr` FROM `default`.`parquet_t1`) AS gen_subquery_1) AS U ON (`gen_attr` = `gen_attr`)) AS gen_subquery_2 --- End diff -- oh, I see. The compare removes the id. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71429222 --- Diff: sql/hive/src/test/resources/sqlgen/broadcast_hint_nested_query_1.sql --- @@ -0,0 +1,7 @@ +-- This file is automatically generated by LogicalPlanToSQLSuite. +SELECT /*+ MAPJOIN(parquet_t0) */ * +FROM (SELECT id tid FROM parquet_t0) T +JOIN (SELECT key uid FROM parquet_t1) U +ON tid=uid + +SELECT `gen_attr` AS `tid`, `gen_attr` AS `uid` FROM (SELECT `gen_attr`, `gen_attr` FROM (SELECT /*+ MAPJOIN(parquet_t0) */ `gen_attr` AS `gen_attr` FROM (SELECT `id` AS `gen_attr` FROM `default`.`parquet_t0`) AS gen_subquery_0) AS T INNER JOIN (SELECT `gen_attr` AS `gen_attr` FROM (SELECT `key` AS `gen_attr`, `value` AS `gen_attr` FROM `default`.`parquet_t1`) AS gen_subquery_1) AS U ON (`gen_attr` = `gen_attr`)) AS gen_subquery_2 --- End diff -- This is wrong, right? If we remove the hint, the output should be like ``` SELECT `gen_attr_86` AS `tid`, `gen_attr_87` AS `uid` FROM (SELECT `gen_attr_86`, `gen_attr_87` FROM (SELECT `gen_attr_13` AS `gen_attr_86` FROM (SELECT `id` AS `gen_attr_13` FROM `default`.`parquet_t0`) AS gen_subquery_0) AS T INNER JOIN (SELECT `gen_attr_88` AS `gen_attr_87` FROM (SELECT `key` AS `gen_attr_88`, `value` AS `gen_attr_89` FROM `default`.`parquet_t1`) AS gen_subquery_1) AS U ON ( `gen_attr_86` = `gen_attr_87` )) AS gen_subquery_2 ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71425663 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/BroadcastHintSuite.scala --- @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import org.apache.spark.sql.QueryTest +import org.apache.spark.sql.catalyst.plans.logical.BroadcastHint +import org.apache.spark.sql.hive.test.TestHiveSingleton +import org.apache.spark.sql.test.SQLTestUtils + +class BroadcastHintSuite extends QueryTest with SQLTestUtils with TestHiveSingleton { + test("broadcast hint on Hive table") { +withTable("hive_t", "hive_u") { + spark.sql("CREATE TABLE hive_t(a int)") + spark.sql("CREATE TABLE hive_u(b int)") + + val hive_t = spark.table("hive_t").queryExecution.analyzed + val hive_u = spark.table("hive_u").queryExecution.analyzed + + val plan = spark.sql("SELECT /*+ MAPJOIN(hive_t) */ * FROM hive_t, hive_u") +.queryExecution.analyzed + assert(plan.children(0).children(0).isInstanceOf[BroadcastHint]) --- End diff -- Sure! No problem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71425497 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala --- @@ -933,4 +933,124 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { checkSQL("select * from orc_t", "select_orc_table") } } + + test("broadcast hint on single table") { +checkSQL("SELECT /*+ MAPJOIN(parquet_t0) */ * FROM parquet_t0", + "broadcast_hint_single_table_1") + +checkSQL("SELECT /*+ MAPJOIN(parquet_t0, parquet_t0) */ * FROM parquet_t0", + "broadcast_hint_single_table_2") + +checkSQL( + "SELECT /*+ MAPJOIN(parquet_t0) */ * FROM parquet_t0 as a", + "broadcast_hint_single_table_3") + +checkSQL( + """ +|SELECT /*+ MAPJOIN(parquet_t0) */ * +|FROM (SELECT id tid FROM parquet_t0) T +|JOIN (SELECT id uid FROM parquet_t0) U +|ON tid=uid + """.stripMargin, + "broadcast_hint_single_4") + } + + test("broadcast hint on multiple tables") { +checkSQL( + "SELECT /*+ MAPJOIN(parquet_t0) */ * FROM parquet_t0, parquet_t1", + "broadcast_hint_multiple_table_1") +checkSQL( + "SELECT /*+ MAPJOIN(parquet_t1) */ * FROM parquet_t0, parquet_t1", + "broadcast_hint_multiple_table_2") + } + + test("broadcast hint on nested query 1") { +checkSQL( + """ +|SELECT /*+ MAPJOIN(parquet_t0) */ * +|FROM (SELECT id tid FROM parquet_t0) T +|JOIN (SELECT key uid FROM parquet_t1) U +|ON tid=uid + """.stripMargin, + "broadcast_hint_nested_query_1") + } + + test("broadcast hint on nested query 2") { +checkSQL( + """ +|SELECT /*+ MAPJOIN(parquet_t1) */ tid +|FROM (SELECT id tid FROM parquet_t0) T +|JOIN (SELECT key uid FROM parquet_t1) U +|ON tid=uid + """.stripMargin, + "broadcast_hint_nested_query_2") + } + + test("multiple broadcast hints on multiple tables") { +checkSQL( + "SELECT /*+ MAPJOIN(parquet_t0, parquet_t1) */ * FROM parquet_t0, parquet_t1", + "multiple_broadcast_hints") + } + + test("broadcast hint with filter") { +checkSQL( + "SELECT /*+ MAPJOIN(parquet_t0) */ * FROM parquet_t0 WHERE id < 10", + "broadcast_hint_with_filter") + } + + test("broadcast hint with filter/limit") { +checkSQL( + "SELECT /*+ MAPJOIN(parquet_t0) */ * FROM parquet_t0 WHERE id < 10 LIMIT 10", + "broadcast_hint_with_filter_limit") + } + + test("broadcast hint with generator") { +checkSQL( + "SELECT * FROM (SELECT /*+ MAPJOIN(parquet_t0) */ EXPLODE(ARRAY(1,2,3)) FROM parquet_t0) T", + "broadcast_hint_generator") + } + + test("broadcast hint with groupby/having/orderby") { +checkSQL( + """ +|SELECT /*+ MAPJOIN(parquet_t0) */ * +|FROM parquet_t0 +|WHERE id > 0 +|GROUP BY id +|HAVING count(*) > 0 +|ORDER BY id + """.stripMargin, + "broadcast_hint_groupby_having_orderby") + } + + test("broadcast hint with window") { +checkSQL( + """ +|SELECT /*+ MAPJOIN(parquet_t1) */ +| x.key, MAX(y.key) OVER (PARTITION BY x.key % 5 ORDER BY x.key) +|FROM parquet_t1 x JOIN parquet_t1 y ON x.key = y.key + """.stripMargin, + "broadcast_hint_window") + } + + test("broadcast hint with rollup") { +checkSQL( + """ +|SELECT /*+ MAPJOIN(parquet_t1) */ +| count(*) as cnt, key%5, grouping_id() +|FROM parquet_t1 +|GROUP BY key % 5 WITH ROLLUP""".stripMargin, --- End diff -- format --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71424857 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/BroadcastHintSuite.scala --- @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import org.apache.spark.sql.QueryTest +import org.apache.spark.sql.catalyst.plans.logical.BroadcastHint +import org.apache.spark.sql.hive.test.TestHiveSingleton +import org.apache.spark.sql.test.SQLTestUtils + +class BroadcastHintSuite extends QueryTest with SQLTestUtils with TestHiveSingleton { + test("broadcast hint on Hive table") { +withTable("hive_t", "hive_u") { + spark.sql("CREATE TABLE hive_t(a int)") + spark.sql("CREATE TABLE hive_u(b int)") + + val hive_t = spark.table("hive_t").queryExecution.analyzed + val hive_u = spark.table("hive_u").queryExecution.analyzed + + val plan = spark.sql("SELECT /*+ MAPJOIN(hive_t) */ * FROM hive_t, hive_u") +.queryExecution.analyzed + assert(plan.children(0).children(0).isInstanceOf[BroadcastHint]) --- End diff -- This is fragile. You can use `collectFirst` instead. Also check the content of `BroadcastHint`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71265020 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,51 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + */ + object SubstituteHints extends Rule[LogicalPlan] { +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case logical: LogicalPlan => logical transformDown { +case h @ Hint(name, parameters, child) +if Seq("BROADCAST", "BROADCASTJOIN", "MAPJOIN").contains(name.toUpperCase) => + var resolvedChild = child + + for (param <- parameters) { +val names = param.split("\\.") +val tid = if (names.length > 1) { + TableIdentifier(names(1), Some(names(0))) +} else { + TableIdentifier(param, None) +} +try { + catalog.lookupRelation(tid) + + var stop = false + resolvedChild = resolvedChild.transformDown { +case r @ BroadcastHint(SubqueryAlias(t, _)) + if !stop && resolver(t, tid.identifier) => + stop = true + r +case r @ SubqueryAlias(t, _) if !stop && resolver(t, tid.identifier) => + stop = true + BroadcastHint(r) --- End diff -- I think we have to remove it; otherwise, the result will be wrong. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71261455 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,51 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + */ + object SubstituteHints extends Rule[LogicalPlan] { +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case logical: LogicalPlan => logical transformDown { +case h @ Hint(name, parameters, child) +if Seq("BROADCAST", "BROADCASTJOIN", "MAPJOIN").contains(name.toUpperCase) => + var resolvedChild = child + + for (param <- parameters) { +val names = param.split("\\.") +val tid = if (names.length > 1) { + TableIdentifier(names(1), Some(names(0))) +} else { + TableIdentifier(param, None) +} +try { + catalog.lookupRelation(tid) + + var stop = false + resolvedChild = resolvedChild.transformDown { +case r @ BroadcastHint(SubqueryAlias(t, _)) + if !stop && resolver(t, tid.identifier) => + stop = true + r +case r @ SubqueryAlias(t, _) if !stop && resolver(t, tid.identifier) => + stop = true + BroadcastHint(r) --- End diff -- Sure, If you want to remove this, I can simply this. It's a little bit legacy I tried to follow your advice as much as possible. (As I mentioned before, I decide to block in Parser layer after I found that Hive does.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71261042 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -425,6 +452,49 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +/** + * Merge and move upward to the nearest Project. + * A broadcast hint comment is scattered into multiple nodes inside the plan, and the + * information of BroadcastHint resides its current position inside the plan. In order to + * reconstruct broadcast hint comment, we need to pack the information of BroadcastHint into + * Hint("BROADCAST", _, _) and collect them up by moving upward to the nearest Project node. + */ +object NormalizeBroadcastHint extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +// Capture the broadcasted information and store it in Hint. +case BroadcastHint(child @ SubqueryAlias(_, Project(_, SQLTable(database, table, _, _ => + Hint("BROADCAST", Seq(table), child) + +// Nearest Project is found. +case p @ Project(_, Hint(_, _, _)) => p + +// Merge BROADCAST hints up to the nearest Project. +case Hint("BROADCAST", params1, h @ Hint("BROADCAST", params2, _)) => + h.copy(parameters = params1 ++ params2) +case j @ Join(h1 @ Hint("BROADCAST", p1, left), h2 @ Hint("BROADCAST", p2, right), _, _) => + h1.copy(parameters = p1 ++ p2, child = j.copy(left = left, right = right)) + +// Bubble up BROADCAST hints to the nearest Project. +case j @ Join(h @ Hint("BROADCAST", _, hintChild), _, _, _) => + h.copy(child = j.copy(left = hintChild)) +case j @ Join(_, h @ Hint("BROADCAST", _, hintChild), _, _) => + h.copy(child = j.copy(right = hintChild)) +case s @ SubqueryAlias(_, h @ Hint("BROADCAST", _, hintChild)) => + h.copy(child = s.copy(child = hintChild)) +case ll @ LocalLimit(_, h @ Hint("BROADCAST", _, hintChild)) => + h.copy(child = ll.copy(child = hintChild)) +case f @ Filter(_, h @ Hint("BROADCAST", _, hintChild)) => + h.copy(child = f.copy(child = hintChild)) +case a @ Aggregate(_, _, h @ Hint("BROADCAST", _, hintChild)) => + h.copy(child = a.copy(child = hintChild)) +case s @ Sort(_, _, h @ Hint("BROADCAST", _, hintChild)) => + h.copy(child = s.copy(child = hintChild)) +case g @ Generate(_, _, _, _, _, h @ Hint("BROADCAST", _, hintChild)) => + h.copy(child = g.copy(child = hintChild)) +// Set operation is not allowed to be across. UNION/INTERCEPT/EXCEPT --- End diff -- Okay. But, that was the reason not to allowed there. Hmm, maybe it looks different. Sorry. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71260947 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,51 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + */ + object SubstituteHints extends Rule[LogicalPlan] { +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case logical: LogicalPlan => logical transformDown { +case h @ Hint(name, parameters, child) +if Seq("BROADCAST", "BROADCASTJOIN", "MAPJOIN").contains(name.toUpperCase) => + var resolvedChild = child + + for (param <- parameters) { +val names = param.split("\\.") +val tid = if (names.length > 1) { + TableIdentifier(names(1), Some(names(0))) +} else { + TableIdentifier(param, None) +} +try { + catalog.lookupRelation(tid) + + var stop = false + resolvedChild = resolvedChild.transformDown { +case r @ BroadcastHint(SubqueryAlias(t, _)) + if !stop && resolver(t, tid.identifier) => + stop = true + r +case r @ SubqueryAlias(t, _) if !stop && resolver(t, tid.identifier) => + stop = true + BroadcastHint(r) --- End diff -- If we do not support `db.table`, why we still compare the whole identifier? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71260158 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -425,6 +452,49 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +/** + * Merge and move upward to the nearest Project. + * A broadcast hint comment is scattered into multiple nodes inside the plan, and the + * information of BroadcastHint resides its current position inside the plan. In order to + * reconstruct broadcast hint comment, we need to pack the information of BroadcastHint into + * Hint("BROADCAST", _, _) and collect them up by moving upward to the nearest Project node. + */ +object NormalizeBroadcastHint extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +// Capture the broadcasted information and store it in Hint. +case BroadcastHint(child @ SubqueryAlias(_, Project(_, SQLTable(database, table, _, _ => + Hint("BROADCAST", Seq(table), child) + +// Nearest Project is found. +case p @ Project(_, Hint(_, _, _)) => p + +// Merge BROADCAST hints up to the nearest Project. +case Hint("BROADCAST", params1, h @ Hint("BROADCAST", params2, _)) => + h.copy(parameters = params1 ++ params2) +case j @ Join(h1 @ Hint("BROADCAST", p1, left), h2 @ Hint("BROADCAST", p2, right), _, _) => + h1.copy(parameters = p1 ++ p2, child = j.copy(left = left, right = right)) + +// Bubble up BROADCAST hints to the nearest Project. +case j @ Join(h @ Hint("BROADCAST", _, hintChild), _, _, _) => + h.copy(child = j.copy(left = hintChild)) +case j @ Join(_, h @ Hint("BROADCAST", _, hintChild), _, _) => + h.copy(child = j.copy(right = hintChild)) +case s @ SubqueryAlias(_, h @ Hint("BROADCAST", _, hintChild)) => + h.copy(child = s.copy(child = hintChild)) +case ll @ LocalLimit(_, h @ Hint("BROADCAST", _, hintChild)) => + h.copy(child = ll.copy(child = hintChild)) +case f @ Filter(_, h @ Hint("BROADCAST", _, hintChild)) => + h.copy(child = f.copy(child = hintChild)) +case a @ Aggregate(_, _, h @ Hint("BROADCAST", _, hintChild)) => + h.copy(child = a.copy(child = hintChild)) +case s @ Sort(_, _, h @ Hint("BROADCAST", _, hintChild)) => + h.copy(child = s.copy(child = hintChild)) +case g @ Generate(_, _, _, _, _, h @ Hint("BROADCAST", _, hintChild)) => + h.copy(child = g.copy(child = hintChild)) +// Set operation is not allowed to be across. UNION/INTERCEPT/EXCEPT --- End diff -- Your reply is different from the comment in the code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71259834 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -87,6 +87,7 @@ class Analyzer( EliminateUnions), Batch("Resolution", fixedPoint, ResolveRelations :: + SubstituteHints :: --- End diff -- In each batch, the order of rules should not matter. That means, the rule `SubstituteHints` impacts the other rules. Please fix it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71259759 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,51 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + */ + object SubstituteHints extends Rule[LogicalPlan] { +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case logical: LogicalPlan => logical transformDown { +case h @ Hint(name, parameters, child) +if Seq("BROADCAST", "BROADCASTJOIN", "MAPJOIN").contains(name.toUpperCase) => + var resolvedChild = child + + for (param <- parameters) { +val names = param.split("\\.") --- End diff -- Sorry, I do not think this is beyond of the scope. Instead, this is a bug. This identifier stores a table identifier. If we do not use the same parsing solution, the result will be wrong. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71259501 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -425,6 +452,49 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +/** + * Merge and move upward to the nearest Project. + * A broadcast hint comment is scattered into multiple nodes inside the plan, and the + * information of BroadcastHint resides its current position inside the plan. In order to + * reconstruct broadcast hint comment, we need to pack the information of BroadcastHint into + * Hint("BROADCAST", _, _) and collect them up by moving upward to the nearest Project node. + */ +object NormalizeBroadcastHint extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +// Capture the broadcasted information and store it in Hint. +case BroadcastHint(child @ SubqueryAlias(_, Project(_, SQLTable(database, table, _, _ => + Hint("BROADCAST", Seq(table), child) + +// Nearest Project is found. +case p @ Project(_, Hint(_, _, _)) => p + +// Merge BROADCAST hints up to the nearest Project. +case Hint("BROADCAST", params1, h @ Hint("BROADCAST", params2, _)) => + h.copy(parameters = params1 ++ params2) +case j @ Join(h1 @ Hint("BROADCAST", p1, left), h2 @ Hint("BROADCAST", p2, right), _, _) => + h1.copy(parameters = p1 ++ p2, child = j.copy(left = left, right = right)) + +// Bubble up BROADCAST hints to the nearest Project. --- End diff -- If you read what we did in SQLBuilder, you might know that is not the normal way we did. @rxin gave the same comment below: https://github.com/apache/spark/pull/14132#issuecomment-233503432. Keeping a white list is hard to maintain. You know, I still can find more missing cases here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71227811 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala --- @@ -988,9 +988,88 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { ) } + test("broadcast hint with window") { +checkGeneratedSQL( + """ +|SELECT /*+ MAPJOIN(parquet_t1) */ +| x.key, MAX(y.key) OVER (PARTITION BY x.key % 5 ORDER BY x.key) +|FROM parquet_t1 x JOIN parquet_t1 y ON x.key = y.key + """.stripMargin, + """ +|SELECT `gen_attr` AS `key`, `gen_attr` AS `max(key) OVER (PARTITION BY (key % CAST(5 AS BIGINT)) ORDER BY key ASC RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)` +|FROM (SELECT `gen_attr`, `gen_attr` +| FROM (SELECT gen_subquery_2.`gen_attr`, gen_subquery_2.`gen_attr`, gen_subquery_2.`gen_attr`, +| max(`gen_attr`) OVER (PARTITION BY `gen_attr` ORDER BY `gen_attr` ASC +| RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS `gen_attr` +|FROM (SELECT /*+ MAPJOIN(parquet_t1) */ `gen_attr`, `gen_attr`, +| (`gen_attr` % CAST(5 AS BIGINT)) AS `gen_attr` +| FROM ((SELECT `key` AS `gen_attr`, `value` AS `gen_attr` +| FROM `default`.`parquet_t1`) +| AS gen_subquery_0) +| AS x +| INNER JOIN (SELECT `key` AS `gen_attr`, `value` AS `gen_attr` +| FROM `default`.`parquet_t1`) +| AS gen_subquery_1 +| ON (`gen_attr` = `gen_attr`)) +| AS gen_subquery_2) +| AS gen_subquery_3) +| AS x + """.stripMargin +) + } + + test("broadcast hint with rollup") { +checkGeneratedSQL( + """ +|SELECT /*+ MAPJOIN(parquet_t1) */ +| count(*) as cnt, key%5, grouping_id() +|FROM parquet_t1 +|GROUP BY key % 5 WITH ROLLUP""".stripMargin, + """ +|SELECT `gen_attr` AS `cnt`, `gen_attr` AS `(key % CAST(5 AS BIGINT))`, `gen_attr` AS `grouping_id()` +|FROM (SELECT /*+ MAPJOIN(parquet_t1) */ count(1) AS `gen_attr`, +| (`gen_attr` % CAST(5 AS BIGINT)) AS `gen_attr`, grouping_id() AS `gen_attr` +| FROM (SELECT `key` AS `gen_attr`, `value` AS `gen_attr` +|FROM `default`.`parquet_t1`) +|AS gen_subquery_0 +| GROUP BY (`gen_attr` % CAST(5 AS BIGINT)) +| GROUPING SETS(((`gen_attr` % CAST(5 AS BIGINT))), ())) +| AS gen_subquery_1 + """.stripMargin) + + } + + test("broadcast hint with grouping sets") { +checkGeneratedSQL( + s""" + |SELECT /*+ MAPJOIN(parquet_t1) */ + | count(*) AS cnt, key % 5 AS k1, key - 5 AS k2,grouping_id() AS k3 + |FROM (SELECT key, key % 2, key - 5 FROM parquet_t1) t GROUP BY key % 5, key - 5 + |GROUPING SETS (key % 5, key - 5) + """.stripMargin, + """ +|SELECT `gen_attr` AS `cnt`, `gen_attr` AS `k1`, `gen_attr` AS `k2`, `gen_attr` AS `k3` +|FROM (SELECT count(1) AS `gen_attr`, (`gen_attr` % CAST(5 AS BIGINT)) AS `gen_attr`, +| (`gen_attr` - CAST(5 AS BIGINT)) AS `gen_attr`, grouping_id() AS `gen_attr` +| FROM (SELECT /*+ MAPJOIN(parquet_t1) */ `gen_attr`, +| (`gen_attr` % CAST(2 AS BIGINT)) AS `gen_attr`, +| (`gen_attr` - CAST(5 AS BIGINT)) AS `gen_attr` +|FROM (SELECT `key` AS `gen_attr`, `value` AS `gen_attr` +| FROM `default`.`parquet_t1`) +| AS gen_subquery_0) +|AS t +| GROUP BY (`gen_attr` % CAST(5 AS BIGINT)), (`gen_attr` - CAST(5 AS BIGINT)) +| GROUPING SETS(((`gen_attr` % CAST(5 AS BIGINT))), ((`gen_attr` - CAST(5 AS BIGINT) +| AS gen_subquery_1 + """.stripMargin) + } + + // This will be removed in favor of SPARK-16590. private def checkGeneratedSQL(sqlString: String, expectedString: String): Unit = { -val generatedSQL = new SQLBuilder(sql(sqlString)).toSQL +val df = sql(sqlString) +logError("\n" + df.queryExecution.analyzed.treeString) --- End diff -- After https://github.com/apache/spark/pull/14235, I need to change
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71227405 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala --- @@ -988,9 +988,88 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { ) } + test("broadcast hint with window") { +checkGeneratedSQL( + """ +|SELECT /*+ MAPJOIN(parquet_t1) */ +| x.key, MAX(y.key) OVER (PARTITION BY x.key % 5 ORDER BY x.key) +|FROM parquet_t1 x JOIN parquet_t1 y ON x.key = y.key + """.stripMargin, + """ +|SELECT `gen_attr` AS `key`, `gen_attr` AS `max(key) OVER (PARTITION BY (key % CAST(5 AS BIGINT)) ORDER BY key ASC RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)` +|FROM (SELECT `gen_attr`, `gen_attr` +| FROM (SELECT gen_subquery_2.`gen_attr`, gen_subquery_2.`gen_attr`, gen_subquery_2.`gen_attr`, +| max(`gen_attr`) OVER (PARTITION BY `gen_attr` ORDER BY `gen_attr` ASC +| RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS `gen_attr` +|FROM (SELECT /*+ MAPJOIN(parquet_t1) */ `gen_attr`, `gen_attr`, +| (`gen_attr` % CAST(5 AS BIGINT)) AS `gen_attr` +| FROM ((SELECT `key` AS `gen_attr`, `value` AS `gen_attr` +| FROM `default`.`parquet_t1`) +| AS gen_subquery_0) +| AS x +| INNER JOIN (SELECT `key` AS `gen_attr`, `value` AS `gen_attr` +| FROM `default`.`parquet_t1`) +| AS gen_subquery_1 +| ON (`gen_attr` = `gen_attr`)) +| AS gen_subquery_2) +| AS gen_subquery_3) +| AS x + """.stripMargin +) + } + + test("broadcast hint with rollup") { +checkGeneratedSQL( + """ +|SELECT /*+ MAPJOIN(parquet_t1) */ +| count(*) as cnt, key%5, grouping_id() +|FROM parquet_t1 +|GROUP BY key % 5 WITH ROLLUP""".stripMargin, + """ +|SELECT `gen_attr` AS `cnt`, `gen_attr` AS `(key % CAST(5 AS BIGINT))`, `gen_attr` AS `grouping_id()` +|FROM (SELECT /*+ MAPJOIN(parquet_t1) */ count(1) AS `gen_attr`, +| (`gen_attr` % CAST(5 AS BIGINT)) AS `gen_attr`, grouping_id() AS `gen_attr` +| FROM (SELECT `key` AS `gen_attr`, `value` AS `gen_attr` +|FROM `default`.`parquet_t1`) +|AS gen_subquery_0 +| GROUP BY (`gen_attr` % CAST(5 AS BIGINT)) +| GROUPING SETS(((`gen_attr` % CAST(5 AS BIGINT))), ())) +| AS gen_subquery_1 + """.stripMargin) + + } + + test("broadcast hint with grouping sets") { +checkGeneratedSQL( + s""" + |SELECT /*+ MAPJOIN(parquet_t1) */ + | count(*) AS cnt, key % 5 AS k1, key - 5 AS k2,grouping_id() AS k3 + |FROM (SELECT key, key % 2, key - 5 FROM parquet_t1) t GROUP BY key % 5, key - 5 + |GROUPING SETS (key % 5, key - 5) + """.stripMargin, + """ +|SELECT `gen_attr` AS `cnt`, `gen_attr` AS `k1`, `gen_attr` AS `k2`, `gen_attr` AS `k3` +|FROM (SELECT count(1) AS `gen_attr`, (`gen_attr` % CAST(5 AS BIGINT)) AS `gen_attr`, +| (`gen_attr` - CAST(5 AS BIGINT)) AS `gen_attr`, grouping_id() AS `gen_attr` +| FROM (SELECT /*+ MAPJOIN(parquet_t1) */ `gen_attr`, +| (`gen_attr` % CAST(2 AS BIGINT)) AS `gen_attr`, +| (`gen_attr` - CAST(5 AS BIGINT)) AS `gen_attr` +|FROM (SELECT `key` AS `gen_attr`, `value` AS `gen_attr` +| FROM `default`.`parquet_t1`) +| AS gen_subquery_0) +|AS t +| GROUP BY (`gen_attr` % CAST(5 AS BIGINT)), (`gen_attr` - CAST(5 AS BIGINT)) +| GROUPING SETS(((`gen_attr` % CAST(5 AS BIGINT))), ((`gen_attr` - CAST(5 AS BIGINT) +| AS gen_subquery_1 + """.stripMargin) + } + + // This will be removed in favor of SPARK-16590. private def checkGeneratedSQL(sqlString: String, expectedString: String): Unit = { -val generatedSQL = new SQLBuilder(sql(sqlString)).toSQL +val df = sql(sqlString) +logError("\n" + df.queryExecution.analyzed.treeString) --- End diff -- Oops. Please ignore this since we will be under review for a while.
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71203247 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala --- @@ -755,4 +755,243 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { checkHiveQl("select * from orc_t") } } + + test("broadcast hint on single table") { +checkGeneratedSQL( + "SELECT /*+ MAPJOIN(parquet_t0) */ * FROM parquet_t0", + """ +|SELECT `gen_attr` AS `id` +|FROM (SELECT /*+ MAPJOIN(parquet_t0) */ `gen_attr` +| FROM (SELECT `id` AS `gen_attr` +|FROM `default`.`parquet_t0`) +|AS gen_subquery_0) +| AS parquet_t0 +|""".stripMargin) + +checkGeneratedSQL( + "SELECT /*+ MAPJOIN(parquet_t0, parquet_t0) */ * FROM parquet_t0", + """ +|SELECT `gen_attr` AS `id` +|FROM (SELECT /*+ MAPJOIN(parquet_t0) */ `gen_attr` +| FROM (SELECT `id` AS `gen_attr` +|FROM `default`.`parquet_t0`) +|AS gen_subquery_0) +| AS parquet_t0 +|""".stripMargin) + +checkGeneratedSQL( + "SELECT /*+ MAPJOIN(parquet_t0) */ * FROM parquet_t0 as a", + """ +|SELECT `gen_attr` AS `id` +|FROM (SELECT /*+ MAPJOIN(parquet_t0) */ `gen_attr` +| FROM ((SELECT `id` AS `gen_attr` +| FROM `default`.`parquet_t0`) +| AS gen_subquery_0) +|AS a) +| AS a + """.stripMargin) + +checkGeneratedSQL( + """ +|SELECT /*+ MAPJOIN(parquet_t0) */ * --- End diff -- It's beyond the scope of this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71203038 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,51 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + */ + object SubstituteHints extends Rule[LogicalPlan] { +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case logical: LogicalPlan => logical transformDown { +case h @ Hint(name, parameters, child) +if Seq("BROADCAST", "BROADCASTJOIN", "MAPJOIN").contains(name.toUpperCase) => + var resolvedChild = child + + for (param <- parameters) { +val names = param.split("\\.") +val tid = if (names.length > 1) { + TableIdentifier(names(1), Some(names(0))) +} else { + TableIdentifier(param, None) +} +try { + catalog.lookupRelation(tid) + + var stop = false + resolvedChild = resolvedChild.transformDown { +case r @ BroadcastHint(SubqueryAlias(t, _)) + if !stop && resolver(t, tid.identifier) => + stop = true + r +case r @ SubqueryAlias(t, _) if !stop && resolver(t, tid.identifier) => + stop = true + BroadcastHint(r) --- End diff -- Here, too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71202999 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -425,6 +452,49 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +/** + * Merge and move upward to the nearest Project. + * A broadcast hint comment is scattered into multiple nodes inside the plan, and the + * information of BroadcastHint resides its current position inside the plan. In order to + * reconstruct broadcast hint comment, we need to pack the information of BroadcastHint into + * Hint("BROADCAST", _, _) and collect them up by moving upward to the nearest Project node. + */ +object NormalizeBroadcastHint extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +// Capture the broadcasted information and store it in Hint. +case BroadcastHint(child @ SubqueryAlias(_, Project(_, SQLTable(database, table, _, _ => + Hint("BROADCAST", Seq(table), child) + +// Nearest Project is found. +case p @ Project(_, Hint(_, _, _)) => p + +// Merge BROADCAST hints up to the nearest Project. +case Hint("BROADCAST", params1, h @ Hint("BROADCAST", params2, _)) => + h.copy(parameters = params1 ++ params2) +case j @ Join(h1 @ Hint("BROADCAST", p1, left), h2 @ Hint("BROADCAST", p2, right), _, _) => + h1.copy(parameters = p1 ++ p2, child = j.copy(left = left, right = right)) + +// Bubble up BROADCAST hints to the nearest Project. +case j @ Join(h @ Hint("BROADCAST", _, hintChild), _, _, _) => + h.copy(child = j.copy(left = hintChild)) +case j @ Join(_, h @ Hint("BROADCAST", _, hintChild), _, _) => + h.copy(child = j.copy(right = hintChild)) +case s @ SubqueryAlias(_, h @ Hint("BROADCAST", _, hintChild)) => --- End diff -- It seems `database.table` issue which I mentioned before. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71202822 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -425,6 +452,49 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +/** + * Merge and move upward to the nearest Project. + * A broadcast hint comment is scattered into multiple nodes inside the plan, and the + * information of BroadcastHint resides its current position inside the plan. In order to + * reconstruct broadcast hint comment, we need to pack the information of BroadcastHint into + * Hint("BROADCAST", _, _) and collect them up by moving upward to the nearest Project node. + */ +object NormalizeBroadcastHint extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +// Capture the broadcasted information and store it in Hint. +case BroadcastHint(child @ SubqueryAlias(_, Project(_, SQLTable(database, table, _, _ => + Hint("BROADCAST", Seq(table), child) + +// Nearest Project is found. +case p @ Project(_, Hint(_, _, _)) => p + +// Merge BROADCAST hints up to the nearest Project. +case Hint("BROADCAST", params1, h @ Hint("BROADCAST", params2, _)) => + h.copy(parameters = params1 ++ params2) +case j @ Join(h1 @ Hint("BROADCAST", p1, left), h2 @ Hint("BROADCAST", p2, right), _, _) => + h1.copy(parameters = p1 ++ p2, child = j.copy(left = left, right = right)) + +// Bubble up BROADCAST hints to the nearest Project. +case j @ Join(h @ Hint("BROADCAST", _, hintChild), _, _, _) => + h.copy(child = j.copy(left = hintChild)) +case j @ Join(_, h @ Hint("BROADCAST", _, hintChild), _, _) => + h.copy(child = j.copy(right = hintChild)) +case s @ SubqueryAlias(_, h @ Hint("BROADCAST", _, hintChild)) => + h.copy(child = s.copy(child = hintChild)) +case ll @ LocalLimit(_, h @ Hint("BROADCAST", _, hintChild)) => + h.copy(child = ll.copy(child = hintChild)) +case f @ Filter(_, h @ Hint("BROADCAST", _, hintChild)) => + h.copy(child = f.copy(child = hintChild)) +case a @ Aggregate(_, _, h @ Hint("BROADCAST", _, hintChild)) => + h.copy(child = a.copy(child = hintChild)) +case s @ Sort(_, _, h @ Hint("BROADCAST", _, hintChild)) => + h.copy(child = s.copy(child = hintChild)) +case g @ Generate(_, _, _, _, _, h @ Hint("BROADCAST", _, hintChild)) => + h.copy(child = g.copy(child = hintChild)) +// Set operation is not allowed to be across. UNION/INTERCEPT/EXCEPT --- End diff -- It's not a restriction, SET operation UNION/INTERCETP/EXCEPT has `Project` inside it. So, I skip those. As I mentioned before, theoretically, the Hint can not be restored into the original `Project`. In addition to that, since `Canonicalizer` adds a number of extra things, we can not move upward enough, too. We need to stop of nearest `Project`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71202124 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -425,6 +452,49 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { } } +/** + * Merge and move upward to the nearest Project. + * A broadcast hint comment is scattered into multiple nodes inside the plan, and the + * information of BroadcastHint resides its current position inside the plan. In order to + * reconstruct broadcast hint comment, we need to pack the information of BroadcastHint into + * Hint("BROADCAST", _, _) and collect them up by moving upward to the nearest Project node. + */ +object NormalizeBroadcastHint extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +// Capture the broadcasted information and store it in Hint. +case BroadcastHint(child @ SubqueryAlias(_, Project(_, SQLTable(database, table, _, _ => + Hint("BROADCAST", Seq(table), child) + +// Nearest Project is found. +case p @ Project(_, Hint(_, _, _)) => p + +// Merge BROADCAST hints up to the nearest Project. +case Hint("BROADCAST", params1, h @ Hint("BROADCAST", params2, _)) => + h.copy(parameters = params1 ++ params2) +case j @ Join(h1 @ Hint("BROADCAST", p1, left), h2 @ Hint("BROADCAST", p2, right), _, _) => + h1.copy(parameters = p1 ++ p2, child = j.copy(left = left, right = right)) + +// Bubble up BROADCAST hints to the nearest Project. --- End diff -- Thanks. I added `GlobalLimit`, too. It seems my mistake during removing the operator not allowed in SQLBuilder. When I develop Optimizers, I frequently asked to use white-list approach. I hope to keep this style if you are okay. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71201300 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1774,6 +1775,51 @@ class Analyzer( } /** + * Substitute Hints. + * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. + */ + object SubstituteHints extends Rule[LogicalPlan] { +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case logical: LogicalPlan => logical transformDown { +case h @ Hint(name, parameters, child) +if Seq("BROADCAST", "BROADCASTJOIN", "MAPJOIN").contains(name.toUpperCase) => + var resolvedChild = child + + for (param <- parameters) { +val names = param.split("\\.") --- End diff -- I took a look at #14244 . It's about `TableIdentifier`. This PR is designed to use `identifier` only instead of `tableIdentifier`. ``` tableIdentifier : (db=identifier '.')? table=identifier ; ``` It's beyond of the scope. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71200265 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -87,6 +87,7 @@ class Analyzer( EliminateUnions), Batch("Resolution", fixedPoint, ResolveRelations :: + SubstituteHints :: --- End diff -- Yep. Here is the error when we move it from the batch `Resolution` into the batch `Nondeterministic`. ``` unresolved operator 'Project [*]; org.apache.spark.sql.AnalysisException: unresolved operator 'Project [*]; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14132: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14132#discussion_r71100213 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala --- @@ -755,4 +755,243 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { checkHiveQl("select * from orc_t") } } + + test("broadcast hint on single table") { +checkGeneratedSQL( + "SELECT /*+ MAPJOIN(parquet_t0) */ * FROM parquet_t0", + """ +|SELECT `gen_attr` AS `id` +|FROM (SELECT /*+ MAPJOIN(parquet_t0) */ `gen_attr` +| FROM (SELECT `id` AS `gen_attr` +|FROM `default`.`parquet_t0`) +|AS gen_subquery_0) +| AS parquet_t0 +|""".stripMargin) + +checkGeneratedSQL( + "SELECT /*+ MAPJOIN(parquet_t0, parquet_t0) */ * FROM parquet_t0", + """ +|SELECT `gen_attr` AS `id` +|FROM (SELECT /*+ MAPJOIN(parquet_t0) */ `gen_attr` +| FROM (SELECT `id` AS `gen_attr` +|FROM `default`.`parquet_t0`) +|AS gen_subquery_0) +| AS parquet_t0 +|""".stripMargin) + +checkGeneratedSQL( + "SELECT /*+ MAPJOIN(parquet_t0) */ * FROM parquet_t0 as a", + """ +|SELECT `gen_attr` AS `id` +|FROM (SELECT /*+ MAPJOIN(parquet_t0) */ `gen_attr` +| FROM ((SELECT `id` AS `gen_attr` +| FROM `default`.`parquet_t0`) +| AS gen_subquery_0) +|AS a) +| AS a + """.stripMargin) + +checkGeneratedSQL( + """ +|SELECT /*+ MAPJOIN(parquet_t0) */ * --- End diff -- This should be `T. parquet_t0`, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org