[jira] [Commented] (SPARK-14781) Support subquery in nested predicates
[ https://issues.apache.org/jira/browse/SPARK-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265658#comment-15265658 ] Apache Spark commented on SPARK-14781: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/12820 > Support subquery in nested predicates > - > > Key: SPARK-14781 > URL: https://issues.apache.org/jira/browse/SPARK-14781 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > Right now, we does not support nested IN/EXISTS subquery, for example > EXISTS( x1) OR EXISTS( x2) > In order to do that, we could use an internal-only join type SemiPlus, which > will output every row from left, plus additional column as the result of join > condition. Then we could replace the EXISTS() or IN() by the result column. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14781) Support subquery in nested predicates
[ https://issues.apache.org/jira/browse/SPARK-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265000#comment-15265000 ] Frederick Reiss commented on SPARK-14781: - Yeah, Distinct will impact performance for the uncorrelated case if the subquery returns more than a few million rows. That problem won't occur in the particular case of TPC-DS query 45 (the subquery there returns at most 500k rows at a 100TB scale factor), but you never know. And of course a Distinct after the join, as one would need to cover EXISTS, would see potentially billions of rows. I just figured I'd mention that possibility as an expedient that doesn't require any additional operators. I'd be up to adding a "LeftSemiPlus" mode to the various join operators if you'd prefer for implementation to start with that step. The new behavior is almost the same as the existing LeftSemi mode: one additional output column in the schema, plus code to emit rows with a null value when nothing on the inner matches an outer tuple. > Support subquery in nested predicates > - > > Key: SPARK-14781 > URL: https://issues.apache.org/jira/browse/SPARK-14781 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu > > Right now, we does not support nested IN/EXISTS subquery, for example > EXISTS( x1) OR EXISTS( x2) > In order to do that, we could use an internal-only join type SemiPlus, which > will output every row from left, plus additional column as the result of join > condition. Then we could replace the EXISTS() or IN() by the result column. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14781) Support subquery in nested predicates
[ https://issues.apache.org/jira/browse/SPARK-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264868#comment-15264868 ] Davies Liu commented on SPARK-14781: PR 12720 is pretty close, we should wait for that. > Support subquery in nested predicates > - > > Key: SPARK-14781 > URL: https://issues.apache.org/jira/browse/SPARK-14781 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu > > Right now, we does not support nested IN/EXISTS subquery, for example > EXISTS( x1) OR EXISTS( x2) > In order to do that, we could use an internal-only join type SemiPlus, which > will output every row from left, plus additional column as the result of join > condition. Then we could replace the EXISTS() or IN() by the result column. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14781) Support subquery in nested predicates
[ https://issues.apache.org/jira/browse/SPARK-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264866#comment-15264866 ] Davies Liu commented on SPARK-14781: Distinct is slow, it's better to not use that. The LeftSemiPlus in my mind is something very close to LeftSemi, but 1) emit all the row from left exact once 2) each row has an additional column, which is the result of join condition (it's nullable) For any IN/EXISTS predicates, we do a LeftSemiPlus join on it's child, then replace the predicate with the additional attribute. (because LeftSemiPlus is not efficient as LeftSemi or LeftAnti, we may only do this when the predicate is not a top level conjunction) When we create the logical Join with LeftSemiPlus, we could create this additional attribute, and pass it around in optimizer and planner, because it will be used by other operators. We should support LeftSemiPlus in all the 4 join implementations. > Support subquery in nested predicates > - > > Key: SPARK-14781 > URL: https://issues.apache.org/jira/browse/SPARK-14781 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu > > Right now, we does not support nested IN/EXISTS subquery, for example > EXISTS( x1) OR EXISTS( x2) > In order to do that, we could use an internal-only join type SemiPlus, which > will output every row from left, plus additional column as the result of join > condition. Then we could replace the EXISTS() or IN() by the result column. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14781) Support subquery in nested predicates
[ https://issues.apache.org/jira/browse/SPARK-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264789#comment-15264789 ] Frederick Reiss commented on SPARK-14781: - [~davies]: I think I have a minimally-invasive plan for covering Q45. *Existing code:* {{RewritePredicateSubquery.apply}} applies to conjunctions of predicates in the WHERE clause. When this rule finds an IN predicate with a subquery, the rule rewrites the IN predicate into a join. *Proposed change:* Modify {{RewritePredicateSubquery}} so that it also detects disjunctions (ORs) where exactly one child of the disjunction is an IN predicate with a non-correlated subquery. Rewrite each such disjunction into a left outer join, followed by a Filter. The inner (right) operand of the left outer join should be the subquery with an additional Distinct operator above it. The Filter will apply the remaining predicates from the disjunction to any tuples that did not join with the subquery. *Notes:* The Distinct here is needed because the in-list could contain duplicates. The Distinct could be eliminated if there was a join operator that combined the behavior of LeftOuter and LeftSemijoin. I suppose that's what SemiPlus will do? This approach could be extended to cover correlated IN/EXISTS subqueries. The rewrite would need to add unique IDs to the outer query's tuples before the join + filter, then remove duplicates after the join + filter. I'm *not* planning to do this extension in the first pass. The approach could also be extended to cover multiple subqueries inside a disjunction by chaining together multiple outer joins. I'm *not* planning to do this extension in the first pass. *Questions:* * Do you foresee any problems with this approach? * There is a second version of the IN/EXISTS subquery rewrite logic in PR #12720, but that code hasn't been merged yet. Would you prefer a diff against the current head; or a diff against the logic in PR 12720? > Support subquery in nested predicates > - > > Key: SPARK-14781 > URL: https://issues.apache.org/jira/browse/SPARK-14781 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu > > Right now, we does not support nested IN/EXISTS subquery, for example > EXISTS( x1) OR EXISTS( x2) > In order to do that, we could use an internal-only join type SemiPlus, which > will output every row from left, plus additional column as the result of join > condition. Then we could replace the EXISTS() or IN() by the result column. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14781) Support subquery in nested predicates
[ https://issues.apache.org/jira/browse/SPARK-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261474#comment-15261474 ] Frederick Reiss commented on SPARK-14781: - Sure, I'd be happy to put something together to cover Q45. Tomorrow is a mess, but I'll have time on Friday and Monday. > Support subquery in nested predicates > - > > Key: SPARK-14781 > URL: https://issues.apache.org/jira/browse/SPARK-14781 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu > > Right now, we does not support nested IN/EXISTS subquery, for example > EXISTS( x1) OR EXISTS( x2) > In order to do that, we could use an internal-only join type SemiPlus, which > will output every row from left, plus additional column as the result of join > condition. Then we could replace the EXISTS() or IN() by the result column. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14781) Support subquery in nested predicates
[ https://issues.apache.org/jira/browse/SPARK-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261230#comment-15261230 ] Davies Liu commented on SPARK-14781: [~freiss] SemiPlus is not introduced yet. Even the subquery in Q45 is not correlated, it also could be rewritten as JOIN (without joining condition), will be executed as BroadcastNestedLoopJoin. Do you have some cycles to work on this? > Support subquery in nested predicates > - > > Key: SPARK-14781 > URL: https://issues.apache.org/jira/browse/SPARK-14781 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu > > Right now, we does not support nested IN/EXISTS subquery, for example > EXISTS( x1) OR EXISTS( x2) > In order to do that, we could use an internal-only join type SemiPlus, which > will output every row from left, plus additional column as the result of join > condition. Then we could replace the EXISTS() or IN() by the result column. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14781) Support subquery in nested predicates
[ https://issues.apache.org/jira/browse/SPARK-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261032#comment-15261032 ] Frederick Reiss commented on SPARK-14781: - [~davies] where is the definition of the SemiPlus operator? I don't see that anywhere in my copy of the main trunk. > Support subquery in nested predicates > - > > Key: SPARK-14781 > URL: https://issues.apache.org/jira/browse/SPARK-14781 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu > > Right now, we does not support nested IN/EXISTS subquery, for example > EXISTS( x1) OR EXISTS( x2) > In order to do that, we could use an internal-only join type SemiPlus, which > will output every row from left, plus additional column as the result of join > condition. Then we could replace the EXISTS() or IN() by the result column. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14781) Support subquery in nested predicates
[ https://issues.apache.org/jira/browse/SPARK-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261028#comment-15261028 ] Frederick Reiss commented on SPARK-14781: - I'm not so sure about Q45. Here's the template for Q45: {noformat} [_LIMITA] select [_LIMITB] ca_zip, [GBOBC], sum(ws_sales_price) from web_sales, customer, customer_address, date_dim, item where ws_bill_customer_sk = c_customer_sk and c_current_addr_sk = ca_address_sk and ws_item_sk = i_item_sk and ( substr(ca_zip,1,5) in ('85669', '86197','88274','83405','86475', '85392', '85460', '80348', '81792') or i_item_id in (select i_item_id from item where i_item_sk in (2, 3, 5, 7, 11, 13, 17, 19, 23, 29) ) ) and ws_sold_date_sk = d_date_sk and d_qoy = [QOY] and d_year = [YEAR] group by ca_zip, [GBOBC] order by ca_zip, [GBOBC] [_LIMITC]; {noformat} This query does contain a subquery inside a disjunction ({{...or i_item_id in (select...}}), but that subquery is not correlated. What is needed there is for that subquery to be added to the list of noncorrelated subqueries evaluated in {{SparkPlan.waitForSubqueries()}} and a placeholder for those query results inserted into the plan. Q10 and Q35 have correlated EXISTS subqueries inside disjunctions. > Support subquery in nested predicates > - > > Key: SPARK-14781 > URL: https://issues.apache.org/jira/browse/SPARK-14781 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu > > Right now, we does not support nested IN/EXISTS subquery, for example > EXISTS( x1) OR EXISTS( x2) > In order to do that, we could use an internal-only join type SemiPlus, which > will output every row from left, plus additional column as the result of join > condition. Then we could replace the EXISTS() or IN() by the result column. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14781) Support subquery in nested predicates
[ https://issues.apache.org/jira/browse/SPARK-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15259909#comment-15259909 ] Roy Cecil commented on SPARK-14781: --- This affects Q10, Q35 and Q45. If this is fixed we can then run 96 out of 99 queries out of the box with Minor Query Rewrites as allowed by TPC-DS . > Support subquery in nested predicates > - > > Key: SPARK-14781 > URL: https://issues.apache.org/jira/browse/SPARK-14781 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu > > Right now, we does not support nested IN/EXISTS subquery, for example > EXISTS( x1) OR EXISTS( x2) > In order to do that, we could use an internal-only join type SemiPlus, which > will output every row from left, plus additional column as the result of join > condition. Then we could replace the EXISTS() or IN() by the result column. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org