[jira] [Commented] (SPARK-14781) Support subquery in nested predicates

2016-05-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265658#comment-15265658
 ] 

Apache Spark commented on SPARK-14781:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/12820

> Support subquery in nested predicates
> -
>
> Key: SPARK-14781
> URL: https://issues.apache.org/jira/browse/SPARK-14781
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Right now, we does not support nested IN/EXISTS subquery, for example 
> EXISTS( x1) OR EXISTS( x2)
> In order to do that, we could use an internal-only join type SemiPlus, which 
> will output every row from left, plus additional column as the result of join 
> condition. Then we could replace the EXISTS() or IN() by the result column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14781) Support subquery in nested predicates

2016-04-29 Thread Frederick Reiss (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265000#comment-15265000
 ] 

Frederick Reiss commented on SPARK-14781:
-

Yeah, Distinct will impact performance for the uncorrelated case if the 
subquery returns more than a few million rows. That problem won't occur in the 
particular case of TPC-DS query 45 (the subquery there returns at most 500k 
rows at a 100TB scale factor), but you never know. And of course a Distinct 
after the join, as one would need to cover EXISTS, would see potentially 
billions of rows. I just figured I'd mention that possibility as an expedient 
that doesn't require any additional operators.

I'd be up to adding a "LeftSemiPlus" mode to the various join operators if 
you'd prefer for implementation to start with that step. The new behavior is 
almost the same as the existing LeftSemi mode: one additional output column in 
the schema, plus code to emit rows with a null value when nothing on the inner 
matches an outer tuple.

> Support subquery in nested predicates
> -
>
> Key: SPARK-14781
> URL: https://issues.apache.org/jira/browse/SPARK-14781
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>
> Right now, we does not support nested IN/EXISTS subquery, for example 
> EXISTS( x1) OR EXISTS( x2)
> In order to do that, we could use an internal-only join type SemiPlus, which 
> will output every row from left, plus additional column as the result of join 
> condition. Then we could replace the EXISTS() or IN() by the result column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14781) Support subquery in nested predicates

2016-04-29 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264868#comment-15264868
 ] 

Davies Liu commented on SPARK-14781:


PR 12720 is pretty close, we should wait for that.

> Support subquery in nested predicates
> -
>
> Key: SPARK-14781
> URL: https://issues.apache.org/jira/browse/SPARK-14781
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>
> Right now, we does not support nested IN/EXISTS subquery, for example 
> EXISTS( x1) OR EXISTS( x2)
> In order to do that, we could use an internal-only join type SemiPlus, which 
> will output every row from left, plus additional column as the result of join 
> condition. Then we could replace the EXISTS() or IN() by the result column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14781) Support subquery in nested predicates

2016-04-29 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264866#comment-15264866
 ] 

Davies Liu commented on SPARK-14781:


Distinct is slow, it's better to not use that.

The LeftSemiPlus in my mind is something very close to LeftSemi, but 
1) emit all the row from left exact once
2) each row has an additional column, which is the result of join condition 
(it's nullable)

For any IN/EXISTS predicates, we do a LeftSemiPlus join on it's child, then 
replace the predicate with the additional attribute. (because LeftSemiPlus is 
not efficient as LeftSemi or LeftAnti, we may only do this when the predicate 
is not a top level conjunction)

When we create the logical Join with LeftSemiPlus, we could create this 
additional attribute, and pass it around in optimizer and planner, because it 
will be used by other operators.

We should support LeftSemiPlus in all the 4 join implementations.

> Support subquery in nested predicates
> -
>
> Key: SPARK-14781
> URL: https://issues.apache.org/jira/browse/SPARK-14781
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>
> Right now, we does not support nested IN/EXISTS subquery, for example 
> EXISTS( x1) OR EXISTS( x2)
> In order to do that, we could use an internal-only join type SemiPlus, which 
> will output every row from left, plus additional column as the result of join 
> condition. Then we could replace the EXISTS() or IN() by the result column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14781) Support subquery in nested predicates

2016-04-29 Thread Frederick Reiss (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264789#comment-15264789
 ] 

Frederick Reiss commented on SPARK-14781:
-

[~davies]: I think I have a minimally-invasive plan for covering Q45.

*Existing code:* {{RewritePredicateSubquery.apply}} applies to conjunctions of 
predicates in the WHERE clause. When this rule finds an IN predicate with a 
subquery, the rule rewrites the IN predicate into a join.

*Proposed change:* Modify {{RewritePredicateSubquery}} so that it also detects 
disjunctions (ORs) where exactly one child of the disjunction is an IN 
predicate with a non-correlated subquery. Rewrite each such disjunction into a 
left outer join, followed by a Filter. The inner (right) operand of the left 
outer join should be the subquery with an additional Distinct operator above 
it. The Filter will apply the remaining predicates from the disjunction to any 
tuples that did not join with the subquery.

*Notes:*
The Distinct here is needed because the in-list could contain duplicates. The 
Distinct could be eliminated if there was a join operator that combined the 
behavior of LeftOuter and LeftSemijoin. I suppose that's what SemiPlus will do?

This approach could be extended to cover correlated IN/EXISTS subqueries. The 
rewrite would need to add unique IDs to the outer query's tuples before the 
join + filter, then remove duplicates after the join + filter. I'm *not* 
planning to do this extension in the first pass.

The approach could also be extended to cover multiple subqueries inside a 
disjunction by chaining together multiple outer joins. I'm *not* planning to do 
this extension in the first pass.

*Questions:*
* Do you foresee any problems with this approach?
* There is a second version of the IN/EXISTS subquery rewrite logic in PR 
#12720, but that code hasn't been merged yet. Would you prefer a diff against 
the current head; or a diff against the logic in PR 12720?

> Support subquery in nested predicates
> -
>
> Key: SPARK-14781
> URL: https://issues.apache.org/jira/browse/SPARK-14781
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>
> Right now, we does not support nested IN/EXISTS subquery, for example 
> EXISTS( x1) OR EXISTS( x2)
> In order to do that, we could use an internal-only join type SemiPlus, which 
> will output every row from left, plus additional column as the result of join 
> condition. Then we could replace the EXISTS() or IN() by the result column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14781) Support subquery in nested predicates

2016-04-27 Thread Frederick Reiss (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261474#comment-15261474
 ] 

Frederick Reiss commented on SPARK-14781:
-

Sure, I'd be happy to put something together to cover Q45. Tomorrow is a mess, 
but I'll have time on Friday and Monday. 

> Support subquery in nested predicates
> -
>
> Key: SPARK-14781
> URL: https://issues.apache.org/jira/browse/SPARK-14781
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>
> Right now, we does not support nested IN/EXISTS subquery, for example 
> EXISTS( x1) OR EXISTS( x2)
> In order to do that, we could use an internal-only join type SemiPlus, which 
> will output every row from left, plus additional column as the result of join 
> condition. Then we could replace the EXISTS() or IN() by the result column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14781) Support subquery in nested predicates

2016-04-27 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261230#comment-15261230
 ] 

Davies Liu commented on SPARK-14781:


[~freiss] SemiPlus is not introduced yet. Even the subquery in Q45 is not 
correlated, it also could be rewritten as JOIN (without joining condition), 
will be executed as BroadcastNestedLoopJoin. Do you have some cycles to work on 
this?

> Support subquery in nested predicates
> -
>
> Key: SPARK-14781
> URL: https://issues.apache.org/jira/browse/SPARK-14781
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>
> Right now, we does not support nested IN/EXISTS subquery, for example 
> EXISTS( x1) OR EXISTS( x2)
> In order to do that, we could use an internal-only join type SemiPlus, which 
> will output every row from left, plus additional column as the result of join 
> condition. Then we could replace the EXISTS() or IN() by the result column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14781) Support subquery in nested predicates

2016-04-27 Thread Frederick Reiss (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261032#comment-15261032
 ] 

Frederick Reiss commented on SPARK-14781:
-

[~davies] where is the definition of the SemiPlus operator? I don't see that 
anywhere in my copy of the main trunk.

> Support subquery in nested predicates
> -
>
> Key: SPARK-14781
> URL: https://issues.apache.org/jira/browse/SPARK-14781
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>
> Right now, we does not support nested IN/EXISTS subquery, for example 
> EXISTS( x1) OR EXISTS( x2)
> In order to do that, we could use an internal-only join type SemiPlus, which 
> will output every row from left, plus additional column as the result of join 
> condition. Then we could replace the EXISTS() or IN() by the result column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14781) Support subquery in nested predicates

2016-04-27 Thread Frederick Reiss (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261028#comment-15261028
 ] 

Frederick Reiss commented on SPARK-14781:
-

I'm not so sure about Q45. Here's the template for Q45:

{noformat}
[_LIMITA] select [_LIMITB] ca_zip, [GBOBC], sum(ws_sales_price)
 from web_sales, customer, customer_address, date_dim, item
 where ws_bill_customer_sk = c_customer_sk
and c_current_addr_sk = ca_address_sk 
and ws_item_sk = i_item_sk 
and ( substr(ca_zip,1,5) in ('85669', '86197','88274','83405','86475', 
'85392', '85460', '80348', '81792')
  or 
  i_item_id in (select i_item_id
 from item
 where i_item_sk in (2, 3, 5, 7, 11, 13, 17, 19, 
23, 29)
 )
)
and ws_sold_date_sk = d_date_sk
and d_qoy = [QOY] and d_year = [YEAR]
 group by ca_zip, [GBOBC]
 order by ca_zip, [GBOBC]
 [_LIMITC];
{noformat}
This query does contain a subquery inside a disjunction ({{...or i_item_id in 
(select...}}), but that subquery is not correlated. What is needed there is for 
that subquery to be added to the list of noncorrelated subqueries evaluated in 
{{SparkPlan.waitForSubqueries()}} and a placeholder for those query results 
inserted into the plan.

Q10 and Q35 have correlated EXISTS subqueries inside disjunctions.

> Support subquery in nested predicates
> -
>
> Key: SPARK-14781
> URL: https://issues.apache.org/jira/browse/SPARK-14781
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>
> Right now, we does not support nested IN/EXISTS subquery, for example 
> EXISTS( x1) OR EXISTS( x2)
> In order to do that, we could use an internal-only join type SemiPlus, which 
> will output every row from left, plus additional column as the result of join 
> condition. Then we could replace the EXISTS() or IN() by the result column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14781) Support subquery in nested predicates

2016-04-27 Thread Roy Cecil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15259909#comment-15259909
 ] 

Roy Cecil commented on SPARK-14781:
---

This affects Q10, Q35 and Q45. If this is fixed we can then run 96 out of 99 
queries out of the box with Minor Query Rewrites as allowed by TPC-DS . 

> Support subquery in nested predicates
> -
>
> Key: SPARK-14781
> URL: https://issues.apache.org/jira/browse/SPARK-14781
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>
> Right now, we does not support nested IN/EXISTS subquery, for example 
> EXISTS( x1) OR EXISTS( x2)
> In order to do that, we could use an internal-only join type SemiPlus, which 
> will output every row from left, plus additional column as the result of join 
> condition. Then we could replace the EXISTS() or IN() by the result column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org