[ 
https://issues.apache.org/jira/browse/PHOENIX-852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092985#comment-14092985
 ] 

James Taylor commented on PHOENIX-852:
--------------------------------------

I can see that the correlated subquery case will be different than the join 
case, but let's ignore that one for now (as it's an issue that needs to be 
solved when we support correlated subqueries regardless of the approach here).

So the IN query will do what you want, but I agree, you'd need to coordinate 
the LHS and RHS iterators carefully. For example, you'd need to not do a next 
on the RHS iterator until the LHS iterator returns a *different* value for the 
join key. You might be able to order the RHS by the join key and work out this 
coordination as you iterate, though doing this in a parallel manner would get 
tricky. Or maybe you could create the hash cache like today and then use it on 
the client side.

If you think it's easier/better to push the hash cache to all region servers 
like we do for the standard join case, extract the relevant keys on the 
server-side, and add a skip scan filter to the RHS scan on the server-side, 
that's fine too. You'll be pushing more information than you need to the region 
servers, as could slice this up and only send what's required for each region 
(that's what we do when we process an IN). 

So seems like there are a lot of different options. I suspect it'll become 
clearer as you get into it which is the best. Let me know how I can help.

> Optimize child/parent foreign key joins
> ---------------------------------------
>
>                 Key: PHOENIX-852
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-852
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: James Taylor
>            Assignee: Maryann Xue
>
> Often times a join will occur from a child to a parent. Our current algorithm 
> would do a full scan of one side or the other. We can do much better than 
> that if the HashCache contains the PK (or even part of the PK) from the table 
> being joined to. In these cases, we should drive the second scan through a 
> skip scan on the server side.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to