[GitHub] spark pull request: [SPARK-8682][SQL][WIP] Range Join

hvanhovell Fri, 17 Jul 2015 09:28:58 -0700

Github user hvanhovell commented on the pull request:

    https://github.com/apache/spark/pull/7379#issuecomment-122333357
  
    No problem.
    
    ### Supporting N-Ary Predicates.
    In order to make the range join work we need the predicates to define a 
single interval for each side of the join. For instance the clause: ```a.low < 
b.high && b.low < a.high``` implies that there are two intervals: [a.low, 
a.high] & [b.low, b.high]. An open interval, for instance ```a.low < b.high```, 
would also work. 
    
    When we use more than two clauses, we can potentially have multiple 
intervals, in your example for instance ```a.key < b.key and a.key2 > b.key2 
and a.key3>=b.key3``` would yield the following intervals: [a.key1, a.key2], 
[a.key1, a.key3], [b.key2, b.key1] & [b.key2, b.key3]. Creating a working 
index, that can deal with the (partially) uncorrelated intervals, will be quite 
a challenge (I haven't really looked into this yet). We could offcourse pick 
join on one pair of intervals and use filtering to take of the rest.
    
    I think the Unary and Binary cases are the most common. Let's start there, 
and see if there is demand for N-ary designs.
    
    ### Generalization
    If you consider the fact that we are joining intervals (Ranges if you 
will), range partitioning will not work because this assumes both intervals 
will be entirely in the same partition (they can span multiple partitions). 
When dealing with larger tables we would have to use a special interval-aware 
partitioning, this would create partitions for a number of fully covering 
non-overlapping intervals, and would multicast the rows to each interval it 
belongs to. The subsequent step would be using an index or doing a 
cartesian/BNL join.
    
    Doing a Cartesian Join in a single partition performs horrible. I thought 
it wouldn't be a problem either, but this completely killed the performance of 
an analysis I was doing for a client (account balances at specific dates). 
    
    I do see opportunities for code re-use. But this would be by generalizing 
HashedRelation and the BroadCast join family.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8682][SQL][WIP] Range Join

Reply via email to