Github user hvanhovell commented on the pull request: https://github.com/apache/spark/pull/7379#issuecomment-122333357 No problem. ### Supporting N-Ary Predicates. In order to make the range join work we need the predicates to define a single interval for each side of the join. For instance the clause: ```a.low < b.high && b.low < a.high``` implies that there are two intervals: [a.low, a.high] & [b.low, b.high]. An open interval, for instance ```a.low < b.high```, would also work. When we use more than two clauses, we can potentially have multiple intervals, in your example for instance ```a.key < b.key and a.key2 > b.key2 and a.key3>=b.key3``` would yield the following intervals: [a.key1, a.key2], [a.key1, a.key3], [b.key2, b.key1] & [b.key2, b.key3]. Creating a working index, that can deal with the (partially) uncorrelated intervals, will be quite a challenge (I haven't really looked into this yet). We could offcourse pick join on one pair of intervals and use filtering to take of the rest. I think the Unary and Binary cases are the most common. Let's start there, and see if there is demand for N-ary designs. ### Generalization If you consider the fact that we are joining intervals (Ranges if you will), range partitioning will not work because this assumes both intervals will be entirely in the same partition (they can span multiple partitions). When dealing with larger tables we would have to use a special interval-aware partitioning, this would create partitions for a number of fully covering non-overlapping intervals, and would multicast the rows to each interval it belongs to. The subsequent step would be using an index or doing a cartesian/BNL join. Doing a Cartesian Join in a single partition performs horrible. I thought it wouldn't be a problem either, but this completely killed the performance of an analysis I was doing for a client (account balances at specific dates). I do see opportunities for code re-use. But this would be by generalizing HashedRelation and the BroadCast join family.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org