Igniters, I'm going to start working on distributed SQL joins soon and want to put my thoughts on this matter here.
To provide collocated and non-collocated joins we need to know affinity key for each table. For example we have a table `Organization(id int)` where `id` is affinity key and `Person(orgId int)` where `orgId` is affinity key. This way when we see join `Person p, Organization o on p.orgId = o.id` we know that it does not make sense to try to find joins between different nodes because if the affinity fields are equal then they must be on the same node. Also this allows to handle transitive collocated joins like `Person pe, Product pr join on pe.orgId = pr.orgId` where `orgId` is affinity key for `Product` but is not a primary key. This will be a backward incompatible configuration change. This configuration approach is consistent with what MemSQL does. Obviously we can have multiple tables joined on different keys in the same query. Lets say `+` is a join on collocated key and `-` is a join on non-collocated key. Suppose we have the following join in query `a + b + c - d + e - f` A. We can run it the following way with shuffle: run collocated `(a + b + c) = m` , `(d + e) = n` resulting into `m - n - f` Then there are two possibilities (because we have only 3 non-collocated entities) either they joined on the same key or not. In the the first case we can shuffle them in a single step `(m - n - f) = z` to achieve collocation on joined fields, in the second we have to do shuffle `(m - n) = k` and then shuffle `(k - f) = z` which will be our resulting entity which is known to be collocated on joined fields of `k` and `f`. B. Another approach is to request data from remote nodes as needed for a join. It means that we are running locally `(a + b + c)` and when we fetch a row from there, we request a joining part for this row from `(d + e)` part. And if it is known that we join on affinity key from `(d + e)` then we know on which node this part exists exactly. Otherwise we have to broadcast this request. After that the same must happen with `f` part. Of course batching here is applicable as well, no need to request data for each node separately. If we analyze these two approaches from the performance standpoint then we can see that the best one is B with known affinity key of remote side. It has the same number of messages as shuffle but reduces traffic because on request we need to send only keys (while on shuffle we need to send the whole local table part to be joined on a third node) and in response we will receive only joined row parts (while in shuffle we need to send the whole remote table part to be joined on a third node). Obviously B without known affinity key requires broadcast and is not practically useful. But it seems that such a case will be quite uncommon since it is a join of two partitioned tables which are in turn collocated with two different partitioned tables. Since implementing approach A is much more complex (it will require more complex query planning and generation as well as more complex coordination between nodes) and B with known affinity keys is simpler and more effective my preference is to implement B with known affinity keys, and forbid case when we don't know remote affinity key. Feel free to comment. Sergi