Distributive SQL Joins

Sergi Vladykin Thu, 06 Aug 2015 23:30:59 -0700

Igniters,

I'm going to start working on distributed SQL joins soon and want to put my
thoughts on this matter here.



To provide collocated and non-collocated joins we need to know affinity key
for each table.
For example we have a table `Organization(id int)` where `id` is affinity
key and `Person(orgId int)`
where `orgId` is affinity key. This way when we see join `Person p,
Organization o on p.orgId = o.id`
we know that it does not make sense to try to find joins between different
nodes because
if the affinity fields are equal then they must be on the same node. Also
this allows to handle
transitive collocated joins like `Person pe, Product pr join on pe.orgId =
pr.orgId` where `orgId`
is affinity key for `Product` but is not a primary key. This will be a
backward incompatible configuration change.
This configuration approach is consistent with what MemSQL does.

Obviously we can have multiple tables joined on different keys in the same
query.
Lets say `+` is a join on collocated key and `-` is a join on
non-collocated key.
Suppose we have the following join in query
`a + b + c - d + e - f`

A. We can run it the following way with shuffle:
run collocated `(a + b + c) = m` , `(d + e) = n` resulting into
`m - n - f`
Then there are two possibilities (because we have only 3 non-collocated
entities)
either they joined on the same key or not. In the the first case we can
shuffle them
in a single step `(m - n - f) = z` to achieve collocation on joined fields,
in the second we have
to do shuffle `(m - n) = k` and then shuffle `(k - f) = z` which will be
our
resulting entity which is known to be collocated on joined fields of `k`
and `f`.

B. Another approach is to request data from remote nodes as needed for a
join.
It means that we are running locally `(a + b + c)` and when we fetch a row
from there, we request a joining part for this row from `(d + e)` part.
And if it is known that we join on affinity key from `(d + e)` then we know
on which node this part exists exactly. Otherwise we have to broadcast this
request.
After that the same must happen with `f` part. Of course batching here is
applicable
as well, no need to request data for each node separately.

If we analyze these two approaches from the performance standpoint then
we can see that the best one is B with known affinity key of remote side.
It has the same number of messages as shuffle but reduces traffic because
on request we need to send only keys (while on shuffle we need to send the
whole local table part to be joined on a third node) and in response we
will receive only
joined row parts (while in shuffle we need to send the whole remote
table part to be joined on a third node).

Obviously B without known affinity key requires broadcast and is not
practically
useful. But it seems that such a case will be quite uncommon since it is a
join of two partitioned tables which are in turn collocated with two
different partitioned tables.

Since implementing approach A is much more complex (it will require more
complex
query planning and generation as well as more complex coordination between
nodes)
and B with known affinity keys is simpler and more effective
my preference is to implement B with known affinity keys, and forbid
case when we don't know remote affinity key.

Feel free to comment.

Sergi

Distributive SQL Joins

Reply via email to