Sergio,

That's good idea. That is necessary work. However, now it would be better
to adopt some of your approaches to the distributed plan part which has a
global view of relations and running tasks.

Actually, I've been also concerned with hash partition on intermediate data
between execution blocks using histogram. I believe that Tajo can handle
most of skewed problems if the intermediate are evenly distributed across
cluster nodes.

Probably, the straightforward solution is as follows. In the current
implementation, a running subquery collects some statistics from running
query units (tasks). The statistics of each task includes hash keys on
specified columns and the number of bytes for each key. By using the
statistics, Tajo could build evenly distributed sets of hash keys. They
could be used to create join tasks with evenly distributed loads; the
creation of even load tasks is not implemented yet and is the future plan.
Like one of your approaches, we could store the statistics of joining
tables into Tajo catalog. The statistics also would be reused for later
query processing.

Best regards,
Hyunsik



On Tue, Jun 18, 2013 at 1:17 PM, Sergio Esteves <[email protected]>wrote:

> Hi,
>
> I have been thinking of the partition phase in hybrid hash join and would
> like to discuss it with the rest of the community.
>
> During the partition of data into buckets, a very important issue is how to
> deal with skewed buckets (many tuples with a certain same join key) that
> may cause hash table overflow, i.e., having partitions that do not fit in
> memory.
>
> It is hard to divide the hash space into buckets with different sizes
> without knowing the join key distribution beforehand. Therefore, one
> possible way is to partition the overflow buckets 1 level further (this
> process may go on onto deeper levels). Other way that is sometimes
> followed, is to do very small partitions of the data and then combine them
> in accordance with their sizes (which can also fail if the number of
> duplicates inside a bucket is too high).
>
> I have been discussing with my GSoC mentor, and we think the best and more
> efficient solution would be to have an histogram of the distribution of the
> join keys, so that we can partition more evenly the hash space through the
> buckets. Such histogram can be mantained for every table that can be used
> with a join (i.e., that have a foreign key or that can be used as foreign
> key to other table). Hence, it is necessary to update the histograms every
> time data is updated/deleted from the respective tables. Other way we were
> discussing was to build the histogram on the first time hybrid hash join
> physical operator is called with a certain join query (i.e., during query
> execution). On the second time the same query is submitted, or a similar
> query containing the same inner relation, the histogram was already there
> and thus hybrid hash join could be executed more efficiently. This latter
> approach has the drawback of ignoring changes between consecutive join
> calls. On the other hand, maintaining a histogram for every table and
> updating it every time changes occur can be costly and introduce some
> overhead.
>
> What do you guys think?
>
> Thanks.
>

Reply via email to