Hi, I have been thinking of the partition phase in hybrid hash join and would like to discuss it with the rest of the community.
During the partition of data into buckets, a very important issue is how to deal with skewed buckets (many tuples with a certain same join key) that may cause hash table overflow, i.e., having partitions that do not fit in memory. It is hard to divide the hash space into buckets with different sizes without knowing the join key distribution beforehand. Therefore, one possible way is to partition the overflow buckets 1 level further (this process may go on onto deeper levels). Other way that is sometimes followed, is to do very small partitions of the data and then combine them in accordance with their sizes (which can also fail if the number of duplicates inside a bucket is too high). I have been discussing with my GSoC mentor, and we think the best and more efficient solution would be to have an histogram of the distribution of the join keys, so that we can partition more evenly the hash space through the buckets. Such histogram can be mantained for every table that can be used with a join (i.e., that have a foreign key or that can be used as foreign key to other table). Hence, it is necessary to update the histograms every time data is updated/deleted from the respective tables. Other way we were discussing was to build the histogram on the first time hybrid hash join physical operator is called with a certain join query (i.e., during query execution). On the second time the same query is submitted, or a similar query containing the same inner relation, the histogram was already there and thus hybrid hash join could be executed more efficiently. This latter approach has the drawback of ignoring changes between consecutive join calls. On the other hand, maintaining a histogram for every table and updating it every time changes occur can be costly and introduce some overhead. What do you guys think? Thanks.
