Hi, it's a good idea, but i think that histograms can be used to decide the optimal physical execution plan. In other words, the physical planner can selects the optimal plan based on the data distribution.
For example, if a relation is sorted by a join key, the physical planner might decide to use the sorted merge join. If the join key distribution is well balanced, the hash join can be used. Making histograms requires scan of the whole relation. This causes the high overheads to the system. To reduce this overhead, Tajo can piggyback making histograms into the query processing. Given a query, the global planner first checks whether there is a histogram of the relation relevant to the query. If the histogram does not exist, the global planner builds a query plan including making a histogram of the relation. After the histogram is made, the global planner and physical planer can decide the better query plan based on the data distribution. Histograms should be maintained by the Tajo master, because it can be also used to build distributed plans. I think that changes of a small portion of a relation is not common, especially for large-scale data storage. (As you know, for large data, the bulk load of whole data might be faster than updates of a small portion of the data.) If you want to work about the physical query optimization, I recommend that you separately proceed the work with the GSoC project. If you do, I am prepared to give you any advice. Thanks, Jihoon 2013/6/18 Sergio Esteves <[email protected]> > Hi, > > I have been thinking of the partition phase in hybrid hash join and would > like to discuss it with the rest of the community. > > During the partition of data into buckets, a very important issue is how to > deal with skewed buckets (many tuples with a certain same join key) that > may cause hash table overflow, i.e., having partitions that do not fit in > memory. > > It is hard to divide the hash space into buckets with different sizes > without knowing the join key distribution beforehand. Therefore, one > possible way is to partition the overflow buckets 1 level further (this > process may go on onto deeper levels). Other way that is sometimes > followed, is to do very small partitions of the data and then combine them > in accordance with their sizes (which can also fail if the number of > duplicates inside a bucket is too high). > > I have been discussing with my GSoC mentor, and we think the best and more > efficient solution would be to have an histogram of the distribution of the > join keys, so that we can partition more evenly the hash space through the > buckets. Such histogram can be mantained for every table that can be used > with a join (i.e., that have a foreign key or that can be used as foreign > key to other table). Hence, it is necessary to update the histograms every > time data is updated/deleted from the respective tables. Other way we were > discussing was to build the histogram on the first time hybrid hash join > physical operator is called with a certain join query (i.e., during query > execution). On the second time the same query is submitted, or a similar > query containing the same inner relation, the histogram was already there > and thus hybrid hash join could be executed more efficiently. This latter > approach has the drawback of ignoring changes between consecutive join > calls. On the other hand, maintaining a histogram for every table and > updating it every time changes occur can be costly and introduce some > overhead. > > What do you guys think? > > Thanks. > -- Jihoon Son Database & Information Systems Group, Prof. Yon Dohn Chung Lab. Dept. of Computer Science & Engineering, Korea University 1, 5-ga, Anam-dong, Seongbuk-gu, Seoul, 136-713, Republic of Korea Tel : +82-2-3290-3580 E-mail : [email protected]
