On Sat, Oct 1, 2022 at 12:45 AM Zhihong Yu <z...@yugabyte.com> wrote:
> > > On Fri, Sep 30, 2022 at 9:20 PM Zhihong Yu <z...@yugabyte.com> wrote: > >> >> >> On Fri, Sep 30, 2022 at 8:40 PM Zhihong Yu <z...@yugabyte.com> wrote: >> >>> >>> >>> On Fri, Sep 30, 2022 at 3:44 PM Zheng Li <zhengl...@gmail.com> wrote: >>> >>>> Hello, >>>> >>>> A bloom filter provides early filtering of rows that cannot be joined >>>> before they would reach the join operator, the optimization is also >>>> called a semi join filter (SJF) pushdown. Such a filter can be created >>>> when one child of the join operator must materialize its derived table >>>> before the other child is evaluated. >>>> >>>> For example, a bloom filter can be created using the the join keys for >>>> the build side/inner side of a hash join or the outer side of a merge >>>> join, the bloom filter can then be used to pre-filter rows on the >>>> other side of the join operator during the scan of the base relation. >>>> The thread about “Hash Joins vs. Bloom Filters / take 2” [1] is good >>>> discussion on using such optimization for hash join without going into >>>> the pushdown of the filter where its performance gain could be further >>>> increased. >>>> >>>> We worked on prototyping bloom filter pushdown for both hash join and >>>> merge join. Attached is a patch set for bloom filter pushdown for >>>> merge join. We also plan to send the patch for hash join once we have >>>> it rebased. >>>> >>>> Here is a summary of the patch set: >>>> 1. Bloom Filter Pushdown optimizes Merge Join by filtering rows early >>>> during the table scan instead of later on. >>>> -The bloom filter is pushed down along the execution tree to >>>> the target SeqScan nodes. >>>> -Experiments show that this optimization can speed up Merge >>>> Join by up to 36%. >>>> >>>> 2. The planner makes the decision to use the bloom filter based on the >>>> estimated filtering rate and the expected performance gain. >>>> -The planner accomplishes this by estimating four numbers per >>>> variable - the total number of rows of the relation, the number of >>>> distinct values for a given variable, and the minimum and maximum >>>> value of the variable (when applicable). Using these numbers, the >>>> planner estimates a filtering rate of a potential filter. >>>> -Because actually creating and implementing the filter adds >>>> more operations, there is a minimum threshold of filtering where the >>>> filter would actually be useful. Based on testing, we query to see if >>>> the estimated filtering rate is higher than 35%, and that informs our >>>> decision to use a filter or not. >>>> >>>> 3. If using a bloom filter, the planner also adjusts the expected cost >>>> of Merge Join based on expected performance gain. >>>> >>>> 4. Capability to build the bloom filter in parallel in case of >>>> parallel SeqScan. This is done efficiently by populating a local bloom >>>> filter for each parallel worker and then taking a bitwise OR over all >>>> the local bloom filters to form a shared bloom filter at the end of >>>> the parallel SeqScan. >>>> >>>> 5. The optimization is GUC controlled, with settings of >>>> enable_mergejoin_semijoin_filter and force_mergejoin_semijoin_filter. >>>> >>>> We found in experiments that there is a significant improvement >>>> when using the bloom filter during Merge Join. One experiment involved >>>> joining two large tables while varying the theoretical filtering rate >>>> (TFR) between the two tables, the TFR is defined as the percentage >>>> that the two datasets are disjoint. Both tables in the merge join were >>>> the same size. We tested changing the TFR to see the change in >>>> filtering optimization. >>>> >>>> For example, let’s imagine t0 has 10 million rows, which contain the >>>> numbers 1 through 10 million randomly shuffled. Also, t1 has the >>>> numbers 4 million through 14 million randomly shuffled. Then the TFR >>>> for a join of these two tables is 40%, since 40% of the tables are >>>> disjoint from the other table (1 through 4 million for t0, 10 million >>>> through 14 million for t4). >>>> >>>> Here is the performance test result joining two tables: >>>> TFR: theoretical filtering rate >>>> EFR: estimated filtering rate >>>> AFR: actual filtering rate >>>> HJ: hash join >>>> MJ Default: default merge join >>>> MJ Filter: merge join with bloom filter optimization enabled >>>> MJ Filter Forced: merge join with bloom filter optimization forced >>>> >>>> TFR EFR AFR HJ MJ Default MJ Filter MJ Filter Forced >>>> >>>> ------------------------------------------------------------------------------------- >>>> 10 33.46 7.41 6529 22638 21949 23160 >>>> 20 37.27 14.85 6483 22290 21928 21930 >>>> 30 41.32 22.25 6395 22374 20718 20794 >>>> 40 45.67 29.7 6272 21969 19449 19410 >>>> 50 50.41 37.1 6210 21412 18222 18224 >>>> 60 55.64 44.51 6052 21108 17060 17018 >>>> 70 61.59 51.98 5947 21020 15682 15737 >>>> 80 68.64 59.36 5761 20812 14411 14437 >>>> 90 77.83 66.86 5701 20585 13171 13200 >>>> Table. Execution Time (ms) vs Filtering Rate (%) for Joining Two >>>> Tables of 10M Rows. >>>> >>>> Attached you can find figures of the same performance test and a SQL >>>> script >>>> to reproduce the performance test. >>>> >>>> The first thing to notice is that Hash Join generally is the most >>>> efficient join strategy. This is because Hash Join is better at >>>> dealing with small tables, and our size of 10 million is still small >>>> enough where Hash Join outperforms the other join strategies. Future >>>> experiments can investigate using much larger tables. >>>> >>>> However, comparing just within the different Merge Join variants, we >>>> see that using the bloom filter greatly improves performance. >>>> Intuitively, all of these execution times follow linear paths. >>>> Comparing forced filtering versus default, we can see that the default >>>> Merge Join outperforms Merge Join with filtering at low filter rates, >>>> but after about 20% TFR, the Merge Join with filtering outperforms >>>> default Merge Join. This makes intuitive sense, as there are some >>>> fixed costs associated with building and checking with the bloom >>>> filter. In the worst case, at only 10% TFR, the bloom filter makes >>>> Merge Join less than 5% slower. However, in the best case, at 90% TFR, >>>> the bloom filter improves Merge Join by 36%. >>>> >>>> Based on the results of the above experiments, we came up with a >>>> linear equation for the performance ratio for using the filter >>>> pushdown from the actual filtering rate. Based on the numbers >>>> presented in the figure, this is the equation: >>>> >>>> T_filter / T_no_filter = 1 / (0.83 * estimated filtering rate + 0.863) >>>> >>>> For example, this means that with an estimated filtering rate of 0.4, >>>> the execution time of merge join is estimated to be improved by 16.3%. >>>> Note that the estimated filtering rate is used in the equation, not >>>> the theoretical filtering rate or the actual filtering rate because it >>>> is what we have during planning. In practice the estimated filtering >>>> rate isn’t usually accurate. In fact, the estimated filtering rate can >>>> differ from the theoretical filtering rate by as much as 17% in our >>>> experiments. One way to mitigate the power loss of bloom filter caused >>>> by inaccurate estimated filtering rate is to adaptively turn it off at >>>> execution time, this is yet to be implemented. >>>> >>>> Here is a list of tasks we plan to work on in order to improve this >>>> patch: >>>> 1. More regression testing to guarantee correctness. >>>> 2. More performance testing involving larger tables and complicated >>>> query plans. >>>> 3. Improve the cost model. >>>> 4. Explore runtime tuning such as making the bloom filter checking >>>> adaptive. >>>> 5. Currently, only the best single join key is used for building the >>>> Bloom filter. However, if there are several keys and we know that >>>> their distributions are somewhat disjoint, we could leverage this fact >>>> and use multiple keys for the bloom filter. >>>> 6. Currently, Bloom filter pushdown is only implemented for SeqScan >>>> nodes. However, it would be possible to allow push down to other types >>>> of scan nodes. >>>> 7. Explore if the Bloom filter could be pushed down through a foreign >>>> scan when the foreign server is capable of handling it – which could >>>> be made true for postgres_fdw. >>>> 8. Better explain command on the usage of bloom filters. >>>> >>>> This patch set is prepared by Marcus Ma, Lyu Pan and myself. Feedback >>>> is appreciated. >>>> >>>> With Regards, >>>> Zheng Li >>>> Amazon RDS/Aurora for PostgreSQL >>>> >>>> [1] >>>> https://www.postgresql.org/message-id/flat/c902844d-837f-5f63-ced3-9f7fd222f175%402ndquadrant.com >>> >>> >>> Hi, >>> In the header of patch 1: >>> >>> In this prototype, the cost model is based on an assumption that there >>> is a linear relationship between the performance gain from using a semijoin >>> filter and the estimated filtering rate: >>> % improvement to Merge Join cost = 0.83 * estimated filtering rate - >>> 0.137. >>> >>> How were the coefficients (0.83 and 0.137) determined ? >>> I guess they were based on the results of running certain workload. >>> >>> Cheers >>> >> Hi, >> For patch 1: >> >> +bool enable_mergejoin_semijoin_filter; >> +bool force_mergejoin_semijoin_filter; >> >> How would (enable_mergejoin_semijoin_filter = off, >> force_mergejoin_semijoin_filter = on) be interpreted ? >> Have you considered using one GUC which has three values: off, enabled, >> forced ? >> >> + mergeclauses_for_sjf = >> get_actual_clauses(path->path_mergeclauses); >> + mergeclauses_for_sjf = >> get_switched_clauses(path->path_mergeclauses, >> + >> path->jpath.outerjoinpath->parent->relids); >> >> mergeclauses_for_sjf is assigned twice and I don't >> see mergeclauses_for_sjf being reference in the call >> to get_switched_clauses(). >> Is this intentional ? >> >> + /* want at least 1000 rows_filtered to avoid any nasty edge >> cases */ >> + if (force_mergejoin_semijoin_filter || (filteringRate >= 0.35 >> && rows_filtered > 1000)) >> >> The above condition is narrower compared to the enclosing condition. >> Since there is no else block for the second if block, please merge the >> two if statements. >> >> + int best_filter_clause; >> >> Normally I would think `clause` is represented by List*. But >> best_filter_clause is an int. Please use another variable name so that >> there is less chance of confusion. >> >> For evaluate_semijoin_filtering_rate(): >> >> + double best_sj_selectivity = 1.01; >> >> How was 1.01 determined ? >> >> + debug_sj1("SJPD: start evaluate_semijoin_filtering_rate"); >> >> There are debug statements in the methods. >> It would be better to remove them in the next patch set. >> >> Cheers >> > Hi, > Still patch 1. > > + if (!outer_arg_md->is_or_maps_to_base_column > + && !inner_arg_md->is_or_maps_to_constant) > + { > + debug_sj2("SJPD: outer equijoin arg does not map %s", > + "to a base column nor a constant; semijoin is not > valid"); > > Looks like there is a typo: inner_arg_md->is_or_maps_to_constant should > be outer_arg_md->is_or_maps_to_constant > > + if (outer_arg_md->est_col_width > MAX_SEMIJOIN_SINGLE_KEY_WIDTH) > + { > + debug_sj2("SJPD: outer equijoin column's width %s", > + "was excessive; condition rejected"); > > How is the value of MAX_SEMIJOIN_SINGLE_KEY_WIDTH determined ? > > For verify_valid_pushdown(): > > + Assert(path); > + Assert(target_var_no > 0); > + > + if (path == NULL) > + { > + return false; > > I don't understand the first assertion. Does it mean path would always be > non-NULL ? Then the if statement should be dropped. > > + if (path->parent->relid == target_var_no) > + { > + /* > + * Found source of target var! We know that the > pushdown > + * is valid now. > + */ > + return true; > + } > + return false; > > The above can be simplified as: return path->parent->relid == > target_var_no; > > + * True if the given con_exprs, ref_exprs and operators will exactlty > > Typo: exactlty -> exactly > > + if (!bms_equal(all_vars, matched_vars)) > + return false; > + return true; > > The above can be simplified as: return bms_equal(all_vars, matched_vars); > > Cheers > Hi, Still in patch 1 :-) + if (best_path->use_semijoinfilter) + { + if (best_path->best_mergeclause != -1) Since there is no else block, the two conditions can be combined. + ListCell *clause_cell = list_nth_cell(mergeclauses, best_path->best_mergeclause); As shown in the above code, best_mergeclause is the position of the best merge clause in mergeclauses. I think best_mergeclause_pos (or similar name) is more appropriate for the fieldname. For depth_of_semijoin_target(): + * Parameters: + * node: plan node to be considered for semijoin push down. The name of the parameter is pn - please align the comment with code. For T_SubqueryScan case in depth_of_semijoin_target(): + Assert(rte->subquery->targetList); ... + if (rel && rel->subroot + && rte && rte->subquery && rte->subquery->targetList) It seems the condition can be simplified since rte->subquery->targetList has passed the assertion. For is_table_scan_node_source_of_relids_or_var(), the else block can be simplified to returning scan_node_varno == target_var->varno directly. For get_appendrel_occluded_references(): + * Given a virtual column from an Union ALL subquery, + * return the expression it immediately occludes that satisfy Since the index is returned from the func, it would be better to clarify the comment by saying `return the last index of expression ...` + /* Subquery without append and partitioned tables */ append and partitioned tables -> append or partitioned tables More reviews for subsequent patches to follow.