Hello Mike Percy, Kudu Jenkins, Grant Henke, Todd Lipcon, I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/12197 to look at the new patch set (#5). Change subject: generic_iterators: basic MergeIterator dominance ...................................................................... generic_iterators: basic MergeIterator dominance This patch adds a dominance algorithm to the MergeIterator. The idea is: rather than considering every sub-iterator for merging, we can omit a sub-iterator if its current block's last primary key is lexicographically greater than another sub-iterator's current block's first primary key. In this state the omitted sub-iterator is said to be "fully dominated" and the two sub-iterators are said to have a "dominance relation". We're already using a similar algorithm in merge compaction (see MergeCompactionInput in compaction.cc). I toyed with the idea of reusing the code but found it to be too unwieldy to be workable. Additionally, because dominance relations come and go as blocks are pulled, this patch also makes all MergeIterState lists intrusive. I ran both TestMerge and TestMergeOverlap with --num-lists=100, once with dominance and once without. The table below shows the average number of comparisons and running time (across 10 iterations) for all four cases. Basic-overlap | basic-nonoverlap | dom-overlap | dom-nonoverlap --------------+------------------+---------------+--------------- avg comps: 86,289,036 | 43,316,333 | 81,976,599 | 18,364,238 avg rt: 1.31s +- 0.6% | 0.57s +- 1.2% | 1.20s +- 0.7% | 0.28s +- 9.4% The results show a mild improvement in running time when the input is fully overlapping and a more significant one (about 2x) when non-overlapping. Real tablets are likely to fall in between the two ends of the spectrum, edging closer to non-overlapping the more they are compacted. Note: the higher variances in the non-overlapped runs are due to more significant differences between the merge input. More optimizations are on the way: - Better memory consumption by establishing dominance using rowset bounds. - Faster merge by copying block-by-block when there's only one non-dominated sub-iterator. Change-Id: If59d831240af15bfa7ef5709ec3d105d13b28322 --- M src/kudu/common/generic_iterators-test.cc M src/kudu/common/generic_iterators.cc 2 files changed, 258 insertions(+), 58 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/97/12197/5 -- To view, visit http://gerrit.cloudera.org:8080/12197 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: If59d831240af15bfa7ef5709ec3d105d13b28322 Gerrit-Change-Number: 12197 Gerrit-PatchSet: 5 Gerrit-Owner: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Grant Henke <granthe...@apache.org> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Mike Percy <mpe...@apache.org> Gerrit-Reviewer: Todd Lipcon <t...@apache.org>