[kudu-CR] generic iterators: basic MergeIterator dominance

Adar Dembo (Code Review) Fri, 11 Jan 2019 15:52:55 -0800

Hello Mike Percy, Kudu Jenkins, Grant Henke, Todd Lipcon,

I'd like you to reexamine a change. Please visit


    http://gerrit.cloudera.org:8080/12197

to look at the new patch set (#5).

Change subject: generic_iterators: basic MergeIterator dominance
......................................................................

generic_iterators: basic MergeIterator dominance

This patch adds a dominance algorithm to the MergeIterator. The idea is:
rather than considering every sub-iterator for merging, we can omit a
sub-iterator if its current block's last primary key is lexicographically
greater than another sub-iterator's current block's first primary key. In
this state the omitted sub-iterator is said to be "fully dominated" and the
two sub-iterators are said to have a "dominance relation".

We're already using a similar algorithm in merge compaction (see
MergeCompactionInput in compaction.cc). I toyed with the idea of reusing the
code but found it to be too unwieldy to be workable.

Additionally, because dominance relations come and go as blocks are pulled,
this patch also makes all MergeIterState lists intrusive.

I ran both TestMerge and TestMergeOverlap with --num-lists=100, once with
dominance and once without. The table below shows the average number of
comparisons and running time (across 10 iterations) for all four cases.

        Basic-overlap | basic-nonoverlap | dom-overlap   | dom-nonoverlap
        --------------+------------------+---------------+---------------
avg comps: 86,289,036 | 43,316,333       | 81,976,599    | 18,364,238
avg rt: 1.31s +- 0.6% | 0.57s +- 1.2%    | 1.20s +- 0.7% | 0.28s +- 9.4%

The results show a mild improvement in running time when the input is fully
overlapping and a more significant one (about 2x) when non-overlapping. Real
tablets are likely to fall in between the two ends of the spectrum, edging
closer to non-overlapping the more they are compacted. Note: the higher
variances in the non-overlapped runs are due to more significant differences
between the merge input.

More optimizations are on the way:
- Better memory consumption by establishing dominance using rowset bounds.
- Faster merge by copying block-by-block when there's only one non-dominated
  sub-iterator.

Change-Id: If59d831240af15bfa7ef5709ec3d105d13b28322
---
M src/kudu/common/generic_iterators-test.cc
M src/kudu/common/generic_iterators.cc
2 files changed, 258 insertions(+), 58 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/97/12197/5
--
To view, visit http://gerrit.cloudera.org:8080/12197
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If59d831240af15bfa7ef5709ec3d105d13b28322
Gerrit-Change-Number: 12197
Gerrit-PatchSet: 5
Gerrit-Owner: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Grant Henke <granthe...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mpe...@apache.org>
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>

[kudu-CR] generic iterators: basic MergeIterator dominance

Reply via email to