liming30 opened a new pull request, #833:
URL: https://github.com/apache/incubator-paimon/pull/833
### Purpose
[Feature] Optimize SortMergeReader: use loser tree to reduce comparisons
(#741)
### Tests
- Unit Tests: `SortMergeReaderTestBase` already exists to verify the
correctness of `SortMergeReader`.
- Benchmark:I created a jmh benchmark to compare the performance of
different implementations,Here is the [benchmark
code](https://github.com/liming30/paimon-benchmark/blob/master/src/main/java/com/bytedance/MergeReaderBenchmark.java).
From the benchmark results, there is almost no difference between
`loser-tree` and `min-heap` when `RecordReader`/`data volume` is small. With
the increase of `RecordReader`/`data volume`, there can be about 10%
performance improvement.
```
Benchmark (readersNum) (recordNum) Mode Cnt
Score Error Units
MergeReaderBenchmark.group:min-heap 2 1000 avgt 10
0.172 ± 0.017 ms/op
MergeReaderBenchmark.group:loser-tree 2 1000 avgt 10
0.172 ± 0.015 ms/op
MergeReaderBenchmark.group:min-heap 2 10000 avgt 10
1.829 ± 0.247 ms/op
MergeReaderBenchmark.group:loser-tree 2 10000 avgt 10
1.633 ± 0.217 ms/op
MergeReaderBenchmark.group:min-heap 2 100000 avgt 10
17.664 ± 1.678 ms/op
MergeReaderBenchmark.group:loser-tree 2 100000 avgt 10
17.355 ± 1.238 ms/op
MergeReaderBenchmark.group:min-heap 5 1000 avgt 10
0.615 ± 0.095 ms/op
MergeReaderBenchmark.group:loser-tree 5 1000 avgt 10
0.616 ± 0.092 ms/op
MergeReaderBenchmark.group:min-heap 5 10000 avgt 10
6.207 ± 0.261 ms/op
MergeReaderBenchmark.group:loser-tree 5 10000 avgt 10
6.169 ± 0.221 ms/op
MergeReaderBenchmark.group:min-heap 5 100000 avgt 10
76.777 ± 6.675 ms/op
MergeReaderBenchmark.group:loser-tree 5 100000 avgt 10
61.951 ± 5.007 ms/op
MergeReaderBenchmark.group:min-heap 10 1000 avgt 10
1.719 ± 0.127 ms/op
MergeReaderBenchmark.group:loser-tree 10 1000 avgt 10
1.474 ± 0.109 ms/op
MergeReaderBenchmark.group:min-heap 10 10000 avgt 10
20.064 ± 2.266 ms/op
MergeReaderBenchmark.group:loser-tree 10 10000 avgt 10
17.543 ± 1.530 ms/op
MergeReaderBenchmark.group:min-heap 10 100000 avgt 10
186.422 ± 13.484 ms/op
MergeReaderBenchmark.group:loser-tree 10 100000 avgt 10
165.942 ± 11.544 ms/op
MergeReaderBenchmark.group:min-heap 20 1000 avgt 10
3.686 ± 0.084 ms/op
MergeReaderBenchmark.group:loser-tree 20 1000 avgt 10
3.333 ± 0.079 ms/op
MergeReaderBenchmark.group:min-heap 20 10000 avgt 10
40.124 ± 0.496 ms/op
MergeReaderBenchmark.group:loser-tree 20 10000 avgt 10
35.157 ± 0.239 ms/op
MergeReaderBenchmark.group:min-heap 20 100000 avgt 10
405.549 ± 10.104 ms/op
MergeReaderBenchmark.group:loser-tree 20 100000 avgt 10
361.460 ± 8.872 ms/op
MergeReaderBenchmark.group:min-heap 50 1000 avgt 10
11.097 ± 1.025 ms/op
MergeReaderBenchmark.group:loser-tree 50 1000 avgt 10
9.721 ± 0.975 ms/op
MergeReaderBenchmark.group:min-heap 50 10000 avgt 10
155.610 ± 18.069 ms/op
MergeReaderBenchmark.group:loser-tree 50 10000 avgt 10
131.680 ± 13.984 ms/op
MergeReaderBenchmark.group:min-heap 50 100000 avgt 10
1394.615 ± 97.327 ms/op
MergeReaderBenchmark.group:loser-tree 50 100000 avgt 10
1283.037 ± 84.270 ms/op
```
### API and Format
No
### Documentation
A variant implementation of loser tree is introduced to optimize the
comparison times of `SortMergeReader`.
Different from the traditional loser tree, since `RecordReader` and
`MergeFunction` may reuse objects, we can iterate data forward for
`RecordReader` only after all the same keys in the entire tree are processed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]