[ https://issues.apache.org/jira/browse/ACCUMULO-2827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14041141#comment-14041141 ]
Jonathan Park commented on ACCUMULO-2827: ----------------------------------------- Results of performance tests (I also re-ran [~kturner]'s cases for comparison): pq => priorityqueue pb => prioritybuffer ||files||rows per file||cols per row||rate w/o patch||rate w/ patch pq||pq speedup||rate w/ patch pb||pb speedup|| |10|1,000,000|1| 449,355 |418,128| .93 |433,477|.965| |10|100,000|10| 598,155 |678,698| 1.135 | 667,465|1.116| |10|10,000|100| 641,190 |739,809| 1.154 |729,501|1.138| |20|500,000|1| 405,915 |400,614| .987 | 405,571|.999| |20|50,000|10| 551,997 |659,932| 1.196 | 643,250|1.165| |1|10,000,000|1| 506,719 |483,178| .954 | 517,362|1.021| Not entirely sure why PriorityQueue is performing worse than PriorityBuffer in the worst cases (high interleaving). Might just be noise in the tests? > HeapIterator optimization > ------------------------- > > Key: ACCUMULO-2827 > URL: https://issues.apache.org/jira/browse/ACCUMULO-2827 > Project: Accumulo > Issue Type: Improvement > Affects Versions: 1.5.1, 1.6.0 > Reporter: Jonathan Park > Assignee: Jonathan Park > Priority: Minor > Fix For: 1.5.2, 1.6.1, 1.7.0 > > Attachments: ACCUMULO-2827-compaction-performance-test.patch, > ACCUMULO-2827.0.patch.txt, ACCUMULO-2827.1.patch.txt, > ACCUMULO-2827.2.patch.txt, accumulo-2827.raw_data, new_heapiter.png, > old_heapiter.png, together.png > > > We've been running a few performance tests of our iterator stack and noticed > a decent amount of time spent in the HeapIterator specifically related to > add/removal into the heap. > This may not be a general enough optimization but we thought we'd see what > people thought. Our assumption is that it's more probable that the current > "top iterator" will supply the next value in the iteration than not. The > current implementation takes the other assumption by always removing + > inserting the minimum iterator back into the heap. With the implementation of > a binary heap that we're using, this can get costly if our assumption is > wrong because we pay the log penalty of percolating up the iterator in the > heap upon insertion and again when percolating down upon removal. > We believe our assumption is a fair one to hold given that as major > compactions create a log distribution of file sizes, it's likely that we may > see a long chain of consecutive entries coming from 1 iterator. > Understandably, taking this assumption comes at an additional cost in the > case that we're wrong. Therefore, we've run a few benchmarking tests to see > how much of a cost we pay as well as what kind of benefit we see. I've > attached a potential patch (which includes a test harness) + image that > captures the results of our tests. The x-axis represents # of repeated keys > before switching to another iterator. The y-axis represents iteration time. > The sets of blue + red lines varies in # of iterators present in the heap. -- This message was sent by Atlassian JIRA (v6.2#6252)