[jira] [Commented] (CASSANDRA-15392) Pool Merge Iterators
[ https://issues.apache.org/jira/browse/CASSANDRA-15392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022492#comment-17022492 ] Blake Eggleston commented on CASSANDRA-15392: - [~jrwest] sure, that would be great. Tag me for review if you'd like. > Pool Merge Iterators > > > Key: CASSANDRA-15392 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15392 > Project: Cassandra > Issue Type: Sub-task > Components: Local/Compaction >Reporter: Blake Eggleston >Assignee: Blake Eggleston >Priority: Normal > Fix For: 4.0 > > > By pooling merge iterators, instead of creating new ones each time we need > them, we can reduce garbage on the compaction and read paths under relevant > workloads by ~4% in many cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15392) Pool Merge Iterators
[ https://issues.apache.org/jira/browse/CASSANDRA-15392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022051#comment-17022051 ] Jordan West commented on CASSANDRA-15392: - {quote} As this relates to SASI, this wouldn’t be a regression right? I'd guess we're already wasting TrivialOneToOne iterators? {quote} Correct. This wouldn't break SASI but it would waste TrivialOneToOne's most of the time and rarely it would waste a ManyToOne that is oversized for its needs. I haven't looked at FQL but for SASI I think its easy enough to just skip using MergeIterators when they aren't needed / skip the pool when they are – happy to submit a patch if its helpful. > Pool Merge Iterators > > > Key: CASSANDRA-15392 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15392 > Project: Cassandra > Issue Type: Sub-task > Components: Local/Compaction >Reporter: Blake Eggleston >Assignee: Blake Eggleston >Priority: Normal > Fix For: 4.0 > > > By pooling merge iterators, instead of creating new ones each time we need > them, we can reduce garbage on the compaction and read paths under relevant > workloads by ~4% in many cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15392) Pool Merge Iterators
[ https://issues.apache.org/jira/browse/CASSANDRA-15392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17021552#comment-17021552 ] Benedict Elliott Smith commented on CASSANDRA-15392: bq. In cases where there are that many sstables to merge the node is usually also under significant gc pressure, so it would not be a great time to abandon this optimization. Perhaps the size limit should be larger, and an intrusive linked-list stack is certainly a good data structure to use here. But we probably want to impose _some_ coarse size limits? Though you make a very good point that three is much too few in this case. But otherwise we have the problem that infrequent storms of sstable creation slowly causes threads to accumulate "huge" pools - and we can have a lot of these threads, since read-parallelism is gated by the number of threads. It's far from crazy to have a system with > 200 threads, and if they each managed to hold on to 1k iterators at a time (noting we need iterators both for the partitions, their rows and any complex columns, so this seems readily achievable), we'd be looking at 50-100MiB+ of heap utilised by these pools. > Pool Merge Iterators > > > Key: CASSANDRA-15392 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15392 > Project: Cassandra > Issue Type: Sub-task > Components: Local/Compaction >Reporter: Blake Eggleston >Assignee: Blake Eggleston >Priority: Normal > Fix For: 4.0 > > > By pooling merge iterators, instead of creating new ones each time we need > them, we can reduce garbage on the compaction and read paths under relevant > workloads by ~4% in many cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15392) Pool Merge Iterators
[ https://issues.apache.org/jira/browse/CASSANDRA-15392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17021455#comment-17021455 ] Blake Eggleston commented on CASSANDRA-15392: - Thanks for the feedback. I just pushed up some commits addressing most points. Getting rid of linked list queues is a good idea, but don’t think TinyThreadLocalPool will be sufficient in some cases. If {{MergeIterator#get}} is called with more than than {{DEFAULT_SIZE}} iterators, we’ll nest pooled iterators. In cases where there _are_ that many sstables to merge the node is usually also under significant gc pressure, so it would not be a great time to abandon this optimization. An intrusive linked stack is the best I could come up with. It's unbounded, doesn't do any allocations, and has very simple logic, which was important for preventing oneToOne performance from regressing too much. @Jordan, short circuiting the merge iterator in that case is a really good idea. However, FQL will need a bit of (probably straightforward) rework before {{MergeIterator#get}} can return a closeable iterator, since it relies on having separate input and output types. As this relates to SASI, this wouldn’t be a regression right? I'd guess we're already wasting TrivialOneToOne iterators? I’ve also put together a very simple jmh benchmark. After a few iterations of additional optimization, the manyToOne path saw a 15% increase in throughput in most cases, and the oneToOne path saw a 2-6% reduction in most cases. Numbers below: {code:java} Baseline: BenchmarkMode Cnt ScoreError Units MergeIteratorGetBench.manyToOnesample 59417991 418.721 ± 22.706 ns/op MergeIteratorGetBench.manyToOne:manyToOne·p0.00sample 101.000 ns/op MergeIteratorGetBench.manyToOne:manyToOne·p0.50sample 148.000 ns/op MergeIteratorGetBench.manyToOne:manyToOne·p0.90sample 171.000 ns/op MergeIteratorGetBench.manyToOne:manyToOne·p0.95sample 176.000 ns/op MergeIteratorGetBench.manyToOne:manyToOne·p0.99sample 188.000 ns/op MergeIteratorGetBench.manyToOne:manyToOne·p0.999 sample 1456.000 ns/op MergeIteratorGetBench.manyToOne:manyToOne·p0. sample 9171.213 ns/op MergeIteratorGetBench.manyToOne:manyToOne·p1.00sample 38862848.000 ns/op MergeIteratorGetBench.one sample 53514991 83.452 ± 4.005 ns/op MergeIteratorGetBench.one:one·p0.00sample 47.000 ns/op MergeIteratorGetBench.one:one·p0.50sample 73.000 ns/op MergeIteratorGetBench.one:one·p0.90sample 90.000 ns/op MergeIteratorGetBench.one:one·p0.95sample 95.000 ns/op MergeIteratorGetBench.one:one·p0.99sample 103.000 ns/op MergeIteratorGetBench.one:one·p0.999 sample 119.000 ns/op MergeIteratorGetBench.one:one·p0. sample 3830.003 ns/op MergeIteratorGetBench.one:one·p1.00sample 21987328.000 ns/op Pooled: BenchmarkModeCnt Score Error Units MergeIteratorGetBench.manyToOnesample 100352304 133.534 ± 1.662 ns/op MergeIteratorGetBench.manyToOne:manyToOne·p0.00sample 86.000 ns/op MergeIteratorGetBench.manyToOne:manyToOne·p0.50sample 128.000 ns/op MergeIteratorGetBench.manyToOne:manyToOne·p0.90sample 143.000 ns/op MergeIteratorGetBench.manyToOne:manyToOne·p0.95sample 148.000 ns/op MergeIteratorGetBench.manyToOne:manyToOne·p0.99sample 157.000 ns/op MergeIteratorGetBench.manyToOne:manyToOne·p0.999 sample 172.000 ns/op MergeIteratorGetBench.manyToOne:manyToOne·p0. sample 3744.000 ns/op MergeIteratorGetBench.manyToOne:manyToOne·p1.00sample 13008896.000 ns/op MergeIteratorGetBench.one sample 97007239 87.340 ± 3.063 ns/op MergeIteratorGetBench.one:one·p0.00sample 51.000 ns/op MergeIteratorGetBench.one:one·p0.50sample 78.000 ns/op MergeIteratorGetBench.one:one·p0.90sample 92.000 ns/op MergeIteratorGetBench.one:one·p0.95sample 97
[jira] [Commented] (CASSANDRA-15392) Pool Merge Iterators
[ https://issues.apache.org/jira/browse/CASSANDRA-15392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17010853#comment-17010853 ] Benedict Elliott Smith commented on CASSANDRA-15392: MergeIteratorPool: * As Jordan suggests, storing the pool inside the \{{MergeIterator}} makes sense - more so because it can avoid incurring the cost of \{{ThreadLocal.get()}} on \{{returnToPool}}, by stashing a link to an inner object (that guards put on the identity of the \{{Thread}} owning the queue) * Perhaps an \{{ArrayList}} is a better idea for collection type, as it will be generally garbage-free, and occupy less space than a \{{LinkedList}}? An \{{ArrayDeque}} could maintain the current behaviour, or you could pop from the back * Perhaps the queue would be better with fixed size, permitting at most two or three items to be stored? * As it happens, I’ve been working on very similar parts of the codebase, and I have implemented my own \{{TinyThreadLocalPool}} in the patches I hope to land soon, that does the above but has lower fixed-overhead and less indirection by permitting at most 3 items, stored in in-line properties (i.e. \{{val0}}, \{{val1}} and \{{val2}}). It might be worthwhile introducing that class here for consideration/comparison, and perhaps some amalgamation of the two used in whichever patch lands first (which I fully expect to be this one)? {\{MergeIterator.OneToOne}}: * While we’re here optimising, it probably makes sense to merge \{{OneToOne}} and \{{TrivialOneToOne}}, simply permitting the reducer in \{{OneToOne}} to be \{{null}}, and to switch behaviour based on a null check. This should permit call-sites to become bimorphic that are presently megamorphic, and the branch should be near perfectly predicted. {\{MergeIterator.ManyToOne}}: * Instead of using a separate \{{candidates}} array for reusing a \{{Candidate}}, could we avoid setting the last heap element to null by instead placing the now-defunct item there? * Do we need \{{active}} when it is implied by \{{MergeIterator.iterators != null}} and \{{Candidate.iter != null}}? * Do we need to set primitives to \{{-1}} or \{{false}} on \{{close}}, given we will throw \{{NullPointerException}} if we haven’t invoked \{{reset}}? Benchmarking: * It’s annoying work, but alongside the garbage benchmarks, it would be nice to see some simple JMH throughput benchmarks introduced, to see what the overall impact of the change is. I don’t mind helping out here. nits: * {\{ManyToOne.DEFAULT_SIZE}} -> \{{ManyToOne.POOLED_MERGE_LIMIT}}? * +1 Jordan’s nit about naming of \{{getSimple}} * {\{MergeIteratorPool.queues.initialValue}} should not declare \{{throw Exception}} * {\{MergeIteratorPool.initialValue}} looks to be a duplicate * {\{TokenTree:380}} errant new line > Pool Merge Iterators > > > Key: CASSANDRA-15392 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15392 > Project: Cassandra > Issue Type: Sub-task > Components: Local/Compaction >Reporter: Blake Eggleston >Assignee: Blake Eggleston >Priority: Normal > Fix For: 4.0 > > > By pooling merge iterators, instead of creating new ones each time we need > them, we can reduce garbage on the compaction and read paths under relevant > workloads by ~4% in many cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org