[ https://issues.apache.org/jira/browse/CASSANDRA-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16891761#comment-16891761 ]
Aleksey Yeschenko commented on CASSANDRA-15202: ----------------------------------------------- Cheers. Addressed most in a separate commit, with a few exceptions: bq. Use {{ByteOrder.LITTLE_ENDIAN}} for off heap? Don't want to change the protocol in any way in this patch - just internal cleanup and efficiency. And make it trivially cherry-pickable for 3.0 without breaking compatibility in-between minors - for those who would want this improvement in their 3.0-based branches. bq. {{RandomPartitioner.MAXIMUM_TOKEN_SIZE}}: use {{(bitLength + 7) / 8}}? Why? {{bitLength() / 8 + 1}} is taken verbatim from {{BigInteger#toByteArray()}} bq. {{prefer_offheap_merkle_trees}} - why prefer? Primarily to decouple from the actual partitioner setting, as we don't support off-heap representation for at least BOP. If all else LGTY, will commit once I've beefed up test coverage a little. > Deserialize merkle trees off-heap > --------------------------------- > > Key: CASSANDRA-15202 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15202 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Repair > Reporter: Jeff Jirsa > Assignee: Aleksey Yeschenko > Priority: Normal > Fix For: 4.0 > > Attachments: offheap-mts-gc.png > > > CASSANDRA-14096 made the first step to address the heavy on-heap footprint of > merkle trees on repair coordinators - by reducing the time frame over which > they are referenced, and by more intelligently limiting depth of the trees > based on available heap size. > That alone improves GC profile and prevents OOMs, but doesn’t address the > issue entirely. The coordinator still must hold all the trees on heap at once > until it’s done diffing them with each other, which has a negative effect, > and, by reducing depth, we lose precision and thus cause more overstreaming > than before. > One way to improve the situation further is to build on CASSANDRA-14096 and > move the trees entirely off-heap. This is a trivial endeavor, given that we > are dealing with what should be full binary trees (though in practice aren’t > quite, yet). This JIRA makes the first step towards there - by moving just > deserialisation off-heap, leaving construction on the replicas on-heap still. > Additionally, the proposed patch fixes the issue of replica coordinators > sending merkle trees to itself over loopback, costing us a ser/deser loop per > tree. > Please note that there is more room for improvement here, and depending on > 4.0 timeline those improvements may or may not land in time. To name a few: > - with some minor modifications to init(), we can make sure that no matter > the range, the tree is *always* perfectly full; this would allow us to get > rid of child pointers in inner nodes, as child node addresses will be > trivially calculatable given fixed size of nodes > - the trees can be easily constructed off-heap so long as you run init() to > pre-size the tree to find out how large a buffer you need > - on-wire format doesn’t need to stream inner nodes, only leaves, and, > really, only the hashes of the leaves -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org