[jira] [Commented] (CASSANDRA-15202) Deserialize merkle trees off-heap
[ https://issues.apache.org/jira/browse/CASSANDRA-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1689#comment-1689 ] Marcus Eriksson commented on CASSANDRA-15202: - +1 > Deserialize merkle trees off-heap > - > > Key: CASSANDRA-15202 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15202 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Repair >Reporter: Jeff Jirsa >Assignee: Aleksey Yeschenko >Priority: Normal > Fix For: 4.0 > > Attachments: offheap-mts-gc.png > > > CASSANDRA-14096 made the first step to address the heavy on-heap footprint of > merkle trees on repair coordinators - by reducing the time frame over which > they are referenced, and by more intelligently limiting depth of the trees > based on available heap size. > That alone improves GC profile and prevents OOMs, but doesn’t address the > issue entirely. The coordinator still must hold all the trees on heap at once > until it’s done diffing them with each other, which has a negative effect, > and, by reducing depth, we lose precision and thus cause more overstreaming > than before. > One way to improve the situation further is to build on CASSANDRA-14096 and > move the trees entirely off-heap. This is a trivial endeavor, given that we > are dealing with what should be full binary trees (though in practice aren’t > quite, yet). This JIRA makes the first step towards there - by moving just > deserialisation off-heap, leaving construction on the replicas on-heap still. > Additionally, the proposed patch fixes the issue of replica coordinators > sending merkle trees to itself over loopback, costing us a ser/deser loop per > tree. > Please note that there is more room for improvement here, and depending on > 4.0 timeline those improvements may or may not land in time. To name a few: > - with some minor modifications to init(), we can make sure that no matter > the range, the tree is *always* perfectly full; this would allow us to get > rid of child pointers in inner nodes, as child node addresses will be > trivially calculatable given fixed size of nodes > - the trees can be easily constructed off-heap so long as you run init() to > pre-size the tree to find out how large a buffer you need > - on-wire format doesn’t need to stream inner nodes, only leaves, and, > really, only the hashes of the leaves -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15202) Deserialize merkle trees off-heap
[ https://issues.apache.org/jira/browse/CASSANDRA-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16898884#comment-16898884 ] Aleksey Yeschenko commented on CASSANDRA-15202: --- bq. One micro-nit: It would be nice to statically import the new Difference enum values, so they can be used without the qualifying {{Difference.}} prefix Done bq. we also need to release the trees after calculating the differences in DifferenceHolder Done, thanks, good catch. This patch was originally 3.0-based, so when future-porting to 4.0 I missed that new code path. > Deserialize merkle trees off-heap > - > > Key: CASSANDRA-15202 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15202 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Repair >Reporter: Jeff Jirsa >Assignee: Aleksey Yeschenko >Priority: Normal > Fix For: 4.0 > > Attachments: offheap-mts-gc.png > > > CASSANDRA-14096 made the first step to address the heavy on-heap footprint of > merkle trees on repair coordinators - by reducing the time frame over which > they are referenced, and by more intelligently limiting depth of the trees > based on available heap size. > That alone improves GC profile and prevents OOMs, but doesn’t address the > issue entirely. The coordinator still must hold all the trees on heap at once > until it’s done diffing them with each other, which has a negative effect, > and, by reducing depth, we lose precision and thus cause more overstreaming > than before. > One way to improve the situation further is to build on CASSANDRA-14096 and > move the trees entirely off-heap. This is a trivial endeavor, given that we > are dealing with what should be full binary trees (though in practice aren’t > quite, yet). This JIRA makes the first step towards there - by moving just > deserialisation off-heap, leaving construction on the replicas on-heap still. > Additionally, the proposed patch fixes the issue of replica coordinators > sending merkle trees to itself over loopback, costing us a ser/deser loop per > tree. > Please note that there is more room for improvement here, and depending on > 4.0 timeline those improvements may or may not land in time. To name a few: > - with some minor modifications to init(), we can make sure that no matter > the range, the tree is *always* perfectly full; this would allow us to get > rid of child pointers in inner nodes, as child node addresses will be > trivially calculatable given fixed size of nodes > - the trees can be easily constructed off-heap so long as you run init() to > pre-size the tree to find out how large a buffer you need > - on-wire format doesn’t need to stream inner nodes, only leaves, and, > really, only the hashes of the leaves -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15202) Deserialize merkle trees off-heap
[ https://issues.apache.org/jira/browse/CASSANDRA-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16898705#comment-16898705 ] Marcus Eriksson commented on CASSANDRA-15202: - we also need to release the trees after calculating the differences in {{DifferenceHolder}} > Deserialize merkle trees off-heap > - > > Key: CASSANDRA-15202 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15202 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Repair >Reporter: Jeff Jirsa >Assignee: Aleksey Yeschenko >Priority: Normal > Fix For: 4.0 > > Attachments: offheap-mts-gc.png > > > CASSANDRA-14096 made the first step to address the heavy on-heap footprint of > merkle trees on repair coordinators - by reducing the time frame over which > they are referenced, and by more intelligently limiting depth of the trees > based on available heap size. > That alone improves GC profile and prevents OOMs, but doesn’t address the > issue entirely. The coordinator still must hold all the trees on heap at once > until it’s done diffing them with each other, which has a negative effect, > and, by reducing depth, we lose precision and thus cause more overstreaming > than before. > One way to improve the situation further is to build on CASSANDRA-14096 and > move the trees entirely off-heap. This is a trivial endeavor, given that we > are dealing with what should be full binary trees (though in practice aren’t > quite, yet). This JIRA makes the first step towards there - by moving just > deserialisation off-heap, leaving construction on the replicas on-heap still. > Additionally, the proposed patch fixes the issue of replica coordinators > sending merkle trees to itself over loopback, costing us a ser/deser loop per > tree. > Please note that there is more room for improvement here, and depending on > 4.0 timeline those improvements may or may not land in time. To name a few: > - with some minor modifications to init(), we can make sure that no matter > the range, the tree is *always* perfectly full; this would allow us to get > rid of child pointers in inner nodes, as child node addresses will be > trivially calculatable given fixed size of nodes > - the trees can be easily constructed off-heap so long as you run init() to > pre-size the tree to find out how large a buffer you need > - on-wire format doesn’t need to stream inner nodes, only leaves, and, > really, only the hashes of the leaves -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15202) Deserialize merkle trees off-heap
[ https://issues.apache.org/jira/browse/CASSANDRA-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16898182#comment-16898182 ] Benedict commented on CASSANDRA-15202: -- One micro-nit: It would be nice to statically import the new {{Difference}} enum values, so they can be used without the qualifying {[Difference.}} prefix > Deserialize merkle trees off-heap > - > > Key: CASSANDRA-15202 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15202 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Repair >Reporter: Jeff Jirsa >Assignee: Aleksey Yeschenko >Priority: Normal > Fix For: 4.0 > > Attachments: offheap-mts-gc.png > > > CASSANDRA-14096 made the first step to address the heavy on-heap footprint of > merkle trees on repair coordinators - by reducing the time frame over which > they are referenced, and by more intelligently limiting depth of the trees > based on available heap size. > That alone improves GC profile and prevents OOMs, but doesn’t address the > issue entirely. The coordinator still must hold all the trees on heap at once > until it’s done diffing them with each other, which has a negative effect, > and, by reducing depth, we lose precision and thus cause more overstreaming > than before. > One way to improve the situation further is to build on CASSANDRA-14096 and > move the trees entirely off-heap. This is a trivial endeavor, given that we > are dealing with what should be full binary trees (though in practice aren’t > quite, yet). This JIRA makes the first step towards there - by moving just > deserialisation off-heap, leaving construction on the replicas on-heap still. > Additionally, the proposed patch fixes the issue of replica coordinators > sending merkle trees to itself over loopback, costing us a ser/deser loop per > tree. > Please note that there is more room for improvement here, and depending on > 4.0 timeline those improvements may or may not land in time. To name a few: > - with some minor modifications to init(), we can make sure that no matter > the range, the tree is *always* perfectly full; this would allow us to get > rid of child pointers in inner nodes, as child node addresses will be > trivially calculatable given fixed size of nodes > - the trees can be easily constructed off-heap so long as you run init() to > pre-size the tree to find out how large a buffer you need > - on-wire format doesn’t need to stream inner nodes, only leaves, and, > really, only the hashes of the leaves -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15202) Deserialize merkle trees off-heap
[ https://issues.apache.org/jira/browse/CASSANDRA-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16895071#comment-16895071 ] Marcus Eriksson commented on CASSANDRA-15202: - * I think we need a null check [here|https://github.com/iamaleksey/cassandra/compare/51a29f21cf3f12dde0a33f4ff6d5e9ca547d6c18..15202-4.0#diff-e657fd15ed537a2bf54a672b6b84afecR210] now that {{find(..)}} can return {{null}} (like in [trunk|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/MerkleTree.java#L247]) nit: * we don't use SHA-256 for hashes anymore (we use 2 instances of murmur3_128 with different seeds to get the same size), the comments on {{HASH_SIZE}} and {{byteArray}} in {{MerkleTree.java}} should be updated > Deserialize merkle trees off-heap > - > > Key: CASSANDRA-15202 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15202 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Repair >Reporter: Jeff Jirsa >Assignee: Aleksey Yeschenko >Priority: Normal > Fix For: 4.0 > > Attachments: offheap-mts-gc.png > > > CASSANDRA-14096 made the first step to address the heavy on-heap footprint of > merkle trees on repair coordinators - by reducing the time frame over which > they are referenced, and by more intelligently limiting depth of the trees > based on available heap size. > That alone improves GC profile and prevents OOMs, but doesn’t address the > issue entirely. The coordinator still must hold all the trees on heap at once > until it’s done diffing them with each other, which has a negative effect, > and, by reducing depth, we lose precision and thus cause more overstreaming > than before. > One way to improve the situation further is to build on CASSANDRA-14096 and > move the trees entirely off-heap. This is a trivial endeavor, given that we > are dealing with what should be full binary trees (though in practice aren’t > quite, yet). This JIRA makes the first step towards there - by moving just > deserialisation off-heap, leaving construction on the replicas on-heap still. > Additionally, the proposed patch fixes the issue of replica coordinators > sending merkle trees to itself over loopback, costing us a ser/deser loop per > tree. > Please note that there is more room for improvement here, and depending on > 4.0 timeline those improvements may or may not land in time. To name a few: > - with some minor modifications to init(), we can make sure that no matter > the range, the tree is *always* perfectly full; this would allow us to get > rid of child pointers in inner nodes, as child node addresses will be > trivially calculatable given fixed size of nodes > - the trees can be easily constructed off-heap so long as you run init() to > pre-size the tree to find out how large a buffer you need > - on-wire format doesn’t need to stream inner nodes, only leaves, and, > really, only the hashes of the leaves -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15202) Deserialize merkle trees off-heap
[ https://issues.apache.org/jira/browse/CASSANDRA-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16891761#comment-16891761 ] Aleksey Yeschenko commented on CASSANDRA-15202: --- Cheers. Addressed most in a separate commit, with a few exceptions: bq. Use {{ByteOrder.LITTLE_ENDIAN}} for off heap? Don't want to change the protocol in any way in this patch - just internal cleanup and efficiency. And make it trivially cherry-pickable for 3.0 without breaking compatibility in-between minors - for those who would want this improvement in their 3.0-based branches. bq. {{RandomPartitioner.MAXIMUM_TOKEN_SIZE}}: use {{(bitLength + 7) / 8}}? Why? {{bitLength() / 8 + 1}} is taken verbatim from {{BigInteger#toByteArray()}} bq. {{prefer_offheap_merkle_trees}} - why prefer? Primarily to decouple from the actual partitioner setting, as we don't support off-heap representation for at least BOP. If all else LGTY, will commit once I've beefed up test coverage a little. > Deserialize merkle trees off-heap > - > > Key: CASSANDRA-15202 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15202 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Repair >Reporter: Jeff Jirsa >Assignee: Aleksey Yeschenko >Priority: Normal > Fix For: 4.0 > > Attachments: offheap-mts-gc.png > > > CASSANDRA-14096 made the first step to address the heavy on-heap footprint of > merkle trees on repair coordinators - by reducing the time frame over which > they are referenced, and by more intelligently limiting depth of the trees > based on available heap size. > That alone improves GC profile and prevents OOMs, but doesn’t address the > issue entirely. The coordinator still must hold all the trees on heap at once > until it’s done diffing them with each other, which has a negative effect, > and, by reducing depth, we lose precision and thus cause more overstreaming > than before. > One way to improve the situation further is to build on CASSANDRA-14096 and > move the trees entirely off-heap. This is a trivial endeavor, given that we > are dealing with what should be full binary trees (though in practice aren’t > quite, yet). This JIRA makes the first step towards there - by moving just > deserialisation off-heap, leaving construction on the replicas on-heap still. > Additionally, the proposed patch fixes the issue of replica coordinators > sending merkle trees to itself over loopback, costing us a ser/deser loop per > tree. > Please note that there is more room for improvement here, and depending on > 4.0 timeline those improvements may or may not land in time. To name a few: > - with some minor modifications to init(), we can make sure that no matter > the range, the tree is *always* perfectly full; this would allow us to get > rid of child pointers in inner nodes, as child node addresses will be > trivially calculatable given fixed size of nodes > - the trees can be easily constructed off-heap so long as you run init() to > pre-size the tree to find out how large a buffer you need > - on-wire format doesn’t need to stream inner nodes, only leaves, and, > really, only the hashes of the leaves -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15202) Deserialize merkle trees off-heap
[ https://issues.apache.org/jira/browse/CASSANDRA-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885370#comment-16885370 ] Benedict commented on CASSANDRA-15202: -- LGTM. Some minor suggestions: * Use {{ByteOrder.LITTLE_ENDIAN}} for off heap? * {{RandomPartitioner.MAXIMUM_TOKEN_SIZE}}: use {{(bitLength + 7) / 8}}? * {{differenceHelper}}: replace int return values with explicit Enum? * {{OffHeapNode.attachRef}}: use {{null}} referent? Should probably (perhaps another time) introduce a special variant for these pure leak-tracking use cases. * {{OnHeapLeaf.deserialize}}: verify {{in.readByte() == HASH_SIZE}}? * Extract shared method {{OffHeapInner.child}} from {{left}} and {{right}}? * Extract shared method for off heap disabled warning? * {{prefer_offheap_merkle_trees}} - why prefer? * {{FBUtilities.xor}} and {{FBUtilities.xorOntoLeft}} - move to {{MerkleTree}}? * Missing line break: {{MerkleTree.release}}, {{OffHeapInner.maxOffHeapSize}}? * Rename test-only methods to {{unsafeMethodName}} and disappear to bottom of class? * "TODO: reset computed flag on OnHeapInners" - simply ignore/remove TODO if renamed to {{unsafeInvalidateHelper}}? > Deserialize merkle trees off-heap > - > > Key: CASSANDRA-15202 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15202 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Repair >Reporter: Jeff Jirsa >Assignee: Aleksey Yeschenko >Priority: Normal > Fix For: 4.0 > > Attachments: offheap-mts-gc.png > > > CASSANDRA-14096 made the first step to address the heavy on-heap footprint of > merkle trees on repair coordinators - by reducing the time frame over which > they are referenced, and by more intelligently limiting depth of the trees > based on available heap size. > That alone improves GC profile and prevents OOMs, but doesn’t address the > issue entirely. The coordinator still must hold all the trees on heap at once > until it’s done diffing them with each other, which has a negative effect, > and, by reducing depth, we lose precision and thus cause more overstreaming > than before. > One way to improve the situation further is to build on CASSANDRA-14096 and > move the trees entirely off-heap. This is a trivial endeavor, given that we > are dealing with what should be full binary trees (though in practice aren’t > quite, yet). This JIRA makes the first step towards there - by moving just > deserialisation off-heap, leaving construction on the replicas on-heap still. > Additionally, the proposed patch fixes the issue of replica coordinators > sending merkle trees to itself over loopback, costing us a ser/deser loop per > tree. > Please note that there is more room for improvement here, and depending on > 4.0 timeline those improvements may or may not land in time. To name a few: > - with some minor modifications to init(), we can make sure that no matter > the range, the tree is *always* perfectly full; this would allow us to get > rid of child pointers in inner nodes, as child node addresses will be > trivially calculatable given fixed size of nodes > - the trees can be easily constructed off-heap so long as you run init() to > pre-size the tree to find out how large a buffer you need > - on-wire format doesn’t need to stream inner nodes, only leaves, and, > really, only the hashes of the leaves -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15202) Deserialize merkle trees off-heap
[ https://issues.apache.org/jira/browse/CASSANDRA-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16879413#comment-16879413 ] Jeff Jirsa commented on CASSANDRA-15202: Perf testing of this patch, using JMX toggling to enable/disable, resulted in the following GC graph: !offheap-mts-gc.png! This is repair of a 12 instance cluster with 100 tables running in a loop. Starting at 04/08@~1330, the old style repair was run. In the afternoon of 04/09, the prop was changed to use the offheap merkle trees, and the result is pretty clear: parnew collections drop from ~3s to ~300ms, and olg gen collections nearly completely disappear. > Deserialize merkle trees off-heap > - > > Key: CASSANDRA-15202 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15202 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Repair >Reporter: Jeff Jirsa >Assignee: Aleksey Yeschenko >Priority: Normal > Fix For: 4.0 > > Attachments: offheap-mts-gc.png > > > CASSANDRA-14096 made the first step to address the heavy on-heap footprint of > merkle trees on repair coordinators - by reducing the time frame over which > they are referenced, and by more intelligently limiting depth of the trees > based on available heap size. > That alone improves GC profile and prevents OOMs, but doesn’t address the > issue entirely. The coordinator still must hold all the trees on heap at once > until it’s done diffing them with each other, which has a negative effect, > and, by reducing depth, we lose precision and thus cause more overstreaming > than before. > One way to improve the situation further is to build on CASSANDRA-14096 and > move the trees entirely off-heap. This is a trivial endeavor, given that we > are dealing with what should be full binary trees (though in practice aren’t > quite, yet). This JIRA makes the first step towards there - by moving just > deserialisation off-heap, leaving construction on the replicas on-heap still. > Additionally, the proposed patch fixes the issue of replica coordinators > sending merkle trees to itself over loopback, costing us a ser/deser loop per > tree. > Please note that there is more room for improvement here, and depending on > 4.0 timeline those improvements may or may not land in time. To name a few: > - with some minor modifications to init(), we can make sure that no matter > the range, the tree is *always* perfectly full; this would allow us to get > rid of child pointers in inner nodes, as child node addresses will be > trivially calculatable given fixed size of nodes > - the trees can be easily constructed off-heap so long as you run init() to > pre-size the tree to find out how large a buffer you need > - on-wire format doesn’t need to stream inner nodes, only leaves, and, > really, only the hashes of the leaves -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org