[jira] [Commented] (CASSANDRA-15202) Deserialize merkle trees off-heap

2019-08-02 Thread Marcus Eriksson (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1689#comment-1689
 ] 

Marcus Eriksson commented on CASSANDRA-15202:
-

+1

> Deserialize merkle trees off-heap
> -
>
> Key: CASSANDRA-15202
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15202
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Repair
>Reporter: Jeff Jirsa
>Assignee: Aleksey Yeschenko
>Priority: Normal
> Fix For: 4.0
>
> Attachments: offheap-mts-gc.png
>
>
> CASSANDRA-14096 made the first step to address the heavy on-heap footprint of 
> merkle trees on repair coordinators - by reducing the time frame over which 
> they are referenced, and by more intelligently limiting depth of the trees 
> based on available heap size.
> That alone improves GC profile and prevents OOMs, but doesn’t address the 
> issue entirely. The coordinator still must hold all the trees on heap at once 
> until it’s done diffing them with each other, which has a negative effect, 
> and, by reducing depth, we lose precision and thus cause more overstreaming 
> than before.
> One way to improve the situation further is to build on CASSANDRA-14096 and 
> move the trees entirely off-heap. This is a trivial endeavor, given that we 
> are dealing with what should be full binary trees (though in practice aren’t 
> quite, yet). This JIRA makes the first step towards there - by moving just 
> deserialisation off-heap, leaving construction on the replicas on-heap still.
> Additionally, the proposed patch fixes the issue of replica coordinators 
> sending merkle trees to itself over loopback, costing us a ser/deser loop per 
> tree.
> Please note that there is more room for improvement here, and depending on 
> 4.0 timeline those improvements may or may not land in time. To name a few:
> - with some minor modifications to init(), we can make sure that no matter 
> the range, the tree is *always* perfectly full; this would allow us to get 
> rid of child pointers in inner nodes, as child node addresses will be 
> trivially calculatable given fixed size of nodes
> - the trees can be easily constructed off-heap so long as you run init() to 
> pre-size the tree to find out how large a buffer you need
> - on-wire format doesn’t need to stream inner nodes, only leaves, and, 
> really, only the hashes of the leaves



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15202) Deserialize merkle trees off-heap

2019-08-02 Thread Aleksey Yeschenko (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16898884#comment-16898884
 ] 

Aleksey Yeschenko commented on CASSANDRA-15202:
---

bq. One micro-nit: It would be nice to statically import the new Difference 
enum values, so they can be used without the qualifying {{Difference.}} prefix

Done

bq. we also need to release the trees after calculating the differences in 
DifferenceHolder

Done, thanks, good catch. This patch was originally 3.0-based, so when 
future-porting to 4.0 I missed that new code path.


> Deserialize merkle trees off-heap
> -
>
> Key: CASSANDRA-15202
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15202
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Repair
>Reporter: Jeff Jirsa
>Assignee: Aleksey Yeschenko
>Priority: Normal
> Fix For: 4.0
>
> Attachments: offheap-mts-gc.png
>
>
> CASSANDRA-14096 made the first step to address the heavy on-heap footprint of 
> merkle trees on repair coordinators - by reducing the time frame over which 
> they are referenced, and by more intelligently limiting depth of the trees 
> based on available heap size.
> That alone improves GC profile and prevents OOMs, but doesn’t address the 
> issue entirely. The coordinator still must hold all the trees on heap at once 
> until it’s done diffing them with each other, which has a negative effect, 
> and, by reducing depth, we lose precision and thus cause more overstreaming 
> than before.
> One way to improve the situation further is to build on CASSANDRA-14096 and 
> move the trees entirely off-heap. This is a trivial endeavor, given that we 
> are dealing with what should be full binary trees (though in practice aren’t 
> quite, yet). This JIRA makes the first step towards there - by moving just 
> deserialisation off-heap, leaving construction on the replicas on-heap still.
> Additionally, the proposed patch fixes the issue of replica coordinators 
> sending merkle trees to itself over loopback, costing us a ser/deser loop per 
> tree.
> Please note that there is more room for improvement here, and depending on 
> 4.0 timeline those improvements may or may not land in time. To name a few:
> - with some minor modifications to init(), we can make sure that no matter 
> the range, the tree is *always* perfectly full; this would allow us to get 
> rid of child pointers in inner nodes, as child node addresses will be 
> trivially calculatable given fixed size of nodes
> - the trees can be easily constructed off-heap so long as you run init() to 
> pre-size the tree to find out how large a buffer you need
> - on-wire format doesn’t need to stream inner nodes, only leaves, and, 
> really, only the hashes of the leaves



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15202) Deserialize merkle trees off-heap

2019-08-02 Thread Marcus Eriksson (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16898705#comment-16898705
 ] 

Marcus Eriksson commented on CASSANDRA-15202:
-

we also need to release the trees after calculating the differences in 
{{DifferenceHolder}}

> Deserialize merkle trees off-heap
> -
>
> Key: CASSANDRA-15202
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15202
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Repair
>Reporter: Jeff Jirsa
>Assignee: Aleksey Yeschenko
>Priority: Normal
> Fix For: 4.0
>
> Attachments: offheap-mts-gc.png
>
>
> CASSANDRA-14096 made the first step to address the heavy on-heap footprint of 
> merkle trees on repair coordinators - by reducing the time frame over which 
> they are referenced, and by more intelligently limiting depth of the trees 
> based on available heap size.
> That alone improves GC profile and prevents OOMs, but doesn’t address the 
> issue entirely. The coordinator still must hold all the trees on heap at once 
> until it’s done diffing them with each other, which has a negative effect, 
> and, by reducing depth, we lose precision and thus cause more overstreaming 
> than before.
> One way to improve the situation further is to build on CASSANDRA-14096 and 
> move the trees entirely off-heap. This is a trivial endeavor, given that we 
> are dealing with what should be full binary trees (though in practice aren’t 
> quite, yet). This JIRA makes the first step towards there - by moving just 
> deserialisation off-heap, leaving construction on the replicas on-heap still.
> Additionally, the proposed patch fixes the issue of replica coordinators 
> sending merkle trees to itself over loopback, costing us a ser/deser loop per 
> tree.
> Please note that there is more room for improvement here, and depending on 
> 4.0 timeline those improvements may or may not land in time. To name a few:
> - with some minor modifications to init(), we can make sure that no matter 
> the range, the tree is *always* perfectly full; this would allow us to get 
> rid of child pointers in inner nodes, as child node addresses will be 
> trivially calculatable given fixed size of nodes
> - the trees can be easily constructed off-heap so long as you run init() to 
> pre-size the tree to find out how large a buffer you need
> - on-wire format doesn’t need to stream inner nodes, only leaves, and, 
> really, only the hashes of the leaves



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15202) Deserialize merkle trees off-heap

2019-08-01 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16898182#comment-16898182
 ] 

Benedict commented on CASSANDRA-15202:
--

One micro-nit: It would be nice to statically import the new {{Difference}} 
enum values, so they can be used without the qualifying {[Difference.}} prefix


> Deserialize merkle trees off-heap
> -
>
> Key: CASSANDRA-15202
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15202
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Repair
>Reporter: Jeff Jirsa
>Assignee: Aleksey Yeschenko
>Priority: Normal
> Fix For: 4.0
>
> Attachments: offheap-mts-gc.png
>
>
> CASSANDRA-14096 made the first step to address the heavy on-heap footprint of 
> merkle trees on repair coordinators - by reducing the time frame over which 
> they are referenced, and by more intelligently limiting depth of the trees 
> based on available heap size.
> That alone improves GC profile and prevents OOMs, but doesn’t address the 
> issue entirely. The coordinator still must hold all the trees on heap at once 
> until it’s done diffing them with each other, which has a negative effect, 
> and, by reducing depth, we lose precision and thus cause more overstreaming 
> than before.
> One way to improve the situation further is to build on CASSANDRA-14096 and 
> move the trees entirely off-heap. This is a trivial endeavor, given that we 
> are dealing with what should be full binary trees (though in practice aren’t 
> quite, yet). This JIRA makes the first step towards there - by moving just 
> deserialisation off-heap, leaving construction on the replicas on-heap still.
> Additionally, the proposed patch fixes the issue of replica coordinators 
> sending merkle trees to itself over loopback, costing us a ser/deser loop per 
> tree.
> Please note that there is more room for improvement here, and depending on 
> 4.0 timeline those improvements may or may not land in time. To name a few:
> - with some minor modifications to init(), we can make sure that no matter 
> the range, the tree is *always* perfectly full; this would allow us to get 
> rid of child pointers in inner nodes, as child node addresses will be 
> trivially calculatable given fixed size of nodes
> - the trees can be easily constructed off-heap so long as you run init() to 
> pre-size the tree to find out how large a buffer you need
> - on-wire format doesn’t need to stream inner nodes, only leaves, and, 
> really, only the hashes of the leaves



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15202) Deserialize merkle trees off-heap

2019-07-29 Thread Marcus Eriksson (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16895071#comment-16895071
 ] 

Marcus Eriksson commented on CASSANDRA-15202:
-

* I think we need a null check 
[here|https://github.com/iamaleksey/cassandra/compare/51a29f21cf3f12dde0a33f4ff6d5e9ca547d6c18..15202-4.0#diff-e657fd15ed537a2bf54a672b6b84afecR210]
 now that {{find(..)}} can return {{null}} (like in 
[trunk|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/MerkleTree.java#L247])

nit:
* we don't use SHA-256 for hashes anymore (we use 2 instances of murmur3_128 
with different seeds to get the same size), the comments on {{HASH_SIZE}} and 
{{byteArray}} in {{MerkleTree.java}} should be updated

> Deserialize merkle trees off-heap
> -
>
> Key: CASSANDRA-15202
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15202
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Repair
>Reporter: Jeff Jirsa
>Assignee: Aleksey Yeschenko
>Priority: Normal
> Fix For: 4.0
>
> Attachments: offheap-mts-gc.png
>
>
> CASSANDRA-14096 made the first step to address the heavy on-heap footprint of 
> merkle trees on repair coordinators - by reducing the time frame over which 
> they are referenced, and by more intelligently limiting depth of the trees 
> based on available heap size.
> That alone improves GC profile and prevents OOMs, but doesn’t address the 
> issue entirely. The coordinator still must hold all the trees on heap at once 
> until it’s done diffing them with each other, which has a negative effect, 
> and, by reducing depth, we lose precision and thus cause more overstreaming 
> than before.
> One way to improve the situation further is to build on CASSANDRA-14096 and 
> move the trees entirely off-heap. This is a trivial endeavor, given that we 
> are dealing with what should be full binary trees (though in practice aren’t 
> quite, yet). This JIRA makes the first step towards there - by moving just 
> deserialisation off-heap, leaving construction on the replicas on-heap still.
> Additionally, the proposed patch fixes the issue of replica coordinators 
> sending merkle trees to itself over loopback, costing us a ser/deser loop per 
> tree.
> Please note that there is more room for improvement here, and depending on 
> 4.0 timeline those improvements may or may not land in time. To name a few:
> - with some minor modifications to init(), we can make sure that no matter 
> the range, the tree is *always* perfectly full; this would allow us to get 
> rid of child pointers in inner nodes, as child node addresses will be 
> trivially calculatable given fixed size of nodes
> - the trees can be easily constructed off-heap so long as you run init() to 
> pre-size the tree to find out how large a buffer you need
> - on-wire format doesn’t need to stream inner nodes, only leaves, and, 
> really, only the hashes of the leaves



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15202) Deserialize merkle trees off-heap

2019-07-24 Thread Aleksey Yeschenko (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16891761#comment-16891761
 ] 

Aleksey Yeschenko commented on CASSANDRA-15202:
---

Cheers. Addressed most in a separate commit, with a few exceptions:

bq. Use {{ByteOrder.LITTLE_ENDIAN}} for off heap?

Don't want to change the protocol in any way in this patch - just internal 
cleanup and efficiency. And make it trivially cherry-pickable for 3.0 without 
breaking compatibility in-between minors - for those who would want this 
improvement in their 3.0-based branches.

bq. {{RandomPartitioner.MAXIMUM_TOKEN_SIZE}}: use {{(bitLength + 7) / 8}}?

Why? {{bitLength() / 8 + 1}} is taken verbatim from {{BigInteger#toByteArray()}}

bq. {{prefer_offheap_merkle_trees}} - why prefer?

Primarily to decouple from the actual partitioner setting, as we don't support 
off-heap representation for at least BOP.

If all else LGTY, will commit once I've beefed up test coverage a little.

> Deserialize merkle trees off-heap
> -
>
> Key: CASSANDRA-15202
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15202
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Repair
>Reporter: Jeff Jirsa
>Assignee: Aleksey Yeschenko
>Priority: Normal
> Fix For: 4.0
>
> Attachments: offheap-mts-gc.png
>
>
> CASSANDRA-14096 made the first step to address the heavy on-heap footprint of 
> merkle trees on repair coordinators - by reducing the time frame over which 
> they are referenced, and by more intelligently limiting depth of the trees 
> based on available heap size.
> That alone improves GC profile and prevents OOMs, but doesn’t address the 
> issue entirely. The coordinator still must hold all the trees on heap at once 
> until it’s done diffing them with each other, which has a negative effect, 
> and, by reducing depth, we lose precision and thus cause more overstreaming 
> than before.
> One way to improve the situation further is to build on CASSANDRA-14096 and 
> move the trees entirely off-heap. This is a trivial endeavor, given that we 
> are dealing with what should be full binary trees (though in practice aren’t 
> quite, yet). This JIRA makes the first step towards there - by moving just 
> deserialisation off-heap, leaving construction on the replicas on-heap still.
> Additionally, the proposed patch fixes the issue of replica coordinators 
> sending merkle trees to itself over loopback, costing us a ser/deser loop per 
> tree.
> Please note that there is more room for improvement here, and depending on 
> 4.0 timeline those improvements may or may not land in time. To name a few:
> - with some minor modifications to init(), we can make sure that no matter 
> the range, the tree is *always* perfectly full; this would allow us to get 
> rid of child pointers in inner nodes, as child node addresses will be 
> trivially calculatable given fixed size of nodes
> - the trees can be easily constructed off-heap so long as you run init() to 
> pre-size the tree to find out how large a buffer you need
> - on-wire format doesn’t need to stream inner nodes, only leaves, and, 
> really, only the hashes of the leaves



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15202) Deserialize merkle trees off-heap

2019-07-15 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885370#comment-16885370
 ] 

Benedict commented on CASSANDRA-15202:
--

LGTM.  Some minor suggestions:


* Use {{ByteOrder.LITTLE_ENDIAN}} for off heap?
* {{RandomPartitioner.MAXIMUM_TOKEN_SIZE}}: use {{(bitLength + 7) / 8}}?
* {{differenceHelper}}: replace int return values with explicit Enum?
* {{OffHeapNode.attachRef}}: use {{null}} referent?  Should probably (perhaps 
another time) introduce a special variant for these pure leak-tracking use 
cases.
* {{OnHeapLeaf.deserialize}}: verify {{in.readByte() == HASH_SIZE}}?
* Extract shared method {{OffHeapInner.child}} from {{left}} and {{right}}?
* Extract shared method for off heap disabled warning?
* {{prefer_offheap_merkle_trees}} - why prefer?
* {{FBUtilities.xor}} and {{FBUtilities.xorOntoLeft}} - move to {{MerkleTree}}?
* Missing line break: {{MerkleTree.release}}, {{OffHeapInner.maxOffHeapSize}}?
* Rename test-only methods to {{unsafeMethodName}} and disappear to bottom of 
class?
* "TODO: reset computed flag on OnHeapInners" - simply ignore/remove TODO if 
renamed to {{unsafeInvalidateHelper}}?


> Deserialize merkle trees off-heap
> -
>
> Key: CASSANDRA-15202
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15202
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Repair
>Reporter: Jeff Jirsa
>Assignee: Aleksey Yeschenko
>Priority: Normal
> Fix For: 4.0
>
> Attachments: offheap-mts-gc.png
>
>
> CASSANDRA-14096 made the first step to address the heavy on-heap footprint of 
> merkle trees on repair coordinators - by reducing the time frame over which 
> they are referenced, and by more intelligently limiting depth of the trees 
> based on available heap size.
> That alone improves GC profile and prevents OOMs, but doesn’t address the 
> issue entirely. The coordinator still must hold all the trees on heap at once 
> until it’s done diffing them with each other, which has a negative effect, 
> and, by reducing depth, we lose precision and thus cause more overstreaming 
> than before.
> One way to improve the situation further is to build on CASSANDRA-14096 and 
> move the trees entirely off-heap. This is a trivial endeavor, given that we 
> are dealing with what should be full binary trees (though in practice aren’t 
> quite, yet). This JIRA makes the first step towards there - by moving just 
> deserialisation off-heap, leaving construction on the replicas on-heap still.
> Additionally, the proposed patch fixes the issue of replica coordinators 
> sending merkle trees to itself over loopback, costing us a ser/deser loop per 
> tree.
> Please note that there is more room for improvement here, and depending on 
> 4.0 timeline those improvements may or may not land in time. To name a few:
> - with some minor modifications to init(), we can make sure that no matter 
> the range, the tree is *always* perfectly full; this would allow us to get 
> rid of child pointers in inner nodes, as child node addresses will be 
> trivially calculatable given fixed size of nodes
> - the trees can be easily constructed off-heap so long as you run init() to 
> pre-size the tree to find out how large a buffer you need
> - on-wire format doesn’t need to stream inner nodes, only leaves, and, 
> really, only the hashes of the leaves



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15202) Deserialize merkle trees off-heap

2019-07-05 Thread Jeff Jirsa (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16879413#comment-16879413
 ] 

Jeff Jirsa commented on CASSANDRA-15202:


Perf testing of this patch, using JMX toggling to enable/disable, resulted in 
the following GC graph:

 

!offheap-mts-gc.png!

 

This is repair of a 12 instance cluster with 100 tables running in a loop. 
Starting at 04/08@~1330, the old style repair was run. In the afternoon of 
04/09, the prop was changed to use the offheap merkle trees, and the result is 
pretty clear: parnew collections drop from ~3s to ~300ms, and olg gen 
collections nearly completely disappear. 

> Deserialize merkle trees off-heap
> -
>
> Key: CASSANDRA-15202
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15202
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Repair
>Reporter: Jeff Jirsa
>Assignee: Aleksey Yeschenko
>Priority: Normal
> Fix For: 4.0
>
> Attachments: offheap-mts-gc.png
>
>
> CASSANDRA-14096 made the first step to address the heavy on-heap footprint of 
> merkle trees on repair coordinators - by reducing the time frame over which 
> they are referenced, and by more intelligently limiting depth of the trees 
> based on available heap size.
> That alone improves GC profile and prevents OOMs, but doesn’t address the 
> issue entirely. The coordinator still must hold all the trees on heap at once 
> until it’s done diffing them with each other, which has a negative effect, 
> and, by reducing depth, we lose precision and thus cause more overstreaming 
> than before.
> One way to improve the situation further is to build on CASSANDRA-14096 and 
> move the trees entirely off-heap. This is a trivial endeavor, given that we 
> are dealing with what should be full binary trees (though in practice aren’t 
> quite, yet). This JIRA makes the first step towards there - by moving just 
> deserialisation off-heap, leaving construction on the replicas on-heap still.
> Additionally, the proposed patch fixes the issue of replica coordinators 
> sending merkle trees to itself over loopback, costing us a ser/deser loop per 
> tree.
> Please note that there is more room for improvement here, and depending on 
> 4.0 timeline those improvements may or may not land in time. To name a few:
> - with some minor modifications to init(), we can make sure that no matter 
> the range, the tree is *always* perfectly full; this would allow us to get 
> rid of child pointers in inner nodes, as child node addresses will be 
> trivially calculatable given fixed size of nodes
> - the trees can be easily constructed off-heap so long as you run init() to 
> pre-size the tree to find out how large a buffer you need
> - on-wire format doesn’t need to stream inner nodes, only leaves, and, 
> really, only the hashes of the leaves



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org