[jira] [Commented] (CASSANDRA-15510) BTree: Improve Building, Inserting and Transforming

2022-05-06 Thread Benjamin Lerer (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17532963#comment-17532963
 ] 

Benjamin Lerer commented on CASSANDRA-15510:


It turned out that I did not merged properly the patch in trunk :-(
It is fixed now. Patch committed into 4.1 at 
596daeb7f08e14d69af90fd4f07b9e87f2816681 and merged into trunk.

 (Note for myself: do not merge things on Friday evening before going in 
holidays).

> BTree: Improve Building, Inserting and Transforming
> ---
>
> Key: CASSANDRA-15510
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15510
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Other
>Reporter: Benedict Elliott Smith
>Assignee: Benedict Elliott Smith
>Priority: Normal
> Fix For: 4.1, 4.0.5
>
>  Time Spent: 10h
>  Remaining Estimate: 0h
>
> This work was originally undertaken as a follow-up to CASSANDRA-15367 to 
> ensure performance is strictly improved, but it may no longer be needed for 
> that purpose.  It’s still hugely impactful, however.  It remains to be 
> decided where this should land.
> The current {{BTree}} implementation is suboptimal in a number of ways, with 
> very little focus having been given to its performance besides its 
> memory-occupancy.  This patch aims to address that, specifically improving 
> the performance and allocations involved in: building, transforming and 
> inserting into a tree.
> To facilitate this work, the {{BTree}} definition is modified slightly, so 
> that we can perform some simple arithmetic on tree sizes.  Specifically, 
> trees of depth n are defined to have a maximum capacity of {{branchFactor^n - 
> 1}}, which translates into capping the number of leaf children at 
> {{branchFactor-1}}, as opposed to {{branchFactor}}.  Since {{branchFactor}} 
> is a power of 2, this permits fast tree size arithmetic, enabling some of 
> these changes.
> h2. Building
> The static build method has been modified to utilise dedicated 
> {{buildPerfect}} methods that build either perfectly dense or perfectly 
> sparse sub-trees.  These perfect trees all share their {{sizeMap}} with each 
> other, and can be built more efficiently than trees of arbitrary size.  The 
> specifics are described in detail in the comments, but this building block 
> can be used to construct trees of any size, using at most one child at each 
> level that is not either perfectly sparse or perfectly dense.  Bulk methods 
> are used where possible.
> For large trees this can produce up to 30x throughput improvement and 30% 
> allocation reduction vs 3.0 (TBC, and to be tested vs 4.0).
> {{FastBuilder}} is introduced for building a tree in-order (or in reverse) 
> without duplicate elements to resolve, without necessarily knowing the size 
> upfront.  This meets the needs of most use cases.  Data is built directly 
> into nodes, with up to one already-constructed node, and one partially 
> constructed node, on each level, being mutated to share their contents in the 
> event of insufficient data to populate the tree.  These builders are 
> thread-locally shared.  These leads to minimal copying, the same sharing of 
> {{sizeMap}} as above, zero wasted allocations, and results in minimal 
> difference in performance between utilising the less-ergonomic static build 
> and builder approach.
> For large trees this leads to ~4.5x throughput improvement, and 70% reduction 
> in allocations vs a normal Builder.  For small trees performance is 
> comparable, but allocations similarly reduced.
> h2. Inserting
> It turns out that we only ever insert another tree into a tree, so we exploit 
> this to implement an efficient union of two trees, operating on them directly 
> via stacks in the transformer, instead of via a collection interface.  A 
> builder-like object is introduced that shares functionality with 
> {{FastBuilder}}, and permits us to build the result of the union directly 
> into the final nodes, reusing as much of the original trees as possible.  
> Bulk methods are used where possible.
> The result is not _uniformly_ faster, but is _significantly_ faster on 
> average: median _improvement_ of 1.4x (that is, 2.4x total throughput), mean 
> improvement of 10x.  Worst reduction is 30%, and it may be that we can 
> isolate and alleviate that.  Allocations are also reduced significantly, with 
> a median of 30% and mean of 42% for the tested workloads.  As the trees get 
> larger the improvement drops, but remains uniformly lower.
> h2. Transforming
> Transformations garbage overhead is minimal, i.e. the main allocations are 
> those necessary to represent the new tree.  It is significantly faster and 
> particularly more efficient when removing elements, utilising the shared 
> 

[jira] [Commented] (CASSANDRA-15510) BTree: Improve Building, Inserting and Transforming

2022-04-22 Thread Benjamin Lerer (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526543#comment-17526543
 ] 

Benjamin Lerer commented on CASSANDRA-15510:


CI runs for 
[4.0|https://app.circleci.com/pipelines/github/blerer/cassandra/284/workflows/76db84a8-a6a1-4364-85ce-72ed5f12081f]
 and 
[trunk|https://app.circleci.com/pipelines/github/blerer/cassandra/287/workflows/f9ad1572-460f-4b99-bfbd-ac3edaac61da]

> BTree: Improve Building, Inserting and Transforming
> ---
>
> Key: CASSANDRA-15510
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15510
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Other
>Reporter: Benedict Elliott Smith
>Assignee: Benedict Elliott Smith
>Priority: Normal
> Fix For: 4.0.x, 4.x
>
>  Time Spent: 10h
>  Remaining Estimate: 0h
>
> This work was originally undertaken as a follow-up to CASSANDRA-15367 to 
> ensure performance is strictly improved, but it may no longer be needed for 
> that purpose.  It’s still hugely impactful, however.  It remains to be 
> decided where this should land.
> The current {{BTree}} implementation is suboptimal in a number of ways, with 
> very little focus having been given to its performance besides its 
> memory-occupancy.  This patch aims to address that, specifically improving 
> the performance and allocations involved in: building, transforming and 
> inserting into a tree.
> To facilitate this work, the {{BTree}} definition is modified slightly, so 
> that we can perform some simple arithmetic on tree sizes.  Specifically, 
> trees of depth n are defined to have a maximum capacity of {{branchFactor^n - 
> 1}}, which translates into capping the number of leaf children at 
> {{branchFactor-1}}, as opposed to {{branchFactor}}.  Since {{branchFactor}} 
> is a power of 2, this permits fast tree size arithmetic, enabling some of 
> these changes.
> h2. Building
> The static build method has been modified to utilise dedicated 
> {{buildPerfect}} methods that build either perfectly dense or perfectly 
> sparse sub-trees.  These perfect trees all share their {{sizeMap}} with each 
> other, and can be built more efficiently than trees of arbitrary size.  The 
> specifics are described in detail in the comments, but this building block 
> can be used to construct trees of any size, using at most one child at each 
> level that is not either perfectly sparse or perfectly dense.  Bulk methods 
> are used where possible.
> For large trees this can produce up to 30x throughput improvement and 30% 
> allocation reduction vs 3.0 (TBC, and to be tested vs 4.0).
> {{FastBuilder}} is introduced for building a tree in-order (or in reverse) 
> without duplicate elements to resolve, without necessarily knowing the size 
> upfront.  This meets the needs of most use cases.  Data is built directly 
> into nodes, with up to one already-constructed node, and one partially 
> constructed node, on each level, being mutated to share their contents in the 
> event of insufficient data to populate the tree.  These builders are 
> thread-locally shared.  These leads to minimal copying, the same sharing of 
> {{sizeMap}} as above, zero wasted allocations, and results in minimal 
> difference in performance between utilising the less-ergonomic static build 
> and builder approach.
> For large trees this leads to ~4.5x throughput improvement, and 70% reduction 
> in allocations vs a normal Builder.  For small trees performance is 
> comparable, but allocations similarly reduced.
> h2. Inserting
> It turns out that we only ever insert another tree into a tree, so we exploit 
> this to implement an efficient union of two trees, operating on them directly 
> via stacks in the transformer, instead of via a collection interface.  A 
> builder-like object is introduced that shares functionality with 
> {{FastBuilder}}, and permits us to build the result of the union directly 
> into the final nodes, reusing as much of the original trees as possible.  
> Bulk methods are used where possible.
> The result is not _uniformly_ faster, but is _significantly_ faster on 
> average: median _improvement_ of 1.4x (that is, 2.4x total throughput), mean 
> improvement of 10x.  Worst reduction is 30%, and it may be that we can 
> isolate and alleviate that.  Allocations are also reduced significantly, with 
> a median of 30% and mean of 42% for the tested workloads.  As the trees get 
> larger the improvement drops, but remains uniformly lower.
> h2. Transforming
> Transformations garbage overhead is minimal, i.e. the main allocations are 
> those necessary to represent the new tree.  It is significantly faster and 
> particularly more efficient when removing elements, utilising the shared 
> functionali

[jira] [Commented] (CASSANDRA-15510) BTree: Improve Building, Inserting and Transforming

2022-04-21 Thread Benjamin Lerer (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17525645#comment-17525645
 ] 

Benjamin Lerer commented on CASSANDRA-15510:


Sorry, I was focussing on porting CASSANDRA-15511.
I will merge the changes and run CI. If I remember correctly I think that the 
patch broke some tests.
We also need to run CI on Jenkins as I do not think that CircleCi can run the 
burn tests.

> BTree: Improve Building, Inserting and Transforming
> ---
>
> Key: CASSANDRA-15510
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15510
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Other
>Reporter: Benedict Elliott Smith
>Assignee: Benedict Elliott Smith
>Priority: Normal
> Fix For: 4.0.x, 4.x
>
>  Time Spent: 10h
>  Remaining Estimate: 0h
>
> This work was originally undertaken as a follow-up to CASSANDRA-15367 to 
> ensure performance is strictly improved, but it may no longer be needed for 
> that purpose.  It’s still hugely impactful, however.  It remains to be 
> decided where this should land.
> The current {{BTree}} implementation is suboptimal in a number of ways, with 
> very little focus having been given to its performance besides its 
> memory-occupancy.  This patch aims to address that, specifically improving 
> the performance and allocations involved in: building, transforming and 
> inserting into a tree.
> To facilitate this work, the {{BTree}} definition is modified slightly, so 
> that we can perform some simple arithmetic on tree sizes.  Specifically, 
> trees of depth n are defined to have a maximum capacity of {{branchFactor^n - 
> 1}}, which translates into capping the number of leaf children at 
> {{branchFactor-1}}, as opposed to {{branchFactor}}.  Since {{branchFactor}} 
> is a power of 2, this permits fast tree size arithmetic, enabling some of 
> these changes.
> h2. Building
> The static build method has been modified to utilise dedicated 
> {{buildPerfect}} methods that build either perfectly dense or perfectly 
> sparse sub-trees.  These perfect trees all share their {{sizeMap}} with each 
> other, and can be built more efficiently than trees of arbitrary size.  The 
> specifics are described in detail in the comments, but this building block 
> can be used to construct trees of any size, using at most one child at each 
> level that is not either perfectly sparse or perfectly dense.  Bulk methods 
> are used where possible.
> For large trees this can produce up to 30x throughput improvement and 30% 
> allocation reduction vs 3.0 (TBC, and to be tested vs 4.0).
> {{FastBuilder}} is introduced for building a tree in-order (or in reverse) 
> without duplicate elements to resolve, without necessarily knowing the size 
> upfront.  This meets the needs of most use cases.  Data is built directly 
> into nodes, with up to one already-constructed node, and one partially 
> constructed node, on each level, being mutated to share their contents in the 
> event of insufficient data to populate the tree.  These builders are 
> thread-locally shared.  These leads to minimal copying, the same sharing of 
> {{sizeMap}} as above, zero wasted allocations, and results in minimal 
> difference in performance between utilising the less-ergonomic static build 
> and builder approach.
> For large trees this leads to ~4.5x throughput improvement, and 70% reduction 
> in allocations vs a normal Builder.  For small trees performance is 
> comparable, but allocations similarly reduced.
> h2. Inserting
> It turns out that we only ever insert another tree into a tree, so we exploit 
> this to implement an efficient union of two trees, operating on them directly 
> via stacks in the transformer, instead of via a collection interface.  A 
> builder-like object is introduced that shares functionality with 
> {{FastBuilder}}, and permits us to build the result of the union directly 
> into the final nodes, reusing as much of the original trees as possible.  
> Bulk methods are used where possible.
> The result is not _uniformly_ faster, but is _significantly_ faster on 
> average: median _improvement_ of 1.4x (that is, 2.4x total throughput), mean 
> improvement of 10x.  Worst reduction is 30%, and it may be that we can 
> isolate and alleviate that.  Allocations are also reduced significantly, with 
> a median of 30% and mean of 42% for the tested workloads.  As the trees get 
> larger the improvement drops, but remains uniformly lower.
> h2. Transforming
> Transformations garbage overhead is minimal, i.e. the main allocations are 
> those necessary to represent the new tree.  It is significantly faster and 
> particularly more efficient when removing elements, utilising the shared 
> functionality of th

[jira] [Commented] (CASSANDRA-15510) BTree: Improve Building, Inserting and Transforming

2022-04-21 Thread Branimir Lambov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17525569#comment-17525569
 ] 

Branimir Lambov commented on CASSANDRA-15510:
-

The code LGTM, but I can't push the latest changes from [my 
branch|https://github.com/blambov/cassandra/tree/CASSANDRA-15510-4.0] to 
Benjamin's pull request, or open a pull request against his branch.

[~blerer], could you copy the two last commits over so that we can run CI and 
merge this?

> BTree: Improve Building, Inserting and Transforming
> ---
>
> Key: CASSANDRA-15510
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15510
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Other
>Reporter: Benedict Elliott Smith
>Assignee: Benedict Elliott Smith
>Priority: Normal
> Fix For: 4.0.x, 4.x
>
>  Time Spent: 10h
>  Remaining Estimate: 0h
>
> This work was originally undertaken as a follow-up to CASSANDRA-15367 to 
> ensure performance is strictly improved, but it may no longer be needed for 
> that purpose.  It’s still hugely impactful, however.  It remains to be 
> decided where this should land.
> The current {{BTree}} implementation is suboptimal in a number of ways, with 
> very little focus having been given to its performance besides its 
> memory-occupancy.  This patch aims to address that, specifically improving 
> the performance and allocations involved in: building, transforming and 
> inserting into a tree.
> To facilitate this work, the {{BTree}} definition is modified slightly, so 
> that we can perform some simple arithmetic on tree sizes.  Specifically, 
> trees of depth n are defined to have a maximum capacity of {{branchFactor^n - 
> 1}}, which translates into capping the number of leaf children at 
> {{branchFactor-1}}, as opposed to {{branchFactor}}.  Since {{branchFactor}} 
> is a power of 2, this permits fast tree size arithmetic, enabling some of 
> these changes.
> h2. Building
> The static build method has been modified to utilise dedicated 
> {{buildPerfect}} methods that build either perfectly dense or perfectly 
> sparse sub-trees.  These perfect trees all share their {{sizeMap}} with each 
> other, and can be built more efficiently than trees of arbitrary size.  The 
> specifics are described in detail in the comments, but this building block 
> can be used to construct trees of any size, using at most one child at each 
> level that is not either perfectly sparse or perfectly dense.  Bulk methods 
> are used where possible.
> For large trees this can produce up to 30x throughput improvement and 30% 
> allocation reduction vs 3.0 (TBC, and to be tested vs 4.0).
> {{FastBuilder}} is introduced for building a tree in-order (or in reverse) 
> without duplicate elements to resolve, without necessarily knowing the size 
> upfront.  This meets the needs of most use cases.  Data is built directly 
> into nodes, with up to one already-constructed node, and one partially 
> constructed node, on each level, being mutated to share their contents in the 
> event of insufficient data to populate the tree.  These builders are 
> thread-locally shared.  These leads to minimal copying, the same sharing of 
> {{sizeMap}} as above, zero wasted allocations, and results in minimal 
> difference in performance between utilising the less-ergonomic static build 
> and builder approach.
> For large trees this leads to ~4.5x throughput improvement, and 70% reduction 
> in allocations vs a normal Builder.  For small trees performance is 
> comparable, but allocations similarly reduced.
> h2. Inserting
> It turns out that we only ever insert another tree into a tree, so we exploit 
> this to implement an efficient union of two trees, operating on them directly 
> via stacks in the transformer, instead of via a collection interface.  A 
> builder-like object is introduced that shares functionality with 
> {{FastBuilder}}, and permits us to build the result of the union directly 
> into the final nodes, reusing as much of the original trees as possible.  
> Bulk methods are used where possible.
> The result is not _uniformly_ faster, but is _significantly_ faster on 
> average: median _improvement_ of 1.4x (that is, 2.4x total throughput), mean 
> improvement of 10x.  Worst reduction is 30%, and it may be that we can 
> isolate and alleviate that.  Allocations are also reduced significantly, with 
> a median of 30% and mean of 42% for the tested workloads.  As the trees get 
> larger the improvement drops, but remains uniformly lower.
> h2. Transforming
> Transformations garbage overhead is minimal, i.e. the main allocations are 
> those necessary to represent the new tree.  It is significantly faster and 
> particularly more efficient when removing e

[jira] [Commented] (CASSANDRA-15510) BTree: Improve Building, Inserting and Transforming

2022-04-21 Thread Benedict Elliott Smith (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17525567#comment-17525567
 ] 

Benedict Elliott Smith commented on CASSANDRA-15510:


I think this is ready to merge?

> BTree: Improve Building, Inserting and Transforming
> ---
>
> Key: CASSANDRA-15510
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15510
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Other
>Reporter: Benedict Elliott Smith
>Assignee: Benedict Elliott Smith
>Priority: Normal
> Fix For: 4.0.x, 4.x
>
>  Time Spent: 10h
>  Remaining Estimate: 0h
>
> This work was originally undertaken as a follow-up to CASSANDRA-15367 to 
> ensure performance is strictly improved, but it may no longer be needed for 
> that purpose.  It’s still hugely impactful, however.  It remains to be 
> decided where this should land.
> The current {{BTree}} implementation is suboptimal in a number of ways, with 
> very little focus having been given to its performance besides its 
> memory-occupancy.  This patch aims to address that, specifically improving 
> the performance and allocations involved in: building, transforming and 
> inserting into a tree.
> To facilitate this work, the {{BTree}} definition is modified slightly, so 
> that we can perform some simple arithmetic on tree sizes.  Specifically, 
> trees of depth n are defined to have a maximum capacity of {{branchFactor^n - 
> 1}}, which translates into capping the number of leaf children at 
> {{branchFactor-1}}, as opposed to {{branchFactor}}.  Since {{branchFactor}} 
> is a power of 2, this permits fast tree size arithmetic, enabling some of 
> these changes.
> h2. Building
> The static build method has been modified to utilise dedicated 
> {{buildPerfect}} methods that build either perfectly dense or perfectly 
> sparse sub-trees.  These perfect trees all share their {{sizeMap}} with each 
> other, and can be built more efficiently than trees of arbitrary size.  The 
> specifics are described in detail in the comments, but this building block 
> can be used to construct trees of any size, using at most one child at each 
> level that is not either perfectly sparse or perfectly dense.  Bulk methods 
> are used where possible.
> For large trees this can produce up to 30x throughput improvement and 30% 
> allocation reduction vs 3.0 (TBC, and to be tested vs 4.0).
> {{FastBuilder}} is introduced for building a tree in-order (or in reverse) 
> without duplicate elements to resolve, without necessarily knowing the size 
> upfront.  This meets the needs of most use cases.  Data is built directly 
> into nodes, with up to one already-constructed node, and one partially 
> constructed node, on each level, being mutated to share their contents in the 
> event of insufficient data to populate the tree.  These builders are 
> thread-locally shared.  These leads to minimal copying, the same sharing of 
> {{sizeMap}} as above, zero wasted allocations, and results in minimal 
> difference in performance between utilising the less-ergonomic static build 
> and builder approach.
> For large trees this leads to ~4.5x throughput improvement, and 70% reduction 
> in allocations vs a normal Builder.  For small trees performance is 
> comparable, but allocations similarly reduced.
> h2. Inserting
> It turns out that we only ever insert another tree into a tree, so we exploit 
> this to implement an efficient union of two trees, operating on them directly 
> via stacks in the transformer, instead of via a collection interface.  A 
> builder-like object is introduced that shares functionality with 
> {{FastBuilder}}, and permits us to build the result of the union directly 
> into the final nodes, reusing as much of the original trees as possible.  
> Bulk methods are used where possible.
> The result is not _uniformly_ faster, but is _significantly_ faster on 
> average: median _improvement_ of 1.4x (that is, 2.4x total throughput), mean 
> improvement of 10x.  Worst reduction is 30%, and it may be that we can 
> isolate and alleviate that.  Allocations are also reduced significantly, with 
> a median of 30% and mean of 42% for the tested workloads.  As the trees get 
> larger the improvement drops, but remains uniformly lower.
> h2. Transforming
> Transformations garbage overhead is minimal, i.e. the main allocations are 
> those necessary to represent the new tree.  It is significantly faster and 
> particularly more efficient when removing elements, utilising the shared 
> functionality of the builder and transformer objects to define an efficient 
> builder that reuses as much of the original tree as possible. 
> We also introduce dedicated {{transform}} methods (that forbid returning

[jira] [Commented] (CASSANDRA-15510) BTree: Improve Building, Inserting and Transforming

2022-04-06 Thread Benedict Elliott Smith (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518317#comment-17518317
 ] 

Benedict Elliott Smith commented on CASSANDRA-15510:


LGTM, left some minor feedback on the PR

> BTree: Improve Building, Inserting and Transforming
> ---
>
> Key: CASSANDRA-15510
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15510
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Other
>Reporter: Benedict Elliott Smith
>Assignee: Benedict Elliott Smith
>Priority: Normal
> Fix For: 4.0.x, 4.x
>
>  Time Spent: 9h
>  Remaining Estimate: 0h
>
> This work was originally undertaken as a follow-up to CASSANDRA-15367 to 
> ensure performance is strictly improved, but it may no longer be needed for 
> that purpose.  It’s still hugely impactful, however.  It remains to be 
> decided where this should land.
> The current {{BTree}} implementation is suboptimal in a number of ways, with 
> very little focus having been given to its performance besides its 
> memory-occupancy.  This patch aims to address that, specifically improving 
> the performance and allocations involved in: building, transforming and 
> inserting into a tree.
> To facilitate this work, the {{BTree}} definition is modified slightly, so 
> that we can perform some simple arithmetic on tree sizes.  Specifically, 
> trees of depth n are defined to have a maximum capacity of {{branchFactor^n - 
> 1}}, which translates into capping the number of leaf children at 
> {{branchFactor-1}}, as opposed to {{branchFactor}}.  Since {{branchFactor}} 
> is a power of 2, this permits fast tree size arithmetic, enabling some of 
> these changes.
> h2. Building
> The static build method has been modified to utilise dedicated 
> {{buildPerfect}} methods that build either perfectly dense or perfectly 
> sparse sub-trees.  These perfect trees all share their {{sizeMap}} with each 
> other, and can be built more efficiently than trees of arbitrary size.  The 
> specifics are described in detail in the comments, but this building block 
> can be used to construct trees of any size, using at most one child at each 
> level that is not either perfectly sparse or perfectly dense.  Bulk methods 
> are used where possible.
> For large trees this can produce up to 30x throughput improvement and 30% 
> allocation reduction vs 3.0 (TBC, and to be tested vs 4.0).
> {{FastBuilder}} is introduced for building a tree in-order (or in reverse) 
> without duplicate elements to resolve, without necessarily knowing the size 
> upfront.  This meets the needs of most use cases.  Data is built directly 
> into nodes, with up to one already-constructed node, and one partially 
> constructed node, on each level, being mutated to share their contents in the 
> event of insufficient data to populate the tree.  These builders are 
> thread-locally shared.  These leads to minimal copying, the same sharing of 
> {{sizeMap}} as above, zero wasted allocations, and results in minimal 
> difference in performance between utilising the less-ergonomic static build 
> and builder approach.
> For large trees this leads to ~4.5x throughput improvement, and 70% reduction 
> in allocations vs a normal Builder.  For small trees performance is 
> comparable, but allocations similarly reduced.
> h2. Inserting
> It turns out that we only ever insert another tree into a tree, so we exploit 
> this to implement an efficient union of two trees, operating on them directly 
> via stacks in the transformer, instead of via a collection interface.  A 
> builder-like object is introduced that shares functionality with 
> {{FastBuilder}}, and permits us to build the result of the union directly 
> into the final nodes, reusing as much of the original trees as possible.  
> Bulk methods are used where possible.
> The result is not _uniformly_ faster, but is _significantly_ faster on 
> average: median _improvement_ of 1.4x (that is, 2.4x total throughput), mean 
> improvement of 10x.  Worst reduction is 30%, and it may be that we can 
> isolate and alleviate that.  Allocations are also reduced significantly, with 
> a median of 30% and mean of 42% for the tested workloads.  As the trees get 
> larger the improvement drops, but remains uniformly lower.
> h2. Transforming
> Transformations garbage overhead is minimal, i.e. the main allocations are 
> those necessary to represent the new tree.  It is significantly faster and 
> particularly more efficient when removing elements, utilising the shared 
> functionality of the builder and transformer objects to define an efficient 
> builder that reuses as much of the original tree as possible. 
> We also introduce dedicated {{transform}} methods (that forbid r

[jira] [Commented] (CASSANDRA-15510) BTree: Improve Building, Inserting and Transforming

2022-03-22 Thread Benedict Elliott Smith (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510512#comment-17510512
 ] 

Benedict Elliott Smith commented on CASSANDRA-15510:


FYI I'm on holiday for a couple of weeks, so I'm unlikely to find time to 
counter-review before April.

> BTree: Improve Building, Inserting and Transforming
> ---
>
> Key: CASSANDRA-15510
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15510
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Other
>Reporter: Benedict Elliott Smith
>Assignee: Benedict Elliott Smith
>Priority: Normal
> Fix For: 4.0.x, 4.x
>
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> This work was originally undertaken as a follow-up to CASSANDRA-15367 to 
> ensure performance is strictly improved, but it may no longer be needed for 
> that purpose.  It’s still hugely impactful, however.  It remains to be 
> decided where this should land.
> The current {{BTree}} implementation is suboptimal in a number of ways, with 
> very little focus having been given to its performance besides its 
> memory-occupancy.  This patch aims to address that, specifically improving 
> the performance and allocations involved in: building, transforming and 
> inserting into a tree.
> To facilitate this work, the {{BTree}} definition is modified slightly, so 
> that we can perform some simple arithmetic on tree sizes.  Specifically, 
> trees of depth n are defined to have a maximum capacity of {{branchFactor^n - 
> 1}}, which translates into capping the number of leaf children at 
> {{branchFactor-1}}, as opposed to {{branchFactor}}.  Since {{branchFactor}} 
> is a power of 2, this permits fast tree size arithmetic, enabling some of 
> these changes.
> h2. Building
> The static build method has been modified to utilise dedicated 
> {{buildPerfect}} methods that build either perfectly dense or perfectly 
> sparse sub-trees.  These perfect trees all share their {{sizeMap}} with each 
> other, and can be built more efficiently than trees of arbitrary size.  The 
> specifics are described in detail in the comments, but this building block 
> can be used to construct trees of any size, using at most one child at each 
> level that is not either perfectly sparse or perfectly dense.  Bulk methods 
> are used where possible.
> For large trees this can produce up to 30x throughput improvement and 30% 
> allocation reduction vs 3.0 (TBC, and to be tested vs 4.0).
> {{FastBuilder}} is introduced for building a tree in-order (or in reverse) 
> without duplicate elements to resolve, without necessarily knowing the size 
> upfront.  This meets the needs of most use cases.  Data is built directly 
> into nodes, with up to one already-constructed node, and one partially 
> constructed node, on each level, being mutated to share their contents in the 
> event of insufficient data to populate the tree.  These builders are 
> thread-locally shared.  These leads to minimal copying, the same sharing of 
> {{sizeMap}} as above, zero wasted allocations, and results in minimal 
> difference in performance between utilising the less-ergonomic static build 
> and builder approach.
> For large trees this leads to ~4.5x throughput improvement, and 70% reduction 
> in allocations vs a normal Builder.  For small trees performance is 
> comparable, but allocations similarly reduced.
> h2. Inserting
> It turns out that we only ever insert another tree into a tree, so we exploit 
> this to implement an efficient union of two trees, operating on them directly 
> via stacks in the transformer, instead of via a collection interface.  A 
> builder-like object is introduced that shares functionality with 
> {{FastBuilder}}, and permits us to build the result of the union directly 
> into the final nodes, reusing as much of the original trees as possible.  
> Bulk methods are used where possible.
> The result is not _uniformly_ faster, but is _significantly_ faster on 
> average: median _improvement_ of 1.4x (that is, 2.4x total throughput), mean 
> improvement of 10x.  Worst reduction is 30%, and it may be that we can 
> isolate and alleviate that.  Allocations are also reduced significantly, with 
> a median of 30% and mean of 42% for the tested workloads.  As the trees get 
> larger the improvement drops, but remains uniformly lower.
> h2. Transforming
> Transformations garbage overhead is minimal, i.e. the main allocations are 
> those necessary to represent the new tree.  It is significantly faster and 
> particularly more efficient when removing elements, utilising the shared 
> functionality of the builder and transformer objects to define an efficient 
> builder that reuses as much of the original tree as possible. 

[jira] [Commented] (CASSANDRA-15510) BTree: Improve Building, Inserting and Transforming

2022-02-19 Thread Benjamin Lerer (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494985#comment-17494985
 ] 

Benjamin Lerer commented on CASSANDRA-15510:


[PR|https://github.com/apache/cassandra/pull/1462] for the 4.0 branch. The JMH 
tests still need to be ported
CI result can be found 
[here|https://app.circleci.com/pipelines/github/blerer/cassandra/267/workflows/99bae171-a3ea-4065-8825-4648de507741]


> BTree: Improve Building, Inserting and Transforming
> ---
>
> Key: CASSANDRA-15510
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15510
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Other
>Reporter: Benedict Elliott Smith
>Assignee: Benedict Elliott Smith
>Priority: Normal
> Fix For: 4.0.x, 4.x
>
>
> This work was originally undertaken as a follow-up to CASSANDRA-15367 to 
> ensure performance is strictly improved, but it may no longer be needed for 
> that purpose.  It’s still hugely impactful, however.  It remains to be 
> decided where this should land.
> The current {{BTree}} implementation is suboptimal in a number of ways, with 
> very little focus having been given to its performance besides its 
> memory-occupancy.  This patch aims to address that, specifically improving 
> the performance and allocations involved in: building, transforming and 
> inserting into a tree.
> To facilitate this work, the {{BTree}} definition is modified slightly, so 
> that we can perform some simple arithmetic on tree sizes.  Specifically, 
> trees of depth n are defined to have a maximum capacity of {{branchFactor^n - 
> 1}}, which translates into capping the number of leaf children at 
> {{branchFactor-1}}, as opposed to {{branchFactor}}.  Since {{branchFactor}} 
> is a power of 2, this permits fast tree size arithmetic, enabling some of 
> these changes.
> h2. Building
> The static build method has been modified to utilise dedicated 
> {{buildPerfect}} methods that build either perfectly dense or perfectly 
> sparse sub-trees.  These perfect trees all share their {{sizeMap}} with each 
> other, and can be built more efficiently than trees of arbitrary size.  The 
> specifics are described in detail in the comments, but this building block 
> can be used to construct trees of any size, using at most one child at each 
> level that is not either perfectly sparse or perfectly dense.  Bulk methods 
> are used where possible.
> For large trees this can produce up to 30x throughput improvement and 30% 
> allocation reduction vs 3.0 (TBC, and to be tested vs 4.0).
> {{FastBuilder}} is introduced for building a tree in-order (or in reverse) 
> without duplicate elements to resolve, without necessarily knowing the size 
> upfront.  This meets the needs of most use cases.  Data is built directly 
> into nodes, with up to one already-constructed node, and one partially 
> constructed node, on each level, being mutated to share their contents in the 
> event of insufficient data to populate the tree.  These builders are 
> thread-locally shared.  These leads to minimal copying, the same sharing of 
> {{sizeMap}} as above, zero wasted allocations, and results in minimal 
> difference in performance between utilising the less-ergonomic static build 
> and builder approach.
> For large trees this leads to ~4.5x throughput improvement, and 70% reduction 
> in allocations vs a normal Builder.  For small trees performance is 
> comparable, but allocations similarly reduced.
> h2. Inserting
> It turns out that we only ever insert another tree into a tree, so we exploit 
> this to implement an efficient union of two trees, operating on them directly 
> via stacks in the transformer, instead of via a collection interface.  A 
> builder-like object is introduced that shares functionality with 
> {{FastBuilder}}, and permits us to build the result of the union directly 
> into the final nodes, reusing as much of the original trees as possible.  
> Bulk methods are used where possible.
> The result is not _uniformly_ faster, but is _significantly_ faster on 
> average: median _improvement_ of 1.4x (that is, 2.4x total throughput), mean 
> improvement of 10x.  Worst reduction is 30%, and it may be that we can 
> isolate and alleviate that.  Allocations are also reduced significantly, with 
> a median of 30% and mean of 42% for the tested workloads.  As the trees get 
> larger the improvement drops, but remains uniformly lower.
> h2. Transforming
> Transformations garbage overhead is minimal, i.e. the main allocations are 
> those necessary to represent the new tree.  It is significantly faster and 
> particularly more efficient when removing elements, utilising the shared 
> functionality of the builder and transformer objects to define an

[jira] [Commented] (CASSANDRA-15510) BTree: Improve Building, Inserting and Transforming

2021-12-01 Thread Michael Semb Wever (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451892#comment-17451892
 ] 

Michael Semb Wever commented on CASSANDRA-15510:


The discussion raised on dev@ ML to include this also in 4.0.x can be read 
[here|https://lists.apache.org/thread/f3dl7rfc2kv9f5r9pxzyz6zojsss81b9].

> BTree: Improve Building, Inserting and Transforming
> ---
>
> Key: CASSANDRA-15510
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15510
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Other
>Reporter: Benedict Elliott Smith
>Assignee: Benedict Elliott Smith
>Priority: Normal
> Fix For: 4.0.x, 4.x
>
>
> This work was originally undertaken as a follow-up to CASSANDRA-15367 to 
> ensure performance is strictly improved, but it may no longer be needed for 
> that purpose.  It’s still hugely impactful, however.  It remains to be 
> decided where this should land.
> The current {{BTree}} implementation is suboptimal in a number of ways, with 
> very little focus having been given to its performance besides its 
> memory-occupancy.  This patch aims to address that, specifically improving 
> the performance and allocations involved in: building, transforming and 
> inserting into a tree.
> To facilitate this work, the {{BTree}} definition is modified slightly, so 
> that we can perform some simple arithmetic on tree sizes.  Specifically, 
> trees of depth n are defined to have a maximum capacity of {{branchFactor^n - 
> 1}}, which translates into capping the number of leaf children at 
> {{branchFactor-1}}, as opposed to {{branchFactor}}.  Since {{branchFactor}} 
> is a power of 2, this permits fast tree size arithmetic, enabling some of 
> these changes.
> h2. Building
> The static build method has been modified to utilise dedicated 
> {{buildPerfect}} methods that build either perfectly dense or perfectly 
> sparse sub-trees.  These perfect trees all share their {{sizeMap}} with each 
> other, and can be built more efficiently than trees of arbitrary size.  The 
> specifics are described in detail in the comments, but this building block 
> can be used to construct trees of any size, using at most one child at each 
> level that is not either perfectly sparse or perfectly dense.  Bulk methods 
> are used where possible.
> For large trees this can produce up to 30x throughput improvement and 30% 
> allocation reduction vs 3.0 (TBC, and to be tested vs 4.0).
> {{FastBuilder}} is introduced for building a tree in-order (or in reverse) 
> without duplicate elements to resolve, without necessarily knowing the size 
> upfront.  This meets the needs of most use cases.  Data is built directly 
> into nodes, with up to one already-constructed node, and one partially 
> constructed node, on each level, being mutated to share their contents in the 
> event of insufficient data to populate the tree.  These builders are 
> thread-locally shared.  These leads to minimal copying, the same sharing of 
> {{sizeMap}} as above, zero wasted allocations, and results in minimal 
> difference in performance between utilising the less-ergonomic static build 
> and builder approach.
> For large trees this leads to ~4.5x throughput improvement, and 70% reduction 
> in allocations vs a normal Builder.  For small trees performance is 
> comparable, but allocations similarly reduced.
> h2. Inserting
> It turns out that we only ever insert another tree into a tree, so we exploit 
> this to implement an efficient union of two trees, operating on them directly 
> via stacks in the transformer, instead of via a collection interface.  A 
> builder-like object is introduced that shares functionality with 
> {{FastBuilder}}, and permits us to build the result of the union directly 
> into the final nodes, reusing as much of the original trees as possible.  
> Bulk methods are used where possible.
> The result is not _uniformly_ faster, but is _significantly_ faster on 
> average: median _improvement_ of 1.4x (that is, 2.4x total throughput), mean 
> improvement of 10x.  Worst reduction is 30%, and it may be that we can 
> isolate and alleviate that.  Allocations are also reduced significantly, with 
> a median of 30% and mean of 42% for the tested workloads.  As the trees get 
> larger the improvement drops, but remains uniformly lower.
> h2. Transforming
> Transformations garbage overhead is minimal, i.e. the main allocations are 
> those necessary to represent the new tree.  It is significantly faster and 
> particularly more efficient when removing elements, utilising the shared 
> functionality of the builder and transformer objects to define an efficient 
> builder that reuses as much of the original tree as possible. 
> We also introduc

[jira] [Commented] (CASSANDRA-15510) BTree: Improve Building, Inserting and Transforming

2020-01-16 Thread Benedict Elliott Smith (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17017490#comment-17017490
 ] 

Benedict Elliott Smith commented on CASSANDRA-15510:


[3.0|http://github.com/belliottsmith/cassandra/tree/15510]

> BTree: Improve Building, Inserting and Transforming
> ---
>
> Key: CASSANDRA-15510
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15510
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Other
>Reporter: Benedict Elliott Smith
>Assignee: Benedict Elliott Smith
>Priority: Normal
>
> This work was originally undertaken as a follow-up to CASSANDRA-15367 to 
> ensure performance is strictly improved, but it may no longer be needed for 
> that purpose.  It’s still hugely impactful, however.  It remains to be 
> decided where this should land.
> The current {{BTree}} implementation is suboptimal in a number of ways, with 
> very little focus having been given to its performance besides its 
> memory-occupancy.  This patch aims to address that, specifically improving 
> the performance and allocations involved in: building, transforming and 
> inserting into a tree.
> To facilitate this work, the {{BTree}} definition is modified slightly, so 
> that we can perform some simple arithmetic on tree sizes.  Specifically, 
> trees of depth n are defined to have a maximum capacity of {{branchFactor^n - 
> 1}}, which translates into capping the number of leaf children at 
> {{branchFactor-1}}, as opposed to {{branchFactor}}.  Since {{branchFactor}} 
> is a power of 2, this permits fast tree size arithmetic, enabling some of 
> these changes.
> h2. Building
> The static build method has been modified to utilise dedicated 
> {{buildPerfect}} methods that build either perfectly dense or perfectly 
> sparse sub-trees.  These perfect trees all share their {{sizeMap}} with each 
> other, and can be built more efficiently than trees of arbitrary size.  The 
> specifics are described in detail in the comments, but this building block 
> can be used to construct trees of any size, using at most one child at each 
> level that is not either perfectly sparse or perfectly dense.  Bulk methods 
> are used where possible.
> For large trees this can produce up to 30x throughput improvement and 30% 
> allocation reduction vs 3.0 (TBC, and to be tested vs 4.0).
> {{FastBuilder}} is introduced for building a tree in-order (or in reverse) 
> without duplicate elements to resolve, without necessarily knowing the size 
> upfront.  This meets the needs of most use cases.  Data is built directly 
> into nodes, with up to one already-constructed node, and one partially 
> constructed node, on each level, being mutated to share their contents in the 
> event of insufficient data to populate the tree.  These builders are 
> thread-locally shared.  These leads to minimal copying, the same sharing of 
> {{sizeMap}} as above, zero wasted allocations, and results in minimal 
> difference in performance between utilising the less-ergonomic static build 
> and builder approach.
> For large trees this leads to ~4.5x throughput improvement, and 70% reduction 
> in allocations vs a normal Builder.  For small trees performance is 
> comparable, but allocations similarly reduced.
> h2. Inserting
> It turns out that we only ever insert another tree into a tree, so we exploit 
> this to implement an efficient union of two trees, operating on them directly 
> via stacks in the transformer, instead of via a collection interface.  A 
> builder-like object is introduced that shares functionality with 
> {{FastBuilder}}, and permits us to build the result of the union directly 
> into the final nodes, reusing as much of the original trees as possible.  
> Bulk methods are used where possible.
> The result is not _uniformly_ faster, but is _significantly_ faster on 
> average: median _improvement_ of 1.4x (that is, 2.4x total throughput), mean 
> improvement of 10x.  Worst reduction is 30%, and it may be that we can 
> isolate and alleviate that.  Allocations are also reduced significantly, with 
> a median of 30% and mean of 42% for the tested workloads.  As the trees get 
> larger the improvement drops, but remains uniformly lower.
> h2. Transforming
> Transformations garbage overhead is minimal, i.e. the main allocations are 
> those necessary to represent the new tree.  It is significantly faster and 
> particularly more efficient when removing elements, utilising the shared 
> functionality of the builder and transformer objects to define an efficient 
> builder that reuses as much of the original tree as possible. 
> We also introduce dedicated {{transform}} methods (that forbid returning 
> {{null}}), and {{BiFunction}} transformations to permit ef