[jira] [Commented] (CASSANDRA-15510) BTree: Improve Building, Inserting and Transforming
[ https://issues.apache.org/jira/browse/CASSANDRA-15510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17532963#comment-17532963 ] Benjamin Lerer commented on CASSANDRA-15510: It turned out that I did not merged properly the patch in trunk :-( It is fixed now. Patch committed into 4.1 at 596daeb7f08e14d69af90fd4f07b9e87f2816681 and merged into trunk. (Note for myself: do not merge things on Friday evening before going in holidays). > BTree: Improve Building, Inserting and Transforming > --- > > Key: CASSANDRA-15510 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15510 > Project: Cassandra > Issue Type: Improvement > Components: Local/Other >Reporter: Benedict Elliott Smith >Assignee: Benedict Elliott Smith >Priority: Normal > Fix For: 4.1, 4.0.5 > > Time Spent: 10h > Remaining Estimate: 0h > > This work was originally undertaken as a follow-up to CASSANDRA-15367 to > ensure performance is strictly improved, but it may no longer be needed for > that purpose. It’s still hugely impactful, however. It remains to be > decided where this should land. > The current {{BTree}} implementation is suboptimal in a number of ways, with > very little focus having been given to its performance besides its > memory-occupancy. This patch aims to address that, specifically improving > the performance and allocations involved in: building, transforming and > inserting into a tree. > To facilitate this work, the {{BTree}} definition is modified slightly, so > that we can perform some simple arithmetic on tree sizes. Specifically, > trees of depth n are defined to have a maximum capacity of {{branchFactor^n - > 1}}, which translates into capping the number of leaf children at > {{branchFactor-1}}, as opposed to {{branchFactor}}. Since {{branchFactor}} > is a power of 2, this permits fast tree size arithmetic, enabling some of > these changes. > h2. Building > The static build method has been modified to utilise dedicated > {{buildPerfect}} methods that build either perfectly dense or perfectly > sparse sub-trees. These perfect trees all share their {{sizeMap}} with each > other, and can be built more efficiently than trees of arbitrary size. The > specifics are described in detail in the comments, but this building block > can be used to construct trees of any size, using at most one child at each > level that is not either perfectly sparse or perfectly dense. Bulk methods > are used where possible. > For large trees this can produce up to 30x throughput improvement and 30% > allocation reduction vs 3.0 (TBC, and to be tested vs 4.0). > {{FastBuilder}} is introduced for building a tree in-order (or in reverse) > without duplicate elements to resolve, without necessarily knowing the size > upfront. This meets the needs of most use cases. Data is built directly > into nodes, with up to one already-constructed node, and one partially > constructed node, on each level, being mutated to share their contents in the > event of insufficient data to populate the tree. These builders are > thread-locally shared. These leads to minimal copying, the same sharing of > {{sizeMap}} as above, zero wasted allocations, and results in minimal > difference in performance between utilising the less-ergonomic static build > and builder approach. > For large trees this leads to ~4.5x throughput improvement, and 70% reduction > in allocations vs a normal Builder. For small trees performance is > comparable, but allocations similarly reduced. > h2. Inserting > It turns out that we only ever insert another tree into a tree, so we exploit > this to implement an efficient union of two trees, operating on them directly > via stacks in the transformer, instead of via a collection interface. A > builder-like object is introduced that shares functionality with > {{FastBuilder}}, and permits us to build the result of the union directly > into the final nodes, reusing as much of the original trees as possible. > Bulk methods are used where possible. > The result is not _uniformly_ faster, but is _significantly_ faster on > average: median _improvement_ of 1.4x (that is, 2.4x total throughput), mean > improvement of 10x. Worst reduction is 30%, and it may be that we can > isolate and alleviate that. Allocations are also reduced significantly, with > a median of 30% and mean of 42% for the tested workloads. As the trees get > larger the improvement drops, but remains uniformly lower. > h2. Transforming > Transformations garbage overhead is minimal, i.e. the main allocations are > those necessary to represent the new tree. It is significantly faster and > particularly more efficient when removing elements, utilising the shared >
[jira] [Commented] (CASSANDRA-15510) BTree: Improve Building, Inserting and Transforming
[ https://issues.apache.org/jira/browse/CASSANDRA-15510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526543#comment-17526543 ] Benjamin Lerer commented on CASSANDRA-15510: CI runs for [4.0|https://app.circleci.com/pipelines/github/blerer/cassandra/284/workflows/76db84a8-a6a1-4364-85ce-72ed5f12081f] and [trunk|https://app.circleci.com/pipelines/github/blerer/cassandra/287/workflows/f9ad1572-460f-4b99-bfbd-ac3edaac61da] > BTree: Improve Building, Inserting and Transforming > --- > > Key: CASSANDRA-15510 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15510 > Project: Cassandra > Issue Type: Improvement > Components: Local/Other >Reporter: Benedict Elliott Smith >Assignee: Benedict Elliott Smith >Priority: Normal > Fix For: 4.0.x, 4.x > > Time Spent: 10h > Remaining Estimate: 0h > > This work was originally undertaken as a follow-up to CASSANDRA-15367 to > ensure performance is strictly improved, but it may no longer be needed for > that purpose. It’s still hugely impactful, however. It remains to be > decided where this should land. > The current {{BTree}} implementation is suboptimal in a number of ways, with > very little focus having been given to its performance besides its > memory-occupancy. This patch aims to address that, specifically improving > the performance and allocations involved in: building, transforming and > inserting into a tree. > To facilitate this work, the {{BTree}} definition is modified slightly, so > that we can perform some simple arithmetic on tree sizes. Specifically, > trees of depth n are defined to have a maximum capacity of {{branchFactor^n - > 1}}, which translates into capping the number of leaf children at > {{branchFactor-1}}, as opposed to {{branchFactor}}. Since {{branchFactor}} > is a power of 2, this permits fast tree size arithmetic, enabling some of > these changes. > h2. Building > The static build method has been modified to utilise dedicated > {{buildPerfect}} methods that build either perfectly dense or perfectly > sparse sub-trees. These perfect trees all share their {{sizeMap}} with each > other, and can be built more efficiently than trees of arbitrary size. The > specifics are described in detail in the comments, but this building block > can be used to construct trees of any size, using at most one child at each > level that is not either perfectly sparse or perfectly dense. Bulk methods > are used where possible. > For large trees this can produce up to 30x throughput improvement and 30% > allocation reduction vs 3.0 (TBC, and to be tested vs 4.0). > {{FastBuilder}} is introduced for building a tree in-order (or in reverse) > without duplicate elements to resolve, without necessarily knowing the size > upfront. This meets the needs of most use cases. Data is built directly > into nodes, with up to one already-constructed node, and one partially > constructed node, on each level, being mutated to share their contents in the > event of insufficient data to populate the tree. These builders are > thread-locally shared. These leads to minimal copying, the same sharing of > {{sizeMap}} as above, zero wasted allocations, and results in minimal > difference in performance between utilising the less-ergonomic static build > and builder approach. > For large trees this leads to ~4.5x throughput improvement, and 70% reduction > in allocations vs a normal Builder. For small trees performance is > comparable, but allocations similarly reduced. > h2. Inserting > It turns out that we only ever insert another tree into a tree, so we exploit > this to implement an efficient union of two trees, operating on them directly > via stacks in the transformer, instead of via a collection interface. A > builder-like object is introduced that shares functionality with > {{FastBuilder}}, and permits us to build the result of the union directly > into the final nodes, reusing as much of the original trees as possible. > Bulk methods are used where possible. > The result is not _uniformly_ faster, but is _significantly_ faster on > average: median _improvement_ of 1.4x (that is, 2.4x total throughput), mean > improvement of 10x. Worst reduction is 30%, and it may be that we can > isolate and alleviate that. Allocations are also reduced significantly, with > a median of 30% and mean of 42% for the tested workloads. As the trees get > larger the improvement drops, but remains uniformly lower. > h2. Transforming > Transformations garbage overhead is minimal, i.e. the main allocations are > those necessary to represent the new tree. It is significantly faster and > particularly more efficient when removing elements, utilising the shared > functionali
[jira] [Commented] (CASSANDRA-15510) BTree: Improve Building, Inserting and Transforming
[ https://issues.apache.org/jira/browse/CASSANDRA-15510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17525645#comment-17525645 ] Benjamin Lerer commented on CASSANDRA-15510: Sorry, I was focussing on porting CASSANDRA-15511. I will merge the changes and run CI. If I remember correctly I think that the patch broke some tests. We also need to run CI on Jenkins as I do not think that CircleCi can run the burn tests. > BTree: Improve Building, Inserting and Transforming > --- > > Key: CASSANDRA-15510 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15510 > Project: Cassandra > Issue Type: Improvement > Components: Local/Other >Reporter: Benedict Elliott Smith >Assignee: Benedict Elliott Smith >Priority: Normal > Fix For: 4.0.x, 4.x > > Time Spent: 10h > Remaining Estimate: 0h > > This work was originally undertaken as a follow-up to CASSANDRA-15367 to > ensure performance is strictly improved, but it may no longer be needed for > that purpose. It’s still hugely impactful, however. It remains to be > decided where this should land. > The current {{BTree}} implementation is suboptimal in a number of ways, with > very little focus having been given to its performance besides its > memory-occupancy. This patch aims to address that, specifically improving > the performance and allocations involved in: building, transforming and > inserting into a tree. > To facilitate this work, the {{BTree}} definition is modified slightly, so > that we can perform some simple arithmetic on tree sizes. Specifically, > trees of depth n are defined to have a maximum capacity of {{branchFactor^n - > 1}}, which translates into capping the number of leaf children at > {{branchFactor-1}}, as opposed to {{branchFactor}}. Since {{branchFactor}} > is a power of 2, this permits fast tree size arithmetic, enabling some of > these changes. > h2. Building > The static build method has been modified to utilise dedicated > {{buildPerfect}} methods that build either perfectly dense or perfectly > sparse sub-trees. These perfect trees all share their {{sizeMap}} with each > other, and can be built more efficiently than trees of arbitrary size. The > specifics are described in detail in the comments, but this building block > can be used to construct trees of any size, using at most one child at each > level that is not either perfectly sparse or perfectly dense. Bulk methods > are used where possible. > For large trees this can produce up to 30x throughput improvement and 30% > allocation reduction vs 3.0 (TBC, and to be tested vs 4.0). > {{FastBuilder}} is introduced for building a tree in-order (or in reverse) > without duplicate elements to resolve, without necessarily knowing the size > upfront. This meets the needs of most use cases. Data is built directly > into nodes, with up to one already-constructed node, and one partially > constructed node, on each level, being mutated to share their contents in the > event of insufficient data to populate the tree. These builders are > thread-locally shared. These leads to minimal copying, the same sharing of > {{sizeMap}} as above, zero wasted allocations, and results in minimal > difference in performance between utilising the less-ergonomic static build > and builder approach. > For large trees this leads to ~4.5x throughput improvement, and 70% reduction > in allocations vs a normal Builder. For small trees performance is > comparable, but allocations similarly reduced. > h2. Inserting > It turns out that we only ever insert another tree into a tree, so we exploit > this to implement an efficient union of two trees, operating on them directly > via stacks in the transformer, instead of via a collection interface. A > builder-like object is introduced that shares functionality with > {{FastBuilder}}, and permits us to build the result of the union directly > into the final nodes, reusing as much of the original trees as possible. > Bulk methods are used where possible. > The result is not _uniformly_ faster, but is _significantly_ faster on > average: median _improvement_ of 1.4x (that is, 2.4x total throughput), mean > improvement of 10x. Worst reduction is 30%, and it may be that we can > isolate and alleviate that. Allocations are also reduced significantly, with > a median of 30% and mean of 42% for the tested workloads. As the trees get > larger the improvement drops, but remains uniformly lower. > h2. Transforming > Transformations garbage overhead is minimal, i.e. the main allocations are > those necessary to represent the new tree. It is significantly faster and > particularly more efficient when removing elements, utilising the shared > functionality of th
[jira] [Commented] (CASSANDRA-15510) BTree: Improve Building, Inserting and Transforming
[ https://issues.apache.org/jira/browse/CASSANDRA-15510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17525569#comment-17525569 ] Branimir Lambov commented on CASSANDRA-15510: - The code LGTM, but I can't push the latest changes from [my branch|https://github.com/blambov/cassandra/tree/CASSANDRA-15510-4.0] to Benjamin's pull request, or open a pull request against his branch. [~blerer], could you copy the two last commits over so that we can run CI and merge this? > BTree: Improve Building, Inserting and Transforming > --- > > Key: CASSANDRA-15510 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15510 > Project: Cassandra > Issue Type: Improvement > Components: Local/Other >Reporter: Benedict Elliott Smith >Assignee: Benedict Elliott Smith >Priority: Normal > Fix For: 4.0.x, 4.x > > Time Spent: 10h > Remaining Estimate: 0h > > This work was originally undertaken as a follow-up to CASSANDRA-15367 to > ensure performance is strictly improved, but it may no longer be needed for > that purpose. It’s still hugely impactful, however. It remains to be > decided where this should land. > The current {{BTree}} implementation is suboptimal in a number of ways, with > very little focus having been given to its performance besides its > memory-occupancy. This patch aims to address that, specifically improving > the performance and allocations involved in: building, transforming and > inserting into a tree. > To facilitate this work, the {{BTree}} definition is modified slightly, so > that we can perform some simple arithmetic on tree sizes. Specifically, > trees of depth n are defined to have a maximum capacity of {{branchFactor^n - > 1}}, which translates into capping the number of leaf children at > {{branchFactor-1}}, as opposed to {{branchFactor}}. Since {{branchFactor}} > is a power of 2, this permits fast tree size arithmetic, enabling some of > these changes. > h2. Building > The static build method has been modified to utilise dedicated > {{buildPerfect}} methods that build either perfectly dense or perfectly > sparse sub-trees. These perfect trees all share their {{sizeMap}} with each > other, and can be built more efficiently than trees of arbitrary size. The > specifics are described in detail in the comments, but this building block > can be used to construct trees of any size, using at most one child at each > level that is not either perfectly sparse or perfectly dense. Bulk methods > are used where possible. > For large trees this can produce up to 30x throughput improvement and 30% > allocation reduction vs 3.0 (TBC, and to be tested vs 4.0). > {{FastBuilder}} is introduced for building a tree in-order (or in reverse) > without duplicate elements to resolve, without necessarily knowing the size > upfront. This meets the needs of most use cases. Data is built directly > into nodes, with up to one already-constructed node, and one partially > constructed node, on each level, being mutated to share their contents in the > event of insufficient data to populate the tree. These builders are > thread-locally shared. These leads to minimal copying, the same sharing of > {{sizeMap}} as above, zero wasted allocations, and results in minimal > difference in performance between utilising the less-ergonomic static build > and builder approach. > For large trees this leads to ~4.5x throughput improvement, and 70% reduction > in allocations vs a normal Builder. For small trees performance is > comparable, but allocations similarly reduced. > h2. Inserting > It turns out that we only ever insert another tree into a tree, so we exploit > this to implement an efficient union of two trees, operating on them directly > via stacks in the transformer, instead of via a collection interface. A > builder-like object is introduced that shares functionality with > {{FastBuilder}}, and permits us to build the result of the union directly > into the final nodes, reusing as much of the original trees as possible. > Bulk methods are used where possible. > The result is not _uniformly_ faster, but is _significantly_ faster on > average: median _improvement_ of 1.4x (that is, 2.4x total throughput), mean > improvement of 10x. Worst reduction is 30%, and it may be that we can > isolate and alleviate that. Allocations are also reduced significantly, with > a median of 30% and mean of 42% for the tested workloads. As the trees get > larger the improvement drops, but remains uniformly lower. > h2. Transforming > Transformations garbage overhead is minimal, i.e. the main allocations are > those necessary to represent the new tree. It is significantly faster and > particularly more efficient when removing e
[jira] [Commented] (CASSANDRA-15510) BTree: Improve Building, Inserting and Transforming
[ https://issues.apache.org/jira/browse/CASSANDRA-15510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17525567#comment-17525567 ] Benedict Elliott Smith commented on CASSANDRA-15510: I think this is ready to merge? > BTree: Improve Building, Inserting and Transforming > --- > > Key: CASSANDRA-15510 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15510 > Project: Cassandra > Issue Type: Improvement > Components: Local/Other >Reporter: Benedict Elliott Smith >Assignee: Benedict Elliott Smith >Priority: Normal > Fix For: 4.0.x, 4.x > > Time Spent: 10h > Remaining Estimate: 0h > > This work was originally undertaken as a follow-up to CASSANDRA-15367 to > ensure performance is strictly improved, but it may no longer be needed for > that purpose. It’s still hugely impactful, however. It remains to be > decided where this should land. > The current {{BTree}} implementation is suboptimal in a number of ways, with > very little focus having been given to its performance besides its > memory-occupancy. This patch aims to address that, specifically improving > the performance and allocations involved in: building, transforming and > inserting into a tree. > To facilitate this work, the {{BTree}} definition is modified slightly, so > that we can perform some simple arithmetic on tree sizes. Specifically, > trees of depth n are defined to have a maximum capacity of {{branchFactor^n - > 1}}, which translates into capping the number of leaf children at > {{branchFactor-1}}, as opposed to {{branchFactor}}. Since {{branchFactor}} > is a power of 2, this permits fast tree size arithmetic, enabling some of > these changes. > h2. Building > The static build method has been modified to utilise dedicated > {{buildPerfect}} methods that build either perfectly dense or perfectly > sparse sub-trees. These perfect trees all share their {{sizeMap}} with each > other, and can be built more efficiently than trees of arbitrary size. The > specifics are described in detail in the comments, but this building block > can be used to construct trees of any size, using at most one child at each > level that is not either perfectly sparse or perfectly dense. Bulk methods > are used where possible. > For large trees this can produce up to 30x throughput improvement and 30% > allocation reduction vs 3.0 (TBC, and to be tested vs 4.0). > {{FastBuilder}} is introduced for building a tree in-order (or in reverse) > without duplicate elements to resolve, without necessarily knowing the size > upfront. This meets the needs of most use cases. Data is built directly > into nodes, with up to one already-constructed node, and one partially > constructed node, on each level, being mutated to share their contents in the > event of insufficient data to populate the tree. These builders are > thread-locally shared. These leads to minimal copying, the same sharing of > {{sizeMap}} as above, zero wasted allocations, and results in minimal > difference in performance between utilising the less-ergonomic static build > and builder approach. > For large trees this leads to ~4.5x throughput improvement, and 70% reduction > in allocations vs a normal Builder. For small trees performance is > comparable, but allocations similarly reduced. > h2. Inserting > It turns out that we only ever insert another tree into a tree, so we exploit > this to implement an efficient union of two trees, operating on them directly > via stacks in the transformer, instead of via a collection interface. A > builder-like object is introduced that shares functionality with > {{FastBuilder}}, and permits us to build the result of the union directly > into the final nodes, reusing as much of the original trees as possible. > Bulk methods are used where possible. > The result is not _uniformly_ faster, but is _significantly_ faster on > average: median _improvement_ of 1.4x (that is, 2.4x total throughput), mean > improvement of 10x. Worst reduction is 30%, and it may be that we can > isolate and alleviate that. Allocations are also reduced significantly, with > a median of 30% and mean of 42% for the tested workloads. As the trees get > larger the improvement drops, but remains uniformly lower. > h2. Transforming > Transformations garbage overhead is minimal, i.e. the main allocations are > those necessary to represent the new tree. It is significantly faster and > particularly more efficient when removing elements, utilising the shared > functionality of the builder and transformer objects to define an efficient > builder that reuses as much of the original tree as possible. > We also introduce dedicated {{transform}} methods (that forbid returning
[jira] [Commented] (CASSANDRA-15510) BTree: Improve Building, Inserting and Transforming
[ https://issues.apache.org/jira/browse/CASSANDRA-15510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518317#comment-17518317 ] Benedict Elliott Smith commented on CASSANDRA-15510: LGTM, left some minor feedback on the PR > BTree: Improve Building, Inserting and Transforming > --- > > Key: CASSANDRA-15510 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15510 > Project: Cassandra > Issue Type: Improvement > Components: Local/Other >Reporter: Benedict Elliott Smith >Assignee: Benedict Elliott Smith >Priority: Normal > Fix For: 4.0.x, 4.x > > Time Spent: 9h > Remaining Estimate: 0h > > This work was originally undertaken as a follow-up to CASSANDRA-15367 to > ensure performance is strictly improved, but it may no longer be needed for > that purpose. It’s still hugely impactful, however. It remains to be > decided where this should land. > The current {{BTree}} implementation is suboptimal in a number of ways, with > very little focus having been given to its performance besides its > memory-occupancy. This patch aims to address that, specifically improving > the performance and allocations involved in: building, transforming and > inserting into a tree. > To facilitate this work, the {{BTree}} definition is modified slightly, so > that we can perform some simple arithmetic on tree sizes. Specifically, > trees of depth n are defined to have a maximum capacity of {{branchFactor^n - > 1}}, which translates into capping the number of leaf children at > {{branchFactor-1}}, as opposed to {{branchFactor}}. Since {{branchFactor}} > is a power of 2, this permits fast tree size arithmetic, enabling some of > these changes. > h2. Building > The static build method has been modified to utilise dedicated > {{buildPerfect}} methods that build either perfectly dense or perfectly > sparse sub-trees. These perfect trees all share their {{sizeMap}} with each > other, and can be built more efficiently than trees of arbitrary size. The > specifics are described in detail in the comments, but this building block > can be used to construct trees of any size, using at most one child at each > level that is not either perfectly sparse or perfectly dense. Bulk methods > are used where possible. > For large trees this can produce up to 30x throughput improvement and 30% > allocation reduction vs 3.0 (TBC, and to be tested vs 4.0). > {{FastBuilder}} is introduced for building a tree in-order (or in reverse) > without duplicate elements to resolve, without necessarily knowing the size > upfront. This meets the needs of most use cases. Data is built directly > into nodes, with up to one already-constructed node, and one partially > constructed node, on each level, being mutated to share their contents in the > event of insufficient data to populate the tree. These builders are > thread-locally shared. These leads to minimal copying, the same sharing of > {{sizeMap}} as above, zero wasted allocations, and results in minimal > difference in performance between utilising the less-ergonomic static build > and builder approach. > For large trees this leads to ~4.5x throughput improvement, and 70% reduction > in allocations vs a normal Builder. For small trees performance is > comparable, but allocations similarly reduced. > h2. Inserting > It turns out that we only ever insert another tree into a tree, so we exploit > this to implement an efficient union of two trees, operating on them directly > via stacks in the transformer, instead of via a collection interface. A > builder-like object is introduced that shares functionality with > {{FastBuilder}}, and permits us to build the result of the union directly > into the final nodes, reusing as much of the original trees as possible. > Bulk methods are used where possible. > The result is not _uniformly_ faster, but is _significantly_ faster on > average: median _improvement_ of 1.4x (that is, 2.4x total throughput), mean > improvement of 10x. Worst reduction is 30%, and it may be that we can > isolate and alleviate that. Allocations are also reduced significantly, with > a median of 30% and mean of 42% for the tested workloads. As the trees get > larger the improvement drops, but remains uniformly lower. > h2. Transforming > Transformations garbage overhead is minimal, i.e. the main allocations are > those necessary to represent the new tree. It is significantly faster and > particularly more efficient when removing elements, utilising the shared > functionality of the builder and transformer objects to define an efficient > builder that reuses as much of the original tree as possible. > We also introduce dedicated {{transform}} methods (that forbid r
[jira] [Commented] (CASSANDRA-15510) BTree: Improve Building, Inserting and Transforming
[ https://issues.apache.org/jira/browse/CASSANDRA-15510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510512#comment-17510512 ] Benedict Elliott Smith commented on CASSANDRA-15510: FYI I'm on holiday for a couple of weeks, so I'm unlikely to find time to counter-review before April. > BTree: Improve Building, Inserting and Transforming > --- > > Key: CASSANDRA-15510 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15510 > Project: Cassandra > Issue Type: Improvement > Components: Local/Other >Reporter: Benedict Elliott Smith >Assignee: Benedict Elliott Smith >Priority: Normal > Fix For: 4.0.x, 4.x > > Time Spent: 7h 20m > Remaining Estimate: 0h > > This work was originally undertaken as a follow-up to CASSANDRA-15367 to > ensure performance is strictly improved, but it may no longer be needed for > that purpose. It’s still hugely impactful, however. It remains to be > decided where this should land. > The current {{BTree}} implementation is suboptimal in a number of ways, with > very little focus having been given to its performance besides its > memory-occupancy. This patch aims to address that, specifically improving > the performance and allocations involved in: building, transforming and > inserting into a tree. > To facilitate this work, the {{BTree}} definition is modified slightly, so > that we can perform some simple arithmetic on tree sizes. Specifically, > trees of depth n are defined to have a maximum capacity of {{branchFactor^n - > 1}}, which translates into capping the number of leaf children at > {{branchFactor-1}}, as opposed to {{branchFactor}}. Since {{branchFactor}} > is a power of 2, this permits fast tree size arithmetic, enabling some of > these changes. > h2. Building > The static build method has been modified to utilise dedicated > {{buildPerfect}} methods that build either perfectly dense or perfectly > sparse sub-trees. These perfect trees all share their {{sizeMap}} with each > other, and can be built more efficiently than trees of arbitrary size. The > specifics are described in detail in the comments, but this building block > can be used to construct trees of any size, using at most one child at each > level that is not either perfectly sparse or perfectly dense. Bulk methods > are used where possible. > For large trees this can produce up to 30x throughput improvement and 30% > allocation reduction vs 3.0 (TBC, and to be tested vs 4.0). > {{FastBuilder}} is introduced for building a tree in-order (or in reverse) > without duplicate elements to resolve, without necessarily knowing the size > upfront. This meets the needs of most use cases. Data is built directly > into nodes, with up to one already-constructed node, and one partially > constructed node, on each level, being mutated to share their contents in the > event of insufficient data to populate the tree. These builders are > thread-locally shared. These leads to minimal copying, the same sharing of > {{sizeMap}} as above, zero wasted allocations, and results in minimal > difference in performance between utilising the less-ergonomic static build > and builder approach. > For large trees this leads to ~4.5x throughput improvement, and 70% reduction > in allocations vs a normal Builder. For small trees performance is > comparable, but allocations similarly reduced. > h2. Inserting > It turns out that we only ever insert another tree into a tree, so we exploit > this to implement an efficient union of two trees, operating on them directly > via stacks in the transformer, instead of via a collection interface. A > builder-like object is introduced that shares functionality with > {{FastBuilder}}, and permits us to build the result of the union directly > into the final nodes, reusing as much of the original trees as possible. > Bulk methods are used where possible. > The result is not _uniformly_ faster, but is _significantly_ faster on > average: median _improvement_ of 1.4x (that is, 2.4x total throughput), mean > improvement of 10x. Worst reduction is 30%, and it may be that we can > isolate and alleviate that. Allocations are also reduced significantly, with > a median of 30% and mean of 42% for the tested workloads. As the trees get > larger the improvement drops, but remains uniformly lower. > h2. Transforming > Transformations garbage overhead is minimal, i.e. the main allocations are > those necessary to represent the new tree. It is significantly faster and > particularly more efficient when removing elements, utilising the shared > functionality of the builder and transformer objects to define an efficient > builder that reuses as much of the original tree as possible.
[jira] [Commented] (CASSANDRA-15510) BTree: Improve Building, Inserting and Transforming
[ https://issues.apache.org/jira/browse/CASSANDRA-15510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494985#comment-17494985 ] Benjamin Lerer commented on CASSANDRA-15510: [PR|https://github.com/apache/cassandra/pull/1462] for the 4.0 branch. The JMH tests still need to be ported CI result can be found [here|https://app.circleci.com/pipelines/github/blerer/cassandra/267/workflows/99bae171-a3ea-4065-8825-4648de507741] > BTree: Improve Building, Inserting and Transforming > --- > > Key: CASSANDRA-15510 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15510 > Project: Cassandra > Issue Type: Improvement > Components: Local/Other >Reporter: Benedict Elliott Smith >Assignee: Benedict Elliott Smith >Priority: Normal > Fix For: 4.0.x, 4.x > > > This work was originally undertaken as a follow-up to CASSANDRA-15367 to > ensure performance is strictly improved, but it may no longer be needed for > that purpose. It’s still hugely impactful, however. It remains to be > decided where this should land. > The current {{BTree}} implementation is suboptimal in a number of ways, with > very little focus having been given to its performance besides its > memory-occupancy. This patch aims to address that, specifically improving > the performance and allocations involved in: building, transforming and > inserting into a tree. > To facilitate this work, the {{BTree}} definition is modified slightly, so > that we can perform some simple arithmetic on tree sizes. Specifically, > trees of depth n are defined to have a maximum capacity of {{branchFactor^n - > 1}}, which translates into capping the number of leaf children at > {{branchFactor-1}}, as opposed to {{branchFactor}}. Since {{branchFactor}} > is a power of 2, this permits fast tree size arithmetic, enabling some of > these changes. > h2. Building > The static build method has been modified to utilise dedicated > {{buildPerfect}} methods that build either perfectly dense or perfectly > sparse sub-trees. These perfect trees all share their {{sizeMap}} with each > other, and can be built more efficiently than trees of arbitrary size. The > specifics are described in detail in the comments, but this building block > can be used to construct trees of any size, using at most one child at each > level that is not either perfectly sparse or perfectly dense. Bulk methods > are used where possible. > For large trees this can produce up to 30x throughput improvement and 30% > allocation reduction vs 3.0 (TBC, and to be tested vs 4.0). > {{FastBuilder}} is introduced for building a tree in-order (or in reverse) > without duplicate elements to resolve, without necessarily knowing the size > upfront. This meets the needs of most use cases. Data is built directly > into nodes, with up to one already-constructed node, and one partially > constructed node, on each level, being mutated to share their contents in the > event of insufficient data to populate the tree. These builders are > thread-locally shared. These leads to minimal copying, the same sharing of > {{sizeMap}} as above, zero wasted allocations, and results in minimal > difference in performance between utilising the less-ergonomic static build > and builder approach. > For large trees this leads to ~4.5x throughput improvement, and 70% reduction > in allocations vs a normal Builder. For small trees performance is > comparable, but allocations similarly reduced. > h2. Inserting > It turns out that we only ever insert another tree into a tree, so we exploit > this to implement an efficient union of two trees, operating on them directly > via stacks in the transformer, instead of via a collection interface. A > builder-like object is introduced that shares functionality with > {{FastBuilder}}, and permits us to build the result of the union directly > into the final nodes, reusing as much of the original trees as possible. > Bulk methods are used where possible. > The result is not _uniformly_ faster, but is _significantly_ faster on > average: median _improvement_ of 1.4x (that is, 2.4x total throughput), mean > improvement of 10x. Worst reduction is 30%, and it may be that we can > isolate and alleviate that. Allocations are also reduced significantly, with > a median of 30% and mean of 42% for the tested workloads. As the trees get > larger the improvement drops, but remains uniformly lower. > h2. Transforming > Transformations garbage overhead is minimal, i.e. the main allocations are > those necessary to represent the new tree. It is significantly faster and > particularly more efficient when removing elements, utilising the shared > functionality of the builder and transformer objects to define an
[jira] [Commented] (CASSANDRA-15510) BTree: Improve Building, Inserting and Transforming
[ https://issues.apache.org/jira/browse/CASSANDRA-15510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451892#comment-17451892 ] Michael Semb Wever commented on CASSANDRA-15510: The discussion raised on dev@ ML to include this also in 4.0.x can be read [here|https://lists.apache.org/thread/f3dl7rfc2kv9f5r9pxzyz6zojsss81b9]. > BTree: Improve Building, Inserting and Transforming > --- > > Key: CASSANDRA-15510 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15510 > Project: Cassandra > Issue Type: Improvement > Components: Local/Other >Reporter: Benedict Elliott Smith >Assignee: Benedict Elliott Smith >Priority: Normal > Fix For: 4.0.x, 4.x > > > This work was originally undertaken as a follow-up to CASSANDRA-15367 to > ensure performance is strictly improved, but it may no longer be needed for > that purpose. It’s still hugely impactful, however. It remains to be > decided where this should land. > The current {{BTree}} implementation is suboptimal in a number of ways, with > very little focus having been given to its performance besides its > memory-occupancy. This patch aims to address that, specifically improving > the performance and allocations involved in: building, transforming and > inserting into a tree. > To facilitate this work, the {{BTree}} definition is modified slightly, so > that we can perform some simple arithmetic on tree sizes. Specifically, > trees of depth n are defined to have a maximum capacity of {{branchFactor^n - > 1}}, which translates into capping the number of leaf children at > {{branchFactor-1}}, as opposed to {{branchFactor}}. Since {{branchFactor}} > is a power of 2, this permits fast tree size arithmetic, enabling some of > these changes. > h2. Building > The static build method has been modified to utilise dedicated > {{buildPerfect}} methods that build either perfectly dense or perfectly > sparse sub-trees. These perfect trees all share their {{sizeMap}} with each > other, and can be built more efficiently than trees of arbitrary size. The > specifics are described in detail in the comments, but this building block > can be used to construct trees of any size, using at most one child at each > level that is not either perfectly sparse or perfectly dense. Bulk methods > are used where possible. > For large trees this can produce up to 30x throughput improvement and 30% > allocation reduction vs 3.0 (TBC, and to be tested vs 4.0). > {{FastBuilder}} is introduced for building a tree in-order (or in reverse) > without duplicate elements to resolve, without necessarily knowing the size > upfront. This meets the needs of most use cases. Data is built directly > into nodes, with up to one already-constructed node, and one partially > constructed node, on each level, being mutated to share their contents in the > event of insufficient data to populate the tree. These builders are > thread-locally shared. These leads to minimal copying, the same sharing of > {{sizeMap}} as above, zero wasted allocations, and results in minimal > difference in performance between utilising the less-ergonomic static build > and builder approach. > For large trees this leads to ~4.5x throughput improvement, and 70% reduction > in allocations vs a normal Builder. For small trees performance is > comparable, but allocations similarly reduced. > h2. Inserting > It turns out that we only ever insert another tree into a tree, so we exploit > this to implement an efficient union of two trees, operating on them directly > via stacks in the transformer, instead of via a collection interface. A > builder-like object is introduced that shares functionality with > {{FastBuilder}}, and permits us to build the result of the union directly > into the final nodes, reusing as much of the original trees as possible. > Bulk methods are used where possible. > The result is not _uniformly_ faster, but is _significantly_ faster on > average: median _improvement_ of 1.4x (that is, 2.4x total throughput), mean > improvement of 10x. Worst reduction is 30%, and it may be that we can > isolate and alleviate that. Allocations are also reduced significantly, with > a median of 30% and mean of 42% for the tested workloads. As the trees get > larger the improvement drops, but remains uniformly lower. > h2. Transforming > Transformations garbage overhead is minimal, i.e. the main allocations are > those necessary to represent the new tree. It is significantly faster and > particularly more efficient when removing elements, utilising the shared > functionality of the builder and transformer objects to define an efficient > builder that reuses as much of the original tree as possible. > We also introduc
[jira] [Commented] (CASSANDRA-15510) BTree: Improve Building, Inserting and Transforming
[ https://issues.apache.org/jira/browse/CASSANDRA-15510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17017490#comment-17017490 ] Benedict Elliott Smith commented on CASSANDRA-15510: [3.0|http://github.com/belliottsmith/cassandra/tree/15510] > BTree: Improve Building, Inserting and Transforming > --- > > Key: CASSANDRA-15510 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15510 > Project: Cassandra > Issue Type: Improvement > Components: Local/Other >Reporter: Benedict Elliott Smith >Assignee: Benedict Elliott Smith >Priority: Normal > > This work was originally undertaken as a follow-up to CASSANDRA-15367 to > ensure performance is strictly improved, but it may no longer be needed for > that purpose. It’s still hugely impactful, however. It remains to be > decided where this should land. > The current {{BTree}} implementation is suboptimal in a number of ways, with > very little focus having been given to its performance besides its > memory-occupancy. This patch aims to address that, specifically improving > the performance and allocations involved in: building, transforming and > inserting into a tree. > To facilitate this work, the {{BTree}} definition is modified slightly, so > that we can perform some simple arithmetic on tree sizes. Specifically, > trees of depth n are defined to have a maximum capacity of {{branchFactor^n - > 1}}, which translates into capping the number of leaf children at > {{branchFactor-1}}, as opposed to {{branchFactor}}. Since {{branchFactor}} > is a power of 2, this permits fast tree size arithmetic, enabling some of > these changes. > h2. Building > The static build method has been modified to utilise dedicated > {{buildPerfect}} methods that build either perfectly dense or perfectly > sparse sub-trees. These perfect trees all share their {{sizeMap}} with each > other, and can be built more efficiently than trees of arbitrary size. The > specifics are described in detail in the comments, but this building block > can be used to construct trees of any size, using at most one child at each > level that is not either perfectly sparse or perfectly dense. Bulk methods > are used where possible. > For large trees this can produce up to 30x throughput improvement and 30% > allocation reduction vs 3.0 (TBC, and to be tested vs 4.0). > {{FastBuilder}} is introduced for building a tree in-order (or in reverse) > without duplicate elements to resolve, without necessarily knowing the size > upfront. This meets the needs of most use cases. Data is built directly > into nodes, with up to one already-constructed node, and one partially > constructed node, on each level, being mutated to share their contents in the > event of insufficient data to populate the tree. These builders are > thread-locally shared. These leads to minimal copying, the same sharing of > {{sizeMap}} as above, zero wasted allocations, and results in minimal > difference in performance between utilising the less-ergonomic static build > and builder approach. > For large trees this leads to ~4.5x throughput improvement, and 70% reduction > in allocations vs a normal Builder. For small trees performance is > comparable, but allocations similarly reduced. > h2. Inserting > It turns out that we only ever insert another tree into a tree, so we exploit > this to implement an efficient union of two trees, operating on them directly > via stacks in the transformer, instead of via a collection interface. A > builder-like object is introduced that shares functionality with > {{FastBuilder}}, and permits us to build the result of the union directly > into the final nodes, reusing as much of the original trees as possible. > Bulk methods are used where possible. > The result is not _uniformly_ faster, but is _significantly_ faster on > average: median _improvement_ of 1.4x (that is, 2.4x total throughput), mean > improvement of 10x. Worst reduction is 30%, and it may be that we can > isolate and alleviate that. Allocations are also reduced significantly, with > a median of 30% and mean of 42% for the tested workloads. As the trees get > larger the improvement drops, but remains uniformly lower. > h2. Transforming > Transformations garbage overhead is minimal, i.e. the main allocations are > those necessary to represent the new tree. It is significantly faster and > particularly more efficient when removing elements, utilising the shared > functionality of the builder and transformer objects to define an efficient > builder that reuses as much of the original tree as possible. > We also introduce dedicated {{transform}} methods (that forbid returning > {{null}}), and {{BiFunction}} transformations to permit ef