[
https://issues.apache.org/jira/browse/CASSANDRA-15397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17054593#comment-17054593
]
Chandrasekhar Thumuluru commented on CASSANDRA-15397:
-
[~benedict] — Sorry for the delayed update on this ticket.
I made necessary changes to IntervalList implementation based on your feedback.
The previous submission was moved to IntervalList2. I also refactored the other
code to use the IntervalList instead of IntervalTree. You suggested to use 4
long[] arrays but I couldn't do so since the code uses a generic that's
comparable. I'm not sure if assuming long will be a good idea. Instead I
changed the implementation to use two lists one to store the interval points
and the other to store the data. Based on my performance comparison, I see the
previous submission performs better. It could be partly because the previous
implementation packs all the relevant items together. I also added the
performance comparison for 100k, 250k, 500k, 750k and 1 million SSTables based
on 1000 searches.
Based on my analysis, the performance of Interval list with (250k and above)
huge number of SSTables falls behind Interval tree by a small margin but the
trade-off is less construction cost and less garbage creation during
construction. Please take a look and let me know what you think.
> IntervalTree performance comparison with Linear Walk and Binary Search based
> Elimination.
> --
>
> Key: CASSANDRA-15397
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15397
> Project: Cassandra
> Issue Type: Improvement
> Components: Local/SSTable
>Reporter: Chandrasekhar Thumuluru
>Assignee: Chandrasekhar Thumuluru
>Priority: Low
> Labels: pull-request-available
> Attachments: 90p_100k_sstables_with_1000_searches.png,
> 90p_1million_sstables_with_1000_searches.png,
> 90p_250k_sstables_with_1000_searches.png,
> 90p_500k_sstables_with_1000_searches.png,
> 90p_750k_sstables_with_1000_searches.png,
> 95p_1_SSTable_with_5000_Searches.png,
> 95p_100k_sstables_with_1000_searches.png,
> 95p_15000_SSTable_with_5000_Searches.png,
> 95p_1million_sstables_with_1000_searches.png,
> 95p_2_SSTable_with_5000_Searches.png,
> 95p_25000_SSTable_with_5000_Searches.png,
> 95p_250k_sstables_with_1000_searches.png,
> 95p_3_SSTable_with_5000_Searches.png,
> 95p_5000_SSTable_with_5000_Searches.png,
> 95p_500k_sstables_with_1000_searches.png,
> 95p_750k_sstables_with_1000_searches.png,
> 99p_1_SSTable_with_5000_Searches.png,
> 99p_100k_sstables_with_1000_searches.png,
> 99p_15000_SSTable_with_5000_Searches.png,
> 99p_1million_sstables_with_1000_searches.png,
> 99p_2_SSTable_with_5000_Searches.png,
> 99p_25000_SSTable_with_5000_Searches.png,
> 99p_250k_sstables_with_1000_searches.png,
> 99p_3_SSTable_with_5000_Searches.png,
> 99p_5000_SSTable_with_5000_Searches.png,
> 99p_500k_sstables_with_1000_searches.png,
> 99p_750k_sstables_with_1000_searches.png, IntervalList.java,
> IntervalListWithElimination.java, IntervalTreeSimplified.java,
> Mean_1_SSTable_with_5000_Searches.png,
> Mean_100k_sstables_with_1000_searches.png,
> Mean_15000_SSTable_with_5000_Searches.png,
> Mean_1million_sstables_with_1000_searches.png,
> Mean_2_SSTable_with_5000_Searches.png,
> Mean_25000_SSTable_with_5000_Searches.png,
> Mean_250k_sstables_with_1000_searches.png,
> Mean_3_SSTable_with_5000_Searches.png,
> Mean_5000_SSTable_with_5000_Searches.png,
> Mean_500k_sstables_with_1000_searches.png,
> Mean_750k_sstables_with_1000_searches.png, TESTS-TestSuites.xml.lz4,
> replace_intervaltree_with_intervallist.patch
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> Cassandra uses IntervalTrees to identify the SSTables that overlap with
> search interval. In Cassandra, IntervalTrees are not mutated. They are
> recreated each time a mutation is required. This can be an issue during
> repairs. In fact we noticed such issues during repair.
> Since lists are cache friendly compared to linked lists and trees, I decided
> to compare the search performance with:
> * Linear Walk.
> * Elimination using Binary Search (idea is to eliminate intervals using start
> and end points of search interval).
> Based on the tests I ran, I noticed Binary Search based elimination almost
> always performs similar to IntervalTree or out performs IntervalTree based
> search. The cost of IntervalTree construction is also substantial and
> produces lot of garbage during repairs.
> I ran the tests using random intervals to build the tree/lists and another
> randomly generated search interval with 5000 iterations. I'm attaching all
> the relevant graphs. The x-axis in the graphs is the