[ 
https://issues.apache.org/jira/browse/CASSANDRA-5677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13700035#comment-13700035
 ] 

Sylvain Lebresne commented on CASSANDRA-5677:
---------------------------------------------

So first, let's remark how inefficient is our current use of the IntervalTree. 
I wrote a small benchmark test (1 node, locally, nothing fancy) that does the 
following:
* Creates the following table: CREATE TABLE test (k int, v int, PRIMARY KEY (k, 
v))
* Inserts N (CQL3) rows for a given (fixed) partition key (so: INSERT INTO 
test(k, v) VALUES (0, <n>)).
* Deletes those N (CQL3) rows (DELETE FROM test WHERE k=0 AND v=<n>). This 
involves insert a range tombstone (because it's not a compact table).
* Queries all rows for that partition key (SELECT * FROM test WHERE k=0), thus 
getting no results. I also did the same query in revsed order to exercise that 
code path too.
I ran that 10 times (with a different partition key for each run) and timed all 
operation. For N=2K (so pretty small), on trunk the results on my machine are:
{noformat}
        |         Insertions |          Deletions |              Query |     
Reversed query
--------------------------------------------------------------------------------------------
 Run 0  |           3418.0ms |          36950.6ms |          26100.5ms |        
  26147.3ms
 Run 1  |           2295.7ms |          36073.0ms |          28388.8ms |        
  28127.0ms
 Run 2  |           1641.2ms |          36119.4ms |          26953.1ms |        
  26177.8ms
 Run 3  |           1647.0ms |          30383.9ms |          28118.1ms |        
  27737.7ms
 Run 4  |           1472.9ms |          35913.1ms |          28172.3ms |        
  28046.6ms
 Run 5  |            679.8ms |          30472.8ms |          28197.5ms |        
  27756.0ms
 Run 6  |           1417.5ms |          30428.8ms |          28022.0ms |        
  27826.3ms
 Run 7  |            657.7ms |          30366.9ms |          28047.5ms |        
  28081.4ms
 Run 8  |            662.8ms |          30369.6ms |          28123.5ms |        
  27768.7ms
 Run 9  |            667.2ms |          30459.5ms |          32821.0ms |        
  32430.0ms
 Avg    |           1456.0ms |          32753.8ms |          28294.4ms |        
  28009.9ms
 8 last |           1105.8ms |          31814.3ms |          28556.9ms |        
  28228.1ms
{noformat}
Even ignoring the 2 first run (to let the JVM warm up), both deletion and query 
take about 30 seconds each! That's obviously very broken.

Now, Fabien's patch does fix the brokenness. After rebase to trunk (for 
fairness since my tests are on trunk), and for N=10K (so 8x more that the 
previous test, the reason I've only use 2K on bare trunk is that it's too long 
with 10K :)) I get:
{noformat}
        |         Insertions |          Deletions |              Query |     
Reversed query
--------------------------------------------------------------------------------------------
 Run 0  |           3460.4ms |           2575.7ms |             69.7ms |        
     93.7ms
 Run 1  |           1223.7ms |           1772.9ms |             64.3ms |        
     57.4ms
 Run 2  |           1416.7ms |            744.3ms |             25.8ms |        
     27.9ms
 Run 3  |            673.0ms |            298.5ms |             39.3ms |        
     29.4ms
 Run 4  |            470.5ms |            666.8ms |             31.7ms |        
     25.4ms
 Run 5  |            303.0ms |            591.8ms |             34.9ms |        
     26.4ms
 Run 6  |            512.9ms |            293.0ms |             26.3ms |        
     28.1ms
 Run 7  |            437.2ms |            595.0ms |             39.0ms |        
     24.8ms
 Run 8  |            295.6ms |            494.2ms |             32.5ms |        
     23.7ms
 Run 9  |            533.8ms |            258.7ms |             32.7ms |        
     25.6ms
 Avg    |            932.7ms |            829.1ms |             39.6ms |        
     36.2ms
 8 last |            580.3ms |            492.8ms |             32.8ms |        
     26.4ms
{noformat}
So, it's sane again (the query is a lot faster than the writes because my test 
do the insert/deletes sequentially one at a time, I was mostly interested by 
read time anyway).  It's worth noting that it's not really that our current 
"centered" interval tree implementation is bad in itself, it's just that you 
can't add new interval once built which make it ill-suited for range tombstones 
(but it's fine for our other use case of storing sstables).


However, as hinted in my previous comment, we can do better and generally 
improve our handling of range tombstones by using the following properties:
# we don't care about overlapping range tombstone. If we have say the following 
range tombstones: [0, 10]@3, [5, 8]@1, [8, 15]@4 (which we currently all store 
as-is), then we'd be fine just storing: [0, 8]@3, [8, 15]@4. And in fact, 
storing the latter is more efficient (we have less ranges) and would simplify 
some things slightly (for the ColumnIndexer for instance, by knowing it can 
only have one "open" range tombstone at any time).
# During reads, we'll read range tombstone in sorted order, so we can use that 
fact to speed up their insertion to the DeletionInfo the same way we do it in 
ArrayBackedSortedColumns for columns.
# If we have a lot of range tombstones for a column family (which we can), the 
DeletionInfo can start to represent quite a lot of memory/objects, because each 
range tombstone is a separate object that has yet another DeletionTime object, 
plus the IntervalTree structure. We could do something along the lines of 
CASSANDRA-5019, but it's a lot easier in that case because the use we do of 
range tombstone is a lot more controlled.

So, I've pushed a patch at https://github.com/pcmanus/cassandra/commits/5677 
with what I have in mind. Instead of providing a generic IntervalTree 
implementation, it adds a specialized RangeTombstoneList (I take better name 
suggestions) structure just for range tombstones. That structure keeps range 
tombstones as a sorted list, and when adding a new range, it only adds the 
relevant part (it stores only [0, 8]@3, [8, 15]@4 if the 3 tombtones of my 
example above are added).  It also tries to be reasonably memory efficient 
(which makes the implementation slightly more verbose that could probably be, 
but it's well contained in the RangeTombstoneList class so I think it's worth 
it overall) and optimize for the "inserts tombstone in sorted order" case.  The 
result of the test above with that patch (N=10K to compare it to Fabien's 
patch):
{noformat}
        |         Insertions |          Deletions |              Query |     
Reversed query
--------------------------------------------------------------------------------------------
 Run 0  |           3567.9ms |           2766.4ms |             42.8ms |        
     42.4ms
 Run 1  |           1718.5ms |           1723.9ms |             62.8ms |        
     33.0ms
 Run 2  |           1288.7ms |            722.4ms |              6.1ms |        
     21.9ms
 Run 3  |            720.0ms |            363.6ms |             10.3ms |        
     27.4ms
 Run 4  |            602.3ms |            642.6ms |             14.0ms |        
     13.4ms
 Run 5  |            272.8ms |            610.8ms |              9.3ms |        
     12.3ms
 Run 6  |            492.2ms |            278.1ms |             12.5ms |        
     26.2ms
 Run 7  |            550.8ms |            621.5ms |              5.5ms |        
     14.1ms
 Run 8  |            278.5ms |            586.0ms |             10.3ms |        
     19.9ms
 Run 9  |            534.1ms |            282.8ms |             10.7ms |        
     26.0ms
 Avg    |           1002.6ms |            859.8ms |             18.4ms |        
     23.7ms
 8 last |            592.4ms |            513.5ms |              9.8ms |        
     20.2ms
{noformat}
Deletions are about as fast (maybe a few percent slower, but even that could be 
noise of the benchmark since it's not optimize for that part) but reads are 
more than 3x faster.  I will note that I did not optimize for reverse queries, 
i.e.  RangeTombstoneList always keep tombstone in comparator order, so reverse 
queries are hitting the worst possible case for that structure. It wouldn't be 
very hard to optimize for it the same way we do it in ArrayBackedSortedColumns 
but I'd rather keep that to a followup ticket because as can be seen above, 
even in the reverse case RangeTombstoneList is faster, so there is probably no 
big rush.

I'll note that my patch is against trunk. I'm not sure what to do for 1.2. 
Neither my patch nor Fabien's one are completely trivial, though at the same 
time the current performance is fairly bad if you have more than a few range 
tombstones.

                
> Performance improvements of RangeTombstones/IntervalTree
> --------------------------------------------------------
>
>                 Key: CASSANDRA-5677
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5677
>             Project: Cassandra
>          Issue Type: Improvement
>    Affects Versions: 1.2.0
>            Reporter: Fabien Rousseau
>            Priority: Minor
>         Attachments: 5677-new-IntervalTree-implementation.patch
>
>
> Using massively range tombstones leads to bad response time (ie 100-500 
> ranges tombstones per row).
> After investigation, it seems that the culprit is how the DeletionInfo are 
> merged. Each time a RangeTombstone is added into the DeletionInfo, the whole 
> IntervalTree is rebuilt (thus, if you have 100 tombstones in one row, then 
> 100 instances of IntervalTree are created, the first one having one interval, 
> the second one 2 intervals,... the 100th one : 100 intervals...)
> It seems that once the IntervalTree is built, it is not possible to add a new 
> Interval. Idea is to change the implementation of the IntervalTree by another 
> one which support "insert interval".
> Attached is a proposed patch which :
>  - renames the IntervalTree implementation to IntervalTreeCentered (the 
> renaming is inspired from : http://en.wikipedia.org/wiki/Interval_tree)
>  - adds a new implementation IntervalTreeAvl (which is described here : 
> http://en.wikipedia.org/wiki/Interval_tree#Augmented_tree and here : 
> http://en.wikipedia.org/wiki/AVL_tree )
>  - adds a new interface IIntervalTree to abstract the implementation
>  - adds a new configuration option (interval_tree_provider) which allows to 
> choose between the two implementations (defaults to previous 
> IntervalTreeCentered)
>  - updates IntervalTreeTest unit tests to test both implementations
>  - creates a mini benchmark between the two implementations (tree creation, 
> point lookup, interval lookup)
>  - creates a mini benchmark between the two implementations when merging 
> DeletionInfo (which shows a big performance improvement when using 500 
> tombstones for a row)
> This patch applies for 1.2 branch...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to