[ 
https://issues.apache.org/jira/browse/CASSANDRA-6446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sylvain Lebresne updated CASSANDRA-6446:
----------------------------------------

    Attachment: 6446-write-path-v2.txt

On the write patch:
* The new 'iterate over range tombstones first' path don't handle columns from 
'cm' that would be deleted by its own range tombstones, while the other 
'iterator over all columns' path does. Truth is, we can remove it from the old 
path because while it's actually possible than an update contains columns 
shadowed by it's own tombstones, it's unlikely and not a big deal if we don't 
call indexer.remove() on them since index reads always check for staled 
entries. So let's drop that from the 'iterate over all columns' path for 
symetry, which allows to use a InOrderTester.
* Since the whole point of this range tombston iteration is to call the index 
updater remove method, there is no point to do anything if it's a NullUpdater 
(this is not a problem of the patch itself but let's tackle that too).
* How did you pick the value for MAX_COLUMNS_TO_APPLY_RANGE_TOMBSTONE? Did you 
preform some kind of test, or is that a guess. If the later, some form of 
simple perf test to check if that's reasonable would be great, but short of 
that, my own guess would be to go with one less 0.
* Nit: for some reason I'm not fan of the name 'addImmutable'. It doesn't make 
it entirely clear that it may or may not return a copy in particular. I 
supposed something like 'maybeCopyAndAdd' would be a bit more explicit, but I'd 
prefer leaving the copy logic in AtomicSortedColumns.

Attaching a v2 patch that shows what I have in mind for that write patch.

On the read patch:
* In RangeTombstoneList, the new searchInternal seems unecessary code 
duplication. Why not do 2 calls to the existing searchInternal to find the 
range covering the start and the one covering the end. And since we know start 
is before finish, we could even make the search for the range covering finish 
start at the index found by the first search.
* For NamesQueryFilter, it sounds like it's more likely that the number of 
columns searched is low while the number of tombstones is large.  So I'd prefer 
looking-up th names (use rangeCovering()). Since searched columns are sorted, 
we can also check if a fetched range covers the next searched column to avoid 
re-searching the same range unnecessarily.
* In SliceQueryFilter: its potentially fairly inefficient to use a range 
covering all slices (there might very well be big holes between the slices). 
Instead, we probably should return a new iterator that iterate over each slices 
(as a side note, slices are always sorted so you could have use start() and 
finish() to get the total covered range).
* In QueryFilter, in {{deletionInfo.getRangeCount() > 1}}, is there a rational 
behind the '1'? I mean, even if 'source' has only one range, maybe that range 
doesn't match the filter. And if the rational is that it's faster to add a few 
tombstone blindly than to call getRangeTombstoneIterator, why 1 and no 10? 
Whatever the rational is, it could use a comment to justify for future readers.
* The patch don't respect the [code 
style|http://wiki.apache.org/cassandra/CodeStyle].

Other than that, I'm not sure how confortable I am targeting 2.0 for this.


> Faster range tombstones on wide partitions
> ------------------------------------------
>
>                 Key: CASSANDRA-6446
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6446
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Oleg Anastasyev
>            Assignee: Oleg Anastasyev
>             Fix For: 2.0.4
>
>         Attachments: 6446-write-path-v2.txt, 
> RangeTombstonesReadOptimization.diff, RangeTombstonesWriteOptimization.diff
>
>
> Having wide CQL rows (~1M in single partition) and after deleting some of 
> them, we found inefficiencies in handling of range tombstones on both write 
> and read paths.
> I attached 2 patches here, one for write path 
> (RangeTombstonesWriteOptimization.diff) and another on read 
> (RangeTombstonesReadOptimization.diff).
> On write path, when you have some CQL rows deletions by primary key, each of 
> deletion is represented by range tombstone. On put of this tombstone to 
> memtable the original code takes all columns from memtable from partition and 
> checks DeletionInfo.isDeleted by brute for loop to decide, should this column 
> stay in memtable or it was deleted by new tombstone. Needless to say, more 
> columns you have on partition the slower deletions you have heating your CPU 
> with brute range tombstones check. 
> The RangeTombstonesWriteOptimization.diff patch for partitions with more than 
> 10000 columns loops by tombstones instead and checks existance of columns for 
> each of them. Also it copies of whole memtable range tombstone list only if 
> there are changes to be made there (original code copies range tombstone list 
> on every write).
> On read path, original code scans whole range tombstone list of a partition 
> to match sstable columns to their range tomstones. The 
> RangeTombstonesReadOptimization.diff patch scans only necessary range of 
> tombstones, according to filter used for read.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to