[ https://issues.apache.org/jira/browse/CASSANDRA-6446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sylvain Lebresne updated CASSANDRA-6446: ---------------------------------------- Attachment: 6446-write-path-v2.txt On the write patch: * The new 'iterate over range tombstones first' path don't handle columns from 'cm' that would be deleted by its own range tombstones, while the other 'iterator over all columns' path does. Truth is, we can remove it from the old path because while it's actually possible than an update contains columns shadowed by it's own tombstones, it's unlikely and not a big deal if we don't call indexer.remove() on them since index reads always check for staled entries. So let's drop that from the 'iterate over all columns' path for symetry, which allows to use a InOrderTester. * Since the whole point of this range tombston iteration is to call the index updater remove method, there is no point to do anything if it's a NullUpdater (this is not a problem of the patch itself but let's tackle that too). * How did you pick the value for MAX_COLUMNS_TO_APPLY_RANGE_TOMBSTONE? Did you preform some kind of test, or is that a guess. If the later, some form of simple perf test to check if that's reasonable would be great, but short of that, my own guess would be to go with one less 0. * Nit: for some reason I'm not fan of the name 'addImmutable'. It doesn't make it entirely clear that it may or may not return a copy in particular. I supposed something like 'maybeCopyAndAdd' would be a bit more explicit, but I'd prefer leaving the copy logic in AtomicSortedColumns. Attaching a v2 patch that shows what I have in mind for that write patch. On the read patch: * In RangeTombstoneList, the new searchInternal seems unecessary code duplication. Why not do 2 calls to the existing searchInternal to find the range covering the start and the one covering the end. And since we know start is before finish, we could even make the search for the range covering finish start at the index found by the first search. * For NamesQueryFilter, it sounds like it's more likely that the number of columns searched is low while the number of tombstones is large. So I'd prefer looking-up th names (use rangeCovering()). Since searched columns are sorted, we can also check if a fetched range covers the next searched column to avoid re-searching the same range unnecessarily. * In SliceQueryFilter: its potentially fairly inefficient to use a range covering all slices (there might very well be big holes between the slices). Instead, we probably should return a new iterator that iterate over each slices (as a side note, slices are always sorted so you could have use start() and finish() to get the total covered range). * In QueryFilter, in {{deletionInfo.getRangeCount() > 1}}, is there a rational behind the '1'? I mean, even if 'source' has only one range, maybe that range doesn't match the filter. And if the rational is that it's faster to add a few tombstone blindly than to call getRangeTombstoneIterator, why 1 and no 10? Whatever the rational is, it could use a comment to justify for future readers. * The patch don't respect the [code style|http://wiki.apache.org/cassandra/CodeStyle]. Other than that, I'm not sure how confortable I am targeting 2.0 for this. > Faster range tombstones on wide partitions > ------------------------------------------ > > Key: CASSANDRA-6446 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6446 > Project: Cassandra > Issue Type: Improvement > Components: Core > Reporter: Oleg Anastasyev > Assignee: Oleg Anastasyev > Fix For: 2.0.4 > > Attachments: 6446-write-path-v2.txt, > RangeTombstonesReadOptimization.diff, RangeTombstonesWriteOptimization.diff > > > Having wide CQL rows (~1M in single partition) and after deleting some of > them, we found inefficiencies in handling of range tombstones on both write > and read paths. > I attached 2 patches here, one for write path > (RangeTombstonesWriteOptimization.diff) and another on read > (RangeTombstonesReadOptimization.diff). > On write path, when you have some CQL rows deletions by primary key, each of > deletion is represented by range tombstone. On put of this tombstone to > memtable the original code takes all columns from memtable from partition and > checks DeletionInfo.isDeleted by brute for loop to decide, should this column > stay in memtable or it was deleted by new tombstone. Needless to say, more > columns you have on partition the slower deletions you have heating your CPU > with brute range tombstones check. > The RangeTombstonesWriteOptimization.diff patch for partitions with more than > 10000 columns loops by tombstones instead and checks existance of columns for > each of them. Also it copies of whole memtable range tombstone list only if > there are changes to be made there (original code copies range tombstone list > on every write). > On read path, original code scans whole range tombstone list of a partition > to match sstable columns to their range tomstones. The > RangeTombstonesReadOptimization.diff patch scans only necessary range of > tombstones, according to filter used for read. -- This message was sent by Atlassian JIRA (v6.1#6144)