[jira] [Updated] (CASSANDRA-8603) Cut tombstone memory footprint in half for cql deletes
[ https://issues.apache.org/jira/browse/CASSANDRA-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Letz updated CASSANDRA-8603: Attachment: cassandra-2.1-8603_v2.txt The equality in the system.log was in the system keyspace (system/local-7ad54392bcdd35a684174e047860b377) This is because of the code in RangeTombstone.java:242 {code} RangeTombstone t = new RangeTombstone(cell.name(), cell.name(), cell.timestamp(), 0); {code} I've provided a patch v2 that actually catches the case I intended and replaces the duplicated ByteBuffer by using the Composite.end() method to construct a BoundedComposite referencing the original. {code} super(start, (start != stop && stop.equals(start.end())) ? start.end() : stop, delTime); {code} > Cut tombstone memory footprint in half for cql deletes > -- > > Key: CASSANDRA-8603 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8603 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Dominic Letz > Labels: tombstone > Attachments: cassandra-2.0.11-8603.txt, cassandra-2.1-8603.txt, > cassandra-2.1-8603_v2.txt, system.log > > > As CQL does not yet support range deletes every delete from CQL results in a > "Semi-RangeTombstone" which actually has the same start and end values - but > until today they are copies. Effectively doubling the required heap memory to > store the RangeTombstone. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8603) Cut tombstone memory footprint in half for cql deletes
[ https://issues.apache.org/jira/browse/CASSANDRA-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279632#comment-14279632 ] Dominic Letz commented on CASSANDRA-8603: - I used 2.1.2 for this testing. > Cut tombstone memory footprint in half for cql deletes > -- > > Key: CASSANDRA-8603 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8603 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Dominic Letz > Labels: tombstone > Attachments: cassandra-2.0.11-8603.txt, cassandra-2.1-8603.txt, > system.log > > > As CQL does not yet support range deletes every delete from CQL results in a > "Semi-RangeTombstone" which actually has the same start and end values - but > until today they are copies. Effectively doubling the required heap memory to > store the RangeTombstone. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8603) Cut tombstone memory footprint in half for cql deletes
[ https://issues.apache.org/jira/browse/CASSANDRA-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Letz updated CASSANDRA-8603: Attachment: system.log Sorry for the long delay here. I did testing before but your comment made me look deeper and indeed something is wrong. First off the attached patches do not catch the situation that they were intended to catch - but they do catch something and that mislead me. So maybe there is a bug as you pointed out. The test case I'm using is like this: {code} CREATE KEYSPACE test WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 }; CREATE TABLE test.timeseries (name text, timestamp int, value text, PRIMARY KEY ((name), timestamp)) WITH compaction = {'class': 'SizeTieredCompactionStrategy'} AND CLUSTERING ORDER BY (timestamp DESC); INSERT INTO test.timeseries (name, timestamp, value) VALUES('a', 1, 'hello'); INSERT INTO test.timeseries (name, timestamp, value) VALUES('a', 2, 'world'); INSERT INTO test.timeseries (name, timestamp, value) VALUES('a', 5, 'hello world'); DELETE FROM test.timeseries WHERE name = 'a' AND timestamp = 1; DELETE FROM test.timeseries WHERE name = 'a' AND timestamp = 2; DELETE FROM test.timeseries WHERE name = 'a' AND timestamp = 5; {code} Plus adding this log line next to the initial patch content: {code} public RangeTombstone(Composite start, Composite stop, DeletionTime delTime) { super(start, stop.equals(start) ? start : stop, delTime); LoggerFactory.getLogger(RangeTombstone.class).error(String.format("Test: start.equals(stop) %s (%s == %s) %s, ", start.equals(stop), new String(Hex.encodeHex(start.toByteBuffer().array())), new String(Hex.encodeHex(stop.toByteBuffer().array())), start.toByteBuffer().compareTo(stop.toByteBuffer(; } {/code} Then I wiped the cassandra data directory and started from scratch, and once started ran the said cql. The resulting system.log is attached It seems that just during start-up there are a couple of hits into the new logic where the tombstone .start and .end are actually exactly the same as shown by the debug output: {code} ERROR [CompactionExecutor:1] 2015-01-16 08:50:57,476 RangeTombstone.java:48 - Test: start.equals(stop) true (00 == 00) 0, ERROR [CompactionExecutor:1] 2015-01-16 08:50:57,478 RangeTombstone.java:48 - Test: start.equals(stop) false (0006746f6b656e73ff == 0006746f6b656e7301) -2, ERROR [CompactionExecutor:1] 2015-01-16 08:50:57,480 RangeTombstone.java:48 - Test: start.equals(stop) true (000c626f6f74737472617070656400 == 000c626f6f74737472617070656400) 0, ERROR [CompactionExecutor:1] 2015-01-16 08:50:57,481 RangeTombstone.java:48 - Test: start.equals(stop) true (000c636c75737465725f6e616d6500 == 000c636c75737465725f6e616d6500) 0, ... {code} But then further down in the log when the actual test cql code is executed these messages pop up: {code} ERROR [Thrift:1] 2015-01-16 08:51:14,089 RangeTombstone.java:48 - Test: start.equals(stop) false (00040001ff == 0004000101) -2, ERROR [Thrift:1] 2015-01-16 08:51:14,092 RangeTombstone.java:48 - Test: start.equals(stop) false (00040002ff == 0004000201) -2, ERROR [Thrift:1] 2015-01-16 08:51:14,094 RangeTombstone.java:48 - Test: start.equals(stop) false (00040005ff == 0004000501) -2, {code} Show that even though they keys match up until the provided sample value (1, 2, 5) there is an additional byte that is creating a difference and thus the code is not catching. It seems I wasn't deep enough into the code to actually make my proposed change successfully as I don't know the meaning of that last byte - but the concept might still hold? > Cut tombstone memory footprint in half for cql deletes > -- > > Key: CASSANDRA-8603 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8603 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Dominic Letz > Labels: tombstone > Attachments: cassandra-2.0.11-8603.txt, cassandra-2.1-8603.txt, > system.log > > > As CQL does not yet support range deletes every delete from CQL results in a > "Semi-RangeTombstone" which actually has the same start and end values - but > until today they are copies. Effectively doubling the required heap memory to > store the RangeTombstone. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8559) OOM caused by large tombstone warning.
[ https://issues.apache.org/jira/browse/CASSANDRA-8559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Letz updated CASSANDRA-8559: Labels: tombstone (was: ) > OOM caused by large tombstone warning. > -- > > Key: CASSANDRA-8559 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8559 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: 2.0.11 / 2.1 >Reporter: Dominic Letz > Labels: tombstone > Fix For: 2.0.12 > > Attachments: Selection_048.png, cassandra-2.0.11-8559.txt, > stacktrace.log > > > When running with high amount of tombstones the error message generation from > CASSANDRA-6117 can lead to out of memory situation with the default setting. > Attached a heapdump viewed in visualvm showing how this construct created two > 777mb strings to print the error message for a read query and then crashed > OOM. > {code} > if (respectTombstoneThresholds() && columnCounter.ignored() > > DatabaseDescriptor.getTombstoneWarnThreshold()) > { > StringBuilder sb = new StringBuilder(); > CellNameType type = container.metadata().comparator; > for (ColumnSlice sl : slices) > { > assert sl != null; > sb.append('['); > sb.append(type.getString(sl.start)); > sb.append('-'); > sb.append(type.getString(sl.finish)); > sb.append(']'); > } > logger.warn("Read {} live and {} tombstoned cells in {}.{} (see > tombstone_warn_threshold). {} columns was requested, slices={}, delInfo={}", > columnCounter.live(), columnCounter.ignored(), > container.metadata().ksName, container.metadata().cfName, count, sb, > container.deletionInfo()); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8547) Make RangeTombstone.Tracker.isDeleted() faster
[ https://issues.apache.org/jira/browse/CASSANDRA-8547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Letz updated CASSANDRA-8547: Labels: tombstone (was: ) > Make RangeTombstone.Tracker.isDeleted() faster > -- > > Key: CASSANDRA-8547 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8547 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: 2.0.11 >Reporter: Dominic Letz >Assignee: Dominic Letz > Labels: tombstone > Fix For: 2.1.3 > > Attachments: Selection_044.png, cassandra-2.0.11-8547.txt, > cassandra-2.1-8547.txt, rangetombstone.tracker.txt > > > During compaction and repairs with many tombstones an exorbitant amount of > time is spend in RangeTombstone.Tracker.isDeleted(). > The amount of time spend there can be so big that compactions and repairs > look "stalled" and the time remaining time estimated frozen at the same value > for days. > Using visualvm I've been sample profiling the code during execution and both > in Compaction as well as during repairs found this. (point in time backtraces > attached) > Looking at the code the problem is obviously the linear scanning: > {code} > public boolean isDeleted(Column column) > { > for (RangeTombstone tombstone : ranges) > { > if (comparator.compare(column.name(), tombstone.min) >= 0 > && comparator.compare(column.name(), tombstone.max) <= 0 > && tombstone.maxTimestamp() >= column.timestamp()) > { > return true; > } > } > return false; > } > {code} > I would like to propose to change this and instead use a sorted list (e.g. > RangeTombstoneList) here instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8546) RangeTombstoneList becoming bottleneck on tombstone heavy tasks
[ https://issues.apache.org/jira/browse/CASSANDRA-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Letz updated CASSANDRA-8546: Labels: tombstone (was: ) > RangeTombstoneList becoming bottleneck on tombstone heavy tasks > --- > > Key: CASSANDRA-8546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8546 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: 2.0.11 / 2.1 >Reporter: Dominic Letz >Assignee: Joshua McKenzie > Labels: tombstone > Fix For: 2.1.3 > > Attachments: cassandra-2.0.11-8546.txt, cassandra-2.1-8546.txt, > rangetombstonelist_compaction.png, rangetombstonelist_mutation.png, > rangetombstonelist_read.png, tombstone_test.tgz > > > I would like to propose a change of the data structure used in the > RangeTombstoneList to store and insert tombstone ranges to something with at > least O(log N) insert in the middle and at near O(1) and start AND end. Here > is why: > When having tombstone heavy work-loads the current implementation of > RangeTombstoneList becomes a bottleneck with slice queries. > Scanning the number of tombstones up to the default maximum (100k) can take > up to 3 minutes of how addInternal() scales on insertion of middle and start > elements. > The attached test shows that with 50k deletes from both sides of a range. > INSERT 1...11 > flush() > DELETE 1...5 > DELETE 11...6 > While one direction performs ok (~400ms on my notebook): > {code} > SELECT * FROM timeseries WHERE name = 'a' ORDER BY timestamp DESC LIMIT 1 > {code} > The other direction underperforms (~7seconds on my notebook) > {code} > SELECT * FROM timeseries WHERE name = 'a' ORDER BY timestamp ASC LIMIT 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8603) Cut tombstone memory footprint in half for cql deletes
[ https://issues.apache.org/jira/browse/CASSANDRA-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Letz updated CASSANDRA-8603: Attachment: cassandra-2.1-8603.txt Attached trivial patches for 2.0.11 and 2.1 > Cut tombstone memory footprint in half for cql deletes > -- > > Key: CASSANDRA-8603 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8603 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Dominic Letz > Labels: tombstone > Attachments: cassandra-2.0.11-8603.txt, cassandra-2.1-8603.txt > > > As CQL does not yet support range deletes every delete from CQL results in a > "Semi-RangeTombstone" which actually has the same start and end values - but > until today they are copies. Effectively doubling the required heap memory to > store the RangeTombstone. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8603) Cut tombstone memory footprint in half for cql deletes
[ https://issues.apache.org/jira/browse/CASSANDRA-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Letz updated CASSANDRA-8603: Attachment: cassandra-2.0.11-8603.txt > Cut tombstone memory footprint in half for cql deletes > -- > > Key: CASSANDRA-8603 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8603 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Dominic Letz > Labels: tombstone > Attachments: cassandra-2.0.11-8603.txt > > > As CQL does not yet support range deletes every delete from CQL results in a > "Semi-RangeTombstone" which actually has the same start and end values - but > until today they are copies. Effectively doubling the required heap memory to > store the RangeTombstone. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-8603) Cut tombstone memory footprint in half for cql deletes
Dominic Letz created CASSANDRA-8603: --- Summary: Cut tombstone memory footprint in half for cql deletes Key: CASSANDRA-8603 URL: https://issues.apache.org/jira/browse/CASSANDRA-8603 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Dominic Letz As CQL does not yet support range deletes every delete from CQL results in a "Semi-RangeTombstone" which actually has the same start and end values - but until today they are copies. Effectively doubling the required heap memory to store the RangeTombstone. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8559) OOM caused by large tombstone warning.
[ https://issues.apache.org/jira/browse/CASSANDRA-8559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Letz updated CASSANDRA-8559: Attachment: cassandra-2.0.11-8559.txt Added a variable tombstone_warn_message_len to limit the length of the generated string representation of DeletionInfo. I'm not completely happy with the naming of the variable as it really only affects DeletionInfo::toString(). Food for thought. > OOM caused by large tombstone warning. > -- > > Key: CASSANDRA-8559 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8559 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: 2.0.11 / 2.1 >Reporter: Dominic Letz > Fix For: 2.0.12 > > Attachments: Selection_048.png, cassandra-2.0.11-8559.txt, > stacktrace.log > > > When running with high amount of tombstones the error message generation from > CASSANDRA-6117 can lead to out of memory situation with the default setting. > Attached a heapdump viewed in visualvm showing how this construct created two > 777mb strings to print the error message for a read query and then crashed > OOM. > {code} > if (respectTombstoneThresholds() && columnCounter.ignored() > > DatabaseDescriptor.getTombstoneWarnThreshold()) > { > StringBuilder sb = new StringBuilder(); > CellNameType type = container.metadata().comparator; > for (ColumnSlice sl : slices) > { > assert sl != null; > sb.append('['); > sb.append(type.getString(sl.start)); > sb.append('-'); > sb.append(type.getString(sl.finish)); > sb.append(']'); > } > logger.warn("Read {} live and {} tombstoned cells in {}.{} (see > tombstone_warn_threshold). {} columns was requested, slices={}, delInfo={}", > columnCounter.live(), columnCounter.ignored(), > container.metadata().ksName, container.metadata().cfName, count, sb, > container.deletionInfo()); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8559) OOM caused by large tombstone warning.
[ https://issues.apache.org/jira/browse/CASSANDRA-8559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Letz updated CASSANDRA-8559: Attachment: stacktrace.log Attaching stacktrace of the very crash (Cassandra 2.0.11). It shows that the crash is actually in DeletionInfo.toString(). > OOM caused by large tombstone warning. > -- > > Key: CASSANDRA-8559 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8559 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: 2.0.11 / 2.1 >Reporter: Dominic Letz > Attachments: Selection_048.png, stacktrace.log > > > When running with high amount of tombstones the error message generation from > CASSANDRA-6117 can lead to out of memory situation with the default setting. > Attached a heapdump viewed in visualvm showing how this construct created two > 777mb strings to print the error message for a read query and then crashed > OOM. > {code} > if (respectTombstoneThresholds() && columnCounter.ignored() > > DatabaseDescriptor.getTombstoneWarnThreshold()) > { > StringBuilder sb = new StringBuilder(); > CellNameType type = container.metadata().comparator; > for (ColumnSlice sl : slices) > { > assert sl != null; > sb.append('['); > sb.append(type.getString(sl.start)); > sb.append('-'); > sb.append(type.getString(sl.finish)); > sb.append(']'); > } > logger.warn("Read {} live and {} tombstoned cells in {}.{} (see > tombstone_warn_threshold). {} columns was requested, slices={}, delInfo={}", > columnCounter.live(), columnCounter.ignored(), > container.metadata().ksName, container.metadata().cfName, count, sb, > container.deletionInfo()); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8546) RangeTombstoneList becoming bottleneck on tombstone heavy tasks
[ https://issues.apache.org/jira/browse/CASSANDRA-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Letz updated CASSANDRA-8546: Attachment: rangetombstonelist_read.png rangetombstonelist_mutation.png rangetombstonelist_compaction.png Adding screenshots from visualvm for read, mutation, compaction before the patch. > RangeTombstoneList becoming bottleneck on tombstone heavy tasks > --- > > Key: CASSANDRA-8546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8546 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: 2.0.11 / 2.1 >Reporter: Dominic Letz >Assignee: Benedict > Fix For: 2.1.3 > > Attachments: cassandra-2.0.11-8546.txt, cassandra-2.1-8546.txt, > rangetombstonelist_compaction.png, rangetombstonelist_mutation.png, > rangetombstonelist_read.png, tombstone_test.tgz > > > I would like to propose a change of the data structure used in the > RangeTombstoneList to store and insert tombstone ranges to something with at > least O(log N) insert in the middle and at near O(1) and start AND end. Here > is why: > When having tombstone heavy work-loads the current implementation of > RangeTombstoneList becomes a bottleneck with slice queries. > Scanning the number of tombstones up to the default maximum (100k) can take > up to 3 minutes of how addInternal() scales on insertion of middle and start > elements. > The attached test shows that with 50k deletes from both sides of a range. > INSERT 1...11 > flush() > DELETE 1...5 > DELETE 11...6 > While one direction performs ok (~400ms on my notebook): > {code} > SELECT * FROM timeseries WHERE name = 'a' ORDER BY timestamp DESC LIMIT 1 > {code} > The other direction underperforms (~7seconds on my notebook) > {code} > SELECT * FROM timeseries WHERE name = 'a' ORDER BY timestamp ASC LIMIT 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8559) OOM caused by large tombstone warning.
[ https://issues.apache.org/jira/browse/CASSANDRA-8559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14265596#comment-14265596 ] Dominic Letz commented on CASSANDRA-8559: - To qualify that a bit more, you are right [~slebresne] the deletionInfo() is the big one and the fragment seen in the screenshot shows what is generated in DeletionInfo.toString(), and the second 700mb string is what is generated from rangesAsString() and copied into the first. So the huge dump is rather generated here in DeletionInfo: {code} private String rangesAsString() { assert !ranges.isEmpty(); StringBuilder sb = new StringBuilder(); CType type = (CType)ranges.comparator(); assert type != null; Iterator iter = rangeIterator(); while (iter.hasNext()) { RangeTombstone i = iter.next(); sb.append("["); sb.append(type.getString(i.min)).append("-"); sb.append(type.getString(i.max)).append(", "); sb.append(i.data); sb.append("]"); } return sb.toString(); } {code} Should this be limited to a maximum (e.g. 1000 ranges?) > OOM caused by large tombstone warning. > -- > > Key: CASSANDRA-8559 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8559 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: 2.0.11 / 2.1 >Reporter: Dominic Letz > Attachments: Selection_048.png > > > When running with high amount of tombstones the error message generation from > CASSANDRA-6117 can lead to out of memory situation with the default setting. > Attached a heapdump viewed in visualvm showing how this construct created two > 777mb strings to print the error message for a read query and then crashed > OOM. > {code} > if (respectTombstoneThresholds() && columnCounter.ignored() > > DatabaseDescriptor.getTombstoneWarnThreshold()) > { > StringBuilder sb = new StringBuilder(); > CellNameType type = container.metadata().comparator; > for (ColumnSlice sl : slices) > { > assert sl != null; > sb.append('['); > sb.append(type.getString(sl.start)); > sb.append('-'); > sb.append(type.getString(sl.finish)); > sb.append(']'); > } > logger.warn("Read {} live and {} tombstoned cells in {}.{} (see > tombstone_warn_threshold). {} columns was requested, slices={}, delInfo={}", > columnCounter.live(), columnCounter.ignored(), > container.metadata().ksName, container.metadata().cfName, count, sb, > container.deletionInfo()); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8547) Make RangeTombstone.Tracker.isDeleted() faster
[ https://issues.apache.org/jira/browse/CASSANDRA-8547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Letz updated CASSANDRA-8547: Attachment: Selection_044.png Attaching a screenshot of visualvm showing where a typical compaction job is spending its time for us without the patch. > Make RangeTombstone.Tracker.isDeleted() faster > -- > > Key: CASSANDRA-8547 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8547 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: 2.0.11 >Reporter: Dominic Letz >Assignee: Dominic Letz > Fix For: 2.1.3 > > Attachments: Selection_044.png, cassandra-2.0.11-8547.txt, > cassandra-2.1-8547.txt, rangetombstone.tracker.txt > > > During compaction and repairs with many tombstones an exorbitant amount of > time is spend in RangeTombstone.Tracker.isDeleted(). > The amount of time spend there can be so big that compactions and repairs > look "stalled" and the time remaining time estimated frozen at the same value > for days. > Using visualvm I've been sample profiling the code during execution and both > in Compaction as well as during repairs found this. (point in time backtraces > attached) > Looking at the code the problem is obviously the linear scanning: > {code} > public boolean isDeleted(Column column) > { > for (RangeTombstone tombstone : ranges) > { > if (comparator.compare(column.name(), tombstone.min) >= 0 > && comparator.compare(column.name(), tombstone.max) <= 0 > && tombstone.maxTimestamp() >= column.timestamp()) > { > return true; > } > } > return false; > } > {code} > I would like to propose to change this and instead use a sorted list (e.g. > RangeTombstoneList) here instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8546) RangeTombstoneList becoming bottleneck on tombstone heavy tasks
[ https://issues.apache.org/jira/browse/CASSANDRA-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14265568#comment-14265568 ] Dominic Letz commented on CASSANDRA-8546: - This makes all sense. From my reading it also appeared that the way the RangeTombstoneList is used in reads vs. in compaction and other places is actually so different that it could use different use case optimized data-structures in different places rather than re-using the RangeTombstoneList. CASSANDRA-8099 might be an opportunity to change that. Until then though the GapList change could be a minimal impact fix. It does not change the structure of the code and does not regress the read path while fixing the reverse read issue and giving linear speedups: - Read in forward stays as is. - Read in reverse moves from O(n) to O(1) - grow() becomes 4x faster since there is only one array to grow - insertFrom becomes at least 8x faster (only one array to shift instead of four , and in worst case only every second insert) Obviously I'm scratching here my own itch to improve reverse reads and compaction with tombstone heavy workloads - is there any way I can help make that happen for 2.1 ? > RangeTombstoneList becoming bottleneck on tombstone heavy tasks > --- > > Key: CASSANDRA-8546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8546 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: 2.0.11 / 2.1 >Reporter: Dominic Letz >Assignee: Benedict > Fix For: 2.1.3 > > Attachments: cassandra-2.0.11-8546.txt, cassandra-2.1-8546.txt, > tombstone_test.tgz > > > I would like to propose a change of the data structure used in the > RangeTombstoneList to store and insert tombstone ranges to something with at > least O(log N) insert in the middle and at near O(1) and start AND end. Here > is why: > When having tombstone heavy work-loads the current implementation of > RangeTombstoneList becomes a bottleneck with slice queries. > Scanning the number of tombstones up to the default maximum (100k) can take > up to 3 minutes of how addInternal() scales on insertion of middle and start > elements. > The attached test shows that with 50k deletes from both sides of a range. > INSERT 1...11 > flush() > DELETE 1...5 > DELETE 11...6 > While one direction performs ok (~400ms on my notebook): > {code} > SELECT * FROM timeseries WHERE name = 'a' ORDER BY timestamp DESC LIMIT 1 > {code} > The other direction underperforms (~7seconds on my notebook) > {code} > SELECT * FROM timeseries WHERE name = 'a' ORDER BY timestamp ASC LIMIT 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8559) OOM caused by large tombstone warning.
[ https://issues.apache.org/jira/browse/CASSANDRA-8559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14264538#comment-14264538 ] Dominic Letz commented on CASSANDRA-8559: - Both is true, there are huge gaps filled with many tombstones that do that to the query and we like the error message that warns us about that and helps us fixing that issue. But in this case the error message was the issue. > OOM caused by large tombstone warning. > -- > > Key: CASSANDRA-8559 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8559 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: 2.0.11 / 2.1 >Reporter: Dominic Letz > Attachments: Selection_048.png > > > When running with high amount of tombstones the error message generation from > CASSANDRA-6117 can lead to out of memory situation with the default setting. > Attached a heapdump viewed in visualvm showing how this construct created two > 777mb strings to print the error message for a read query and then crashed > OOM. > {code} > if (respectTombstoneThresholds() && columnCounter.ignored() > > DatabaseDescriptor.getTombstoneWarnThreshold()) > { > StringBuilder sb = new StringBuilder(); > CellNameType type = container.metadata().comparator; > for (ColumnSlice sl : slices) > { > assert sl != null; > sb.append('['); > sb.append(type.getString(sl.start)); > sb.append('-'); > sb.append(type.getString(sl.finish)); > sb.append(']'); > } > logger.warn("Read {} live and {} tombstoned cells in {}.{} (see > tombstone_warn_threshold). {} columns was requested, slices={}, delInfo={}", > columnCounter.live(), columnCounter.ignored(), > container.metadata().ksName, container.metadata().cfName, count, sb, > container.deletionInfo()); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-8546) RangeTombstoneList becoming bottleneck on tombstone heavy tasks
[ https://issues.apache.org/jira/browse/CASSANDRA-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14264520#comment-14264520 ] Dominic Letz edited comment on CASSANDRA-8546 at 1/5/15 12:00 PM: -- Sorry for the assignment, Tyler mentioned you to look into it but did assign back to me. Since he didn't mention anything for me to be done I assumed that was an error and he meant to assign to you, as he mentioned you. Good call on BTree vs. GapList indeed the BTree does guarantee O(log N) but you will get in all cases. While GapList gives O(1) on front and back insert, and O(1) for middle insert with the same locality - and insertFrom() is exactly doing that on overlapping ranges. But it will degrade to the mentioned O(N) on random non-overlapping tombstones. The true reason I went for GapList though was that it allowed me to keep most of the code unchanged as it also supports index based access. Now though I think I know the code good enough to replace it with TreeMap or similar. Let me know whether such a changed patch would be helpful. was (Author: dominicletz): Sorry for the assignment, Tyler mentioned you to look into it but did assign back to me. Since he didn't mention anything for me to be done I assumed that was an error and he meant to assign to you, as he mentioned you. Good call on BTree vs. GapList indeed the BTree does guarantee O(log N) but you will get in all cases. While GapList gives O(1) on front and back insert, and O(1) for middle insert with the same locality - and insertFrom() which is exactly doing that on overlapping ranges. The true reason I went for GapList though was that it allowed me to keep most of the code unchanged as it also supports index based access. Now though I think I know the code good enough to replace it with TreeMap or similar. Let me know whether such a changed patch would be helpful. > RangeTombstoneList becoming bottleneck on tombstone heavy tasks > --- > > Key: CASSANDRA-8546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8546 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: 2.0.11 / 2.1 >Reporter: Dominic Letz >Assignee: Benedict > Fix For: 2.1.3 > > Attachments: cassandra-2.0.11-8546.txt, cassandra-2.1-8546.txt, > tombstone_test.tgz > > > I would like to propose a change of the data structure used in the > RangeTombstoneList to store and insert tombstone ranges to something with at > least O(log N) insert in the middle and at near O(1) and start AND end. Here > is why: > When having tombstone heavy work-loads the current implementation of > RangeTombstoneList becomes a bottleneck with slice queries. > Scanning the number of tombstones up to the default maximum (100k) can take > up to 3 minutes of how addInternal() scales on insertion of middle and start > elements. > The attached test shows that with 50k deletes from both sides of a range. > INSERT 1...11 > flush() > DELETE 1...5 > DELETE 11...6 > While one direction performs ok (~400ms on my notebook): > {code} > SELECT * FROM timeseries WHERE name = 'a' ORDER BY timestamp DESC LIMIT 1 > {code} > The other direction underperforms (~7seconds on my notebook) > {code} > SELECT * FROM timeseries WHERE name = 'a' ORDER BY timestamp ASC LIMIT 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8546) RangeTombstoneList becoming bottleneck on tombstone heavy tasks
[ https://issues.apache.org/jira/browse/CASSANDRA-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14264520#comment-14264520 ] Dominic Letz commented on CASSANDRA-8546: - Sorry for the assignment, Tyler mentioned you to look into it but did assign back to me. Since he didn't mention anything for me to be done I assumed that was an error and he meant to assign to you, as he mentioned you. Good call on BTree vs. GapList indeed the BTree does guarantee O(log N) but you will get in all cases. While GapList gives O(1) on front and back insert, and O(1) for middle insert with the same locality - and insertFrom() which is exactly doing that on overlapping ranges. The true reason I went for GapList though was that it allowed me to keep most of the code unchanged as it also supports index based access. Now though I think I know the code good enough to replace it with TreeMap or similar. Let me know whether such a changed patch would be helpful. > RangeTombstoneList becoming bottleneck on tombstone heavy tasks > --- > > Key: CASSANDRA-8546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8546 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: 2.0.11 / 2.1 >Reporter: Dominic Letz >Assignee: Benedict > Fix For: 2.1.3 > > Attachments: cassandra-2.0.11-8546.txt, cassandra-2.1-8546.txt, > tombstone_test.tgz > > > I would like to propose a change of the data structure used in the > RangeTombstoneList to store and insert tombstone ranges to something with at > least O(log N) insert in the middle and at near O(1) and start AND end. Here > is why: > When having tombstone heavy work-loads the current implementation of > RangeTombstoneList becomes a bottleneck with slice queries. > Scanning the number of tombstones up to the default maximum (100k) can take > up to 3 minutes of how addInternal() scales on insertion of middle and start > elements. > The attached test shows that with 50k deletes from both sides of a range. > INSERT 1...11 > flush() > DELETE 1...5 > DELETE 11...6 > While one direction performs ok (~400ms on my notebook): > {code} > SELECT * FROM timeseries WHERE name = 'a' ORDER BY timestamp DESC LIMIT 1 > {code} > The other direction underperforms (~7seconds on my notebook) > {code} > SELECT * FROM timeseries WHERE name = 'a' ORDER BY timestamp ASC LIMIT 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8546) RangeTombstoneList becoming bottleneck on tombstone heavy tasks
[ https://issues.apache.org/jira/browse/CASSANDRA-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Letz updated CASSANDRA-8546: Assignee: Benedict (was: Dominic Letz) > RangeTombstoneList becoming bottleneck on tombstone heavy tasks > --- > > Key: CASSANDRA-8546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8546 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: 2.0.11 / 2.1 >Reporter: Dominic Letz >Assignee: Benedict > Fix For: 2.1.3 > > Attachments: cassandra-2.0.11-8546.txt, cassandra-2.1-8546.txt, > tombstone_test.tgz > > > I would like to propose a change of the data structure used in the > RangeTombstoneList to store and insert tombstone ranges to something with at > least O(log N) insert in the middle and at near O(1) and start AND end. Here > is why: > When having tombstone heavy work-loads the current implementation of > RangeTombstoneList becomes a bottleneck with slice queries. > Scanning the number of tombstones up to the default maximum (100k) can take > up to 3 minutes of how addInternal() scales on insertion of middle and start > elements. > The attached test shows that with 50k deletes from both sides of a range. > INSERT 1...11 > flush() > DELETE 1...5 > DELETE 11...6 > While one direction performs ok (~400ms on my notebook): > {code} > SELECT * FROM timeseries WHERE name = 'a' ORDER BY timestamp DESC LIMIT 1 > {code} > The other direction underperforms (~7seconds on my notebook) > {code} > SELECT * FROM timeseries WHERE name = 'a' ORDER BY timestamp ASC LIMIT 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-8546) RangeTombstoneList becoming bottleneck on tombstone heavy tasks
[ https://issues.apache.org/jira/browse/CASSANDRA-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14264185#comment-14264185 ] Dominic Letz edited comment on CASSANDRA-8546 at 1/5/15 6:05 AM: - Seems I'm the only one having fun with this one. I've done some more profiling and found that when middle-inserts and searching is alternating the GapLists own binarySearch is not performing well as it "normalizes" the GapList on each search. In this update I'm using the built-in Collections.binarySearch instead. was (Author: dominicletz): Seems I'm the only one having fun with this one. I've done some more profiling and found that when there are is a pendulum between middle-inserts and searching the GapLists own binarySearch is not performing well as it "normalizes" the GapList on each search. In this update I'm using the built-in Collections.binarySearch instead. > RangeTombstoneList becoming bottleneck on tombstone heavy tasks > --- > > Key: CASSANDRA-8546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8546 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: 2.0.11 / 2.1 >Reporter: Dominic Letz >Assignee: Dominic Letz > Fix For: 2.1.3 > > Attachments: cassandra-2.0.11-8546.txt, cassandra-2.1-8546.txt, > tombstone_test.tgz > > > I would like to propose a change of the data structure used in the > RangeTombstoneList to store and insert tombstone ranges to something with at > least O(log N) insert in the middle and at near O(1) and start AND end. Here > is why: > When having tombstone heavy work-loads the current implementation of > RangeTombstoneList becomes a bottleneck with slice queries. > Scanning the number of tombstones up to the default maximum (100k) can take > up to 3 minutes of how addInternal() scales on insertion of middle and start > elements. > The attached test shows that with 50k deletes from both sides of a range. > INSERT 1...11 > flush() > DELETE 1...5 > DELETE 11...6 > While one direction performs ok (~400ms on my notebook): > {code} > SELECT * FROM timeseries WHERE name = 'a' ORDER BY timestamp DESC LIMIT 1 > {code} > The other direction underperforms (~7seconds on my notebook) > {code} > SELECT * FROM timeseries WHERE name = 'a' ORDER BY timestamp ASC LIMIT 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8559) OOM caused by large tombstone warning.
[ https://issues.apache.org/jira/browse/CASSANDRA-8559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Letz updated CASSANDRA-8559: Description: When running with high amount of tombstones the error message generation from CASSANDRA-6117 can lead to out of memory situation with the default setting. Attached a heapdump viewed in visualvm showing how this construct created two 777mb strings to print the error message for a read query and then crashed OOM. {code} if (respectTombstoneThresholds() && columnCounter.ignored() > DatabaseDescriptor.getTombstoneWarnThreshold()) { StringBuilder sb = new StringBuilder(); CellNameType type = container.metadata().comparator; for (ColumnSlice sl : slices) { assert sl != null; sb.append('['); sb.append(type.getString(sl.start)); sb.append('-'); sb.append(type.getString(sl.finish)); sb.append(']'); } logger.warn("Read {} live and {} tombstoned cells in {}.{} (see tombstone_warn_threshold). {} columns was requested, slices={}, delInfo={}", columnCounter.live(), columnCounter.ignored(), container.metadata().ksName, container.metadata().cfName, count, sb, container.deletionInfo()); } {code} was: When running with high amount of tombstones the error message generation from CASSANDRA-6117 can lead to out of memory situation with the default setting. Attached a heapdump viewed in visualvm showing how this construct is create a 777mb string to print the error message for a single read query. {code} if (respectTombstoneThresholds() && columnCounter.ignored() > DatabaseDescriptor.getTombstoneWarnThreshold()) { StringBuilder sb = new StringBuilder(); CellNameType type = container.metadata().comparator; for (ColumnSlice sl : slices) { assert sl != null; sb.append('['); sb.append(type.getString(sl.start)); sb.append('-'); sb.append(type.getString(sl.finish)); sb.append(']'); } logger.warn("Read {} live and {} tombstoned cells in {}.{} (see tombstone_warn_threshold). {} columns was requested, slices={}, delInfo={}", columnCounter.live(), columnCounter.ignored(), container.metadata().ksName, container.metadata().cfName, count, sb, container.deletionInfo()); } {code} > OOM caused by large tombstone warning. > -- > > Key: CASSANDRA-8559 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8559 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: 2.0.11 / 2.1 >Reporter: Dominic Letz > Attachments: Selection_048.png > > > When running with high amount of tombstones the error message generation from > CASSANDRA-6117 can lead to out of memory situation with the default setting. > Attached a heapdump viewed in visualvm showing how this construct created two > 777mb strings to print the error message for a read query and then crashed > OOM. > {code} > if (respectTombstoneThresholds() && columnCounter.ignored() > > DatabaseDescriptor.getTombstoneWarnThreshold()) > { > StringBuilder sb = new StringBuilder(); > CellNameType type = container.metadata().comparator; > for (ColumnSlice sl : slices) > { > assert sl != null; > sb.append('['); > sb.append(type.getString(sl.start)); > sb.append('-'); > sb.append(type.getString(sl.finish)); > sb.append(']'); > } > logger.warn("Read {} live and {} tombstoned cells in {}.{} (see > tombstone_warn_threshold). {} columns was requested, slices={}, delInfo={}", > columnCounter.live(), columnCounter.ignored(), > container.metadata().ksName, container.metadata().cfName, count, sb, > container.deletionInfo()); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-8559) OOM caused by large tombstone warning.
Dominic Letz created CASSANDRA-8559: --- Summary: OOM caused by large tombstone warning. Key: CASSANDRA-8559 URL: https://issues.apache.org/jira/browse/CASSANDRA-8559 Project: Cassandra Issue Type: Bug Components: Core Environment: 2.0.11 / 2.1 Reporter: Dominic Letz Attachments: Selection_048.png When running with high amount of tombstones the error message generation from CASSANDRA-6117 can lead to out of memory situation with the default setting. Attached a heapdump viewed in visualvm showing how this construct is create a 777mb string to print the error message for a single read query. {code} if (respectTombstoneThresholds() && columnCounter.ignored() > DatabaseDescriptor.getTombstoneWarnThreshold()) { StringBuilder sb = new StringBuilder(); CellNameType type = container.metadata().comparator; for (ColumnSlice sl : slices) { assert sl != null; sb.append('['); sb.append(type.getString(sl.start)); sb.append('-'); sb.append(type.getString(sl.finish)); sb.append(']'); } logger.warn("Read {} live and {} tombstoned cells in {}.{} (see tombstone_warn_threshold). {} columns was requested, slices={}, delInfo={}", columnCounter.live(), columnCounter.ignored(), container.metadata().ksName, container.metadata().cfName, count, sb, container.deletionInfo()); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8546) RangeTombstoneList becoming bottleneck on tombstone heavy tasks
[ https://issues.apache.org/jira/browse/CASSANDRA-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Letz updated CASSANDRA-8546: Attachment: (was: cassandra-2.1-8546.txt) > RangeTombstoneList becoming bottleneck on tombstone heavy tasks > --- > > Key: CASSANDRA-8546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8546 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: 2.0.11 / 2.1 >Reporter: Dominic Letz >Assignee: Dominic Letz > Fix For: 2.1.3 > > Attachments: cassandra-2.0.11-8546.txt, cassandra-2.1-8546.txt, > tombstone_test.tgz > > > I would like to propose a change of the data structure used in the > RangeTombstoneList to store and insert tombstone ranges to something with at > least O(log N) insert in the middle and at near O(1) and start AND end. Here > is why: > When having tombstone heavy work-loads the current implementation of > RangeTombstoneList becomes a bottleneck with slice queries. > Scanning the number of tombstones up to the default maximum (100k) can take > up to 3 minutes of how addInternal() scales on insertion of middle and start > elements. > The attached test shows that with 50k deletes from both sides of a range. > INSERT 1...11 > flush() > DELETE 1...5 > DELETE 11...6 > While one direction performs ok (~400ms on my notebook): > {code} > SELECT * FROM timeseries WHERE name = 'a' ORDER BY timestamp DESC LIMIT 1 > {code} > The other direction underperforms (~7seconds on my notebook) > {code} > SELECT * FROM timeseries WHERE name = 'a' ORDER BY timestamp ASC LIMIT 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8546) RangeTombstoneList becoming bottleneck on tombstone heavy tasks
[ https://issues.apache.org/jira/browse/CASSANDRA-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Letz updated CASSANDRA-8546: Attachment: (was: cassandra-2.0.11-8546.txt) > RangeTombstoneList becoming bottleneck on tombstone heavy tasks > --- > > Key: CASSANDRA-8546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8546 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: 2.0.11 / 2.1 >Reporter: Dominic Letz >Assignee: Dominic Letz > Fix For: 2.1.3 > > Attachments: cassandra-2.0.11-8546.txt, cassandra-2.1-8546.txt, > tombstone_test.tgz > > > I would like to propose a change of the data structure used in the > RangeTombstoneList to store and insert tombstone ranges to something with at > least O(log N) insert in the middle and at near O(1) and start AND end. Here > is why: > When having tombstone heavy work-loads the current implementation of > RangeTombstoneList becomes a bottleneck with slice queries. > Scanning the number of tombstones up to the default maximum (100k) can take > up to 3 minutes of how addInternal() scales on insertion of middle and start > elements. > The attached test shows that with 50k deletes from both sides of a range. > INSERT 1...11 > flush() > DELETE 1...5 > DELETE 11...6 > While one direction performs ok (~400ms on my notebook): > {code} > SELECT * FROM timeseries WHERE name = 'a' ORDER BY timestamp DESC LIMIT 1 > {code} > The other direction underperforms (~7seconds on my notebook) > {code} > SELECT * FROM timeseries WHERE name = 'a' ORDER BY timestamp ASC LIMIT 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8546) RangeTombstoneList becoming bottleneck on tombstone heavy tasks
[ https://issues.apache.org/jira/browse/CASSANDRA-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Letz updated CASSANDRA-8546: Attachment: cassandra-2.1-8546.txt cassandra-2.0.11-8546.txt Seems I'm the only one having fun with this one. I've done some more profiling and found that when there are is a pendulum between middle-inserts and searching the GapLists own binarySearch is not performing well as it "normalizes" the GapList on each search. In this update I'm using the built-in Collections.binarySearch instead. > RangeTombstoneList becoming bottleneck on tombstone heavy tasks > --- > > Key: CASSANDRA-8546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8546 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: 2.0.11 / 2.1 >Reporter: Dominic Letz >Assignee: Dominic Letz > Fix For: 2.1.3 > > Attachments: cassandra-2.0.11-8546.txt, cassandra-2.1-8546.txt, > tombstone_test.tgz > > > I would like to propose a change of the data structure used in the > RangeTombstoneList to store and insert tombstone ranges to something with at > least O(log N) insert in the middle and at near O(1) and start AND end. Here > is why: > When having tombstone heavy work-loads the current implementation of > RangeTombstoneList becomes a bottleneck with slice queries. > Scanning the number of tombstones up to the default maximum (100k) can take > up to 3 minutes of how addInternal() scales on insertion of middle and start > elements. > The attached test shows that with 50k deletes from both sides of a range. > INSERT 1...11 > flush() > DELETE 1...5 > DELETE 11...6 > While one direction performs ok (~400ms on my notebook): > {code} > SELECT * FROM timeseries WHERE name = 'a' ORDER BY timestamp DESC LIMIT 1 > {code} > The other direction underperforms (~7seconds on my notebook) > {code} > SELECT * FROM timeseries WHERE name = 'a' ORDER BY timestamp ASC LIMIT 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8547) Make RangeTombstone.Tracker.isDeleted() faster
[ https://issues.apache.org/jira/browse/CASSANDRA-8547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Letz updated CASSANDRA-8547: Attachment: cassandra-2.1-8547.txt cassandra-2.0.11-8547.txt The fix for this turned out to be much more trivial then I thought. I've attached it for both 2.0.11 and 2.1 There seems to be a silent assumption in the RangeTombstone.Tracker that ranges with the same .max values to do not repeat e.g. the existing maxOrderingSet is only storing one value and it's maxOrderingSet.add(t) will discarded but there is no check for that case. I've left the logic like it is but feel there might be an issue around second deletes for the same range value. > Make RangeTombstone.Tracker.isDeleted() faster > -- > > Key: CASSANDRA-8547 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8547 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: 2.0.11 >Reporter: Dominic Letz > Fix For: 2.1.3 > > Attachments: cassandra-2.0.11-8547.txt, cassandra-2.1-8547.txt, > rangetombstone.tracker.txt > > > During compaction and repairs with many tombstones an exorbitant amount of > time is spend in RangeTombstone.Tracker.isDeleted(). > The amount of time spend there can be so big that compactions and repairs > look "stalled" and the time remaining time estimated frozen at the same value > for days. > Using visualvm I've been sample profiling the code during execution and both > in Compaction as well as during repairs found this. (point in time backtraces > attached) > Looking at the code the problem is obviously the linear scanning: > {code} > public boolean isDeleted(Column column) > { > for (RangeTombstone tombstone : ranges) > { > if (comparator.compare(column.name(), tombstone.min) >= 0 > && comparator.compare(column.name(), tombstone.max) <= 0 > && tombstone.maxTimestamp() >= column.timestamp()) > { > return true; > } > } > return false; > } > {code} > I would like to propose to change this and instead use a sorted list (e.g. > RangeTombstoneList) here instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8546) RangeTombstoneList becoming bottleneck on tombstone heavy tasks
[ https://issues.apache.org/jira/browse/CASSANDRA-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Letz updated CASSANDRA-8546: Attachment: (was: cassandra-2.1-8546.txt) > RangeTombstoneList becoming bottleneck on tombstone heavy tasks > --- > > Key: CASSANDRA-8546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8546 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: 2.0.11 / 2.1 >Reporter: Dominic Letz >Assignee: Dominic Letz > Fix For: 2.1.3 > > Attachments: cassandra-2.0.11-8546.txt, cassandra-2.1-8546.txt, > tombstone_test.tgz > > > I would like to propose a change of the data structure used in the > RangeTombstoneList to store and insert tombstone ranges to something with at > least O(log N) insert in the middle and at near O(1) and start AND end. Here > is why: > When having tombstone heavy work-loads the current implementation of > RangeTombstoneList becomes a bottleneck with slice queries. > Scanning the number of tombstones up to the default maximum (100k) can take > up to 3 minutes of how addInternal() scales on insertion of middle and start > elements. > The attached test shows that with 50k deletes from both sides of a range. > INSERT 1...11 > flush() > DELETE 1...5 > DELETE 11...6 > While one direction performs ok (~400ms on my notebook): > {code} > SELECT * FROM timeseries WHERE name = 'a' ORDER BY timestamp DESC LIMIT 1 > {code} > The other direction underperforms (~7seconds on my notebook) > {code} > SELECT * FROM timeseries WHERE name = 'a' ORDER BY timestamp ASC LIMIT 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8546) RangeTombstoneList becoming bottleneck on tombstone heavy tasks
[ https://issues.apache.org/jira/browse/CASSANDRA-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Letz updated CASSANDRA-8546: Attachment: cassandra-2.1-8546.txt There was seemingly a rebase error on my side, I've uploaded the fixed version. Got confused by the existing errors here: http://cassci.datastax.com/view/cassandra-2.1/job/cassandra-2.1_utest/791/testReport/ as I get the same on my machine. > RangeTombstoneList becoming bottleneck on tombstone heavy tasks > --- > > Key: CASSANDRA-8546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8546 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: 2.0.11 / 2.1 >Reporter: Dominic Letz >Assignee: Dominic Letz > Fix For: 2.1.3 > > Attachments: cassandra-2.0.11-8546.txt, cassandra-2.1-8546.txt, > cassandra-2.1-8546.txt, tombstone_test.tgz > > > I would like to propose a change of the data structure used in the > RangeTombstoneList to store and insert tombstone ranges to something with at > least O(log N) insert in the middle and at near O(1) and start AND end. Here > is why: > When having tombstone heavy work-loads the current implementation of > RangeTombstoneList becomes a bottleneck with slice queries. > Scanning the number of tombstones up to the default maximum (100k) can take > up to 3 minutes of how addInternal() scales on insertion of middle and start > elements. > The attached test shows that with 50k deletes from both sides of a range. > INSERT 1...11 > flush() > DELETE 1...5 > DELETE 11...6 > While one direction performs ok (~400ms on my notebook): > {code} > SELECT * FROM timeseries WHERE name = 'a' ORDER BY timestamp DESC LIMIT 1 > {code} > The other direction underperforms (~7seconds on my notebook) > {code} > SELECT * FROM timeseries WHERE name = 'a' ORDER BY timestamp ASC LIMIT 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8546) RangeTombstoneList becoming bottleneck on tombstone heavy tasks
[ https://issues.apache.org/jira/browse/CASSANDRA-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Letz updated CASSANDRA-8546: Attachment: cassandra-2.1-8546.txt Makes sense. I'm attaching a rebased version for 2.1 - same issues and solution still applies. > RangeTombstoneList becoming bottleneck on tombstone heavy tasks > --- > > Key: CASSANDRA-8546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8546 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: 2.0.11 / 2.1 >Reporter: Dominic Letz > Fix For: 2.1.3 > > Attachments: cassandra-2.0.11-8546.txt, cassandra-2.1-8546.txt, > tombstone_test.tgz > > > I would like to propose a change of the data structure used in the > RangeTombstoneList to store and insert tombstone ranges to something with at > least O(log N) insert in the middle and at near O(1) and start AND end. Here > is why: > When having tombstone heavy work-loads the current implementation of > RangeTombstoneList becomes a bottleneck with slice queries. > Scanning the number of tombstones up to the default maximum (100k) can take > up to 3 minutes of how addInternal() scales on insertion of middle and start > elements. > The attached test shows that with 50k deletes from both sides of a range. > INSERT 1...11 > flush() > DELETE 1...5 > DELETE 11...6 > While one direction performs ok (~400ms on my notebook): > {code} > SELECT * FROM timeseries WHERE name = 'a' ORDER BY timestamp DESC LIMIT 1 > {code} > The other direction underperforms (~7seconds on my notebook) > {code} > SELECT * FROM timeseries WHERE name = 'a' ORDER BY timestamp ASC LIMIT 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-8547) Make RangeTombstone.Tracker.isDeleted() faster
Dominic Letz created CASSANDRA-8547: --- Summary: Make RangeTombstone.Tracker.isDeleted() faster Key: CASSANDRA-8547 URL: https://issues.apache.org/jira/browse/CASSANDRA-8547 Project: Cassandra Issue Type: Improvement Components: Core Environment: 2.0.11 Reporter: Dominic Letz Attachments: rangetombstone.tracker.txt During compaction and repairs with many tombstones an exorbitant amount of time is spend in RangeTombstone.Tracker.isDeleted(). The amount of time spend there can be so big that compactions and repairs look "stalled" and the time remaining time estimated frozen at the same value for days. Using visualvm I've been sample profiling the code during execution and both in Compaction as well as during repairs found this. (point in time backtraces attached) Looking at the code the problem is obviously the linear scanning: {code} public boolean isDeleted(Column column) { for (RangeTombstone tombstone : ranges) { if (comparator.compare(column.name(), tombstone.min) >= 0 && comparator.compare(column.name(), tombstone.max) <= 0 && tombstone.maxTimestamp() >= column.timestamp()) { return true; } } return false; } {code} I would like to propose to change this and instead use a sorted list (e.g. RangeTombstoneList) here instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8546) RangeTombstoneList becoming bottleneck on tombstone heavy tasks
[ https://issues.apache.org/jira/browse/CASSANDRA-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Letz updated CASSANDRA-8546: Attachment: cassandra-2.0.11-8546.txt Attaching a patch that replaces the plain arrays with a GapList from the brownies java collections library (apache 2.0 license). Passes ant test Improves performance in the worse case scenarios like this on my notebook: 50k deletes: 7 seconds without patch 700ms with patch. 100k deletes: 3 minutes without patch 3 seconds with patch. > RangeTombstoneList becoming bottleneck on tombstone heavy tasks > --- > > Key: CASSANDRA-8546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8546 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: 2.0.11 / 2.1 >Reporter: Dominic Letz >Priority: Critical > Fix For: 2.0.12 > > Attachments: cassandra-2.0.11-8546.txt, tombstone_test.tgz > > > I would like to propose a change of the data structure used in the > RangeTombstoneList to store and insert tombstone ranges to something with at > least O(log N) insert in the middle and at near O(1) and start AND end. Here > is why: > When having tombstone heavy work-loads the current implementation of > RangeTombstoneList becomes a bottleneck with slice queries. > Scanning the number of tombstones up to the default maximum (100k) can take > up to 3 minutes of how addInternal() scales on insertion of middle and start > elements. > The attached test shows that with 50k deletes from both sides of a range. > INSERT 1...11 > flush() > DELETE 1...5 > DELETE 11...6 > While one direction performs ok (~400ms on my notebook): > {code} > SELECT * FROM timeseries WHERE name = 'a' ORDER BY timestamp DESC LIMIT 1 > {code} > The other direction underperforms (~7seconds on my notebook) > {code} > SELECT * FROM timeseries WHERE name = 'a' ORDER BY timestamp ASC LIMIT 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-8546) RangeTombstoneList becoming bottleneck on tombstone heavy tasks
Dominic Letz created CASSANDRA-8546: --- Summary: RangeTombstoneList becoming bottleneck on tombstone heavy tasks Key: CASSANDRA-8546 URL: https://issues.apache.org/jira/browse/CASSANDRA-8546 Project: Cassandra Issue Type: Improvement Components: Core Environment: 2.0.11 / 2.1 Reporter: Dominic Letz Priority: Critical Fix For: 2.0.12 Attachments: tombstone_test.tgz I would like to propose a change of the data structure used in the RangeTombstoneList to store and insert tombstone ranges to something with at least O(log N) insert in the middle and at near O(1) and start AND end. Here is why: When having tombstone heavy work-loads the current implementation of RangeTombstoneList becomes a bottleneck with slice queries. Scanning the number of tombstones up to the default maximum (100k) can take up to 3 minutes of how addInternal() scales on insertion of middle and start elements. The attached test shows that with 50k deletes from both sides of a range. INSERT 1...11 flush() DELETE 1...5 DELETE 11...6 While one direction performs ok (~400ms on my notebook): {code} SELECT * FROM timeseries WHERE name = 'a' ORDER BY timestamp DESC LIMIT 1 {code} The other direction underperforms (~7seconds on my notebook) {code} SELECT * FROM timeseries WHERE name = 'a' ORDER BY timestamp ASC LIMIT 1 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-7470) java.lang.AssertionError are causing cql responses to always go to streamId = 0 instead of the request sending streamId
[ https://issues.apache.org/jira/browse/CASSANDRA-7470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Letz updated CASSANDRA-7470: Attachment: letz.patch Attaching patch for the catch block to assign streamId > java.lang.AssertionError are causing cql responses to always go to streamId = > 0 instead of the request sending streamId > --- > > Key: CASSANDRA-7470 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7470 > Project: Cassandra > Issue Type: Bug > Components: API > Environment: CQL native interface >Reporter: Dominic Letz > Attachments: letz.patch > > > When triggering an unexpected assertion like the one below then then the > ExecuteMessage.java is catching that and generates a response with a streamId > = 0. > When sending multiple queries at the same time on different streams this > makes it impossible to know which of those queries failed. > ERROR [Native-Transport-Requests:455] 2014-06-27 02:20:38,484 > ErrorMessage.java (line 222) Unexpected exception during request > java.lang.AssertionError: Added column does not sort as the last column > at > org.apache.cassandra.db.ArrayBackedSortedColumns.addColumn(ArrayBackedSortedColumns.java:115) > at > org.apache.cassandra.db.ColumnFamily.addColumn(ColumnFamily.java:116) > at > org.apache.cassandra.service.pager.AbstractQueryPager.discardHead(AbstractQueryPager.java:319) > at > org.apache.cassandra.service.pager.AbstractQueryPager.discardLast(AbstractQueryPager.java:301) > at > org.apache.cassandra.service.pager.AbstractQueryPager.discardFirst(AbstractQueryPager.java:219) > at > org.apache.cassandra.service.pager.AbstractQueryPager.discardFirst(AbstractQueryPager.java:202) > at > org.apache.cassandra.service.pager.AbstractQueryPager.fetchPage(AbstractQueryPager.java:124) > at > org.apache.cassandra.service.pager.SliceQueryPager.fetchPage(SliceQueryPager.java:35) > at > org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:232) > at > org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:62) > at > org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:158) > at > org.apache.cassandra.cql3.QueryProcessor.processPrepared(QueryProcessor.java:309) > at > org.apache.cassandra.transport.messages.ExecuteMessage.execute(ExecuteMessage.java:132) > at > org.apache.cassandra.transport.Message$Dispatcher.messageReceived(Message.java:304) > at > org.jboss.netty.handler.execution.ChannelUpstreamEventRunnable.doRun(ChannelUpstreamEventRunnable.java:43) > at > org.jboss.netty.handler.execution.ChannelEventRunnable.run(ChannelEventRunnable.java:67) > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > at java.lang.Thread.run(Unknown Source) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (CASSANDRA-7470) java.lang.AssertionError are causing cql responses to always go to streamId = 0 instead of the request sending streamId
Dominic Letz created CASSANDRA-7470: --- Summary: java.lang.AssertionError are causing cql responses to always go to streamId = 0 instead of the request sending streamId Key: CASSANDRA-7470 URL: https://issues.apache.org/jira/browse/CASSANDRA-7470 Project: Cassandra Issue Type: Bug Components: API Environment: CQL native interface Reporter: Dominic Letz When triggering an unexpected assertion like the one below then then the ExecuteMessage.java is catching that and generates a response with a streamId = 0. When sending multiple queries at the same time on different streams this makes it impossible to know which of those queries failed. ERROR [Native-Transport-Requests:455] 2014-06-27 02:20:38,484 ErrorMessage.java (line 222) Unexpected exception during request java.lang.AssertionError: Added column does not sort as the last column at org.apache.cassandra.db.ArrayBackedSortedColumns.addColumn(ArrayBackedSortedColumns.java:115) at org.apache.cassandra.db.ColumnFamily.addColumn(ColumnFamily.java:116) at org.apache.cassandra.service.pager.AbstractQueryPager.discardHead(AbstractQueryPager.java:319) at org.apache.cassandra.service.pager.AbstractQueryPager.discardLast(AbstractQueryPager.java:301) at org.apache.cassandra.service.pager.AbstractQueryPager.discardFirst(AbstractQueryPager.java:219) at org.apache.cassandra.service.pager.AbstractQueryPager.discardFirst(AbstractQueryPager.java:202) at org.apache.cassandra.service.pager.AbstractQueryPager.fetchPage(AbstractQueryPager.java:124) at org.apache.cassandra.service.pager.SliceQueryPager.fetchPage(SliceQueryPager.java:35) at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:232) at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:62) at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:158) at org.apache.cassandra.cql3.QueryProcessor.processPrepared(QueryProcessor.java:309) at org.apache.cassandra.transport.messages.ExecuteMessage.execute(ExecuteMessage.java:132) at org.apache.cassandra.transport.Message$Dispatcher.messageReceived(Message.java:304) at org.jboss.netty.handler.execution.ChannelUpstreamEventRunnable.doRun(ChannelUpstreamEventRunnable.java:43) at org.jboss.netty.handler.execution.ChannelEventRunnable.run(ChannelEventRunnable.java:67) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (CASSANDRA-6536) SStable gets corrupted after keyspace drop and recreation
Dominic Letz created CASSANDRA-6536: --- Summary: SStable gets corrupted after keyspace drop and recreation Key: CASSANDRA-6536 URL: https://issues.apache.org/jira/browse/CASSANDRA-6536 Project: Cassandra Issue Type: Bug Environment: Cassandra 1.2.12 & 1.2.13 Reporter: Dominic Letz ERROR [ReadStage:41] 2014-01-02 14:27:00,629 CassandraDaemon.java (line 191) Exception in thread Thread[ReadStage:41,5,main] java.lang.RuntimeException: org.apache.cassandra.io.sstable.CorruptSSTableException: java.io.IOException: Corrupt (negative) value length encountered When running a test like this the SECOND TIME: DROP KEYSPACE testspace; CREATE KEYSPACE testspace with REPLICATION = {'class':'SimpleStrategy', 'replication_factor':1} AND durable_writes = false; USE testspace; CREATE TABLE testtable (id text PRIMARY KEY, group text) WITH compression = {'sstable_compression':'LZ4Compressor'}; CREATE INDEX testindex ON testtable (group); INSERT INTO testtable (id, group) VALUES ('1', 'beta'); INSERT INTO testtable (id, group) VALUES ('2', 'gamma'); INSERT INTO testtable (id, group) VALUES ('3', 'delta'); INSERT INTO testtable (id, group) VALUES ('4', 'epsilon'); INSERT INTO testtable (id, group) VALUES ('5', 'alpha'); INSERT INTO testtable (id, group) VALUES ('6', 'beta'); INSERT INTO testtable (id, group) VALUES ('7', 'gamma'); INSERT INTO testtable (id, group) VALUES ('8', 'delta'); INSERT INTO testtable (id, group) VALUES ('9', 'epsilon'); INSERT INTO testtable (id, group) VALUES ('00010', 'alpha'); INSERT INTO testtable (id, group) VALUES ('00011', 'beta'); INSERT INTO testtable (id, group) VALUES ('00012', 'gamma'); INSERT INTO testtable (id, group) VALUES ('00013', 'delta'); INSERT INTO testtable (id, group) VALUES ('00014', 'epsilon'); INSERT INTO testtable (id, group) VALUES ('00015', 'alpha'); INSERT INTO testtable (id, group) VALUES ('00016', 'beta'); INSERT INTO testtable (id, group) VALUES ('00017', 'gamma'); ... INSERT INTO testtable (id, group) VALUES ('10', 'alpha'); SELECT COUNT(*) FROM testspace.testtable WHERE group = 'alpha' LIMIT 11; -- This message was sent by Atlassian JIRA (v6.1.5#6160)