[ https://issues.apache.org/jira/browse/CASSANDRA-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13945623#comment-13945623 ]
Jonathan Ellis commented on CASSANDRA-6918: ------------------------------------------- [~agoodrich] [~redpriest] does it log "Compacting large row" before the exception? > Compaction Assert: Incorrect Row Data Size > ------------------------------------------ > > Key: CASSANDRA-6918 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6918 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: 11 node Linux Cassandra 1.2.15 cluster, each node > configured as follows: > 2P IntelXeon CPU X5660 @ 2.8 GHz (12 cores, 24 threads total) > 148 GB RAM > CentOS release 6.4 (Final) > 2.6.32-358.11.1.el6.x86_64 #1 SMP Wed May 15 10:48:38 EDT 2013 x86_64 x86_64 > x86_64 GNU/Linux > Java(TM) SE Runtime Environment (build 1.7.0_40-b43) > Java HotSpot(TM) 64-Bit Server VM (build 24.0-b56, mixed mode) > Node configuration: > Default cassandra.yaml settings for the most part with the following > exceptions: > rpc_server_type: hsha > Reporter: Alexander Goodrich > Fix For: 1.2.16 > > > I have four tables in a schema with Replication Factor: 6 (previously we set > this to 3, but when we added more nodes we figured adding more replication to > improve read time would help, this might have aggravated the issue). > create table table_value_one ( > id timeuuid PRIMARY KEY, > value_1 counter > ); > > create table table_value_two ( > id timeuuid PRIMARY KEY, > value_2 counter > ); > create table table_position_lookup ( > value_1 bigint, > value_2 bigint, > id timeuuid, > PRIMARY KEY (id) > ) WITH compaction={'class': 'LeveledCompactionStrategy'}; > create table sorted_table ( > row_key_index text, > range bigint, > sorted_value bigint, > id timeuuid, > extra_data list<bigint>, > PRIMARY KEY ((row_key_index, range), sorted_value, id) > ) WITH CLUSTERING ORDER BY (sorted_value DESC) AND > compaction={'class': 'LeveledCompactionStrategy'}; > The application creates an object, and stores it in sorted_table based on a > value position - for example, an object has a value_1 of 5500, and a value_2 > of 4300. > There are rows which represent indices by which I can sort items based on > these values in descending order. If I wish to see items with the highest # > of value_1, I can create an index that stores them like so: > row_key_index = 'highest_value_1s' > Additionally, we shard each row by bucket ranges - which is simply the > value_1 or value_2 / 1000. For example, our object above would be found in > row_key_index = 'highest_value_1s' and range 5000, and also in row_key_index > = 'highest_value_2s' with range 4300. > The true values of this object are stored in two counter tables, > table_value_one and table_value_two. The current indexed position is stored > in table_position_lookup. > We allow the application to modify value_one and value_two in the counter > table indiscriminately. If we know the current values for these are dirty, we > wait a tuned amount of time before we update the position in the sorted_table > index. This creates 2 delete operations, and 2 write operations on the same > table. > The issue is when we expand the number of write/delete operations on > sorted_table, we see the following assert in the system log: > ERROR [CompactionExecutor:169] 2014-03-24 08:07:12,871 CassandraDaemon.java > (line 191) Exception in thread Thread[CompactionExecutor:169,1,main] > java.lang.AssertionError: incorrect row data size 77705872 written to > /var/lib/cassandra/data/loadtest_1/sorted_table/loadtest_1-sorted_table-tmp-ic-165-Data.db; > correct is 77800512 > at > org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:162) > at > org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:162) > at > org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:58) > at > org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:60) > at > org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:208) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:724) > Each object creates approximately ~500 unique row keys in sorted_table, and > it possesses an extra_data field containing approximately 15 different bigint > values. > Previously, our application was running Cassandra 1.2.10 and we did not see > the assert when our sorted_table did not have the "extra data list<bigint>". > Also, we were writing around ~200 unique row keys, only containing the ID > column. > We tried both leveled compaction and size tiered compaction and both cause > the same assert - compaction fails to happen, and after about 100k object > writes (creating 55 million rows, each having potentially as many as 100k > items in a single column), we have ~ 2.4 GB of SSTables spread across 4840 > files, and 691 SSTables: > SSTable count: 691 > SSTables in each level: [685/4, 6, 0, 0, 0, 0, 0, 0, 0] > Space used (live): 2244774352 > Space used (total): 2251159892 > SSTable Compression Ratio: 0.15101393198465862 > Number of Keys (estimate): 4704128 > Memtable Columns Count: 0 > Memtable Data Size: 0 > Memtable Switch Count: 264 > Read Count: 9204 > Read Latency: NaN ms. > Write Count: 10151343 > Write Latency: NaN ms. > Pending Tasks: 0 > Bloom Filter False Positives: 0 > Bloom Filter False Ratio: 0.00000 > Bloom Filter Space Used: 3500496 > Compacted row minimum size: 125 > Compacted row maximum size: 62479625 > Compacted row mean size: 1285302 > Average live cells per slice (last five minutes): 1001.0 > Average tombstones per slice (last five minutes): 8566.5 > Some mitigation strategies we have discussed include: > * Breaking sorted_table into multiple column families to spread the # of > writes between. > * Increasing the coalescing time delay > * Removing extra_data and paying the cost of another table look up for each > item > * Compressing extra_data into a blob > * Reduce replication factor back down to 3 to reduce size pressure on SSTable. > Running nodetool -pr repair does not fix the issue. Running nodetool compact > manually has not solved the issue as well. The asserts happen pretty > frequently across all nodes of the cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)