[jira] [Created] (CASSANDRA-9235) nodetool compactionstats. Negative numbers in pending tasks
Sergey Maznichenko created CASSANDRA-9235: - Summary: nodetool compactionstats. Negative numbers in pending tasks Key: CASSANDRA-9235 URL: https://issues.apache.org/jira/browse/CASSANDRA-9235 Project: Cassandra Issue Type: Bug Components: API, Core Environment: CentOS 6.2 x64, Cassandra 2.1.4 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) Reporter: Sergey Maznichenko Priority: Minor nodetool compactionstats pending tasks: -8 I can see negative numbers in 'pending tasks' on all 8 nodes it looks like -8 + real number of pending tasks for example -22128 for 100 real pending tasks -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9235) nodetool compactionstats. Negative numbers in pending tasks
[ https://issues.apache.org/jira/browse/CASSANDRA-9235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14511105#comment-14511105 ] Sergey Maznichenko commented on CASSANDRA-9235: --- I use LCS. It happens on all nodes: pending tasks: -09007 pending tasks: -8 pending tasks: -09142 compaction type keyspace table completed total unit progress Compaction archivespace file_storage 493893100494 710281935201 bytes 69.53% Active compaction remaining time : 0h03m26s pending tasks: -17719 compaction type keyspace table completed total unit progress Compaction archivespace file_storage 286845775 539131720 bytes 53.21% Active compaction remaining time : 0h00m00s pending tasks: -21094 compaction type keyspace table completedtotal unit progress Compaction archivespace file_storage 546045136 1040249351 bytes 52.49% Active compaction remaining time : 0h00m00s pending tasks: -11160 compaction type keyspace table completed total unit progress Compaction archivespace file_storage 173527855063 754763739654 bytes 22.99% Active compaction remaining time : 0h09m14s pending tasks: -17961 compaction type keyspace table completed total unit progress Compaction archivespace file_storage 307582716 539133247 bytes 57.05% Active compaction remaining time : 0h00m00s pending tasks: -10946 compaction type keyspace table completed totalunit progress Compaction archivespace file_storage 1102055174099 6063766404241 bytes 18.17% Active compaction remaining time : 1h18m56s It seems to me that it began when I added another table in keyspace. nodetool compactionstats. Negative numbers in pending tasks Key: CASSANDRA-9235 URL: https://issues.apache.org/jira/browse/CASSANDRA-9235 Project: Cassandra Issue Type: Bug Components: API, Core Environment: CentOS 6.2 x64, Cassandra 2.1.4 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) Reporter: Sergey Maznichenko Assignee: Marcus Eriksson Priority: Minor Fix For: 2.1.5 nodetool compactionstats pending tasks: -8 I can see negative numbers in 'pending tasks' on all 8 nodes it looks like -8 + real number of pending tasks for example -22128 for 100 real pending tasks -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9235) nodetool compactionstats. Negative numbers in pending tasks
[ https://issues.apache.org/jira/browse/CASSANDRA-9235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14511129#comment-14511129 ] Sergey Maznichenko commented on CASSANDRA-9235: --- It happened when archivespace.file_storage2 had beed added. KEYSPACE description: CREATE KEYSPACE archivespace WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '1', 'DC2': '1'} AND durable_writes = true; CREATE TABLE archivespace.files ( id uuid PRIMARY KEY, category decimal, created timestamp, data blob, filename text ) WITH bloom_filter_fp_chance = 0.01 AND caching = '{keys:ALL, rows_per_partition:NONE}' AND comment = '' AND compaction = {'min_threshold': '4', 'enabled': 'true', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND dclocal_read_repair_chance = 0.1 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = '99.0PERCENTILE'; CREATE TABLE archivespace.file_storage ( key text, chunk text, value blob, PRIMARY KEY (key, chunk) ) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (chunk ASC) AND bloom_filter_fp_chance = 0.01 AND caching = '{keys:ALL, rows_per_partition:NONE}' AND comment = '' AND compaction = {'sstable_size_in_mb': '512', 'min_threshold': '4', 'enabled': 'true', 'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy', 'max_threshold': '64'} AND compression = {} AND dclocal_read_repair_chance = 0.1 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = '99.0PERCENTILE'; CREATE TABLE archivespace.file_storage2 ( key text, chunk text, value blob, PRIMARY KEY (key, chunk) ) WITH CLUSTERING ORDER BY (chunk ASC) AND bloom_filter_fp_chance = 0.1 AND caching = '{keys:ALL, rows_per_partition:NONE}' AND comment = '' AND compaction = {'sstable_size_in_mb': '2048', 'min_threshold': '4', 'enabled': 'true', 'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy', 'max_threshold': '32'} AND compression = {} AND dclocal_read_repair_chance = 0.1 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = '99.0PERCENTILE'; nodetool compactionstats. Negative numbers in pending tasks Key: CASSANDRA-9235 URL: https://issues.apache.org/jira/browse/CASSANDRA-9235 Project: Cassandra Issue Type: Bug Components: API, Core Environment: CentOS 6.2 x64, Cassandra 2.1.4 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) Reporter: Sergey Maznichenko Assignee: Marcus Eriksson Priority: Minor Fix For: 2.1.5 nodetool compactionstats pending tasks: -8 I can see negative numbers in 'pending tasks' on all 8 nodes it looks like -8 + real number of pending tasks for example -22128 for 100 real pending tasks -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9235) nodetool compactionstats. Negative numbers in pending tasks
[ https://issues.apache.org/jira/browse/CASSANDRA-9235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14511135#comment-14511135 ] Sergey Maznichenko commented on CASSANDRA-9235: --- DEBUG [RMI TCP Connection(10)-127.0.0.1] 2015-04-24 18:26:01,639 LeveledManifest.java:680 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions to do for system.paxos DEBUG [RMI TCP Connection(10)-127.0.0.1] 2015-04-24 18:26:01,639 LeveledManifest.java:680 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions to do for system.paxos DEBUG [RMI TCP Connection(10)-127.0.0.1] 2015-04-24 18:26:01,640 LeveledManifest.java:680 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions to do for archivespace.file_storage DEBUG [RMI TCP Connection(10)-127.0.0.1] 2015-04-24 18:26:01,641 LeveledManifest.java:680 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions to do for archivespace.file_storage DEBUG [RMI TCP Connection(10)-127.0.0.1] 2015-04-24 18:26:01,641 LeveledManifest.java:680 - Estimating [-4, -10, -100, -1000, -1, -10, -100, -1000, -1] compactions to do for archivespace.file_storage2 DEBUG [RMI TCP Connection(10)-127.0.0.1] 2015-04-24 18:26:01,641 LeveledManifest.java:680 - Estimating [-4, -10, -100, -1000, -1, -10, -100, -1000, -1] compactions to do for archivespace.file_storage2 DEBUG [RMI TCP Connection(12)-127.0.0.1] 2015-04-24 18:26:07,041 LeveledManifest.java:680 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions to do for system.paxos DEBUG [RMI TCP Connection(12)-127.0.0.1] 2015-04-24 18:26:07,041 LeveledManifest.java:680 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions to do for system.paxos DEBUG [RMI TCP Connection(12)-127.0.0.1] 2015-04-24 18:26:07,041 LeveledManifest.java:680 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions to do for archivespace.file_storage DEBUG [RMI TCP Connection(12)-127.0.0.1] 2015-04-24 18:26:07,042 LeveledManifest.java:680 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions to do for archivespace.file_storage DEBUG [RMI TCP Connection(12)-127.0.0.1] 2015-04-24 18:26:07,042 LeveledManifest.java:680 - Estimating [-4, -10, -100, -1000, -1, -10, -100, -1000, -1] compactions to do for archivespace.file_storage2 DEBUG [RMI TCP Connection(12)-127.0.0.1] 2015-04-24 18:26:07,043 LeveledManifest.java:680 - Estimating [-4, -10, -100, -1000, -1, -10, -100, -1000, -1] compactions to do for archivespace.file_storage2 DEBUG [RMI TCP Connection(6)-127.0.0.1] 2015-04-24 18:26:09,059 LeveledManifest.java:680 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions to do for system.paxos DEBUG [RMI TCP Connection(6)-127.0.0.1] 2015-04-24 18:26:09,059 LeveledManifest.java:680 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions to do for system.paxos DEBUG [RMI TCP Connection(6)-127.0.0.1] 2015-04-24 18:26:09,060 LeveledManifest.java:680 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions to do for archivespace.file_storage DEBUG [RMI TCP Connection(6)-127.0.0.1] 2015-04-24 18:26:09,061 LeveledManifest.java:680 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions to do for archivespace.file_storage DEBUG [RMI TCP Connection(6)-127.0.0.1] 2015-04-24 18:26:09,061 LeveledManifest.java:680 - Estimating [-4, -10, -100, -1000, -1, -10, -100, -1000, -1] compactions to do for archivespace.file_storage2 DEBUG [RMI TCP Connection(6)-127.0.0.1] 2015-04-24 18:26:09,061 LeveledManifest.java:680 - Estimating [-4, -10, -100, -1000, -1, -10, -100, -1000, -1] compactions to do for archivespace.file_storage2 DEBUG [RMI TCP Connection(14)-127.0.0.1] 2015-04-24 18:26:16,600 LeveledManifest.java:680 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions to do for system.paxos DEBUG [RMI TCP Connection(14)-127.0.0.1] 2015-04-24 18:26:16,600 LeveledManifest.java:680 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions to do for system.paxos DEBUG [RMI TCP Connection(14)-127.0.0.1] 2015-04-24 18:26:16,600 LeveledManifest.java:680 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions to do for archivespace.file_storage DEBUG [RMI TCP Connection(14)-127.0.0.1] 2015-04-24 18:26:16,601 LeveledManifest.java:680 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions to do for archivespace.file_storage DEBUG [RMI TCP Connection(14)-127.0.0.1] 2015-04-24 18:26:16,602 LeveledManifest.java:680 - Estimating [-4, -10, -100, -1000, -1, -10, -100, -1000, -1] compactions to do for archivespace.file_storage2 DEBUG [RMI TCP Connection(14)-127.0.0.1] 2015-04-24 18:26:16,602 LeveledManifest.java:680 - Estimating [-4, -10, -100, -1000, -1, -10, -100, -1000, -1] compactions to do for archivespace.file_storage2 nodetool compactionstats. Negative numbers in pending tasks
[jira] [Commented] (CASSANDRA-9235) nodetool compactionstats. Negative numbers in pending tasks
[ https://issues.apache.org/jira/browse/CASSANDRA-9235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14511146#comment-14511146 ] Sergey Maznichenko commented on CASSANDRA-9235: --- archivespace.file_storage2 is empty. select * from archivespace.file_storage2; key | chunk | value -+---+--- (0 rows) nodetool compactionstats. Negative numbers in pending tasks Key: CASSANDRA-9235 URL: https://issues.apache.org/jira/browse/CASSANDRA-9235 Project: Cassandra Issue Type: Bug Components: API, Core Environment: CentOS 6.2 x64, Cassandra 2.1.4 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) Reporter: Sergey Maznichenko Assignee: Marcus Eriksson Priority: Minor Fix For: 2.1.5 nodetool compactionstats pending tasks: -8 I can see negative numbers in 'pending tasks' on all 8 nodes it looks like -8 + real number of pending tasks for example -22128 for 100 real pending tasks -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9235) Max sstable size in leveled manifest is an int, creating large sstables overflows this and breaks LCS
[ https://issues.apache.org/jira/browse/CASSANDRA-9235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14511448#comment-14511448 ] Sergey Maznichenko commented on CASSANDRA-9235: --- OK. Thanks guys! Max sstable size in leveled manifest is an int, creating large sstables overflows this and breaks LCS - Key: CASSANDRA-9235 URL: https://issues.apache.org/jira/browse/CASSANDRA-9235 Project: Cassandra Issue Type: Bug Components: API, Core Environment: CentOS 6.2 x64, Cassandra 2.1.4 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) Reporter: Sergey Maznichenko Assignee: Marcus Eriksson Fix For: 2.1.5 Attachments: 0001-9235.patch nodetool compactionstats pending tasks: -8 I can see negative numbers in 'pending tasks' on all 8 nodes it looks like -8 + real number of pending tasks for example -22128 for 100 real pending tasks -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9092) Nodes in DC2 die during and after huge write workload
[ https://issues.apache.org/jira/browse/CASSANDRA-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394352#comment-14394352 ] Sergey Maznichenko commented on CASSANDRA-9092: --- Should I provide any additional information from the failed node? I want to delete all hints and run repair on this node. Nodes in DC2 die during and after huge write workload - Key: CASSANDRA-9092 URL: https://issues.apache.org/jira/browse/CASSANDRA-9092 Project: Cassandra Issue Type: Bug Environment: CentOS 6.2 64-bit, Cassandra 2.1.2, java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) Reporter: Sergey Maznichenko Assignee: Sam Tunnicliffe Fix For: 2.1.5 Attachments: cassandra_crash1.txt Hello, We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2. Node is VM 8 CPU, 32GB RAM During significant workload (loading several millions blobs ~3.5MB each), 1 node in DC2 stops and after some time next 2 nodes in DC2 also stops. Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. I see many files in system.hints table and error appears in 2-3 minutes after starting system.hints auto compaction. Stops, means ERROR [CompactionExecutor:1] 2015-04-01 23:33:44,456 CassandraDaemon.java:153 - Exception in thread Thread[CompactionExecutor:1,1,main] java.lang.OutOfMemoryError: Java heap space ERROR [HintedHandoff:1] 2015-04-01 23:33:44,456 CassandraDaemon.java:153 - Exception in thread Thread[HintedHandoff:1,1,main] java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space Full errors listing attached in cassandra_crash1.txt The problem exists only in DC2. We have 1GbE between DC1 and DC2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9092) Nodes in DC2 die during and after huge write workload
[ https://issues.apache.org/jira/browse/CASSANDRA-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394627#comment-14394627 ] Sergey Maznichenko commented on CASSANDRA-9092: --- We have OpsCenter Agent. Such errors repeat 1-2 timer per hour during load of data. In DC1 now we don't have any hints. I guess that traffic can go to all nodes because client settings, I will check it. I had tried to perform 'nodetool repair' from the node in DC2 and after 30 hours delay, I got bunch of errors in console, like: [2015-04-02 19:32:14,352] Repair session 6ff4f071-d94d-11e4-9257-f7b14a924a15 for range (-3563451573336693456,-3535530477916720868] failed with error java.io.IOException: Cannot proceed on repair because a neighbor (/10.XX.XX.11) is dead: session failed but 'nodetool status' reports that all nodes are live and I can see successful communication between nodes in their logs. It's strange... Nodes in DC2 die during and after huge write workload - Key: CASSANDRA-9092 URL: https://issues.apache.org/jira/browse/CASSANDRA-9092 Project: Cassandra Issue Type: Bug Environment: CentOS 6.2 64-bit, Cassandra 2.1.2, java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) Reporter: Sergey Maznichenko Assignee: Sam Tunnicliffe Fix For: 2.1.5 Attachments: cassandra_crash1.txt Hello, We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2. Node is VM 8 CPU, 32GB RAM During significant workload (loading several millions blobs ~3.5MB each), 1 node in DC2 stops and after some time next 2 nodes in DC2 also stops. Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. I see many files in system.hints table and error appears in 2-3 minutes after starting system.hints auto compaction. Stops, means ERROR [CompactionExecutor:1] 2015-04-01 23:33:44,456 CassandraDaemon.java:153 - Exception in thread Thread[CompactionExecutor:1,1,main] java.lang.OutOfMemoryError: Java heap space ERROR [HintedHandoff:1] 2015-04-01 23:33:44,456 CassandraDaemon.java:153 - Exception in thread Thread[HintedHandoff:1,1,main] java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space Full errors listing attached in cassandra_crash1.txt The problem exists only in DC2. We have 1GbE between DC1 and DC2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9092) Nodes in DC2 die during and after huge write workload
[ https://issues.apache.org/jira/browse/CASSANDRA-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394474#comment-14394474 ] Sergey Maznichenko commented on CASSANDRA-9092: --- Consistency ONE. Clients use Datastax Client (Java). We are writing only to DC1. In the logs of the nodes which don't fail we have errors and warnings during load: INFO [SharedPool-Worker-5] 2015-03-31 15:48:52,534 Message.java:532 - Unexpected exception during request; channel = [id: 0x48b3ad12, /10.77.81.33:56581 : /10.XX.XX.10:9042] java.io.IOException: Error while read(...): Connection reset by peer at io.netty.channel.epoll.Native.readAddress(Native Method) ~[netty-all-4.0.23.Final.jar:4.0.23.Final] at io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.doReadBytes(EpollSocketChannel.java:675) ~[netty-all-4.0.23.Final.jar:4.0.23.Final] at io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.epollInReady(EpollSocketChannel.java:714) ~[netty-all-4.0.23.Final.jar:4.0.23.Final] at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:326) ~[netty-all-4.0.23.Final.jar:4.0.23.Final] at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:264) ~[netty-all-4.0.23.Final.jar:4.0.23.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ~[netty-all-4.0.23.Final.jar:4.0.23.Final] at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) ~[netty-all-4.0.23.Final.jar:4.0.23.Final] at java.lang.Thread.run(Unknown Source) [na:1.7.0_71] ERROR [Thrift:15] 2015-03-31 11:54:35,163 CustomTThreadPoolServer.java:221 - Error occurred during processing of message. java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 2 responses. at org.apache.cassandra.auth.Auth.selectUser(Auth.java:317) ~[apache-cassandra-2.1.2.jar:2.1.2] at org.apache.cassandra.auth.Auth.isExistingUser(Auth.java:125) ~[apache-cassandra-2.1.2.jar:2.1.2] at org.apache.cassandra.service.ClientState.login(ClientState.java:171) ~[apache-cassandra-2.1.2.jar:2.1.2] at org.apache.cassandra.thrift.CassandraServer.login(CassandraServer.java:1493) ~[apache-cassandra-2.1.2.jar:2.1.2] at org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3579) ~[apache-cassandra-thrift-2.1.2.jar:2.1.2] at org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3563) ~[apache-cassandra-thrift-2.1.2.jar:2.1.2] at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) ~[libthrift-0.9.1.jar:0.9.1] at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) ~[libthrift-0.9.1.jar:0.9.1] at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:202) ~[apache-cassandra-2.1.2.jar:2.1.2] at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [na:1.7.0_71] at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [na:1.7.0_71] at java.lang.Thread.run(Unknown Source) [na:1.7.0_71] Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 2 responses. at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:103) ~[apache-cassandra-2.1.2.jar:2.1.2] at org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:144) ~[apache-cassandra-2.1.2.jar:2.1.2] at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1263) ~[apache-cassandra-2.1.2.jar:2.1.2] at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1184) ~[apache-cassandra-2.1.2.jar:2.1.2] at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:262) ~[apache-cassandra-2.1.2.jar:2.1.2] at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:215) ~[apache-cassandra-2.1.2.jar:2.1.2] at org.apache.cassandra.auth.Auth.selectUser(Auth.java:306) ~[apache-cassandra-2.1.2.jar:2.1.2] ... 11 common frames omitted I've changed schema definition. It's periodic workload, so I will disable hinted handoff temporary. Also I disabled compaction for filespace.filestorage because it takes long time and gives 1% efficiency. My hints parameters now: hinted_handoff_enabled: 'true' max_hints_delivery_threads: 4 max_hint_window_in_ms: 1080 hinted_handoff_throttle_in_kb: 10240 I suppose Cassandra should do some kind of partial compaction if system.hints is big, or do clean old hints before compaction. Do you have idea about nessesary changes in 2.1.5? Nodes in DC2 die during and after huge write workload
[jira] [Comment Edited] (CASSANDRA-9092) Nodes in DC2 die during and after huge write workload
[ https://issues.apache.org/jira/browse/CASSANDRA-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392398#comment-14392398 ] Sergey Maznichenko edited comment on CASSANDRA-9092 at 4/3/15 1:39 PM: --- Java heap is selected automatically in cassandra-env.sh. I tried to set MAX_HEAP_SIZE=8G, NEW_HEAP_SIZE=800M, but it didn't help. nodetool disableautocompaction - didn't help, compactions continue after restart node. nodetool truncatehints - didn't help, it showed message like 'cannot stop running hint compaction'. One of nodes had ~24000 files in system\hints-..., I stepped node and deleted them, it helps and node is running about 10 hours. Other node has 18154 files in system\hints-... (~1.1TB) and has the same problem, I leave it for experiments. Workload: 20-40 processes on application servers, each one performs loading files in blobs (one big table), size of each file is about 3.5MB, key - UUID. CREATE KEYSPACE filespace WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '1', 'DC2': '1'} AND durable_writes = true; CREATE TABLE filespace.filestorage ( key text, chunk text, value blob, PRIMARY KEY (key, chunk) ) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (chunk ASC) AND bloom_filter_fp_chance = 0.01 AND caching = '{keys:ALL, rows_per_partition:NONE}' AND comment = '' AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND dclocal_read_repair_chance = 0.1 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = '99.0PERCENTILE'; nodetool status filespace Datacenter: DC1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.X.X.12 4.82 TB256 28.0% 25cefe6a-a9b1-4b30-839d-46ed5f4736cc RAC1 UN 10.X.X.13 3.98 TB256 22.9% ef439686-1e8f-4b31-9c42-f49ff7a8b537 RAC1 UN 10.X.X.10 4.52 TB256 26.1% a11f52a6-1bff-4b47-bfa9-628a55a058dc RAC1 UN 10.X.X.11 4.01 TB256 23.1% 0f454fa7-5cdf-45b3-bf2d-729ab7bd9e52 RAC1 Datacenter: DC2 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.X.X.137 4.64 TB256 22.6% e184cc42-7cd9-4e2e-bd0d-55a6a62f69dd RAC1 UN 10.X.X.136 1.25 TB256 27.2% c8360341-83e0-4778-b2d4-3966f083151b RAC1 DN 10.X.X.139 4.81 TB256 25.8% 1f434cfe-6952-4d41-8fc5-780a18e64963 RAC1 UN 10.X.X.138 3.69 TB256 24.4% b7467041-05d9-409f-a59a-438d0a29f6a7 RAC1 I need some workaround to prevent this situation with hints. How we use default values for: hinted_handoff_enabled: 'true' max_hints_delivery_threads: 2 max_hint_window_in_ms: 1080 hinted_handoff_throttle_in_kb: 1024 Should I disable hints or increase number of threads and throughput? For example: hinted_handoff_enabled: 'true' max_hints_delivery_threads: 20 max_hint_window_in_ms: 10800 hinted_handoff_throttle_in_kb: 10240 was (Author: msb): Java heap is selected automatically in cassandra-env.sh. I tried to set MAX_HEAP_SIZE=8G, NEW_HEAP_SIZE=800M, but it didn't help. nodetool disableautocompaction - didn't help, compactions continue after restart node. nodetool truncatehints - didn't help, it showed message like 'cannot stop running hint compaction'. One of nodes had ~24000 files in system\hints-..., I stepped node and deleted them, it helps and node is running about 10 hours. Other node has 18154 files in system\hints-... (~1.1TB) and has the same problem, I leave it for experiments. Workload: 20-40 processes on application servers, each one performs loading files in blobs (one big table), size of each file is about 3.5MB, key - UUID. CREATE KEYSPACE filespace WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '1', 'DC2': '1'} AND durable_writes = true; CREATE TABLE filespace.filestorage ( key text, filename text, value blob, PRIMARY KEY (key, chunk) ) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (chunk ASC) AND bloom_filter_fp_chance = 0.01 AND caching = '{keys:ALL, rows_per_partition:NONE}' AND comment = '' AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'} AND compression = {'sstable_compression':
[jira] [Commented] (CASSANDRA-9092) Nodes in DC2 die during and after huge write workload
[ https://issues.apache.org/jira/browse/CASSANDRA-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392398#comment-14392398 ] Sergey Maznichenko commented on CASSANDRA-9092: --- Java heap is selected automatically in cassandra-env.sh. I tried to set MAX_HEAP_SIZE=8G, NEW_HEAP_SIZE=800M, but it didn't help. nodetool disableautocompaction - didn't help, compactions continue after restart node. nodetool truncatehints - didn't help, it showed message like 'cannot stop running hint compaction'. One of nodes had ~24000 files in system\hints-..., I stepped node and deleted them, it helps and node is running about 10 hours. Other node has 18154 files in system\hints-... (~1.1TB) and has the same problem, I leave it for experiments. Workload: 20-40 processes on application servers, each one performs loading files in blobs (one big table), size of each file is about 3.5MB, key - UUID. CREATE KEYSPACE filespace WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '1', 'DC2': '1'} AND durable_writes = true; CREATE TABLE filespace.filestorage ( key text, filename text, value blob, PRIMARY KEY (key, chunk) ) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (chunk ASC) AND bloom_filter_fp_chance = 0.01 AND caching = '{keys:ALL, rows_per_partition:NONE}' AND comment = '' AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND dclocal_read_repair_chance = 0.1 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = '99.0PERCENTILE'; nodetool status filespace Datacenter: DC1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.X.X.12 4.82 TB256 28.0% 25cefe6a-a9b1-4b30-839d-46ed5f4736cc RAC1 UN 10.X.X.13 3.98 TB256 22.9% ef439686-1e8f-4b31-9c42-f49ff7a8b537 RAC1 UN 10.X.X.10 4.52 TB256 26.1% a11f52a6-1bff-4b47-bfa9-628a55a058dc RAC1 UN 10.X.X.11 4.01 TB256 23.1% 0f454fa7-5cdf-45b3-bf2d-729ab7bd9e52 RAC1 Datacenter: DC2 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.X.X.137 4.64 TB256 22.6% e184cc42-7cd9-4e2e-bd0d-55a6a62f69dd RAC1 UN 10.X.X.136 1.25 TB256 27.2% c8360341-83e0-4778-b2d4-3966f083151b RAC1 DN 10.X.X.139 4.81 TB256 25.8% 1f434cfe-6952-4d41-8fc5-780a18e64963 RAC1 UN 10.X.X.138 3.69 TB256 24.4% b7467041-05d9-409f-a59a-438d0a29f6a7 RAC1 I need some workaround to prevent this situation with hints. How we use dafault values for: hinted_handoff_enabled: 'true' max_hints_delivery_threads: 2 max_hint_window_in_ms: 1080 hinted_handoff_throttle_in_kb: 1024 Should I disable hints or increase number of threads and throughput? For example: hinted_handoff_enabled: 'true' max_hints_delivery_threads: 20 max_hint_window_in_ms: 10800 hinted_handoff_throttle_in_kb: 10240 Nodes in DC2 die during and after huge write workload - Key: CASSANDRA-9092 URL: https://issues.apache.org/jira/browse/CASSANDRA-9092 Project: Cassandra Issue Type: Bug Environment: CentOS 6.2 64-bit, Cassandra 2.1.2, java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) Reporter: Sergey Maznichenko Fix For: 2.1.5 Attachments: cassandra_crash1.txt Hello, We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2. Node is VM 8 CPU, 32GB RAM During significant workload (loading several millions blobs ~3.5MB each), 1 node in DC2 stops and after some time next 2 nodes in DC2 also stops. Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. I see many files in system.hints table and error appears in 2-3 minutes after starting system.hints auto compaction. The problem exists only in DC2. We have 1GbE between DC1 and DC2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-9092) Nodes in DC2 die during and after huge write workload
[ https://issues.apache.org/jira/browse/CASSANDRA-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Maznichenko updated CASSANDRA-9092: -- Description: Hello, We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2. Node is VM 8 CPU, 32GB RAM During significant workload (loading several millions blobs ~3.5MB each), 1 node in DC2 stops and after some time next 2 nodes in DC2 also stops. Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. I see many files in system.hints table and error appears in 2-3 minutes after starting system.hints auto compaction. Stops, means ERROR [CompactionExecutor:1] 2015-04-01 23:33:44,456 CassandraDaemon.java:153 - Exception in thread Thread[CompactionExecutor:1,1,main] java.lang.OutOfMemoryError: Java heap space ERROR [HintedHandoff:1] 2015-04-01 23:33:44,456 CassandraDaemon.java:153 - Exception in thread Thread[HintedHandoff:1,1,main] java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space Full errors listing attached in cassandra_crash1.txt The problem exists only in DC2. We have 1GbE between DC1 and DC2. was: Hello, We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2. Node is VM 8 CPU, 32GB RAM During significant workload (loading several millions blobs ~3.5MB each), 1 node in DC2 stops and after some time next 2 nodes in DC2 also stops. Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. I see many files in system.hints table and error appears in 2-3 minutes after starting system.hints auto compaction. Stops, means ERROR [CompactionExecutor:1] 2015-04-01 23:33:44,456 CassandraDaemon.java:153 - Exception in thread Thread[CompactionExecutor:1,1,main] java.lang.OutOfMemoryError: Java heap space Full errors listing attached in cassandra_crash1.txt The problem exists only in DC2. We have 1GbE between DC1 and DC2. Nodes in DC2 die during and after huge write workload - Key: CASSANDRA-9092 URL: https://issues.apache.org/jira/browse/CASSANDRA-9092 Project: Cassandra Issue Type: Bug Environment: CentOS 6.2 64-bit, Cassandra 2.1.2, java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) Reporter: Sergey Maznichenko Fix For: 2.1.5 Attachments: cassandra_crash1.txt Hello, We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2. Node is VM 8 CPU, 32GB RAM During significant workload (loading several millions blobs ~3.5MB each), 1 node in DC2 stops and after some time next 2 nodes in DC2 also stops. Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. I see many files in system.hints table and error appears in 2-3 minutes after starting system.hints auto compaction. Stops, means ERROR [CompactionExecutor:1] 2015-04-01 23:33:44,456 CassandraDaemon.java:153 - Exception in thread Thread[CompactionExecutor:1,1,main] java.lang.OutOfMemoryError: Java heap space ERROR [HintedHandoff:1] 2015-04-01 23:33:44,456 CassandraDaemon.java:153 - Exception in thread Thread[HintedHandoff:1,1,main] java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space Full errors listing attached in cassandra_crash1.txt The problem exists only in DC2. We have 1GbE between DC1 and DC2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-9092) Nodes in DC2 die during and after huge write workload
[ https://issues.apache.org/jira/browse/CASSANDRA-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Maznichenko updated CASSANDRA-9092: -- Description: Hello, We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2. Node is VM 8 CPU, 32GB RAM During significant workload (loading several millions blobs ~3.5MB each), 1 node in DC2 stops and after some time next 2 nodes in DC2 also stops. Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. I see many files in system.hints table and error appears in 2-3 minutes after starting system.hints auto compaction. Stops, means ERROR [CompactionExecutor:1] 2015-04-01 23:33:44,456 CassandraDaemon.java:153 - Exception in thread Thread[CompactionExecutor:1,1,main] java.lang.OutOfMemoryError: Java heap space The problem exists only in DC2. We have 1GbE between DC1 and DC2. was: Hello, We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2. Node is VM 8 CPU, 32GB RAM During significant workload (loading several millions blobs ~3.5MB each), 1 node in DC2 stops and after some time next 2 nodes in DC2 also stops. Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. I see many files in system.hints table and error appears in 2-3 minutes after starting system.hints auto compaction. The problem exists only in DC2. We have 1GbE between DC1 and DC2. Nodes in DC2 die during and after huge write workload - Key: CASSANDRA-9092 URL: https://issues.apache.org/jira/browse/CASSANDRA-9092 Project: Cassandra Issue Type: Bug Environment: CentOS 6.2 64-bit, Cassandra 2.1.2, java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) Reporter: Sergey Maznichenko Fix For: 2.1.5 Attachments: cassandra_crash1.txt Hello, We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2. Node is VM 8 CPU, 32GB RAM During significant workload (loading several millions blobs ~3.5MB each), 1 node in DC2 stops and after some time next 2 nodes in DC2 also stops. Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. I see many files in system.hints table and error appears in 2-3 minutes after starting system.hints auto compaction. Stops, means ERROR [CompactionExecutor:1] 2015-04-01 23:33:44,456 CassandraDaemon.java:153 - Exception in thread Thread[CompactionExecutor:1,1,main] java.lang.OutOfMemoryError: Java heap space The problem exists only in DC2. We have 1GbE between DC1 and DC2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-9092) Nodes in DC2 die during and after huge write workload
[ https://issues.apache.org/jira/browse/CASSANDRA-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Maznichenko updated CASSANDRA-9092: -- Description: Hello, We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2. Node is VM 8 CPU, 32GB RAM During significant workload (loading several millions blobs ~3.5MB each), 1 node in DC2 stops and after some time next 2 nodes in DC2 also stops. Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. I see many files in system.hints table and error appears in 2-3 minutes after starting system.hints auto compaction. Stops, means ERROR [CompactionExecutor:1] 2015-04-01 23:33:44,456 CassandraDaemon.java:153 - Exception in thread Thread[CompactionExecutor:1,1,main] java.lang.OutOfMemoryError: Java heap space Full errors listing attached in cassandra_crash1.txt The problem exists only in DC2. We have 1GbE between DC1 and DC2. was: Hello, We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2. Node is VM 8 CPU, 32GB RAM During significant workload (loading several millions blobs ~3.5MB each), 1 node in DC2 stops and after some time next 2 nodes in DC2 also stops. Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. I see many files in system.hints table and error appears in 2-3 minutes after starting system.hints auto compaction. Stops, means ERROR [CompactionExecutor:1] 2015-04-01 23:33:44,456 CassandraDaemon.java:153 - Exception in thread Thread[CompactionExecutor:1,1,main] java.lang.OutOfMemoryError: Java heap space Full errors listing attached in The problem exists only in DC2. We have 1GbE between DC1 and DC2. Nodes in DC2 die during and after huge write workload - Key: CASSANDRA-9092 URL: https://issues.apache.org/jira/browse/CASSANDRA-9092 Project: Cassandra Issue Type: Bug Environment: CentOS 6.2 64-bit, Cassandra 2.1.2, java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) Reporter: Sergey Maznichenko Fix For: 2.1.5 Attachments: cassandra_crash1.txt Hello, We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2. Node is VM 8 CPU, 32GB RAM During significant workload (loading several millions blobs ~3.5MB each), 1 node in DC2 stops and after some time next 2 nodes in DC2 also stops. Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. I see many files in system.hints table and error appears in 2-3 minutes after starting system.hints auto compaction. Stops, means ERROR [CompactionExecutor:1] 2015-04-01 23:33:44,456 CassandraDaemon.java:153 - Exception in thread Thread[CompactionExecutor:1,1,main] java.lang.OutOfMemoryError: Java heap space Full errors listing attached in cassandra_crash1.txt The problem exists only in DC2. We have 1GbE between DC1 and DC2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-9092) Nodes in DC2 die during and after huge write workload
[ https://issues.apache.org/jira/browse/CASSANDRA-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Maznichenko updated CASSANDRA-9092: -- Description: Hello, We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2. Node is VM 8 CPU, 32GB RAM During significant workload (loading several millions blobs ~3.5MB each), 1 node in DC2 stops and after some time next 2 nodes in DC2 also stops. Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. I see many files in system.hints table and error appears in 2-3 minutes after starting system.hints auto compaction. Stops, means ERROR [CompactionExecutor:1] 2015-04-01 23:33:44,456 CassandraDaemon.java:153 - Exception in thread Thread[CompactionExecutor:1,1,main] java.lang.OutOfMemoryError: Java heap space Full errors listing attached in The problem exists only in DC2. We have 1GbE between DC1 and DC2. was: Hello, We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2. Node is VM 8 CPU, 32GB RAM During significant workload (loading several millions blobs ~3.5MB each), 1 node in DC2 stops and after some time next 2 nodes in DC2 also stops. Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. I see many files in system.hints table and error appears in 2-3 minutes after starting system.hints auto compaction. Stops, means ERROR [CompactionExecutor:1] 2015-04-01 23:33:44,456 CassandraDaemon.java:153 - Exception in thread Thread[CompactionExecutor:1,1,main] java.lang.OutOfMemoryError: Java heap space The problem exists only in DC2. We have 1GbE between DC1 and DC2. Nodes in DC2 die during and after huge write workload - Key: CASSANDRA-9092 URL: https://issues.apache.org/jira/browse/CASSANDRA-9092 Project: Cassandra Issue Type: Bug Environment: CentOS 6.2 64-bit, Cassandra 2.1.2, java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) Reporter: Sergey Maznichenko Fix For: 2.1.5 Attachments: cassandra_crash1.txt Hello, We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2. Node is VM 8 CPU, 32GB RAM During significant workload (loading several millions blobs ~3.5MB each), 1 node in DC2 stops and after some time next 2 nodes in DC2 also stops. Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. I see many files in system.hints table and error appears in 2-3 minutes after starting system.hints auto compaction. Stops, means ERROR [CompactionExecutor:1] 2015-04-01 23:33:44,456 CassandraDaemon.java:153 - Exception in thread Thread[CompactionExecutor:1,1,main] java.lang.OutOfMemoryError: Java heap space Full errors listing attached in The problem exists only in DC2. We have 1GbE between DC1 and DC2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-9092) Nodes in DC2 die during and after huge write workload
[ https://issues.apache.org/jira/browse/CASSANDRA-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392398#comment-14392398 ] Sergey Maznichenko edited comment on CASSANDRA-9092 at 4/2/15 10:19 AM: Java heap is selected automatically in cassandra-env.sh. I tried to set MAX_HEAP_SIZE=8G, NEW_HEAP_SIZE=800M, but it didn't help. nodetool disableautocompaction - didn't help, compactions continue after restart node. nodetool truncatehints - didn't help, it showed message like 'cannot stop running hint compaction'. One of nodes had ~24000 files in system\hints-..., I stepped node and deleted them, it helps and node is running about 10 hours. Other node has 18154 files in system\hints-... (~1.1TB) and has the same problem, I leave it for experiments. Workload: 20-40 processes on application servers, each one performs loading files in blobs (one big table), size of each file is about 3.5MB, key - UUID. CREATE KEYSPACE filespace WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '1', 'DC2': '1'} AND durable_writes = true; CREATE TABLE filespace.filestorage ( key text, filename text, value blob, PRIMARY KEY (key, chunk) ) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (chunk ASC) AND bloom_filter_fp_chance = 0.01 AND caching = '{keys:ALL, rows_per_partition:NONE}' AND comment = '' AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND dclocal_read_repair_chance = 0.1 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = '99.0PERCENTILE'; nodetool status filespace Datacenter: DC1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.X.X.12 4.82 TB256 28.0% 25cefe6a-a9b1-4b30-839d-46ed5f4736cc RAC1 UN 10.X.X.13 3.98 TB256 22.9% ef439686-1e8f-4b31-9c42-f49ff7a8b537 RAC1 UN 10.X.X.10 4.52 TB256 26.1% a11f52a6-1bff-4b47-bfa9-628a55a058dc RAC1 UN 10.X.X.11 4.01 TB256 23.1% 0f454fa7-5cdf-45b3-bf2d-729ab7bd9e52 RAC1 Datacenter: DC2 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.X.X.137 4.64 TB256 22.6% e184cc42-7cd9-4e2e-bd0d-55a6a62f69dd RAC1 UN 10.X.X.136 1.25 TB256 27.2% c8360341-83e0-4778-b2d4-3966f083151b RAC1 DN 10.X.X.139 4.81 TB256 25.8% 1f434cfe-6952-4d41-8fc5-780a18e64963 RAC1 UN 10.X.X.138 3.69 TB256 24.4% b7467041-05d9-409f-a59a-438d0a29f6a7 RAC1 I need some workaround to prevent this situation with hints. How we use default values for: hinted_handoff_enabled: 'true' max_hints_delivery_threads: 2 max_hint_window_in_ms: 1080 hinted_handoff_throttle_in_kb: 1024 Should I disable hints or increase number of threads and throughput? For example: hinted_handoff_enabled: 'true' max_hints_delivery_threads: 20 max_hint_window_in_ms: 10800 hinted_handoff_throttle_in_kb: 10240 was (Author: msb): Java heap is selected automatically in cassandra-env.sh. I tried to set MAX_HEAP_SIZE=8G, NEW_HEAP_SIZE=800M, but it didn't help. nodetool disableautocompaction - didn't help, compactions continue after restart node. nodetool truncatehints - didn't help, it showed message like 'cannot stop running hint compaction'. One of nodes had ~24000 files in system\hints-..., I stepped node and deleted them, it helps and node is running about 10 hours. Other node has 18154 files in system\hints-... (~1.1TB) and has the same problem, I leave it for experiments. Workload: 20-40 processes on application servers, each one performs loading files in blobs (one big table), size of each file is about 3.5MB, key - UUID. CREATE KEYSPACE filespace WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '1', 'DC2': '1'} AND durable_writes = true; CREATE TABLE filespace.filestorage ( key text, filename text, value blob, PRIMARY KEY (key, chunk) ) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (chunk ASC) AND bloom_filter_fp_chance = 0.01 AND caching = '{keys:ALL, rows_per_partition:NONE}' AND comment = '' AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'} AND compression = {'sstable_compression':
[jira] [Commented] (CASSANDRA-9092) Nodes in DC2 die during and after huge write workload
[ https://issues.apache.org/jira/browse/CASSANDRA-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392777#comment-14392777 ] Sergey Maznichenko commented on CASSANDRA-9092: --- The node reproduces this error every time after attempt of compacting system.hints. I tried MAX_HEAP_SIZE=16G, it didn't help. Workaround is manually deleting system.hint files and restart node, but we have a chance to investigate this error in order to fix it in future releases. Any suggestions? Nodes in DC2 die during and after huge write workload - Key: CASSANDRA-9092 URL: https://issues.apache.org/jira/browse/CASSANDRA-9092 Project: Cassandra Issue Type: Bug Environment: CentOS 6.2 64-bit, Cassandra 2.1.2, java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) Reporter: Sergey Maznichenko Fix For: 2.1.5 Attachments: cassandra_crash1.txt Hello, We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2. Node is VM 8 CPU, 32GB RAM During significant workload (loading several millions blobs ~3.5MB each), 1 node in DC2 stops and after some time next 2 nodes in DC2 also stops. Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. I see many files in system.hints table and error appears in 2-3 minutes after starting system.hints auto compaction. Stops, means ERROR [CompactionExecutor:1] 2015-04-01 23:33:44,456 CassandraDaemon.java:153 - Exception in thread Thread[CompactionExecutor:1,1,main] java.lang.OutOfMemoryError: Java heap space Full errors listing attached in cassandra_crash1.txt The problem exists only in DC2. We have 1GbE between DC1 and DC2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-9092) Nodes in DC2 die during and after huge write workload
[ https://issues.apache.org/jira/browse/CASSANDRA-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Maznichenko updated CASSANDRA-9092: -- Summary: Nodes in DC2 die during and after huge write workload (was: Nodes in DC2 dies during and after huge write workload) Nodes in DC2 die during and after huge write workload - Key: CASSANDRA-9092 URL: https://issues.apache.org/jira/browse/CASSANDRA-9092 Project: Cassandra Issue Type: Bug Environment: CentOS 6.2 64-bit, Cassandra 2.1.2, java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) Reporter: Sergey Maznichenko Attachments: cassandra_crash1.txt Hello, We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2. Node is VM 8 CPU, 32GB RAM During significant workload (loading several millions blobs ~3.5MB each), 1 node in DC2 stops and after some time next 2 nodes in DC2 also stops. Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. I see many files in system.hints table and error appears in 2-3 minutes after starting system.hints auto compaction. The problem exists only in DC2. We have 1GbE between DC1 and DC2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-9092) Nodes in DC2 dies during and after huge write workload
Sergey Maznichenko created CASSANDRA-9092: - Summary: Nodes in DC2 dies during and after huge write workload Key: CASSANDRA-9092 URL: https://issues.apache.org/jira/browse/CASSANDRA-9092 Project: Cassandra Issue Type: Bug Environment: CentOS 6.2 64-bit, Cassandra 2.1.2, java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) Reporter: Sergey Maznichenko Attachments: cassandra_crash1.txt Hello, We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2. Node is VM 8 CPU, 32GB RAM During significant workload (loading several millions blobs ~3.5MB each), 1 node in DC2 stops and after some time next 2 nodes in DC2 also stops. Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. I see many files in system.hints table and error appears in 2-3 minutes after starting system.hints auto compaction. The problem exists only in DC2. We have 1GbE between DC1 and DC2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)