[jira] [Created] (CASSANDRA-9235) nodetool compactionstats. Negative numbers in pending tasks

2015-04-24 Thread Sergey Maznichenko (JIRA)
Sergey Maznichenko created CASSANDRA-9235:
-

 Summary: nodetool compactionstats. Negative numbers in pending 
tasks 
 Key: CASSANDRA-9235
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9235
 Project: Cassandra
  Issue Type: Bug
  Components: API, Core
 Environment: CentOS 6.2 x64, Cassandra 2.1.4
Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
Reporter: Sergey Maznichenko
Priority: Minor


nodetool compactionstats
pending tasks: -8

I can see negative numbers in 'pending tasks' on all 8 nodes
it looks like -8 + real number of pending tasks
for example -22128 for 100 real pending tasks





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9235) nodetool compactionstats. Negative numbers in pending tasks

2015-04-24 Thread Sergey Maznichenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14511105#comment-14511105
 ] 

Sergey Maznichenko commented on CASSANDRA-9235:
---

I use LCS.
It happens on all nodes:

pending tasks: -09007

pending tasks: -8

pending tasks: -09142
   compaction type   keyspace  table  completed  total  
  unit   progress
Compaction   archivespace   file_storage   493893100494   710281935201  
 bytes 69.53%
Active compaction remaining time :   0h03m26s

pending tasks: -17719
   compaction type   keyspace  table   completed   total
unit   progress
Compaction   archivespace   file_storage   286845775   539131720   
bytes 53.21%
Active compaction remaining time :   0h00m00s

pending tasks: -21094
   compaction type   keyspace  table   completedtotal
unit   progress
Compaction   archivespace   file_storage   546045136   1040249351   
bytes 52.49%
Active compaction remaining time :   0h00m00s

pending tasks: -11160
   compaction type   keyspace  table  completed  total  
  unit   progress
Compaction   archivespace   file_storage   173527855063   754763739654  
 bytes 22.99%
Active compaction remaining time :   0h09m14s

pending tasks: -17961
   compaction type   keyspace  table   completed   total
unit   progress
Compaction   archivespace   file_storage   307582716   539133247   
bytes 57.05%
Active compaction remaining time :   0h00m00s

pending tasks: -10946
   compaction type   keyspace  table   completed   
totalunit   progress
Compaction   archivespace   file_storage   1102055174099   
6063766404241   bytes 18.17%
Active compaction remaining time :   1h18m56s

It seems to me that it began when I added another table in keyspace.


 nodetool compactionstats. Negative numbers in pending tasks 
 

 Key: CASSANDRA-9235
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9235
 Project: Cassandra
  Issue Type: Bug
  Components: API, Core
 Environment: CentOS 6.2 x64, Cassandra 2.1.4
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
Reporter: Sergey Maznichenko
Assignee: Marcus Eriksson
Priority: Minor
 Fix For: 2.1.5


 nodetool compactionstats
 pending tasks: -8
 I can see negative numbers in 'pending tasks' on all 8 nodes
 it looks like -8 + real number of pending tasks
 for example -22128 for 100 real pending tasks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9235) nodetool compactionstats. Negative numbers in pending tasks

2015-04-24 Thread Sergey Maznichenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14511129#comment-14511129
 ] 

Sergey Maznichenko commented on CASSANDRA-9235:
---

It happened when archivespace.file_storage2 had beed added.

KEYSPACE description:

CREATE KEYSPACE archivespace WITH replication = {'class': 
'NetworkTopologyStrategy', 'DC1': '1', 'DC2': '1'}  AND durable_writes = true;

CREATE TABLE archivespace.files (
id uuid PRIMARY KEY,
category decimal,
created timestamp,
data blob,
filename text
) WITH bloom_filter_fp_chance = 0.01
AND caching = '{keys:ALL, rows_per_partition:NONE}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'enabled': 'true', 'class': 
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 
'max_threshold': '32'}
AND compression = {'sstable_compression': 
'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';

CREATE TABLE archivespace.file_storage (
key text,
chunk text,
value blob,
PRIMARY KEY (key, chunk)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (chunk ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{keys:ALL, rows_per_partition:NONE}'
AND comment = ''
AND compaction = {'sstable_size_in_mb': '512', 'min_threshold': '4', 
'enabled': 'true', 'class': 
'org.apache.cassandra.db.compaction.LeveledCompactionStrategy', 
'max_threshold': '64'}
AND compression = {}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';

CREATE TABLE archivespace.file_storage2 (
key text,
chunk text,
value blob,
PRIMARY KEY (key, chunk)
) WITH CLUSTERING ORDER BY (chunk ASC)
AND bloom_filter_fp_chance = 0.1
AND caching = '{keys:ALL, rows_per_partition:NONE}'
AND comment = ''
AND compaction = {'sstable_size_in_mb': '2048', 'min_threshold': '4', 
'enabled': 'true', 'class': 
'org.apache.cassandra.db.compaction.LeveledCompactionStrategy', 
'max_threshold': '32'}
AND compression = {}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';


 nodetool compactionstats. Negative numbers in pending tasks 
 

 Key: CASSANDRA-9235
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9235
 Project: Cassandra
  Issue Type: Bug
  Components: API, Core
 Environment: CentOS 6.2 x64, Cassandra 2.1.4
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
Reporter: Sergey Maznichenko
Assignee: Marcus Eriksson
Priority: Minor
 Fix For: 2.1.5


 nodetool compactionstats
 pending tasks: -8
 I can see negative numbers in 'pending tasks' on all 8 nodes
 it looks like -8 + real number of pending tasks
 for example -22128 for 100 real pending tasks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9235) nodetool compactionstats. Negative numbers in pending tasks

2015-04-24 Thread Sergey Maznichenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14511135#comment-14511135
 ] 

Sergey Maznichenko commented on CASSANDRA-9235:
---


DEBUG [RMI TCP Connection(10)-127.0.0.1] 2015-04-24 18:26:01,639 
LeveledManifest.java:680 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions 
to do for system.paxos
DEBUG [RMI TCP Connection(10)-127.0.0.1] 2015-04-24 18:26:01,639 
LeveledManifest.java:680 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions 
to do for system.paxos
DEBUG [RMI TCP Connection(10)-127.0.0.1] 2015-04-24 18:26:01,640 
LeveledManifest.java:680 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions 
to do for archivespace.file_storage
DEBUG [RMI TCP Connection(10)-127.0.0.1] 2015-04-24 18:26:01,641 
LeveledManifest.java:680 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions 
to do for archivespace.file_storage
DEBUG [RMI TCP Connection(10)-127.0.0.1] 2015-04-24 18:26:01,641 
LeveledManifest.java:680 - Estimating [-4, -10, -100, -1000, -1, -10, 
-100, -1000, -1] compactions to do for 
archivespace.file_storage2
DEBUG [RMI TCP Connection(10)-127.0.0.1] 2015-04-24 18:26:01,641 
LeveledManifest.java:680 - Estimating [-4, -10, -100, -1000, -1, -10, 
-100, -1000, -1] compactions to do for 
archivespace.file_storage2
DEBUG [RMI TCP Connection(12)-127.0.0.1] 2015-04-24 18:26:07,041 
LeveledManifest.java:680 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions 
to do for system.paxos
DEBUG [RMI TCP Connection(12)-127.0.0.1] 2015-04-24 18:26:07,041 
LeveledManifest.java:680 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions 
to do for system.paxos
DEBUG [RMI TCP Connection(12)-127.0.0.1] 2015-04-24 18:26:07,041 
LeveledManifest.java:680 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions 
to do for archivespace.file_storage
DEBUG [RMI TCP Connection(12)-127.0.0.1] 2015-04-24 18:26:07,042 
LeveledManifest.java:680 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions 
to do for archivespace.file_storage
DEBUG [RMI TCP Connection(12)-127.0.0.1] 2015-04-24 18:26:07,042 
LeveledManifest.java:680 - Estimating [-4, -10, -100, -1000, -1, -10, 
-100, -1000, -1] compactions to do for 
archivespace.file_storage2
DEBUG [RMI TCP Connection(12)-127.0.0.1] 2015-04-24 18:26:07,043 
LeveledManifest.java:680 - Estimating [-4, -10, -100, -1000, -1, -10, 
-100, -1000, -1] compactions to do for 
archivespace.file_storage2
DEBUG [RMI TCP Connection(6)-127.0.0.1] 2015-04-24 18:26:09,059 
LeveledManifest.java:680 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions 
to do for system.paxos
DEBUG [RMI TCP Connection(6)-127.0.0.1] 2015-04-24 18:26:09,059 
LeveledManifest.java:680 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions 
to do for system.paxos
DEBUG [RMI TCP Connection(6)-127.0.0.1] 2015-04-24 18:26:09,060 
LeveledManifest.java:680 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions 
to do for archivespace.file_storage
DEBUG [RMI TCP Connection(6)-127.0.0.1] 2015-04-24 18:26:09,061 
LeveledManifest.java:680 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions 
to do for archivespace.file_storage
DEBUG [RMI TCP Connection(6)-127.0.0.1] 2015-04-24 18:26:09,061 
LeveledManifest.java:680 - Estimating [-4, -10, -100, -1000, -1, -10, 
-100, -1000, -1] compactions to do for 
archivespace.file_storage2
DEBUG [RMI TCP Connection(6)-127.0.0.1] 2015-04-24 18:26:09,061 
LeveledManifest.java:680 - Estimating [-4, -10, -100, -1000, -1, -10, 
-100, -1000, -1] compactions to do for 
archivespace.file_storage2
DEBUG [RMI TCP Connection(14)-127.0.0.1] 2015-04-24 18:26:16,600 
LeveledManifest.java:680 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions 
to do for system.paxos
DEBUG [RMI TCP Connection(14)-127.0.0.1] 2015-04-24 18:26:16,600 
LeveledManifest.java:680 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions 
to do for system.paxos
DEBUG [RMI TCP Connection(14)-127.0.0.1] 2015-04-24 18:26:16,600 
LeveledManifest.java:680 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions 
to do for archivespace.file_storage
DEBUG [RMI TCP Connection(14)-127.0.0.1] 2015-04-24 18:26:16,601 
LeveledManifest.java:680 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions 
to do for archivespace.file_storage
DEBUG [RMI TCP Connection(14)-127.0.0.1] 2015-04-24 18:26:16,602 
LeveledManifest.java:680 - Estimating [-4, -10, -100, -1000, -1, -10, 
-100, -1000, -1] compactions to do for 
archivespace.file_storage2
DEBUG [RMI TCP Connection(14)-127.0.0.1] 2015-04-24 18:26:16,602 
LeveledManifest.java:680 - Estimating [-4, -10, -100, -1000, -1, -10, 
-100, -1000, -1] compactions to do for 
archivespace.file_storage2

 nodetool compactionstats. Negative numbers in pending tasks 
 

[jira] [Commented] (CASSANDRA-9235) nodetool compactionstats. Negative numbers in pending tasks

2015-04-24 Thread Sergey Maznichenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14511146#comment-14511146
 ] 

Sergey Maznichenko commented on CASSANDRA-9235:
---

archivespace.file_storage2 is empty.

select * from archivespace.file_storage2;

 key | chunk | value
-+---+---

(0 rows)


 nodetool compactionstats. Negative numbers in pending tasks 
 

 Key: CASSANDRA-9235
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9235
 Project: Cassandra
  Issue Type: Bug
  Components: API, Core
 Environment: CentOS 6.2 x64, Cassandra 2.1.4
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
Reporter: Sergey Maznichenko
Assignee: Marcus Eriksson
Priority: Minor
 Fix For: 2.1.5


 nodetool compactionstats
 pending tasks: -8
 I can see negative numbers in 'pending tasks' on all 8 nodes
 it looks like -8 + real number of pending tasks
 for example -22128 for 100 real pending tasks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9235) Max sstable size in leveled manifest is an int, creating large sstables overflows this and breaks LCS

2015-04-24 Thread Sergey Maznichenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14511448#comment-14511448
 ] 

Sergey Maznichenko commented on CASSANDRA-9235:
---

OK. Thanks guys!

 Max sstable size in leveled manifest is an int, creating large sstables 
 overflows this and breaks LCS
 -

 Key: CASSANDRA-9235
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9235
 Project: Cassandra
  Issue Type: Bug
  Components: API, Core
 Environment: CentOS 6.2 x64, Cassandra 2.1.4
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
Reporter: Sergey Maznichenko
Assignee: Marcus Eriksson
 Fix For: 2.1.5

 Attachments: 0001-9235.patch


 nodetool compactionstats
 pending tasks: -8
 I can see negative numbers in 'pending tasks' on all 8 nodes
 it looks like -8 + real number of pending tasks
 for example -22128 for 100 real pending tasks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9092) Nodes in DC2 die during and after huge write workload

2015-04-03 Thread Sergey Maznichenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394352#comment-14394352
 ] 

Sergey Maznichenko commented on CASSANDRA-9092:
---

Should I provide any additional information from the failed node? I want to 
delete all hints and run repair on this node.

 Nodes in DC2 die during and after huge write workload
 -

 Key: CASSANDRA-9092
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9092
 Project: Cassandra
  Issue Type: Bug
 Environment: CentOS 6.2 64-bit, Cassandra 2.1.2, 
 java version 1.7.0_71
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
Reporter: Sergey Maznichenko
Assignee: Sam Tunnicliffe
 Fix For: 2.1.5

 Attachments: cassandra_crash1.txt


 Hello,
 We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2.
 Node is VM 8 CPU, 32GB RAM
 During significant workload (loading several millions blobs ~3.5MB each), 1 
 node in DC2 stops and after some time next 2 nodes in DC2 also stops.
 Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. 
 I see many files in system.hints table and error appears in 2-3 minutes after 
 starting system.hints auto compaction.
 Stops, means ERROR [CompactionExecutor:1] 2015-04-01 23:33:44,456 
 CassandraDaemon.java:153 - Exception in thread 
 Thread[CompactionExecutor:1,1,main]
 java.lang.OutOfMemoryError: Java heap space
 ERROR [HintedHandoff:1] 2015-04-01 23:33:44,456 CassandraDaemon.java:153 - 
 Exception in thread Thread[HintedHandoff:1,1,main]
 java.lang.RuntimeException: java.util.concurrent.ExecutionException: 
 java.lang.OutOfMemoryError: Java heap space
 Full errors listing attached in cassandra_crash1.txt
 The problem exists only in DC2. We have 1GbE between DC1 and DC2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9092) Nodes in DC2 die during and after huge write workload

2015-04-03 Thread Sergey Maznichenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394627#comment-14394627
 ] 

Sergey Maznichenko commented on CASSANDRA-9092:
---

We have OpsCenter Agent. Such errors repeat 1-2 timer per hour during load of 
data. In DC1 now we don't have any hints.
I guess that traffic can go to all nodes because client settings, I will check 
it.
I had tried to perform 'nodetool repair' from the node in DC2 and after 30 
hours delay, I got bunch of errors in console, like:

[2015-04-02 19:32:14,352] Repair session 6ff4f071-d94d-11e4-9257-f7b14a924a15 
for range (-3563451573336693456,-3535530477916720868] failed with error 
java.io.IOException: Cannot proceed on repair because a neighbor (/10.XX.XX.11) 
is dead: session failed

but 'nodetool status' reports that all nodes are live and I can see successful 
communication between nodes in their logs. It's strange...


 Nodes in DC2 die during and after huge write workload
 -

 Key: CASSANDRA-9092
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9092
 Project: Cassandra
  Issue Type: Bug
 Environment: CentOS 6.2 64-bit, Cassandra 2.1.2, 
 java version 1.7.0_71
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
Reporter: Sergey Maznichenko
Assignee: Sam Tunnicliffe
 Fix For: 2.1.5

 Attachments: cassandra_crash1.txt


 Hello,
 We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2.
 Node is VM 8 CPU, 32GB RAM
 During significant workload (loading several millions blobs ~3.5MB each), 1 
 node in DC2 stops and after some time next 2 nodes in DC2 also stops.
 Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. 
 I see many files in system.hints table and error appears in 2-3 minutes after 
 starting system.hints auto compaction.
 Stops, means ERROR [CompactionExecutor:1] 2015-04-01 23:33:44,456 
 CassandraDaemon.java:153 - Exception in thread 
 Thread[CompactionExecutor:1,1,main]
 java.lang.OutOfMemoryError: Java heap space
 ERROR [HintedHandoff:1] 2015-04-01 23:33:44,456 CassandraDaemon.java:153 - 
 Exception in thread Thread[HintedHandoff:1,1,main]
 java.lang.RuntimeException: java.util.concurrent.ExecutionException: 
 java.lang.OutOfMemoryError: Java heap space
 Full errors listing attached in cassandra_crash1.txt
 The problem exists only in DC2. We have 1GbE between DC1 and DC2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9092) Nodes in DC2 die during and after huge write workload

2015-04-03 Thread Sergey Maznichenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394474#comment-14394474
 ] 

Sergey Maznichenko commented on CASSANDRA-9092:
---

Consistency ONE. Clients use Datastax Client (Java).
We are writing only to DC1.

In the logs of the nodes which don't fail we have errors and warnings during 
load:

INFO  [SharedPool-Worker-5] 2015-03-31 15:48:52,534 Message.java:532 - 
Unexpected exception during request; channel = [id: 0x48b3ad12, 
/10.77.81.33:56581 : /10.XX.XX.10:9042]
java.io.IOException: Error while read(...): Connection reset by peer
at io.netty.channel.epoll.Native.readAddress(Native Method) 
~[netty-all-4.0.23.Final.jar:4.0.23.Final]
at 
io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.doReadBytes(EpollSocketChannel.java:675)
 ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
at 
io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.epollInReady(EpollSocketChannel.java:714)
 ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
at 
io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:326) 
~[netty-all-4.0.23.Final.jar:4.0.23.Final]
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:264) 
~[netty-all-4.0.23.Final.jar:4.0.23.Final]
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
 ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
 ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
at java.lang.Thread.run(Unknown Source) [na:1.7.0_71]

ERROR [Thrift:15] 2015-03-31 11:54:35,163 CustomTThreadPoolServer.java:221 - 
Error occurred during processing of message.
java.lang.RuntimeException: 
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - 
received only 2 responses.
at org.apache.cassandra.auth.Auth.selectUser(Auth.java:317) 
~[apache-cassandra-2.1.2.jar:2.1.2]
at org.apache.cassandra.auth.Auth.isExistingUser(Auth.java:125) 
~[apache-cassandra-2.1.2.jar:2.1.2]
at org.apache.cassandra.service.ClientState.login(ClientState.java:171) 
~[apache-cassandra-2.1.2.jar:2.1.2]
at 
org.apache.cassandra.thrift.CassandraServer.login(CassandraServer.java:1493) 
~[apache-cassandra-2.1.2.jar:2.1.2]
at 
org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3579)
 ~[apache-cassandra-thrift-2.1.2.jar:2.1.2]
at 
org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3563)
 ~[apache-cassandra-thrift-2.1.2.jar:2.1.2]
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) 
~[libthrift-0.9.1.jar:0.9.1]
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) 
~[libthrift-0.9.1.jar:0.9.1]
at 
org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:202)
 ~[apache-cassandra-2.1.2.jar:2.1.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) 
[na:1.7.0_71]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 
[na:1.7.0_71]
at java.lang.Thread.run(Unknown Source) [na:1.7.0_71]
Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation 
timed out - received only 2 responses.
at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:103) 
~[apache-cassandra-2.1.2.jar:2.1.2]
at 
org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:144)
 ~[apache-cassandra-2.1.2.jar:2.1.2]
at 
org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1263) 
~[apache-cassandra-2.1.2.jar:2.1.2]
at 
org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1184) 
~[apache-cassandra-2.1.2.jar:2.1.2]
at 
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:262)
 ~[apache-cassandra-2.1.2.jar:2.1.2]
at 
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:215)
 ~[apache-cassandra-2.1.2.jar:2.1.2]
at org.apache.cassandra.auth.Auth.selectUser(Auth.java:306) 
~[apache-cassandra-2.1.2.jar:2.1.2]
... 11 common frames omitted

I've changed schema definition.
It's periodic workload, so I will disable hinted handoff temporary. Also I 
disabled compaction for filespace.filestorage because it takes long time and 
gives 1% efficiency.

My hints parameters now:
hinted_handoff_enabled: 'true'
max_hints_delivery_threads: 4
max_hint_window_in_ms: 1080
hinted_handoff_throttle_in_kb: 10240

I suppose Cassandra should do some kind of partial compaction if system.hints 
is big, or do clean old hints before compaction. Do you have idea about 
nessesary changes in 2.1.5? 


 Nodes in DC2 die during and after huge write workload
 

[jira] [Comment Edited] (CASSANDRA-9092) Nodes in DC2 die during and after huge write workload

2015-04-03 Thread Sergey Maznichenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392398#comment-14392398
 ] 

Sergey Maznichenko edited comment on CASSANDRA-9092 at 4/3/15 1:39 PM:
---

Java heap is selected automatically in cassandra-env.sh. I tried to set 
MAX_HEAP_SIZE=8G, NEW_HEAP_SIZE=800M, but it didn't help.

nodetool disableautocompaction - didn't help, compactions continue after 
restart node.
nodetool truncatehints - didn't help, it showed message like 'cannot stop 
running hint compaction'.

One of nodes had ~24000 files in system\hints-..., I stepped node and deleted 
them, it helps and node is running about 10 hours. Other node has 18154 files 
in system\hints-... (~1.1TB) and has the same problem, I leave it for 
experiments.

Workload: 20-40 processes on application servers, each one performs loading 
files in blobs (one big table), size of each file is about 3.5MB, key - UUID.

CREATE KEYSPACE filespace WITH replication = {'class': 
'NetworkTopologyStrategy', 'DC1': '1', 'DC2': '1'}  AND durable_writes = true;

CREATE TABLE filespace.filestorage (
key text,
chunk text,
value blob,
PRIMARY KEY (key, chunk)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (chunk ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{keys:ALL, rows_per_partition:NONE}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'class': 
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 
'max_threshold': '32'}
AND compression = {'sstable_compression': 
'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';

nodetool status filespace
Datacenter: DC1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address   Load   Tokens  Owns (effective)  Host ID  
 Rack
UN  10.X.X.12   4.82 TB256 28.0% 
25cefe6a-a9b1-4b30-839d-46ed5f4736cc  RAC1
UN  10.X.X.13   3.98 TB256 22.9% 
ef439686-1e8f-4b31-9c42-f49ff7a8b537  RAC1
UN  10.X.X.10   4.52 TB256 26.1% 
a11f52a6-1bff-4b47-bfa9-628a55a058dc  RAC1
UN  10.X.X.11   4.01 TB256 23.1% 
0f454fa7-5cdf-45b3-bf2d-729ab7bd9e52  RAC1
Datacenter: DC2
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address   Load   Tokens  Owns (effective)  Host ID  
 Rack
UN  10.X.X.137  4.64 TB256 22.6% 
e184cc42-7cd9-4e2e-bd0d-55a6a62f69dd  RAC1
UN  10.X.X.136  1.25 TB256 27.2% 
c8360341-83e0-4778-b2d4-3966f083151b  RAC1
DN  10.X.X.139  4.81 TB256 25.8% 
1f434cfe-6952-4d41-8fc5-780a18e64963  RAC1
UN  10.X.X.138  3.69 TB256 24.4% 
b7467041-05d9-409f-a59a-438d0a29f6a7  RAC1

I need some workaround to prevent this situation with hints. 

How we use default values for:

hinted_handoff_enabled: 'true'
max_hints_delivery_threads: 2
max_hint_window_in_ms: 1080
hinted_handoff_throttle_in_kb: 1024

Should I disable hints or increase number of threads and throughput?

For example:

hinted_handoff_enabled: 'true'
max_hints_delivery_threads: 20
max_hint_window_in_ms: 10800
hinted_handoff_throttle_in_kb: 10240



was (Author: msb):
Java heap is selected automatically in cassandra-env.sh. I tried to set 
MAX_HEAP_SIZE=8G, NEW_HEAP_SIZE=800M, but it didn't help.

nodetool disableautocompaction - didn't help, compactions continue after 
restart node.
nodetool truncatehints - didn't help, it showed message like 'cannot stop 
running hint compaction'.

One of nodes had ~24000 files in system\hints-..., I stepped node and deleted 
them, it helps and node is running about 10 hours. Other node has 18154 files 
in system\hints-... (~1.1TB) and has the same problem, I leave it for 
experiments.

Workload: 20-40 processes on application servers, each one performs loading 
files in blobs (one big table), size of each file is about 3.5MB, key - UUID.

CREATE KEYSPACE filespace WITH replication = {'class': 
'NetworkTopologyStrategy', 'DC1': '1', 'DC2': '1'}  AND durable_writes = true;

CREATE TABLE filespace.filestorage (
key text,
filename text,
value blob,
PRIMARY KEY (key, chunk)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (chunk ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{keys:ALL, rows_per_partition:NONE}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'class': 
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 
'max_threshold': '32'}
AND compression = {'sstable_compression': 

[jira] [Commented] (CASSANDRA-9092) Nodes in DC2 die during and after huge write workload

2015-04-02 Thread Sergey Maznichenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392398#comment-14392398
 ] 

Sergey Maznichenko commented on CASSANDRA-9092:
---

Java heap is selected automatically in cassandra-env.sh. I tried to set 
MAX_HEAP_SIZE=8G, NEW_HEAP_SIZE=800M, but it didn't help.

nodetool disableautocompaction - didn't help, compactions continue after 
restart node.
nodetool truncatehints - didn't help, it showed message like 'cannot stop 
running hint compaction'.

One of nodes had ~24000 files in system\hints-..., I stepped node and deleted 
them, it helps and node is running about 10 hours. Other node has 18154 files 
in system\hints-... (~1.1TB) and has the same problem, I leave it for 
experiments.

Workload: 20-40 processes on application servers, each one performs loading 
files in blobs (one big table), size of each file is about 3.5MB, key - UUID.

CREATE KEYSPACE filespace WITH replication = {'class': 
'NetworkTopologyStrategy', 'DC1': '1', 'DC2': '1'}  AND durable_writes = true;

CREATE TABLE filespace.filestorage (
key text,
filename text,
value blob,
PRIMARY KEY (key, chunk)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (chunk ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{keys:ALL, rows_per_partition:NONE}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'class': 
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 
'max_threshold': '32'}
AND compression = {'sstable_compression': 
'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';

nodetool status filespace
Datacenter: DC1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address   Load   Tokens  Owns (effective)  Host ID  
 Rack
UN  10.X.X.12   4.82 TB256 28.0% 
25cefe6a-a9b1-4b30-839d-46ed5f4736cc  RAC1
UN  10.X.X.13   3.98 TB256 22.9% 
ef439686-1e8f-4b31-9c42-f49ff7a8b537  RAC1
UN  10.X.X.10   4.52 TB256 26.1% 
a11f52a6-1bff-4b47-bfa9-628a55a058dc  RAC1
UN  10.X.X.11   4.01 TB256 23.1% 
0f454fa7-5cdf-45b3-bf2d-729ab7bd9e52  RAC1
Datacenter: DC2
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address   Load   Tokens  Owns (effective)  Host ID  
 Rack
UN  10.X.X.137  4.64 TB256 22.6% 
e184cc42-7cd9-4e2e-bd0d-55a6a62f69dd  RAC1
UN  10.X.X.136  1.25 TB256 27.2% 
c8360341-83e0-4778-b2d4-3966f083151b  RAC1
DN  10.X.X.139  4.81 TB256 25.8% 
1f434cfe-6952-4d41-8fc5-780a18e64963  RAC1
UN  10.X.X.138  3.69 TB256 24.4% 
b7467041-05d9-409f-a59a-438d0a29f6a7  RAC1

I need some workaround to prevent this situation with hints. 

How we use dafault values for:
hinted_handoff_enabled: 'true'
max_hints_delivery_threads: 2
max_hint_window_in_ms: 1080
hinted_handoff_throttle_in_kb: 1024

Should I disable hints or increase number of threads and throughput?

For example:
hinted_handoff_enabled: 'true'
max_hints_delivery_threads: 20
max_hint_window_in_ms: 10800
hinted_handoff_throttle_in_kb: 10240


 Nodes in DC2 die during and after huge write workload
 -

 Key: CASSANDRA-9092
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9092
 Project: Cassandra
  Issue Type: Bug
 Environment: CentOS 6.2 64-bit, Cassandra 2.1.2, 
 java version 1.7.0_71
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
Reporter: Sergey Maznichenko
 Fix For: 2.1.5

 Attachments: cassandra_crash1.txt


 Hello,
 We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2.
 Node is VM 8 CPU, 32GB RAM
 During significant workload (loading several millions blobs ~3.5MB each), 1 
 node in DC2 stops and after some time next 2 nodes in DC2 also stops.
 Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. 
 I see many files in system.hints table and error appears in 2-3 minutes after 
 starting system.hints auto compaction.
 The problem exists only in DC2. We have 1GbE between DC1 and DC2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9092) Nodes in DC2 die during and after huge write workload

2015-04-02 Thread Sergey Maznichenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Maznichenko updated CASSANDRA-9092:
--
Description: 
Hello,

We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2.
Node is VM 8 CPU, 32GB RAM
During significant workload (loading several millions blobs ~3.5MB each), 1 
node in DC2 stops and after some time next 2 nodes in DC2 also stops.
Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. I 
see many files in system.hints table and error appears in 2-3 minutes after 
starting system.hints auto compaction.

Stops, means ERROR [CompactionExecutor:1] 2015-04-01 23:33:44,456 
CassandraDaemon.java:153 - Exception in thread 
Thread[CompactionExecutor:1,1,main]
java.lang.OutOfMemoryError: Java heap space

ERROR [HintedHandoff:1] 2015-04-01 23:33:44,456 CassandraDaemon.java:153 - 
Exception in thread Thread[HintedHandoff:1,1,main]
java.lang.RuntimeException: java.util.concurrent.ExecutionException: 
java.lang.OutOfMemoryError: Java heap space


Full errors listing attached in cassandra_crash1.txt

The problem exists only in DC2. We have 1GbE between DC1 and DC2.




  was:
Hello,

We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2.
Node is VM 8 CPU, 32GB RAM
During significant workload (loading several millions blobs ~3.5MB each), 1 
node in DC2 stops and after some time next 2 nodes in DC2 also stops.
Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. I 
see many files in system.hints table and error appears in 2-3 minutes after 
starting system.hints auto compaction.

Stops, means ERROR [CompactionExecutor:1] 2015-04-01 23:33:44,456 
CassandraDaemon.java:153 - Exception in thread 
Thread[CompactionExecutor:1,1,main]
java.lang.OutOfMemoryError: Java heap space

Full errors listing attached in cassandra_crash1.txt

The problem exists only in DC2. We have 1GbE between DC1 and DC2.





 Nodes in DC2 die during and after huge write workload
 -

 Key: CASSANDRA-9092
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9092
 Project: Cassandra
  Issue Type: Bug
 Environment: CentOS 6.2 64-bit, Cassandra 2.1.2, 
 java version 1.7.0_71
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
Reporter: Sergey Maznichenko
 Fix For: 2.1.5

 Attachments: cassandra_crash1.txt


 Hello,
 We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2.
 Node is VM 8 CPU, 32GB RAM
 During significant workload (loading several millions blobs ~3.5MB each), 1 
 node in DC2 stops and after some time next 2 nodes in DC2 also stops.
 Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. 
 I see many files in system.hints table and error appears in 2-3 minutes after 
 starting system.hints auto compaction.
 Stops, means ERROR [CompactionExecutor:1] 2015-04-01 23:33:44,456 
 CassandraDaemon.java:153 - Exception in thread 
 Thread[CompactionExecutor:1,1,main]
 java.lang.OutOfMemoryError: Java heap space
 ERROR [HintedHandoff:1] 2015-04-01 23:33:44,456 CassandraDaemon.java:153 - 
 Exception in thread Thread[HintedHandoff:1,1,main]
 java.lang.RuntimeException: java.util.concurrent.ExecutionException: 
 java.lang.OutOfMemoryError: Java heap space
 Full errors listing attached in cassandra_crash1.txt
 The problem exists only in DC2. We have 1GbE between DC1 and DC2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9092) Nodes in DC2 die during and after huge write workload

2015-04-02 Thread Sergey Maznichenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Maznichenko updated CASSANDRA-9092:
--
Description: 
Hello,

We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2.
Node is VM 8 CPU, 32GB RAM
During significant workload (loading several millions blobs ~3.5MB each), 1 
node in DC2 stops and after some time next 2 nodes in DC2 also stops.
Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. I 
see many files in system.hints table and error appears in 2-3 minutes after 
starting system.hints auto compaction.

Stops, means ERROR [CompactionExecutor:1] 2015-04-01 23:33:44,456 
CassandraDaemon.java:153 - Exception in thread 
Thread[CompactionExecutor:1,1,main]
java.lang.OutOfMemoryError: Java heap space

The problem exists only in DC2. We have 1GbE between DC1 and DC2.




  was:
Hello,

We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2.
Node is VM 8 CPU, 32GB RAM
During significant workload (loading several millions blobs ~3.5MB each), 1 
node in DC2 stops and after some time next 2 nodes in DC2 also stops.
Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. I 
see many files in system.hints table and error appears in 2-3 minutes after 
starting system.hints auto compaction.

The problem exists only in DC2. We have 1GbE between DC1 and DC2.





 Nodes in DC2 die during and after huge write workload
 -

 Key: CASSANDRA-9092
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9092
 Project: Cassandra
  Issue Type: Bug
 Environment: CentOS 6.2 64-bit, Cassandra 2.1.2, 
 java version 1.7.0_71
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
Reporter: Sergey Maznichenko
 Fix For: 2.1.5

 Attachments: cassandra_crash1.txt


 Hello,
 We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2.
 Node is VM 8 CPU, 32GB RAM
 During significant workload (loading several millions blobs ~3.5MB each), 1 
 node in DC2 stops and after some time next 2 nodes in DC2 also stops.
 Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. 
 I see many files in system.hints table and error appears in 2-3 minutes after 
 starting system.hints auto compaction.
 Stops, means ERROR [CompactionExecutor:1] 2015-04-01 23:33:44,456 
 CassandraDaemon.java:153 - Exception in thread 
 Thread[CompactionExecutor:1,1,main]
 java.lang.OutOfMemoryError: Java heap space
 The problem exists only in DC2. We have 1GbE between DC1 and DC2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9092) Nodes in DC2 die during and after huge write workload

2015-04-02 Thread Sergey Maznichenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Maznichenko updated CASSANDRA-9092:
--
Description: 
Hello,

We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2.
Node is VM 8 CPU, 32GB RAM
During significant workload (loading several millions blobs ~3.5MB each), 1 
node in DC2 stops and after some time next 2 nodes in DC2 also stops.
Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. I 
see many files in system.hints table and error appears in 2-3 minutes after 
starting system.hints auto compaction.

Stops, means ERROR [CompactionExecutor:1] 2015-04-01 23:33:44,456 
CassandraDaemon.java:153 - Exception in thread 
Thread[CompactionExecutor:1,1,main]
java.lang.OutOfMemoryError: Java heap space

Full errors listing attached in cassandra_crash1.txt

The problem exists only in DC2. We have 1GbE between DC1 and DC2.




  was:
Hello,

We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2.
Node is VM 8 CPU, 32GB RAM
During significant workload (loading several millions blobs ~3.5MB each), 1 
node in DC2 stops and after some time next 2 nodes in DC2 also stops.
Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. I 
see many files in system.hints table and error appears in 2-3 minutes after 
starting system.hints auto compaction.

Stops, means ERROR [CompactionExecutor:1] 2015-04-01 23:33:44,456 
CassandraDaemon.java:153 - Exception in thread 
Thread[CompactionExecutor:1,1,main]
java.lang.OutOfMemoryError: Java heap space

Full errors listing attached in 

The problem exists only in DC2. We have 1GbE between DC1 and DC2.





 Nodes in DC2 die during and after huge write workload
 -

 Key: CASSANDRA-9092
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9092
 Project: Cassandra
  Issue Type: Bug
 Environment: CentOS 6.2 64-bit, Cassandra 2.1.2, 
 java version 1.7.0_71
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
Reporter: Sergey Maznichenko
 Fix For: 2.1.5

 Attachments: cassandra_crash1.txt


 Hello,
 We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2.
 Node is VM 8 CPU, 32GB RAM
 During significant workload (loading several millions blobs ~3.5MB each), 1 
 node in DC2 stops and after some time next 2 nodes in DC2 also stops.
 Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. 
 I see many files in system.hints table and error appears in 2-3 minutes after 
 starting system.hints auto compaction.
 Stops, means ERROR [CompactionExecutor:1] 2015-04-01 23:33:44,456 
 CassandraDaemon.java:153 - Exception in thread 
 Thread[CompactionExecutor:1,1,main]
 java.lang.OutOfMemoryError: Java heap space
 Full errors listing attached in cassandra_crash1.txt
 The problem exists only in DC2. We have 1GbE between DC1 and DC2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9092) Nodes in DC2 die during and after huge write workload

2015-04-02 Thread Sergey Maznichenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Maznichenko updated CASSANDRA-9092:
--
Description: 
Hello,

We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2.
Node is VM 8 CPU, 32GB RAM
During significant workload (loading several millions blobs ~3.5MB each), 1 
node in DC2 stops and after some time next 2 nodes in DC2 also stops.
Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. I 
see many files in system.hints table and error appears in 2-3 minutes after 
starting system.hints auto compaction.

Stops, means ERROR [CompactionExecutor:1] 2015-04-01 23:33:44,456 
CassandraDaemon.java:153 - Exception in thread 
Thread[CompactionExecutor:1,1,main]
java.lang.OutOfMemoryError: Java heap space

Full errors listing attached in 

The problem exists only in DC2. We have 1GbE between DC1 and DC2.




  was:
Hello,

We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2.
Node is VM 8 CPU, 32GB RAM
During significant workload (loading several millions blobs ~3.5MB each), 1 
node in DC2 stops and after some time next 2 nodes in DC2 also stops.
Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. I 
see many files in system.hints table and error appears in 2-3 minutes after 
starting system.hints auto compaction.

Stops, means ERROR [CompactionExecutor:1] 2015-04-01 23:33:44,456 
CassandraDaemon.java:153 - Exception in thread 
Thread[CompactionExecutor:1,1,main]
java.lang.OutOfMemoryError: Java heap space

The problem exists only in DC2. We have 1GbE between DC1 and DC2.





 Nodes in DC2 die during and after huge write workload
 -

 Key: CASSANDRA-9092
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9092
 Project: Cassandra
  Issue Type: Bug
 Environment: CentOS 6.2 64-bit, Cassandra 2.1.2, 
 java version 1.7.0_71
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
Reporter: Sergey Maznichenko
 Fix For: 2.1.5

 Attachments: cassandra_crash1.txt


 Hello,
 We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2.
 Node is VM 8 CPU, 32GB RAM
 During significant workload (loading several millions blobs ~3.5MB each), 1 
 node in DC2 stops and after some time next 2 nodes in DC2 also stops.
 Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. 
 I see many files in system.hints table and error appears in 2-3 minutes after 
 starting system.hints auto compaction.
 Stops, means ERROR [CompactionExecutor:1] 2015-04-01 23:33:44,456 
 CassandraDaemon.java:153 - Exception in thread 
 Thread[CompactionExecutor:1,1,main]
 java.lang.OutOfMemoryError: Java heap space
 Full errors listing attached in 
 The problem exists only in DC2. We have 1GbE between DC1 and DC2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9092) Nodes in DC2 die during and after huge write workload

2015-04-02 Thread Sergey Maznichenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392398#comment-14392398
 ] 

Sergey Maznichenko edited comment on CASSANDRA-9092 at 4/2/15 10:19 AM:


Java heap is selected automatically in cassandra-env.sh. I tried to set 
MAX_HEAP_SIZE=8G, NEW_HEAP_SIZE=800M, but it didn't help.

nodetool disableautocompaction - didn't help, compactions continue after 
restart node.
nodetool truncatehints - didn't help, it showed message like 'cannot stop 
running hint compaction'.

One of nodes had ~24000 files in system\hints-..., I stepped node and deleted 
them, it helps and node is running about 10 hours. Other node has 18154 files 
in system\hints-... (~1.1TB) and has the same problem, I leave it for 
experiments.

Workload: 20-40 processes on application servers, each one performs loading 
files in blobs (one big table), size of each file is about 3.5MB, key - UUID.

CREATE KEYSPACE filespace WITH replication = {'class': 
'NetworkTopologyStrategy', 'DC1': '1', 'DC2': '1'}  AND durable_writes = true;

CREATE TABLE filespace.filestorage (
key text,
filename text,
value blob,
PRIMARY KEY (key, chunk)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (chunk ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{keys:ALL, rows_per_partition:NONE}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'class': 
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 
'max_threshold': '32'}
AND compression = {'sstable_compression': 
'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';

nodetool status filespace
Datacenter: DC1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address   Load   Tokens  Owns (effective)  Host ID  
 Rack
UN  10.X.X.12   4.82 TB256 28.0% 
25cefe6a-a9b1-4b30-839d-46ed5f4736cc  RAC1
UN  10.X.X.13   3.98 TB256 22.9% 
ef439686-1e8f-4b31-9c42-f49ff7a8b537  RAC1
UN  10.X.X.10   4.52 TB256 26.1% 
a11f52a6-1bff-4b47-bfa9-628a55a058dc  RAC1
UN  10.X.X.11   4.01 TB256 23.1% 
0f454fa7-5cdf-45b3-bf2d-729ab7bd9e52  RAC1
Datacenter: DC2
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address   Load   Tokens  Owns (effective)  Host ID  
 Rack
UN  10.X.X.137  4.64 TB256 22.6% 
e184cc42-7cd9-4e2e-bd0d-55a6a62f69dd  RAC1
UN  10.X.X.136  1.25 TB256 27.2% 
c8360341-83e0-4778-b2d4-3966f083151b  RAC1
DN  10.X.X.139  4.81 TB256 25.8% 
1f434cfe-6952-4d41-8fc5-780a18e64963  RAC1
UN  10.X.X.138  3.69 TB256 24.4% 
b7467041-05d9-409f-a59a-438d0a29f6a7  RAC1

I need some workaround to prevent this situation with hints. 

How we use default values for:

hinted_handoff_enabled: 'true'
max_hints_delivery_threads: 2
max_hint_window_in_ms: 1080
hinted_handoff_throttle_in_kb: 1024

Should I disable hints or increase number of threads and throughput?

For example:

hinted_handoff_enabled: 'true'
max_hints_delivery_threads: 20
max_hint_window_in_ms: 10800
hinted_handoff_throttle_in_kb: 10240



was (Author: msb):
Java heap is selected automatically in cassandra-env.sh. I tried to set 
MAX_HEAP_SIZE=8G, NEW_HEAP_SIZE=800M, but it didn't help.

nodetool disableautocompaction - didn't help, compactions continue after 
restart node.
nodetool truncatehints - didn't help, it showed message like 'cannot stop 
running hint compaction'.

One of nodes had ~24000 files in system\hints-..., I stepped node and deleted 
them, it helps and node is running about 10 hours. Other node has 18154 files 
in system\hints-... (~1.1TB) and has the same problem, I leave it for 
experiments.

Workload: 20-40 processes on application servers, each one performs loading 
files in blobs (one big table), size of each file is about 3.5MB, key - UUID.

CREATE KEYSPACE filespace WITH replication = {'class': 
'NetworkTopologyStrategy', 'DC1': '1', 'DC2': '1'}  AND durable_writes = true;

CREATE TABLE filespace.filestorage (
key text,
filename text,
value blob,
PRIMARY KEY (key, chunk)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (chunk ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{keys:ALL, rows_per_partition:NONE}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'class': 
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 
'max_threshold': '32'}
AND compression = {'sstable_compression': 

[jira] [Commented] (CASSANDRA-9092) Nodes in DC2 die during and after huge write workload

2015-04-02 Thread Sergey Maznichenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392777#comment-14392777
 ] 

Sergey Maznichenko commented on CASSANDRA-9092:
---

The node reproduces this error every time after attempt of compacting 
system.hints. I tried MAX_HEAP_SIZE=16G, it didn't help. 
Workaround is manually deleting system.hint files and restart node, but we have 
a chance to investigate this error in order to fix it in future releases.
Any suggestions?


 Nodes in DC2 die during and after huge write workload
 -

 Key: CASSANDRA-9092
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9092
 Project: Cassandra
  Issue Type: Bug
 Environment: CentOS 6.2 64-bit, Cassandra 2.1.2, 
 java version 1.7.0_71
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
Reporter: Sergey Maznichenko
 Fix For: 2.1.5

 Attachments: cassandra_crash1.txt


 Hello,
 We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2.
 Node is VM 8 CPU, 32GB RAM
 During significant workload (loading several millions blobs ~3.5MB each), 1 
 node in DC2 stops and after some time next 2 nodes in DC2 also stops.
 Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. 
 I see many files in system.hints table and error appears in 2-3 minutes after 
 starting system.hints auto compaction.
 Stops, means ERROR [CompactionExecutor:1] 2015-04-01 23:33:44,456 
 CassandraDaemon.java:153 - Exception in thread 
 Thread[CompactionExecutor:1,1,main]
 java.lang.OutOfMemoryError: Java heap space
 Full errors listing attached in cassandra_crash1.txt
 The problem exists only in DC2. We have 1GbE between DC1 and DC2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9092) Nodes in DC2 die during and after huge write workload

2015-04-01 Thread Sergey Maznichenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Maznichenko updated CASSANDRA-9092:
--
Summary: Nodes in DC2 die during and after huge write workload  (was: Nodes 
in DC2 dies during and after huge write workload)

 Nodes in DC2 die during and after huge write workload
 -

 Key: CASSANDRA-9092
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9092
 Project: Cassandra
  Issue Type: Bug
 Environment: CentOS 6.2 64-bit, Cassandra 2.1.2, 
 java version 1.7.0_71
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
Reporter: Sergey Maznichenko
 Attachments: cassandra_crash1.txt


 Hello,
 We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2.
 Node is VM 8 CPU, 32GB RAM
 During significant workload (loading several millions blobs ~3.5MB each), 1 
 node in DC2 stops and after some time next 2 nodes in DC2 also stops.
 Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. 
 I see many files in system.hints table and error appears in 2-3 minutes after 
 starting system.hints auto compaction.
 The problem exists only in DC2. We have 1GbE between DC1 and DC2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-9092) Nodes in DC2 dies during and after huge write workload

2015-04-01 Thread Sergey Maznichenko (JIRA)
Sergey Maznichenko created CASSANDRA-9092:
-

 Summary: Nodes in DC2 dies during and after huge write workload
 Key: CASSANDRA-9092
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9092
 Project: Cassandra
  Issue Type: Bug
 Environment: CentOS 6.2 64-bit, Cassandra 2.1.2, 
java version 1.7.0_71
Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)

Reporter: Sergey Maznichenko
 Attachments: cassandra_crash1.txt

Hello,

We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2.
Node is VM 8 CPU, 32GB RAM
During significant workload (loading several millions blobs ~3.5MB each), 1 
node in DC2 stops and after some time next 2 nodes in DC2 also stops.
Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. I 
see many files in system.hints table and error appears in 2-3 minutes after 
starting system.hints auto compaction.

The problem exists only in DC2. We have 1GbE between DC1 and DC2.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)