[ https://issues.apache.org/jira/browse/CASSANDRA-3226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107408#comment-13107408 ]
Thibaut commented on CASSANDRA-3226: ------------------------------------ All Client actions get timeoutexceptions. Data is distributed at consistency level 3. I'm writing and reading at quorum level. I'm using an ordered partioner. Our application can't do any reads/writes until I kill -9 that specific node. It just hangs. It's not a matter of seconds, it seems to never recover until I kill the node. Then all the pending operations from all the other nodes disappear in a matter of seconds/minutes and our application recovers. I didn't test manual reading/writing single keys at different ranges. I didn't see any OOM in the logs. I'm starting cassandra with -XX:+HeapDumpOnOutOfMemoryError. Running nodetool info showed a very high heap usage (not normal) which caused me to believe there might have been an OOM. > Single faulty node brings down entire cluster. No reads/writes possible > ----------------------------------------------------------------------- > > Key: CASSANDRA-3226 > URL: https://issues.apache.org/jira/browse/CASSANDRA-3226 > Project: Cassandra > Issue Type: Bug > Components: Core > Affects Versions: 0.8.5 > Environment: linux > Reporter: Thibaut > Attachments: jstack > > > No client is able to read anything from the entire cluster anymore. This > occured a few times so far, but I can't reproduce it. > Looks like an OOM directly after starting up the node? Restarting the node > "solves" the issue. I also have to kill the node with -9 because normal kill > won't kill the node. > Healty nodes: > *.13: > Mode: Normal > Not sending any streams. > Not receiving any streams. > Pool Name Active Pending Completed > Commands n/a 15237 416868 > Responses n/a 0 126721 > *.14. > Mode: Normal > Not sending any streams. > Not receiving any streams. > Pool Name Active Pending Completed > Commands n/a 15387 437325 > Responses n/a 0 131066 > *.15: > Mode: Normal > Not sending any streams. > Not receiving any streams. > Pool Name Active Pending Completed > Commands n/a 15530 368771 > Responses n/a 0 145168 > etc... The pending commands at the healty nodes all increase. > Faulty node before restart: > /software/cassandra/bin/nodetool -h localhost info > Token : f33 > Gossip active : true > Load : 130.67 GB > Generation No : 1316197687 > Uptime (seconds) : 137051 > Heap Memory (MB) : 3580.22 / 3614.00 > Data Center : datacenter1 > Rack : rack1 > Exceptions : 108 > /software/cassandra/bin/nodetool -h localhost netstats > Mode: Normal > Not sending any streams. > Not receiving any streams. > Pool Name Active Pending Completed > Commands n/a 0 29696566 > Responses n/a 560 26650981 > Log excerpt: > INFO [GossipStage:3] 2011-09-18 09:16:46,254 Gossiper.java (line 713) Node > /192.168.0.11 has restarted, now UP again > INFO [GossipStage:3] 2011-09-18 09:16:46,255 Gossiper.java (line 681) > InetAddress /192.168.0.11 is now UP > INFO [GossipStage:3] 2011-09-18 09:16:46,255 StorageService.java (line 815) > Node /192.168.0.11 state jump to normal > INFO [GossipStage:3] 2011-09-18 09:16:46,257 StorageService.java (line 815) > Node /192.168.0.11 state jump to normal > INFO [GossipStage:3] 2011-09-18 09:16:54,984 StorageService.java (line 815) > Node /192.168.0.6 state jump to normal > INFO [GossipStage:3] 2011-09-18 09:16:54,984 Gossiper.java (line 681) > InetAddress /192.168.0.6 is now UP > INFO [GossipStage:3] 2011-09-18 09:16:56,262 StorageService.java (line 815) > Node /192.168.0.18 state jump to normal > INFO [GossipStage:3] 2011-09-18 09:16:56,263 Gossiper.java (line 681) > InetAddress /192.168.0.18 is now UP > INFO [GossipStage:3] 2011-09-18 09:17:06,272 Gossiper.java (line 713) Node > /192.168.0.1 has restarted, now UP again > INFO [GossipStage:3] 2011-09-18 09:17:06,272 Gossiper.java (line 681) > InetAddress /192.168.0.1 is now UP > INFO [GossipStage:3] 2011-09-18 09:17:06,272 StorageService.java (line 815) > Node /192.168.0.1 state jump to normal > INFO [HintedHandoff:1] 2011-09-18 09:20:49,846 HintedHandOffManager.java > (line 323) Started hinted handoff for endpoint /192.168.0.8 > INFO [HintedHandoff:1] 2011-09-18 09:20:49,847 HintedHandOffManager.java > (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.8 > INFO [HintedHandoff:1] 2011-09-18 09:21:45,430 HintedHandOffManager.java > (line 323) Started hinted handoff for endpoint /192.168.0.7 > INFO [HintedHandoff:1] 2011-09-18 09:21:45,696 HintedHandOffManager.java > (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.7 > INFO [HintedHandoff:1] 2011-09-18 09:21:52,432 HintedHandOffManager.java > (line 323) Started hinted handoff for endpoint /192.168.0.20 > INFO [HintedHandoff:1] 2011-09-18 09:21:52,432 HintedHandOffManager.java > (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.20 > INFO [HintedHandoff:1] 2011-09-18 09:22:12,469 HintedHandOffManager.java > (line 323) Started hinted handoff for endpoint /192.168.0.9 > INFO [HintedHandoff:1] 2011-09-18 09:22:12,469 HintedHandOffManager.java > (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.9 > INFO [HintedHandoff:1] 2011-09-18 09:23:05,202 HintedHandOffManager.java > (line 323) Started hinted handoff for endpoint /192.168.0.3 > INFO [HintedHandoff:1] 2011-09-18 09:23:05,203 HintedHandOffManager.java > (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.3 > INFO [HintedHandoff:1] 2011-09-18 09:23:08,611 HintedHandOffManager.java > (line 323) Started hinted handoff for endpoint /192.168.0.17 > INFO [HintedHandoff:1] 2011-09-18 09:23:08,612 HintedHandOffManager.java > (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.17 > INFO [HintedHandoff:1] 2011-09-18 09:23:22,687 HintedHandOffManager.java > (line 323) Started hinted handoff for endpoint /192.168.0.11 > INFO [HintedHandoff:1] 2011-09-18 09:23:22,688 HintedHandOffManager.java > (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.11 > INFO [HintedHandoff:1] 2011-09-18 09:24:13,051 HintedHandOffManager.java > (line 323) Started hinted handoff for endpoint /192.168.0.6 > INFO [HintedHandoff:1] 2011-09-18 09:24:13,051 HintedHandOffManager.java > (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.6 > ERROR [Thread-549] 2011-09-18 09:24:26,806 AbstractCassandraDaemon.java (line > 139) Fatal exception in thread Thread[Thread-549,5,main] > java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut > down > at > org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:60) > at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658) > at > org.apache.cassandra.net.MessagingService.receive(MessagingService.java:490) > at > org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:133) > INFO [HintedHandoff:1] 2011-09-18 09:23:05,202 HintedHandOffManager.java > (line 323) Started hinted handoff for endpoint /192.168.0.3 > INFO [HintedHandoff:1] 2011-09-18 09:23:05,203 HintedHandOffManager.java > (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.3 > INFO [HintedHandoff:1] 2011-09-18 09:23:08,611 HintedHandOffManager.java > (line 323) Started hinted handoff for endpoint /192.168.0.17 > INFO [HintedHandoff:1] 2011-09-18 09:23:08,612 HintedHandOffManager.java > (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.17 > INFO [HintedHandoff:1] 2011-09-18 09:23:22,687 HintedHandOffManager.java > (line 323) Started hinted handoff for endpoint /192.168.0.11 > INFO [HintedHandoff:1] 2011-09-18 09:23:22,688 HintedHandOffManager.java > (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.11 > INFO [HintedHandoff:1] 2011-09-18 09:24:13,051 HintedHandOffManager.java > (line 323) Started hinted handoff for endpoint /192.168.0.6 > INFO [HintedHandoff:1] 2011-09-18 09:24:13,051 HintedHandOffManager.java > (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.6 > ERROR [Thread-549] 2011-09-18 09:24:26,806 AbstractCassandraDaemon.java (line > 139) Fatal exception in thread Thread[Thread-549,5,main] > java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut > down > at > org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:60) > at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658) > at > org.apache.cassandra.net.MessagingService.receive(MessagingService.java:490) > at > org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:133) > INFO [HintedHandoff:1] 2011-09-18 09:24:34,488 HintedHandOffManager.java > (line 323) Started hinted handoff for endpoint /192.168.0.18 > INFO [HintedHandoff:1] 2011-09-18 09:24:34,488 HintedHandOffManager.java > (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.18 > ERROR [Thread-554] 2011-09-18 09:24:43,853 AbstractCassandraDaemon.java (line > 139) Fatal exception in thread Thread[Thread-554,5,main] > java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut > down > at > org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:60) > at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658) > at > org.apache.cassandra.net.MessagingService.receive(MessagingService.java:490) > at > org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:133) > ERROR [Thread-542] 2011-09-18 09:24:43,924 AbstractCassandraDaemon.java (line > 139) Fatal exception in thread Thread[Thread-542,5,main] > java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut > down > at > org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:60) > at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658) > at > org.apache.cassandra.net.MessagingService.receive(MessagingService.java:490) > at > org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:133) > INFO [HintedHandoff:1] 2011-09-18 09:25:03,601 HintedHandOffManager.java > (line 323) Started hinted handoff for endpoint /192.168.0.1 > INFO [HintedHandoff:1] 2011-09-18 09:25:03,601 HintedHandOffManager.java > (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.1 > ERROR [Thread-535] 2011-09-18 09:25:03,995 AbstractCassandraDaemon.java (line > 139) Fatal exception in thread Thread[Thread-535,5,main] > If I restart the node (kill -9), everything works fine again. No OOM!. > /software/scripts# /software/cassandra/bin/nodetool -h localhost netstats > Mode: Normal > Not sending any streams. > Not receiving any streams. > Pool Name Active Pending Completed > Commands n/a 0 131351 > Responses n/a 0 190697 > /software/scripts# /software/cassandra/bin/nodetool -h localhost info > Token : f33 > Gossip active : true > Load : 88.54 GB > Generation No : 1316334645 > Uptime (seconds) : 494 > Heap Memory (MB) : 1582.95 / 3614.00 > Data Center : datacenter1 > Rack : rack1 > Exceptions : 0 > /software/scripts# /software/cassandra/bin/nodetool -h localhost ring > Address DC Rack Status State Load Owns > Token > > ffffffffffffffff > 192.168.0.1 datacenter1 rack1 Up Normal 85.2 GB 0.02% > 0cc > 192.168.0.2 datacenter1 rack1 Up Normal 86.94 GB 0.02% > 199 > 192.168.0.3 datacenter1 rack1 Up Normal 85.24 GB 0.02% > 266 > 192.168.0.4 datacenter1 rack1 Up Normal 86.38 GB 0.02% > 333 > 192.168.0.5 datacenter1 rack1 Up Normal 86.96 GB 0.02% > 400 > 192.168.0.6 datacenter1 rack1 Up Normal 86.17 GB 0.02% > 4cc > 192.168.0.7 datacenter1 rack1 Up Normal 83.88 GB 0.02% > 599 > 192.168.0.8 datacenter1 rack1 Up Normal 84.42 GB 0.02% > 666 > 192.168.0.9 datacenter1 rack1 Up Normal 85.06 GB 0.02% > 733 > 192.168.0.10 datacenter1 rack1 Up Normal 83.08 GB 0.02% > 7ff > 192.168.0.11 datacenter1 rack1 Up Normal 86.22 GB 0.02% > 8cc > 192.168.0.12 datacenter1 rack1 Up Normal 85.94 GB 0.02% > 999 > 192.168.0.13 datacenter1 rack1 Up Normal 85.01 GB 0.02% > a66 > 192.168.0.14 datacenter1 rack1 Up Normal 86.5 GB 0.02% > b33 > 192.168.0.15 datacenter1 rack1 Up Normal 83.33 GB 0.02% > c00 > 192.168.0.16 datacenter1 rack1 Up Normal 84.41 GB 0.02% > ccc > 192.168.0.17 datacenter1 rack1 Up Normal 86.97 GB 28.51% > d99 > 192.168.0.18 datacenter1 rack1 Up Normal 112.63 GB 41.88% > e66 > 192.168.0.19 datacenter1 rack1 Up Normal 88.56 GB 29.27% > f33 > 192.168.0.20 datacenter1 rack1 Up Normal 85.83 GB 0.02% > ffffffffffffffff > Interestingly, after restart, the node load (from nodetool info) is reduced). > Any ideas? The node doesn't seem to have any hardware memory issues. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira