Hi, all, Here is some more information on before the OOM happened on the rebooted node in a 2-node test cluster:
1. It seems the schema version has changed on the rebooted node after reboot, i.e. Before reboot, Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 MigrationManager.java (line 328) Gossiping my schema version 4cb463f8-5376-3baf-8e88-a5cc6a94f58f Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 MigrationManager.java (line 328) Gossiping my schema version 4cb463f8-5376-3baf-8e88-a5cc6a94f58f After rebooting node 2, Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java (line 328) Gossiping my schema version f5270873-ba1f-39c7-ab2e-a86db868b09b 2. After reboot, both nods repeatedly send MigrationTask to each other - we suspect it is related to the schema version (Digest) mismatch after Node 2 rebooted: The node2 keeps submitting the migration task over 100+ times to the other node. INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011) Node /192.168.88.33 has restarted, now UP INFO [GossipStage:1] 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) Updating topology for /192.168.88.33 INFO [GossipStage:1] 2016-04-19 11:18:18,263 StorageService.java (line 1544) Node /192.168.88.33 state jump to normal INFO [GossipStage:1] 2016-04-19 11:18:18,264 TokenMetadata.java (line 414) Updating topology for /192.168.88.33 DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33 DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33 DEBUG [MigrationStage:1] 2016-04-19 11:18:18,268 MigrationTask.java (line 62) Can't send schema pull request: node /192.168.88.33 is down. DEBUG [MigrationStage:1] 2016-04-19 11:18:18,268 MigrationTask.java (line 62) Can't send schema pull request: node /192.168.88.33 is down. DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,353 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.33 INFO [RequestResponseStage:1] 2016-04-19 11:18:18,353 Gossiper.java (line 978) InetAddress /192.168.88.33 is now UP DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,353 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33 DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,355 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.33 INFO [RequestResponseStage:1] 2016-04-19 11:18:18,355 Gossiper.java (line 978) InetAddress /192.168.88.33 is now UP DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,355 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33 DEBUG [RequestResponseStage:2] 2016-04-19 11:18:18,355 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.33 INFO [RequestResponseStage:2] 2016-04-19 11:18:18,355 Gossiper.java (line 978) InetAddress /192.168.88.33 is now UP DEBUG [RequestResponseStage:2] 2016-04-19 11:18:18,356 MigrationManager.java (line 102) Submitting migration task for /192.168.88.33 ..... On the otherhand, Node 1 keeps updating its gossip information, followed by receiving and submitting migrationTask afterwards: DEBUG [RequestResponseStage:3] 2016-04-19 11:18:18,332 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.34 INFO [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line 978) InetAddress /192.168.88.34 is now UP DEBUG [RequestResponseStage:4] 2016-04-19 11:18:18,335 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.34 INFO [RequestResponseStage:4] 2016-04-19 11:18:18,335 Gossiper.java (line 978) InetAddress /192.168.88.34 is now UP DEBUG [RequestResponseStage:3] 2016-04-19 11:18:18,335 Gossiper.java (line 977) removing expire time for endpoint : /192.168.88.34 INFO [RequestResponseStage:3] 2016-04-19 11:18:18,335 Gossiper.java (line 978) InetAddress /192.168.88.34 is now UP ...... DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34. DEBUG [MigrationStage:1] 2016-04-19 11:18:18,595 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34. DEBUG [MigrationStage:1] 2016-04-19 11:18:18,843 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34. DEBUG [MigrationStage:1] 2016-04-19 11:18:18,878 MigrationRequestVerbHandler.java (line 41) Received migration request from /192.168.88.34. ...... DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 127) submitting migration task for /192.168.88.34 DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 127) submitting migration task for /192.168.88.34 DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 127) submitting migration task for /192.168.88.34 ..... Has anyone experienced this scenario? Thanks in advanced! Sincerely, Michael Fong From: Michael Fong [mailto:michael.f...@ruckuswireless.com] Sent: Wednesday, April 20, 2016 10:43 AM To: user@cassandra.apache.org; d...@cassandra.apache.org Subject: Cassandra 2.0.x OOM during bootstrap Hi, all, We have recently encountered a Cassandra OOM issue when Cassandra is brought up sometimes (but not always) in our 4-node cluster test bed. After analyzing the heap dump, we could find the Internal-Response thread pool (JMXEnabledThreadPoolExecutor) is filled with thounds of 'org.apache.cassandra.net.MessageIn' objects, and occupy > 2 gigabytes of heap memory. According to the documents on internet, it seems internal-response thread pool is somewhat related to schema-checking. Has anyone encountered similar issue before? We are using Cassandra 2.0.17 and JDK 1.8. Thanks in advance! Sincerely, Michael Fong