RE: Cassandra 2.0.x OOM during bootstrap

Michael Fong Thu, 21 Apr 2016 03:41:46 -0700

Hi, all,

Here is some more information on before the OOM happened on the rebooted node 
in a 2-node test cluster:



1.       It seems the schema version has changed on the rebooted node after 
reboot, i.e.
Before reboot,
Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 MigrationManager.java 
(line 328) Gossiping my schema version 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 MigrationManager.java 
(line 328) Gossiping my schema version 4cb463f8-5376-3baf-8e88-a5cc6a94f58f

After rebooting node 2,
Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java (line 328) 
Gossiping my schema version f5270873-ba1f-39c7-ab2e-a86db868b09b



2.       After reboot, both nods repeatedly send MigrationTask to each other - 
we suspect it is related to the schema version (Digest) mismatch after Node 2 
rebooted:
The node2  keeps submitting the migration task over 100+ times to the other 
node.
INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011) Node 
/192.168.88.33 has restarted, now UP
INFO [GossipStage:1] 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) 
Updating topology for /192.168.88.33
INFO [GossipStage:1] 2016-04-19 11:18:18,263 StorageService.java (line 1544) 
Node /192.168.88.33 state jump to normal
INFO [GossipStage:1] 2016-04-19 11:18:18,264 TokenMetadata.java (line 414) 
Updating topology for /192.168.88.33
DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 102) 
Submitting migration task for /192.168.88.33
DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 102) 
Submitting migration task for /192.168.88.33
DEBUG [MigrationStage:1] 2016-04-19 11:18:18,268 MigrationTask.java (line 62) 
Can't send schema pull request: node /192.168.88.33 is down.
DEBUG [MigrationStage:1] 2016-04-19 11:18:18,268 MigrationTask.java (line 62) 
Can't send schema pull request: node /192.168.88.33 is down.
DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,353 Gossiper.java (line 977) 
removing expire time for endpoint : /192.168.88.33
INFO [RequestResponseStage:1] 2016-04-19 11:18:18,353 Gossiper.java (line 978) 
InetAddress /192.168.88.33 is now UP
DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,353 MigrationManager.java 
(line 102) Submitting migration task for /192.168.88.33
DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,355 Gossiper.java (line 977) 
removing expire time for endpoint : /192.168.88.33
INFO [RequestResponseStage:1] 2016-04-19 11:18:18,355 Gossiper.java (line 978) 
InetAddress /192.168.88.33 is now UP
DEBUG [RequestResponseStage:1] 2016-04-19 11:18:18,355 MigrationManager.java 
(line 102) Submitting migration task for /192.168.88.33
DEBUG [RequestResponseStage:2] 2016-04-19 11:18:18,355 Gossiper.java (line 977) 
removing expire time for endpoint : /192.168.88.33
INFO [RequestResponseStage:2] 2016-04-19 11:18:18,355 Gossiper.java (line 978) 
InetAddress /192.168.88.33 is now UP
DEBUG [RequestResponseStage:2] 2016-04-19 11:18:18,356 MigrationManager.java 
(line 102) Submitting migration task for /192.168.88.33
.....


On the otherhand, Node 1 keeps updating its gossip information, followed by 
receiving and submitting migrationTask afterwards:
DEBUG [RequestResponseStage:3] 2016-04-19 11:18:18,332 Gossiper.java (line 977) 
removing expire time for endpoint : /192.168.88.34
INFO [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line 978) 
InetAddress /192.168.88.34 is now UP
DEBUG [RequestResponseStage:4] 2016-04-19 11:18:18,335 Gossiper.java (line 977) 
removing expire time for endpoint : /192.168.88.34
INFO [RequestResponseStage:4] 2016-04-19 11:18:18,335 Gossiper.java (line 978) 
InetAddress /192.168.88.34 is now UP
DEBUG [RequestResponseStage:3] 2016-04-19 11:18:18,335 Gossiper.java (line 977) 
removing expire time for endpoint : /192.168.88.34
INFO [RequestResponseStage:3] 2016-04-19 11:18:18,335 Gossiper.java (line 978) 
InetAddress /192.168.88.34 is now UP
......
DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496 
MigrationRequestVerbHandler.java (line 41) Received migration request from 
/192.168.88.34.
DEBUG [MigrationStage:1] 2016-04-19 11:18:18,595 
MigrationRequestVerbHandler.java (line 41) Received migration request from 
/192.168.88.34.
DEBUG [MigrationStage:1] 2016-04-19 11:18:18,843 
MigrationRequestVerbHandler.java (line 41) Received migration request from 
/192.168.88.34.
DEBUG [MigrationStage:1] 2016-04-19 11:18:18,878 
MigrationRequestVerbHandler.java (line 41) Received migration request from 
/192.168.88.34.
......
DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 
127) submitting migration task for /192.168.88.34
DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 
127) submitting migration task for /192.168.88.34
DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 
127) submitting migration task for /192.168.88.34
.....

Has anyone experienced this scenario? Thanks in advanced!

Sincerely,

Michael Fong

From: Michael Fong [mailto:michael.f...@ruckuswireless.com]
Sent: Wednesday, April 20, 2016 10:43 AM
To: user@cassandra.apache.org; d...@cassandra.apache.org
Subject: Cassandra 2.0.x OOM during bootstrap

Hi, all,

We have recently encountered a Cassandra OOM issue when Cassandra is brought up 
sometimes (but not always) in our 4-node cluster test bed.

After analyzing the heap dump, we could find the Internal-Response thread pool 
(JMXEnabledThreadPoolExecutor) is filled with thounds of 
'org.apache.cassandra.net.MessageIn' objects, and occupy > 2 gigabytes of heap 
memory.

According to the documents on internet, it seems internal-response thread pool 
is somewhat related to schema-checking. Has anyone encountered similar issue 
before?

We are using Cassandra 2.0.17 and JDK 1.8. Thanks in advance!

Sincerely,

Michael Fong

RE: Cassandra 2.0.x OOM during bootstrap

Reply via email to