Dear Alain,
Thank you for your reply.
Ok, yes I did not drain. The cluster was loaded with tons of records, and no
new records were added since few weeks. Each node had a load of about 160 GB as
seen in the “nodetool status". I killed the cassandradeamon, and restarted it.
After cassandra was restarted, I could see in the “nodetool status” a load of 5
GB!!
I don’t use counters.
I use RF 3 on 5 nodes. I did not change the replication factor.
I have two types of read queries. One use QUORUM for read, and the other use
ONE for consistency level.
I did not change the Topology.
Are you sure this node had data before you restart it ?
Actually the full story is:
- I stopped node0(192.168.2.100), and I restarted it.
- I stopped node1(192.168.2.101).
- I made a nodetool status and I noticed that node0 was UN and had a load 5 GB.
I found this really weird because all the other ones had about 160GB. I also
saw that node1 was DN with a load of about 160GB.
- I restarted node1.
- I made a nodetool status and I noticed that node1 was UN and had a load of
also 5GB, it previously had a load of about 160GB. That I’m sure.
- Then my program could no longer query C*. Neither the QUORUM nor the ONE
consistency level statements could read data.
What does a "nodetool status mykeyspace" outputs ?
I cannot try this anymore. I flushed the whole cluster, and I am currently
reloading everything. I was too much in a hurry. I have a demo tomorrow, and I
will manage to have it back before tomorrow.
After my bad decision of flushing the cluster, I realised that I could have
bootstrapped again my two nodes. Learning by doing.
"It’s like the whole cluster is paralysed" --> what does it mean, be more
accurate on this please.
You should tell us action that were taken before this occurred and now what is
not working since a C* cluster in this state could perfectly run. No SPOF.
What I did before? Well this cluster was basically idling. I was only making
lots of select on it. I was loaded since few weeks.
But what I noticed when I restarted node0 is the following:
INFO [InternalResponseStage:1] 2015-06-23 11:45:32,723
ColumnFamilyStore.java:882 - Enqueuing flush of schema_columnfamilies: 131587
(0%) on-heap, 0 (0%) off-heap
INFO [MemtableFlushWriter:2] 2015-06-23 11:45:32,723 Memtable.java:346 -
Writing Memtable-schema_columnfamilies@917967643(34850 serialized bytes, 585
ops, 0%/0% of on/off-heap limit)
>>>> WARN [GossipTasks:1] 2015-06-23 11:45:33,459 FailureDetector.java:251 -
>>>> Not marking nodes down due to local pause of 25509152054 > 5000000000
INFO [MemtableFlushWriter:1] 2015-06-23 11:45:33,982 Memtable.java:385 -
Completed flushing
/home/maia/apache-cassandra-DATA/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-ka-11-Data.db
(5274 bytes) for commitlog position ReplayPos
ition(segmentId=1435052707645, position=144120)
INFO [GossipStage:1] 2015-06-23 11:45:33,985 StorageService.java:1642 - Node
/192.168.2.101 state jump to normal
INFO [GossipStage:1] 2015-06-23 11:45:33,991 Gossiper.java:987 - Node
/192.168.2.102 has restarted, now UP
INFO [SharedPool-Worker-1] 2015-06-23 11:45:33,992 Gossiper.java:954 -
InetAddress /192.168.2.102 is now UP
INFO [HANDSHAKE-/192.168.2.102] 2015-06-23 11:45:33,993
OutboundTcpConnection.java:485 - Handshaking version with /192.168.2.102
INFO [GossipStage:1] 2015-06-23 11:45:33,993 StorageService.java:1642 - Node
/192.168.2.102 state jump to normal
INFO [GossipStage:1] 2015-06-23 11:45:33,999 Gossiper.java:987 - Node
/192.168.2.103 has restarted, now UP
INFO [SharedPool-Worker-1] 2015-06-23 11:45:33,999 Gossiper.java:954 -
InetAddress /192.168.2.103 is now UP
INFO [GossipStage:1] 2015-06-23 11:45:34,001 StorageService.java:1642 - Node
/192.168.2.103 state jump to normal
INFO [HANDSHAKE-/192.168.2.103] 2015-06-23 11:45:34,020
OutboundTcpConnection.java:485 - Handshaking version with /192.168.2.103
INFO [main] 2015-06-23 11:45:34,021 StorageService.java:1642 - Node
zennode0/192.168.2.100 state jump to normal
INFO [GossipStage:1] 2015-06-23 11:45:34,028 StorageService.java:1642 - Node
/192.168.2.104 state jump to normal
INFO [main] 2015-06-23 11:45:34,038 CassandraDaemon.java:583 - Waiting for
gossip to settle before accepting client requests...
INFO [GossipStage:1] 2015-06-23 11:45:34,039 StorageService.java:1642 - Node
/192.168.2.101 state jump to normal
INFO [GossipStage:1] 2015-06-23 11:45:34,047 StorageService.java:1642 - Node
/192.168.2.103 state jump to normal
INFO [GossipStage:1] 2015-06-23 11:45:34,055 StorageService.java:1642 - Node
/192.168.2.102 state jump to normal
INFO [CompactionExecutor:1] 2015-06-23 11:45:34,062 CompactionTask.java:270 -
Compacted 1 sstables to
[/home/maia/apache-cassandra-DATA/data/system/hints-2666e20573ef38b390fefecf96e8f0c7/system-hints-ka-2,].
171,593 bytes to 171,593 (~100% of orig
inal) in 1,544ms = 0.105987MB/s. 2 total partitions merged to 2. Partition
merge counts were {1:2, }
INFO [CompactionExecutor:4] 2015-06-23 11:45:34,063 CompactionTask.java:140 -
Compacting
[SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/hints-2666e20573ef38b390fefecf96e8f0c7/system-hints-ka-2-Data.db')]
INFO [GossipStage:1] 2015-06-23 11:45:34,066 StorageService.java:1642 - Node
/192.168.2.102 state jump to normal
INFO [MemtableFlushWriter:2] 2015-06-23 11:45:34,140 Memtable.java:385 -
Completed flushing
/home/maia/apache-cassandra-DATA/data/system/schema_columnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-12-Data.db
(8246 bytes)
for commitlog position ReplayPosition(segmentId=1435052707645, position=144120)
INFO [CompactionExecutor:2] 2015-06-23 11:45:34,140 CompactionTask.java:140 -
Compacting
[SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-11-Dat
a.db'),
SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-10-Data.db'),
SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_col
umnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-9-Data.db'),
SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-12-Data
.db')]
INFO [InternalResponseStage:1] 2015-06-23 11:45:34,140
ColumnFamilyStore.java:882 - Enqueuing flush of schema_columns: 284282 (0%)
on-heap, 0 (0%) off-heap
INFO [MemtableFlushWriter:1] 2015-06-23 11:45:34,141 Memtable.java:346 -
Writing Memtable-schema_columns@813508218(55024 serialized bytes, 1041 ops,
0%/0% of on/off-heap limit)
INFO [CompactionExecutor:4] 2015-06-23 11:45:37,980 CompactionTask.java:270 -
Compacted 1 sstables to
[/home/maia/apache-cassandra-DATA/data/system/hints-2666e20573ef38b390fefecf96e8f0c7/system-hints-ka-3,].
171,593 bytes to 171,593 (~100% of orig
inal) in 3,916ms = 0.041789MB/s. 2 total partitions merged to 2. Partition
merge counts were {1:2, }
INFO [MemtableFlushWriter:1] 2015-06-23 11:45:39,265 Memtable.java:385 -
Completed flushing
/home/maia/apache-cassandra-DATA/data/system/schema_columns-296e9c049bec3085827dc17d3df2122a/system-schema_columns-ka-12-Data.db
(12966 bytes) for commitlog
position ReplayPosition(segmentId=1435052707645, position=251695)
INFO [CompactionExecutor:1] 2015-06-23 11:45:39,266 CompactionTask.java:140 -
Compacting
[SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columns-296e9c049bec3085827dc17d3df2122a/system-schema_columns-ka-10-Data.db'),
SSTabl
eReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columns-296e9c049bec3085827dc17d3df2122a/system-schema_columns-ka-11-Data.db'),
SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columns-296e9c049bec3085827dc17
d3df2122a/system-schema_columns-ka-9-Data.db'),
SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columns-296e9c049bec3085827dc17d3df2122a/system-schema_columns-ka-12-Data.db')]
INFO [InternalResponseStage:1] 2015-06-23 11:45:39,271
ColumnFamilyStore.java:882 - Enqueuing flush of schema_triggers: 154 (0%)
on-heap, 0 (0%) off-heap
INFO [MemtableFlushWriter:2] 2015-06-23 11:45:39,271 Memtable.java:346 -
Writing Memtable-schema_triggers@652166139(8 serialized bytes, 1 ops, 0%/0% of
on/off-heap limit)
INFO [CompactionExecutor:2] 2015-06-23 11:45:39,275 CompactionTask.java:270 -
Compacted 4 sstables to
[/home/maia/apache-cassandra-DATA/data/system/schema_columnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-13,].
17,971
bytes to 8,246 (~45% of original) in 5,134ms = 0.001532MB/s. 8 total
partitions merged to 3. Partition merge counts were {2:2, 4:1, }
INFO [main] 2015-06-23 11:45:42,042 CassandraDaemon.java:615 - No gossip
backlog; proceeding
I don’t know what the warning means. WARN [GossipTasks:1] 2015-06-23
11:45:33,459 FailureDetector.java:251 - Not marking nodes down due to local
pause of 25509152054 > 5000000000
And shortly after the proceeding…
INFO [HANDSHAKE-zennode0/192.168.2.100] 2015-06-23 11:46:00,566
OutboundTcpConnection.java:485 - Handshaking version with zennode0/192.168.2.100
WARN [MessagingService-Incoming-/192.168.2.103] 2015-06-23 11:46:00,595
IncomingTcpConnection.java:97 - UnknownColumnFamilyException reading from
socket; closing
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find
cfId=ddc346b0-1372-11e5-9ba1-195596ed1fd9
at
org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:164)
~[apache-cassandra-2.1.6.jar:2.1.6]
at
org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:97)
~[apache-cassandra-2.1.6.jar:2.1.6]
at
org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:89)
~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.Row$RowSerializer.deserialize(Row.java:74)
~[apache-cassandra-2.1.6.jar:2.1.6]
at
org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:97)
~[apache-cassandra-2.1.6.jar:2.1.6]
at
org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:69)
~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.net.MessageIn.read(MessageIn.java:99)
~[apache-cassandra-2.1.6.jar:2.1.6]
at
org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:188)
~[apache-cassandra-2.1.6.jar:2.1.6]
at
org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:170)
~[apache-cassandra-2.1.6.jar:2.1.6]
at
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:88)
~[apache-cassandra-2.1.6.jar:2.1.6]
WARN [MessagingService-Incoming-/192.168.2.102] 2015-06-23 11:46:00,629
IncomingTcpConnection.java:97 - UnknownColumnFamilyException reading from
socket; closing
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find
cfId=ddc346b0-1372-11e5-9ba1-195596ed1fd9
at
org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:164)
~[apache-cassandra-2.1.6.jar:2.1.6]
at
org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:97)
~[apache-cassandra-2.1.6.jar:2.1.6]
at
org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:89)
~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.Row$RowSerializer.deserialize(Row.java:74)
~[apache-cassandra-2.1.6.jar:2.1.6]
at
org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:97)
~[apache-cassandra-2.1.6.jar:2.1.6]
at
org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:69)
~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.net.MessageIn.read(MessageIn.java:99)
~[apache-cassandra-2.1.6.jar:2.1.6]
at
org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:188)
~[apache-cassandra-2.1.6.jar:2.1.6]
at
org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:170)
~[apache-cassandra-2.1.6.jar:2.1.6]
at
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:88)
~[apache-cassandra-2.1.6.jar:2.1.6]
WARN [MessagingService-Incoming-/192.168.2.103] 2015-06-23 11:46:00,633
IncomingTcpConnection.java:97 - UnknownColumnFamilyException reading from
socket; closing
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find
cfId=ddc346b0-1372-11e5-9ba1-195596ed1fd9
at
org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:164)
~[apache-cassandra-2.1.6.jar:2.1.6]
at
org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:97)
~[apache-cassandra-2.1.6.jar:2.1.6]
at
org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:89)
~[apache-cassandra-2.1.6.jar:2.1.6]
at org.apache.cassandra.db.Row$RowSerializer.deserialize(Row.java:74)
~[apache-cassandra-2.1.6.jar:2.1.6]
That seems to me more nasty. I don’t know what it means, but I have the
impression that the nodes which I did not touch seem now to talk with node0
about a table which the latter does not know about!!! A table node0 “forgot”.
Well I don’t ask for miracle here. If I would have noticed that there was
already a problem after restarted the first node I would have simply fixed that
node before doing anything else… but I did not noticed this.
Thanks for your help...
On 23 Jun 2015, at 19:24 , Alain RODRIGUEZ
<[email protected]<mailto:[email protected]>> wrote:
Hi Jean,
"I had to reboot a node. I killed the cassandra process on that node". You
should drain the node before killing java (or using any service stop command).
This is not what causes the issue yet it will help you to keep consistence if
you use counters, and make the reboot faster in any cases.
What is going on highly depends on what you did before.
Did you change the RF ?
Did you change the Topology ?
Are you sure this node had data before you restart it ?
What does a "nodetool status mykeyspace" outputs ?
To make the join faster you could try to bootstrap the node again. I just hope
you have a RF > 1 (btw, you have one replica down, if you want to still have
Reads / Writes working, take care of having a Consistency Level low enough).
"It’s like the whole cluster is paralysed" --> what does it mean, be more
accurate on this please.
You should tell us action that were taken before this occurred and now what is
not working since a C* cluster in this state could perfectly run. No SPOF.
C*heers
2015-06-23 16:10 GMT+02:00 Jean Tremblay
<[email protected]<mailto:[email protected]>>:
Does anyone know what to do when such an event occurs?
Does anyone know how this could happen?
I would have tried repairing the node with nodetool repair but that takes much
too long… I need my cluster to work for our online system. At this point
nothing is working. It’s like the whole cluster is paralysed.
The only solution I see is to remove temporarily that node.
Thanks for your comments.
On 23 Jun 2015, at 12:45 , Jean Tremblay
<[email protected]<mailto:[email protected]>>
wrote:
Hi,
I have a cluster with 5 nodes running Cassandra 2.1.6.
I had to reboot a node. I killed the cassandra process on that node. Rebooted
the machine, and restarted Cassandra.
~/apache-cassandra-DATA/data:321>nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID
Rack
UN 192.168.2.104 158.27 GB 256 ?
6479205e-6a19-49a8-b1a1-7e788ec29caa rack1
UN 192.168.2.100 4.75 GB 256 ?
e821da50-23c6-4ea0-b3a1-275ded63bc1f rack1
UN 192.168.2.101 157.43 GB 256 ?
01525665-bacc-4207-a8c3-eb4fd9532401 rack1
UN 192.168.2.102 159.27 GB 256 ?
596a33d7-5089-4c7e-a9ad-e1f22111b160 rack1
UN 192.168.2.103 167 GB 256 ?
0ce1d48e-57a9-4615-8e12-d7ef3d621c7d rack1
After restarting node 192.168.2.100 I noticed that its load was diminish to 2%.
And now when I query the cluster my queries are bombing and that node times out
with an error
WARN [MessagingService-Incoming-/192.168.2.102<http://192.168.2.102/>]
2015-06-23 12:26:00,056 IncomingTcpConnection.java:97 -
UnknownColumnFamilyException reading from socket; closing
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find
cfId=ddc346b0-1372-11e5-9ba1-195596ed1fd9
at
org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:164)
~[apache-cassandra-2.1.6.jar:2.1.6]
at
org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:97)
~[apache-cassandra-2.1.6.jar:2.1.6]
at
org.apache.cassandra.db.Mutation$MutationSerializer.deserializeOneCf(Mutation.java:322)
~[apache-cassandra-2.1.6.jar:2.1.6]
at
org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:302)
~[apache-cassandra-2.1.6.jar:2.1.6]
at
org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:330)
~[apache-cassandra-2.1.6.jar:2.1.6]
What is going on? Did anyone live something like that?