Hi All,

Could you please help me understand the impact on my data?

I am running a 6 node 0.7-rc4 Cassandra cluster with RF=2. Schema was defined
when the cluster was created and did not change. I am doing batch load with
CL=ONE. The cluster is under some stress in memory and I/O. Each node has 1G
heap. CPU is around 10% but the latency is high. 

I saw this exception on 2 out of 6 nodes in a relatively short window of time. 
Hector clients received no exception and the nodes continued running. The
exception has not happened since even though the load is continuing. 
I do get an occasional OOM and I am adjusting thresholds and other 
settings as I go. I also doubled RAM to 2G since the exception.

Here is the exception - the same stack trace in all cases.
org.apache.cassandra.db.UnserializableColumnFamilyException: C
ouldn't find cfId=1004
 at org.apache.cassandra.db.ColumnFamilySerializer.deserialize
(ColumnFamilySerializer.java:117)
 at org.apache.cassandra.db.RowMutationSerializer.defreezeTheMaps
(RowMutation.java:385)
 at org.apache.cassandra.db.RowMutationSerializer.deserialize
 (RowMutation.java:395)
 at org.apache.cassandra.db.RowMutationSerializer.deserialize
 (RowMutation.java:353)
 at org.apache.cassandra.db.RowMutationVerbHandler.doVerb
(RowMutationVerbHandler.java:52)
 at org.apache.cassandra.net.MessageDeliveryTask.run
 (MessageDeliveryTask.java:63)
 at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)


It refers to two cfIds - cfId=1004 and cfId=1013. Mutation stages are always
different even for the exceptions appearing within the same millisecond.
As you can see below cfId=004 appears on both nodes several times but at
different times while cfId=0013 appears only once on one node.

It happened as a group within one second on one node and in 5 groups spread
across 45 minutes on another node. I left the first log entry of each group.

xxx.xxx.xxx.140 grep -i cfid -B 1 log/cassandra.log
xxx.xxx.xxx.141 grep -i cfid -B 1 log/cassandra.log
xxx.xxx.xxx.142 grep -i cfid -B 1 log/cassandra.log
xxx.xxx.xxx.143 grep -i cfid -B 1 log/cassandra.log


xxx.xxx.xxx.144 grep -i cfid -B 1 log/cassandra.log
ERROR [MutationStage:11] 2011-01-14 15:02:03,911 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException: 
Couldn't find cfId=1004


xxx.xxx.xxx.145 grep -i cfid -B 1 log/cassandra.log
ERROR [MutationStage:1] 2011-01-14 15:02:34,460 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException: 
Couldn't find cfId=1004
--
ERROR [MutationStage:13] 2011-01-14 15:03:28,637 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException: 
Couldn't find cfId=1004
--
ERROR [MutationStage:27] 2011-01-14 15:05:02,513 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException: 
Couldn't find cfId=1004
--
ERROR [MutationStage:4] 2011-01-14 15:12:30,731 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException: 
Couldn't find cfId=1004
--
ERROR [MutationStage:23] 2011-01-14 15:47:03,416 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException: 
Couldn't find cfId=1013



Q. What does this mean for the consistency? Am I still within my guarantee of
CL=ONE? 



NOTE: I experienced similar exceptions in 0.7-rc2 but at that time cfIds looked
corrupted. They were random/negative and these exceptions 
were followed by an OOM with an attempt to allocate a huge HeapByteBuffer.

Thank you very much,
Oleg



Reply via email to