Hi All, Could you please help me understand the impact on my data?
I am running a 6 node 0.7-rc4 Cassandra cluster with RF=2. Schema was defined when the cluster was created and did not change. I am doing batch load with CL=ONE. The cluster is under some stress in memory and I/O. Each node has 1G heap. CPU is around 10% but the latency is high. I saw this exception on 2 out of 6 nodes in a relatively short window of time. Hector clients received no exception and the nodes continued running. The exception has not happened since even though the load is continuing. I do get an occasional OOM and I am adjusting thresholds and other settings as I go. I also doubled RAM to 2G since the exception. Here is the exception - the same stack trace in all cases. org.apache.cassandra.db.UnserializableColumnFamilyException: C ouldn't find cfId=1004 at org.apache.cassandra.db.ColumnFamilySerializer.deserialize (ColumnFamilySerializer.java:117) at org.apache.cassandra.db.RowMutationSerializer.defreezeTheMaps (RowMutation.java:385) at org.apache.cassandra.db.RowMutationSerializer.deserialize (RowMutation.java:395) at org.apache.cassandra.db.RowMutationSerializer.deserialize (RowMutation.java:353) at org.apache.cassandra.db.RowMutationVerbHandler.doVerb (RowMutationVerbHandler.java:52) at org.apache.cassandra.net.MessageDeliveryTask.run (MessageDeliveryTask.java:63) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) It refers to two cfIds - cfId=1004 and cfId=1013. Mutation stages are always different even for the exceptions appearing within the same millisecond. As you can see below cfId=004 appears on both nodes several times but at different times while cfId=0013 appears only once on one node. It happened as a group within one second on one node and in 5 groups spread across 45 minutes on another node. I left the first log entry of each group. xxx.xxx.xxx.140 grep -i cfid -B 1 log/cassandra.log xxx.xxx.xxx.141 grep -i cfid -B 1 log/cassandra.log xxx.xxx.xxx.142 grep -i cfid -B 1 log/cassandra.log xxx.xxx.xxx.143 grep -i cfid -B 1 log/cassandra.log xxx.xxx.xxx.144 grep -i cfid -B 1 log/cassandra.log ERROR [MutationStage:11] 2011-01-14 15:02:03,911 RowMutationVerbHandler.java (line 83) Error in row mutation org.apache.cassandra.db.UnserializableColumnFamilyException: Couldn't find cfId=1004 xxx.xxx.xxx.145 grep -i cfid -B 1 log/cassandra.log ERROR [MutationStage:1] 2011-01-14 15:02:34,460 RowMutationVerbHandler.java (line 83) Error in row mutation org.apache.cassandra.db.UnserializableColumnFamilyException: Couldn't find cfId=1004 -- ERROR [MutationStage:13] 2011-01-14 15:03:28,637 RowMutationVerbHandler.java (line 83) Error in row mutation org.apache.cassandra.db.UnserializableColumnFamilyException: Couldn't find cfId=1004 -- ERROR [MutationStage:27] 2011-01-14 15:05:02,513 RowMutationVerbHandler.java (line 83) Error in row mutation org.apache.cassandra.db.UnserializableColumnFamilyException: Couldn't find cfId=1004 -- ERROR [MutationStage:4] 2011-01-14 15:12:30,731 RowMutationVerbHandler.java (line 83) Error in row mutation org.apache.cassandra.db.UnserializableColumnFamilyException: Couldn't find cfId=1004 -- ERROR [MutationStage:23] 2011-01-14 15:47:03,416 RowMutationVerbHandler.java (line 83) Error in row mutation org.apache.cassandra.db.UnserializableColumnFamilyException: Couldn't find cfId=1013 Q. What does this mean for the consistency? Am I still within my guarantee of CL=ONE? NOTE: I experienced similar exceptions in 0.7-rc2 but at that time cfIds looked corrupted. They were random/negative and these exceptions were followed by an OOM with an attempt to allocate a huge HeapByteBuffer. Thank you very much, Oleg