Sounds like there are multiple versions of your schema around the cluster. What client API are you using? Does it support the describe_schema_versions() function? This will tell you how many versions there are. 

The easy solutions here is scrub the data and start a new 0.7 cluster using the release version.If possible you should not use data created in the non release versions once you get to production. 

Hope that helps. 
Aaron


On 21 Jan, 2011,at 09:15 AM, Oleg Proudnikov <ol...@cloudorange.com> wrote:

Hi All,

Could you please help me understand the impact on my data?

I am running a 6 node 0.7-rc4 Cassandra cluster with RF=2. Schema was defined
when the cluster was created and did not change. I am doing batch load with
CL=ONE. The cluster is under some stress in memory and I/O. Each node has 1G
heap. CPU is around 10% but the latency is high.

I saw this exception on 2 out of 6 nodes in a relatively short window of time.
Hector clients received no exception and the nodes continued running. The
exception has not happened since even though the load is continuing.
I do get an occasional OOM and I am adjusting thresholds and other
settings as I go. I also doubled RAM to 2G since the exception.

Here is the exception - the same stack trace in all cases.
org.apache.cassandra.db.UnserializableColumnFamilyException: C
ouldn't find cfId=1004
at org.apache.cassandra.dbColumnFamilySerializer.deserialize
(ColumnFamilySerializer.java:117)
at org.apache.cassandra.db.RowMutationSerializer.defreezeTheMaps
(RowMutation.java:385)
at org.apache.cassandra.db.RowMutationSerializer.deserialize
(RowMutation.java:395)
at org.apache.cassandra.db.RowMutationSerializer.deserialize
(RowMutation.java:353)
at org.apache.cassandra.db.RowMutationVerbHandler.doVerb
(RowMutationVerbHandler.java:52)
at org.apache.cassandra.net.MessageDeliveryTask.run
(MessageDeliveryTask.java:63)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)


It refers to two cfIds - cfId=1004 and cfId=1013. Mutation stages are always
different even for the exceptions appearing within the same millisecond.
As you can see below cfId=004 appears on both nodes several times but at
different times while cfId=0013 appears only once on one node.

It happened as a group within one second on one node and in 5 groups spread
across 45 minutes on another node. I left the first log entry of each group.

xxx.xxx.xxx.140 grep -i cfid -B 1 log/cassandra.log
xxx.xxx.xxx.141 grep -i cfid -B 1 log/cassandra.log
xxx.xxx.xxx.142 grep -i cfid -B 1 log/cassandra.log
xxx.xxx.xxx.143 grep -i cfid -B 1 log/cassandra.log


xxx.xxx.xxx.144 grep -i cfid -B 1 log/cassandra.log
ERROR [MutationStage:11] 2011-01-14 15:02:03,911 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException:
Couldn't find cfId=1004


xxx.xxx.xxx.145 grep -i cfid -B 1 log/cassandra.log
ERROR [MutationStage:1] 2011-01-14 15:02:34,460 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException:
Couldn't find cfId=1004
--
ERROR [MutationStage:13] 2011-01-14 15:03:28,637 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException:
Couldn't find cfId=1004
--
ERROR [MutationStage:27] 2011-01-14 15:05:02,513 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException:
Couldn't find cfId=1004
--
ERROR [MutationStage:4] 2011-01-14 15:12:30,731 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException:
Couldn't find cfId=1004
--
ERROR [MutationStage:23] 2011-01-14 15:47:03,416 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException:
Couldn't find cfId=1013



Q. What does this mean for the consistency? Am I still within my guarantee of
CL=ONE?



NOTE: I experienced similar exceptions in 0.7-rc2 but at that time cfIds looked
corrupted. They were random/negative and these exceptions
were followed by an OOM with an attempt to allocate a huge HeapByteBuffer.

Thank you very much,
Oleg



Reply via email to