column.name.remaining() <= 0 deep in mutation stage (+ maybe other effects)
---------------------------------------------------------------------------

                 Key: CASSANDRA-1860
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1860
             Project: Cassandra
          Issue Type: Bug
            Reporter: Peter Schuller


I suspect there is a lingering ByteBuffer consumption issue somewhere. I'll 
start with the most direct indication of this, which is that I see this (on 
yesterday's 0.7):

2010-12-14T08:11:43.656+00:00 127.0.0.1 java.lang.AssertionError
2010-12-14T08:11:43.656+00:00 127.0.0.1 at 
org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:44)
2010-12-14T08:11:43.656+00:00 127.0.0.1 at 
org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:35)
2010-12-14T08:11:43.656+00:00 127.0.0.1 at 
org.apache.cassandra.db.ColumnFamilySerializer.serializeForSSTable(ColumnFamilySerializer.java:87)
2010-12-14T08:11:43.656+00:00 127.0.0.1 at 
org.apache.cassandra.db.ColumnFamilySerializer.serialize(ColumnFamilySerializer.java:73)
2010-12-14T08:11:43.656+00:00 127.0.0.1 at 
org.apache.cassandra.db.RowMutationSerializer.freezeTheMaps(RowMutation.java:367)
2010-12-14T08:11:43.656+00:00 127.0.0.1 at 
org.apache.cassandra.db.RowMutationSerializer.serialize(RowMutation.java:378)
2010-12-14T08:11:43.656+00:00 127.0.0.1 at 
org.apache.cassandra.db.RowMutationSerializer.serialize(RowMutation.java:356)
2010-12-14T08:11:43.656+00:00 127.0.0.1 at 
org.apache.cassandra.db.RowMutation.makeRowMutationMessage(RowMutation.java:223)
2010-12-14T08:11:43.656+00:00 127.0.0.1 at 
org.apache.cassandra.db.RowMutation.makeRowMutationMessage(RowMutation.java:216)
2010-12-14T08:11:43.657+00:00 127.0.0.1 at 
org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:136)
2010-12-14T08:11:43.657+00:00 127.0.0.1 at 
org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:442)
2010-12-14T08:11:43.657+00:00 127.0.0.1 at 
org.apache.cassandra.thrift.CassandraServer.insert(CassandraServer.java:379)
2010-12-14T08:11:43.657+00:00 127.0.0.1 at 
org.apache.cassandra.thrift.Cassandra$Processor$insert.process(Cassandra.java:2952)
2010-12-14T08:11:43.657+00:00 127.0.0.1 at 
org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2555)
2010-12-14T08:11:43.657+00:00 127.0.0.1 at 
org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:167)

2010-12-14T08:11:43.657+00:00 127.0.0.1 at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
2010-12-14T08:11:43.657+00:00 127.0.0.1 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
2010-12-14T08:11:43.657+00:00 127.0.0.1 at java.lang.Thread.run(Thread.java:662)

That assertion failure is:

        assert column.name().remaining() > 0;

While I have not absolutely confirmed that we are not accidentally sending 
empty column names, I don't believe that is the case and regardless Cassandra 
does try to validate RowMutations and part of that validation is that column 
names are non-empty (ThriftValidation.validateColumns()).

Some more information:

This is on a three-node test cluster. We have some low number of processes 
doing nothing but insertions using insert() (via pycassa). As far as I can tell 
the column validation code is triggering regardless of whether you use 
batch_mutate() or insert() (pycassa optimizes the single-column case by using 
insert()).

The interesting thing is that we have three nodes in the cluster and they 
trigger different issues. On the two remaining nodes we had already observed 
two different stack traces:


org.apache.cassandra.db.UnserializableColumnFamilyException: Couldn't find 
cfId=798579
        at 
org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:117)
        at 
org.apache.cassandra.db.RowMutationSerializer.defreezeTheMaps(RowMutation.java:388)
        at 
org.apache.cassandra.db.RowMutationSerializer.deserialize(RowMutation.java:398)
        at 
org.apache.cassandra.db.RowMutationSerializer.deserialize(RowMutation.java:356)
        at 
org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:52)
        at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:63)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)

And the other one:


java.io.EOFException
        at java.io.DataInputStream.readFully(DataInputStream.java:180)
        at 
org.apache.cassandra.utils.FBUtilities.readByteArray(FBUtilities.java:265)
        at 
org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:95)
        at 
org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:35)
        at 
org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:129)
        at 
org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:120)
        at 
org.apache.cassandra.db.RowMutationSerializer.defreezeTheMaps(RowMutation.java:388)
        at 
org.apache.cassandra.db.RowMutationSerializer.deserialize(RowMutation.java:398)
        at 
org.apache.cassandra.db.RowMutationSerializer.deserialize(RowMutation.java:356)
        at 
org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:52)
        at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:63)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)

Originally I was only looking at these later two wondering whether it might 
have been a result of a schema mismatch due to a bootstrapping mistake w.r.t. 
seed nodes joining the cluster. However, the schema looks matching on all nodes 
in the cluster and we keep seeing those exceptions even after a full 
repair/compact on the entire cluster (and those don't fail; so all data is read 
without problems) in addition to removing of the hints sstables to remove any 
lingering mutations.

Today I noticed that the third node, that does NOT see the above two issues, 
reports quite a number of exceptions of the assertion failure kind. I am not 
sure why the behavior does not seem to be consistent; we have three hosts 
running the python/pycassa clients spread over several processes. tcpdump 
indicates thrift RPC calls are definitely flowing to hosts other than the third 
one (that sees the assertion failure). They are all running the same version of 
Cassandra (same debian package, definitely restarted since the latest upgrade - 
I know that since I restarted nodes when testing the schema change issue I have 
filed separately).

At this moment I don't have better information than the above and I have not 
reproduced it on e.g. an independent cluster.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to