column.name.remaining() <= 0 deep in mutation stage (+ maybe other effects) ---------------------------------------------------------------------------
Key: CASSANDRA-1860 URL: https://issues.apache.org/jira/browse/CASSANDRA-1860 Project: Cassandra Issue Type: Bug Reporter: Peter Schuller I suspect there is a lingering ByteBuffer consumption issue somewhere. I'll start with the most direct indication of this, which is that I see this (on yesterday's 0.7): 2010-12-14T08:11:43.656+00:00 127.0.0.1 java.lang.AssertionError 2010-12-14T08:11:43.656+00:00 127.0.0.1 at org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:44) 2010-12-14T08:11:43.656+00:00 127.0.0.1 at org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:35) 2010-12-14T08:11:43.656+00:00 127.0.0.1 at org.apache.cassandra.db.ColumnFamilySerializer.serializeForSSTable(ColumnFamilySerializer.java:87) 2010-12-14T08:11:43.656+00:00 127.0.0.1 at org.apache.cassandra.db.ColumnFamilySerializer.serialize(ColumnFamilySerializer.java:73) 2010-12-14T08:11:43.656+00:00 127.0.0.1 at org.apache.cassandra.db.RowMutationSerializer.freezeTheMaps(RowMutation.java:367) 2010-12-14T08:11:43.656+00:00 127.0.0.1 at org.apache.cassandra.db.RowMutationSerializer.serialize(RowMutation.java:378) 2010-12-14T08:11:43.656+00:00 127.0.0.1 at org.apache.cassandra.db.RowMutationSerializer.serialize(RowMutation.java:356) 2010-12-14T08:11:43.656+00:00 127.0.0.1 at org.apache.cassandra.db.RowMutation.makeRowMutationMessage(RowMutation.java:223) 2010-12-14T08:11:43.656+00:00 127.0.0.1 at org.apache.cassandra.db.RowMutation.makeRowMutationMessage(RowMutation.java:216) 2010-12-14T08:11:43.657+00:00 127.0.0.1 at org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:136) 2010-12-14T08:11:43.657+00:00 127.0.0.1 at org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:442) 2010-12-14T08:11:43.657+00:00 127.0.0.1 at org.apache.cassandra.thrift.CassandraServer.insert(CassandraServer.java:379) 2010-12-14T08:11:43.657+00:00 127.0.0.1 at org.apache.cassandra.thrift.Cassandra$Processor$insert.process(Cassandra.java:2952) 2010-12-14T08:11:43.657+00:00 127.0.0.1 at org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2555) 2010-12-14T08:11:43.657+00:00 127.0.0.1 at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:167) 2010-12-14T08:11:43.657+00:00 127.0.0.1 at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) 2010-12-14T08:11:43.657+00:00 127.0.0.1 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) 2010-12-14T08:11:43.657+00:00 127.0.0.1 at java.lang.Thread.run(Thread.java:662) That assertion failure is: assert column.name().remaining() > 0; While I have not absolutely confirmed that we are not accidentally sending empty column names, I don't believe that is the case and regardless Cassandra does try to validate RowMutations and part of that validation is that column names are non-empty (ThriftValidation.validateColumns()). Some more information: This is on a three-node test cluster. We have some low number of processes doing nothing but insertions using insert() (via pycassa). As far as I can tell the column validation code is triggering regardless of whether you use batch_mutate() or insert() (pycassa optimizes the single-column case by using insert()). The interesting thing is that we have three nodes in the cluster and they trigger different issues. On the two remaining nodes we had already observed two different stack traces: org.apache.cassandra.db.UnserializableColumnFamilyException: Couldn't find cfId=798579 at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:117) at org.apache.cassandra.db.RowMutationSerializer.defreezeTheMaps(RowMutation.java:388) at org.apache.cassandra.db.RowMutationSerializer.deserialize(RowMutation.java:398) at org.apache.cassandra.db.RowMutationSerializer.deserialize(RowMutation.java:356) at org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:52) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:63) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) And the other one: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at org.apache.cassandra.utils.FBUtilities.readByteArray(FBUtilities.java:265) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:95) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:35) at org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:129) at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:120) at org.apache.cassandra.db.RowMutationSerializer.defreezeTheMaps(RowMutation.java:388) at org.apache.cassandra.db.RowMutationSerializer.deserialize(RowMutation.java:398) at org.apache.cassandra.db.RowMutationSerializer.deserialize(RowMutation.java:356) at org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:52) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:63) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Originally I was only looking at these later two wondering whether it might have been a result of a schema mismatch due to a bootstrapping mistake w.r.t. seed nodes joining the cluster. However, the schema looks matching on all nodes in the cluster and we keep seeing those exceptions even after a full repair/compact on the entire cluster (and those don't fail; so all data is read without problems) in addition to removing of the hints sstables to remove any lingering mutations. Today I noticed that the third node, that does NOT see the above two issues, reports quite a number of exceptions of the assertion failure kind. I am not sure why the behavior does not seem to be consistent; we have three hosts running the python/pycassa clients spread over several processes. tcpdump indicates thrift RPC calls are definitely flowing to hosts other than the third one (that sees the assertion failure). They are all running the same version of Cassandra (same debian package, definitely restarted since the latest upgrade - I know that since I restarted nodes when testing the schema change issue I have filed separately). At this moment I don't have better information than the above and I have not reproduced it on e.g. an independent cluster. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.