[ https://issues.apache.org/jira/browse/CASSANDRA-5725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jonathan Ellis updated CASSANDRA-5725: -------------------------------------- Still working on this, [~sbtourist]? > Silently failing messages in case of schema not fully propagated > ---------------------------------------------------------------- > > Key: CASSANDRA-5725 > URL: https://issues.apache.org/jira/browse/CASSANDRA-5725 > Project: Cassandra > Issue Type: Bug > Reporter: Sergio Bossa > Fix For: 1.2.9 > > Attachments: 5725-0001.patch > > > When a new keyspace and/or column family is created on a multi nodes cluster > (at least three), and then a mutation is executed on such new column family, > the operations sometimes silently fails by timing out. > I tracked this down to the schema not being fully propagated to all nodes. > Here's what happens: > 1) Node 1 receives the create keyspace/column family request. > 2) The same node receives a mutation request at CL.QUORUM and sends to other > nodes too. > 3) Upon receiving the mutation request, other nodes try to deserialize it and > fail in doing so if the schema is not fully propagated, i.e. because they > don't find the mutated column family. > 4) The connection between node 1 and the failed node is dropped, and the > request on the former hangs until timing out. > Here is the underlying exception, I had to tweak several log levels to get > it: > {noformat} > INFO 13:11:39,441 IOException reading from socket; closing > org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find > cfId=a31c7604-0e40-393b-82d7-ba3d910ad50a > at > org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:184) > at > org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:94) > at > org.apache.cassandra.db.RowMutation$RowMutationSerializer.deserialize(RowMutation.java:397) > at > org.apache.cassandra.db.RowMutation$RowMutationSerializer.deserialize(RowMutation.java:407) > at > org.apache.cassandra.db.RowMutation$RowMutationSerializer.deserialize(RowMutation.java:367) > at org.apache.cassandra.net.MessageIn.read(MessageIn.java:94) > at > org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:207) > at > org.apache.cassandra.net.IncomingTcpConnection.handleModernVersion(IncomingTcpConnection.java:139) > at > org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:82) > {noformat} > Finally, there's probably a correlated failure happening during repairs of > newly created/mutated column family, causing the repair process to hang > forever as follows: > {noformat} > "AntiEntropySessions:1" daemon prio=5 tid=7fe981148000 nid=0x11abea000 in > Object.wait() [11abe9000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <7c6200840> (a org.apache.cassandra.utils.SimpleCondition) > at java.lang.Object.wait(Object.java:485) > at > org.apache.cassandra.utils.SimpleCondition.await(SimpleCondition.java:34) > - locked <7c6200840> (a org.apache.cassandra.utils.SimpleCondition) > at > org.apache.cassandra.service.AntiEntropyService$RepairSession.runMayThrow(AntiEntropyService.java:695) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > at java.lang.Thread.run(Thread.java:680) > "http-8983-1" daemon prio=5 tid=7fe97d24d000 nid=0x11a5c8000 in Object.wait() > [11a5c6000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <7c620db58> (a org.apache.cassandra.utils.SimpleCondition) > at java.lang.Object.wait(Object.java:485) > at > org.apache.cassandra.utils.SimpleCondition.await(SimpleCondition.java:34) > - locked <7c620db58> (a org.apache.cassandra.utils.SimpleCondition) > at > org.apache.cassandra.service.StorageService$4.runMayThrow(StorageService.java:2442) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > org.apache.cassandra.service.StorageService.forceTableRepairRange(StorageService.java:2409) > at > org.apache.cassandra.service.StorageService.forceTableRepair(StorageService.java:2387) > at > com.datastax.bdp.cassandra.index.solr.SolrCoreResourceManager.repairResources(SolrCoreResourceManager.java:693) > at > com.datastax.bdp.cassandra.index.solr.SolrCoreResourceManager.createCore(SolrCoreResourceManager.java:255) > at > com.datastax.bdp.cassandra.index.solr.CassandraCoreAdminHandler.handleCreateAction(CassandraCoreAdminHandler.java:121) > at > org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:144) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) > at > org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:615) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:206) > {noformat} > I wasn't able to track any exception as I can't reproduce it reliably enough, > but I believe it's correlated to schema propagation as based on log messages > the merkle tree request on node 1 happens concurrently to schema installation > on other nodes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira