I'm seeing an issue similar to: http://issues.apache.org/jira/browse/CASSANDRA-169
Here is when I see it. I'm running Cassandra on 5 nodes using the OrderPreservingPartitioner, and have populated Cassandra with 78 records, and I can use get_key_range via Thrift just fine. Then, if I manually kill one of the nodes (if I kill off node #5), the node (node #1) which I've been using to call get_key_range will timeout and the error: Thrift: Internal error processing get_key_range And the Cassandra output shows the same trace as in 169: ERROR - Encountered IOException on connection: java.nio.channels.SocketChannel[closed] java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592) at org.apache.cassandra.net.TcpConnection.connect(TcpConnection.java:349) at org.apache.cassandra.net.SelectorManager.doProcess(SelectorManager.java:131) at org.apache.cassandra.net.SelectorManager.run(SelectorManager.java:98) WARN - Closing down connection java.nio.channels.SocketChannel[closed] ERROR - Internal error processing get_key_range java.lang.RuntimeException: java.util.concurrent.TimeoutException: Operation timed out. at org.apache.cassandra.service.StorageProxy.getKeyRange(StorageProxy.java:573) at org.apache.cassandra.service.CassandraServer.get_key_range(CassandraServer.java:595) at org.apache.cassandra.service.Cassandra$Processor$get_key_range.process(Cassandra.java:853) at org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:606) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:675) Caused by: java.util.concurrent.TimeoutException: Operation timed out. at org.apache.cassandra.net.AsyncResult.get(AsyncResult.java:97) at org.apache.cassandra.service.StorageProxy.getKeyRange(StorageProxy.java:569) ... 7 more If it was giving an error just one time, I could just rely on catching the error and trying again. But a get_key_range call to that node I was already making get_key_range queries against (node #1) never works again (it is still up and it responds fine to multiget Thrift calls), sometimes not even after I restart the down node (node #5). I end up having to restart node #1 in addition to node #5. The behavior for the other 3 nodes varies - some of them are also unable to respond to get_key_range calls, but some of them do respond to get_key_range calls. My question is, what path should I go down in terms of reproducing this problem? I'm using Aug 27 trunk code - should I update my Cassandra install prior to gathering more information for this issue, and if so, which version (0.4 or trunk). If there is anyone who is familiar with this issue, could you let me know what I might be doing wrong, or what the next info-gathering step should be for me? Thank you, Simon Smith Arcode Corporation