Ideally, I would hope that a bad disk wouldn't hang a node but would
instead just cause writes to fail, but if that is not the case,
perhaps the bad disk somehow wedged that server node completely so
that requests were not being processed at all (maybe not even being
timed out). At that point you'd be depending on Hector's
CassandraHostConfigurator.cassandraThriftSocketTimeout to expire,
which would cause the request to fail over to a working node. But that
value defaults to zero (i.e. forever), so if you didn't explicitly
configure it your client would hang along with the server node.

Perhaps someone with more knowledge of Cassandra's internals could
comment on the possibility of the server hanging completely. I would
think that the logs from the bad node might help to diagnose that.

Jim

On Sun, Jul 31, 2011 at 4:58 PM, aaron morton <aa...@thelastpickle.com> wrote:
> Yup, it sounds like things may not have failed as their should. Do you have
> a better definition of stuck ? Was the client waiting for a single request
> to completed or was the client not cycling to another node ?
>  If there is some server log details out it may help understand what
> happened. Also what setting you had for  commitlog_sync in the yaml.  Also
> some info on the failure, did the disk stop dead, or run slowly, or fail
> sometimes etc.
> AFAIK the wait on the writes to return should have timed out on the
> coordinator. I may be behind on the expected behaviour, perhaps a thread
> pool was shutdown as part of handling the error and this prevents the error
> from returning.
> I would check the rpc_timeout in the yaml, and that the client is setting a
> client side socket time out. Timeouts should kick in. Then check the
> expected behaviour for Hector in when it gets a timeout.
> Cheers
>
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
> On 1 Aug 2011, at 09:40, Lior Golan wrote:
>
> Thanks Aaron. We will try to pull the logs and post them in this forum.
>
> But what I don't understand is why the client should pause at all. We are
> writing with CL.ONE, and the replication factor is 2. As far as we
> understand – the client communicates with a certain node (any node for that
> matter) StorageProxy, which then sends write requests to all 2 replicas, but
> wait for just the first one of them to acknowledge the write.ii
>
> So even if one node got stuck because of this commit log disk failure, it
> should not have stuck the client. Can you explain why that ever happened in
> the first place?
>
> And to add to that – when we took down the Cassandra node with the faulty
> commit log disk, the client continued to write and didn't seem to bother
> (which is what we expected to happen in the first place, but didn't).
>
> From: aaron morton [mailto:aa...@thelastpickle.com]
> Sent: Monday, August 01, 2011 12:19 AM
> To: user@cassandra.apache.org
> Subject: Re: Damaged commit log disk causes Cassandra client to get stuck
>
> A couple of timeouts should have kicked in.
>
> First the rpc_timeout on the server side should have kicked in and given the
> client a (thrift) TimedOutException. Secondly a client side socket timeout
> should be set so the client will timeout the socket. Did either of these
> appear in the client side logs?
>
> In response to either of those my guess would be that hector would cycle the
> connection. (I've not checked this.)
>
> How did the disk fail ? Was their anything in the server logs ?
>
> Some background about handling disk
> fails https://issues.apache.org/jira/browse/CASSANDRA-809
>
> Cheers
>
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 1 Aug 2011, at 08:13, Lior Golan wrote:
>
> In one of our test clusters we had a damaged commit log disks in one of the
> nodes.
>
> We have replication factor = 2 in this cluster, and write with consistency
> level = ONE. So we expected writes will not be affected by such an issue.
> But what actually happened is that the client that was writing with CL.ONE
> got stuck. The client could resume writing when we stopped the server with
> the faulty disk (so this is another indication it's not a replication factor
> or consistency level issue).
>
> We are running Cassandra 0.7.6, and the client we're using is Hector.
>
> Can anyone explain what happened here? Why the client got stuck when the
> commit log disk on one of the servers damaged (and could resume writing if
> we actually took off that server)?
>

Reply via email to