HTimedOutException and cluster not working

Jason Wee Mon, 17 Sep 2012 21:25:26 -0700

Hello,

A context to our environment, we have a clusters of 9 nodes with a few
keyspaces. The client write to the cluster with consistency level of one to
a keyspace in the cluster with a replication factor of 3. The hector client
is configured such that all the nodes in cluster is specified and so that
we would want to ensure that at any write request, two nodes, can fail and
one write is succcess to the cluster node.


However, under certain situation, we seen in the log, HTimedOutException is
logged during writing to the cluster. Hector client thus failover to the
next node in the cluster but what we noticed is that, the same exception,
HTimedOutException is logged for all the nodes. This result that the
cluster is not working as a whole. Logically, we checked all the nodes in
the cluster for load. Only node-3 seem to have high pending MutationStage
when nodetool tpstats is run. Other nodes are fine with 0 active and 0
pending for all the stages.

/nodetool -h localhost tpstats
Pool Name Active Pending Completed Blocked All time blocked
ReadStage 0 0 11116983 0 0
RequestResponseStage 0 0 1252368951 0 0
MutationStage 16 2177067 879092633 0 0
ReadRepairStage 0 0 3648106 0 0
ReplicateOnWriteStage 0 0 33722610 0 0
GossipStage 0 0 20504608 0 0
AntiEntropyStage 0 0 1197 0 0
MigrationStage 0 0 89 0 0
MemtablePostFlusher 0 0 5659 0 0
StreamStage 0 0 296 0 0
FlushWriter 0 0 5616 0 1321
MiscStage 0 0 5964 0 0
AntiEntropySessions 0 0 88 0 0
InternalResponseStage 0 0 27 0 0
HintedHandoff 1 2 5976 0 0

Message type Dropped
RANGE_SLICE 0
READ_REPAIR 0
BINARY 0
READ 178
MUTATION 17467
REQUEST_RESPONSE 0

We proceed to check if there is any compaction in node-3 and found out the
following:

./nodetool -hlocalhost compactionstats
pending tasks: 196
compaction type keyspace column family bytes compacted bytes total progress
Cleanup MyKeyspace MyCF 6946398685 10230720119 67.90%


Question:
* with a replication factor of 3 in the keyspace and client write
consistency
  level of one, in the situation above, and the current hector client
settings
  and cluster settings, it should be possible in this scenario, write
success
  on one of the nodes even though node-3 is too busy or failing for any
reason?

* when hector client failover to other nodes, basically all the nodes fail,
why
  is this so?

* what factors that increase MutationStage active and pending values?

Thank you for any comments and insight

Regards,
Jason

HTimedOutException and cluster not working

Reply via email to