Hello, A context to our environment, we have a clusters of 9 nodes with a few keyspaces. The client write to the cluster with consistency level of one to a keyspace in the cluster with a replication factor of 3. The hector client is configured such that all the nodes in cluster is specified and so that we would want to ensure that at any write request, two nodes, can fail and one write is succcess to the cluster node.
However, under certain situation, we seen in the log, HTimedOutException is logged during writing to the cluster. Hector client thus failover to the next node in the cluster but what we noticed is that, the same exception, HTimedOutException is logged for all the nodes. This result that the cluster is not working as a whole. Logically, we checked all the nodes in the cluster for load. Only node-3 seem to have high pending MutationStage when nodetool tpstats is run. Other nodes are fine with 0 active and 0 pending for all the stages. /nodetool -h localhost tpstats Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 11116983 0 0 RequestResponseStage 0 0 1252368951 0 0 MutationStage 16 2177067 879092633 0 0 ReadRepairStage 0 0 3648106 0 0 ReplicateOnWriteStage 0 0 33722610 0 0 GossipStage 0 0 20504608 0 0 AntiEntropyStage 0 0 1197 0 0 MigrationStage 0 0 89 0 0 MemtablePostFlusher 0 0 5659 0 0 StreamStage 0 0 296 0 0 FlushWriter 0 0 5616 0 1321 MiscStage 0 0 5964 0 0 AntiEntropySessions 0 0 88 0 0 InternalResponseStage 0 0 27 0 0 HintedHandoff 1 2 5976 0 0 Message type Dropped RANGE_SLICE 0 READ_REPAIR 0 BINARY 0 READ 178 MUTATION 17467 REQUEST_RESPONSE 0 We proceed to check if there is any compaction in node-3 and found out the following: ./nodetool -hlocalhost compactionstats pending tasks: 196 compaction type keyspace column family bytes compacted bytes total progress Cleanup MyKeyspace MyCF 6946398685 10230720119 67.90% Question: * with a replication factor of 3 in the keyspace and client write consistency level of one, in the situation above, and the current hector client settings and cluster settings, it should be possible in this scenario, write success on one of the nodes even though node-3 is too busy or failing for any reason? * when hector client failover to other nodes, basically all the nodes fail, why is this so? * what factors that increase MutationStage active and pending values? Thank you for any comments and insight Regards, Jason