One thing you might try is to update the consensus rpc timeout to 30 seconds instead of 1. We changed the default in later versions.
I'd also recommend updating up 1.4 or 1.5 for other related fixes to consensus stability. I think I recall you were on 1.3 still? Todd On Nov 3, 2017 7:47 PM, "Lee King" <yuyunliu...@gmail.com> wrote: Hi, Our kudu cluster have ran well a long time, but write became slowly recently,client also come out rpc timeout. I check the warning and find vast error look this: W1104 10:25:16.833736 10271 consensus_peers.cc:365] T 149ffa58ac274c9ba8385ccfdc01ea14 P 59c768eb799243678ee7fa3f83801316 -> Peer 1c67a7e7ff8f4de494469766641fccd1 (cloud-sk-ds-08:7050): Couldn't send request to peer 1c67a7e7ff8f4de494469766641fccd1 for tablet 149ffa58ac274c9ba8385ccfdc01ea14. Status: Timed out: UpdateConsensus RPC to 10.6.60.9:7050 timed out after 1.000s (SENT). Retrying in the next heartbeat period. Already tried 5 times. I change the configure rpc_service_queue_length=400,rpc_num_service_threads=40, but it takes no effect. Our cluster include 5 master , 10 ts. 3800G data, 800 tablet per ts. I check one of the ts machine's memory, 14G left(128 In all), thread 4739(max 32000), openfile 28000(max 65536), cpu disk utilization ratio about 30%(32 core), disk util less than 30%. Any suggestion for this? Thanks!