More information:
1. The balancer is running. And if we stop it, failover would only happen
about 2-3 times a day. But, we have to run it since the datanodes usage is
like: 14.65% / 78.37% / 83.18% / 23.27%
2. Jvm pause log is not often, and all pauses are less than 2 seconds

Wenqi Ma <mawenqi...@gmail.com> 于2019年9月19日周四 下午2:33写道:

> Sure I checked that, and it is namenode health monitoring timing out, like:
>
> 2019-09-19 09:15:03,823 INFO org.apache.hadoop.ha.ZKFailoverController:
> Successfully transitioned NameNode at dphadoop20/192.168.1.20:8020 to
> active state
> 2019-09-19 10:48:55,898 WARN org.apache.hadoop.ha.HealthMonitor:
> Transport-level exception trying to monitor health of NameNode at
> dphadoop20/192.168.1.20:8020: java.net.SocketTimeoutException: 45000
> millis timeout while waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/192.168.1.20:36622
> remote=dphadoop20/192.168.1.20:8020] Call From dphadoop20/192.168.1.20 to
> dphadoop20:8020 failed on socket timeout exception:
> java.net.SocketTimeoutException: 45000 millis timeout while waiting for
> channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/192.168.1.20:36622
> remote=dphadoop20/192.168.1.20:8020]; For more details see:
> http://wiki.apache.org/hadoop/SocketTimeout
> 2019-09-19 10:48:55,898 INFO org.apache.hadoop.ha.HealthMonitor: Entering
> state SERVICE_NOT_RESPONDING
>
> Then the standby namenode will be transitioned to active state, while the
> original active namenode will get following FATAL error and quit:
>   IPC's epoch 353 is less than the last promised epoch 354
>
> BTW, the stopped namenode wil be started up immediately, however, since
> the fsimage file is huge, about 26GB, so it needs about 30 minutes to load
> the fsimage and another 30 minutes to handle block report to quit the safe
> mode.
>
>
> HK <hemakumar.sunn...@gmail.com> 于2019年9月19日周四 下午12:19写道:
>
>> Are you checking ZKFC process logs and jstack?
>> At what stage ZKFC timing out? zk session timing  out? or namenode health
>> monitoring timing out?
>>
>>
>>>
> --
> Best Regards!
> Wenqi
>
>

-- 
Best Regards!
Wenqi

Reply via email to