More information: 1. The balancer is running. And if we stop it, failover would only happen about 2-3 times a day. But, we have to run it since the datanodes usage is like: 14.65% / 78.37% / 83.18% / 23.27% 2. Jvm pause log is not often, and all pauses are less than 2 seconds
Wenqi Ma <mawenqi...@gmail.com> 于2019年9月19日周四 下午2:33写道: > Sure I checked that, and it is namenode health monitoring timing out, like: > > 2019-09-19 09:15:03,823 INFO org.apache.hadoop.ha.ZKFailoverController: > Successfully transitioned NameNode at dphadoop20/192.168.1.20:8020 to > active state > 2019-09-19 10:48:55,898 WARN org.apache.hadoop.ha.HealthMonitor: > Transport-level exception trying to monitor health of NameNode at > dphadoop20/192.168.1.20:8020: java.net.SocketTimeoutException: 45000 > millis timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/192.168.1.20:36622 > remote=dphadoop20/192.168.1.20:8020] Call From dphadoop20/192.168.1.20 to > dphadoop20:8020 failed on socket timeout exception: > java.net.SocketTimeoutException: 45000 millis timeout while waiting for > channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/192.168.1.20:36622 > remote=dphadoop20/192.168.1.20:8020]; For more details see: > http://wiki.apache.org/hadoop/SocketTimeout > 2019-09-19 10:48:55,898 INFO org.apache.hadoop.ha.HealthMonitor: Entering > state SERVICE_NOT_RESPONDING > > Then the standby namenode will be transitioned to active state, while the > original active namenode will get following FATAL error and quit: > IPC's epoch 353 is less than the last promised epoch 354 > > BTW, the stopped namenode wil be started up immediately, however, since > the fsimage file is huge, about 26GB, so it needs about 30 minutes to load > the fsimage and another 30 minutes to handle block report to quit the safe > mode. > > > HK <hemakumar.sunn...@gmail.com> 于2019年9月19日周四 下午12:19写道: > >> Are you checking ZKFC process logs and jstack? >> At what stage ZKFC timing out? zk session timing out? or namenode health >> monitoring timing out? >> >> >>> > -- > Best Regards! > Wenqi > > -- Best Regards! Wenqi