Hi, All One of the RegionServer of our company’s cluster was crashed. At this time, I found:
1. All the RegionServer stopped handling the requests from the client side( requestsPerSecond=0 at the master-status UI page). 2. It takes about 12-15 minutes to recovery. 3. I have set hbase.regionserver.restart.on.zk.expire to true, but it does not work. For 1, I knew the cluster began to split log and recover the data on the crashed RegionServer, will the recovery operation block all the requests from the client side? For 2, Is there any solution to reduce the recovery time? For 3, I checked the log, found “session is timeout” exception, maybe for full gc and the session was timeout. But why the hbase.regionserver.restart.on.zk.expire does not work? My HBase version is 0.94.0. Thanks for any suggestions and feedback! Fowler Zhang
