>From region server log: 2016-04-11 03:11:51,589 WARN org.apache.zookeeper.ClientCnxnSocket: Connected to an old server; r-o mode will be unavailable 2016-04-11 03:11:51,589 INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x52ee1452fec5ac has expired, closing socket connection
>From zookeeper log: 2016-04-11 03:11:27,323 - INFO [CommitProcessor:0:NIOServerCnxn@1435] - Closed socket connection for client /172.20.67.19:58404 which had sessionid 0x52ee1452fec71f 2016-04-11 03:11:53,301 - INFO [CommitProcessor:0:NIOServerCnxn@1435] - Closed socket connection for client /172.20.67.13:32946 which had sessionid 0x52ee1452fec6ea Note the 26 second gap. What do you see in the logs of the other two zookeeper servers ? Thanks On Mon, Apr 11, 2016 at 5:08 PM, Ted Tuttle <t...@mentacapital.com> wrote: > Hello - > > We've started experiencing regular failures of our HBase cluster. For the > last week we've had nightly failures about 1hr after a heavy batch process > starts. > > In the logs below we see the failure starting at 2016-04-11 03:11 in > zookeeper, master and region server logs: > > zookeeper: http://pastebin.com/kf7ja22K > > region server: http://pastebin.com/tduJgKqq > > master: http://pastebin.com/0szhi0bJ > > The master log seems most interesting. Here we see problems connecting to > Zookeeper then a number of region servers dying in quick succession. From > the log evidence it appears Zookeeper is not responding rather than the > more typical GC causing isolated RS to abort. > > Any insights on what may be happening here? > > Best, > Ted >