Hello - We've started experiencing regular failures of our HBase cluster. For the last week we've had nightly failures about 1hr after a heavy batch process starts.
In the logs below we see the failure starting at 2016-04-11 03:11 in zookeeper, master and region server logs: zookeeper: http://pastebin.com/kf7ja22K region server: http://pastebin.com/tduJgKqq master: http://pastebin.com/0szhi0bJ The master log seems most interesting. Here we see problems connecting to Zookeeper then a number of region servers dying in quick succession. From the log evidence it appears Zookeeper is not responding rather than the more typical GC causing isolated RS to abort. Any insights on what may be happening here? Best, Ted