Re: unstable cluster

2016-04-11 Thread Ted Yu
>From region server log:

2016-04-11 03:11:51,589 WARN org.apache.zookeeper.ClientCnxnSocket:
Connected to an old server; r-o mode will be unavailable
2016-04-11 03:11:51,589 INFO org.apache.zookeeper.ClientCnxn: Unable to
reconnect to ZooKeeper service, session 0x52ee1452fec5ac has expired,
closing socket connection

>From zookeeper log:

2016-04-11 03:11:27,323 - INFO  [CommitProcessor:0:NIOServerCnxn@1435] -
Closed socket connection for client /172.20.67.19:58404 which had sessionid
0x52ee1452fec71f
2016-04-11 03:11:53,301 - INFO  [CommitProcessor:0:NIOServerCnxn@1435] -
Closed socket connection for client /172.20.67.13:32946 which had sessionid
0x52ee1452fec6ea

Note the 26 second gap.

What do you see in the logs of the other two zookeeper servers ?

Thanks

On Mon, Apr 11, 2016 at 5:08 PM, Ted Tuttle  wrote:

> Hello -
>
> We've started experiencing regular failures of our HBase cluster.  For the
> last week we've had nightly failures about 1hr after a heavy batch process
> starts.
>
> In the logs below we see the failure starting at 2016-04-11 03:11 in
> zookeeper, master and region server logs:
>
> zookeeper:  http://pastebin.com/kf7ja22K
>
> region server: http://pastebin.com/tduJgKqq
>
> master:  http://pastebin.com/0szhi0bJ
>
> The master log seems most interesting.  Here we see problems connecting to
> Zookeeper then a number of region servers dying in quick succession.  From
> the log evidence it appears Zookeeper is not responding rather than the
> more typical GC causing isolated RS to abort.
>
> Any insights on what may be happening here?
>
> Best,
> Ted
>


unstable cluster

2016-04-11 Thread Ted Tuttle
Hello -

We've started experiencing regular failures of our HBase cluster.  For the last 
week we've had nightly failures about 1hr after a heavy batch process starts.

In the logs below we see the failure starting at 2016-04-11 03:11 in zookeeper, 
master and region server logs:

zookeeper:  http://pastebin.com/kf7ja22K

region server: http://pastebin.com/tduJgKqq

master:  http://pastebin.com/0szhi0bJ

The master log seems most interesting.  Here we see problems connecting to 
Zookeeper then a number of region servers dying in quick succession.  From the 
log evidence it appears Zookeeper is not responding rather than the more 
typical GC causing isolated RS to abort.

Any insights on what may be happening here?

Best,
Ted