zookeeper quorum failing because of high network load

2015-04-27 Thread Martin Stiborský
Hello guys, we are running a mesos stack on CoreOS, with three zookeeper nodes. We can start a docker containers with Marathon and all, that's fine, but some of the docker containers generates high network load, while communicating between nodes/containers and I think that' the reason why the zook

Re: zookeeper quorum failing because of high network load

2015-04-27 Thread Tomas Barton
Hi Martin, how many ZooKeepers do you have? Is your transaction log on a dedicated disk? How many clients are approximately connecting? have a look at http://zookeeper.apache.org/doc/r3.2.2/zookeeperAdmin.html#sc_bestPractices Tomas On 27 April 2015 at 10:58, Martin Stiborský wrote: > Hello g

Re: zookeeper quorum failing because of high network load

2015-04-27 Thread Martin Stiborský
Hi, there are 3 zookeepers nodes. We've started our containers and this time I was watching the zookeepers and their condition with the "stat" command. It seems that zookeeper latency is not the issue, there was only about 8 connections, max latency time 134ms. I'm still not sure what is the real

Re: zookeeper quorum failing because of high network load

2015-04-27 Thread Charles Baker
Hi Martin. Are these VMs or bare-metal? Is ZK running on the same 3 nodes as the mesos cluster? Does your application also use ZooKeeper to manage it's own state? Are there any other services running on the machines and does Mesos and ZK have enough resources? And as Tomas asked; is your ZK log on

Re: zookeeper quorum failing because of high network load

2015-04-28 Thread Martin Stiborský
Hi guys, these machines are relatively beefy - Dell PowerEdge r710 with 2x QC Xeon, 144GB RAM, CoreOS is deployed on baremetal. - ZK is running on the same 3 nodes as the mesos cluster - our application is not using ZK - nothing else running on the stack, only 1 mesos master, 3 mesos slaves and ma

Re: zookeeper quorum failing because of high network load

2015-04-28 Thread Ondrej Smola
Hi Martin, do all 3 zookeepers go down with same error logs/cause - there should be some info as one node failure should not cause ZK to fail (as quorum is maintained) and remaining nodes should at least show some info from failure detector. The original log you posted are after stopping zookeeper

Re: zookeeper quorum failing because of high network load

2015-04-28 Thread Martin Stiborský
Now I finally tracked down the real problem, and it's nothing related to mesos at all. It was fleet on CoreOS stopping all containers on a node, because the node was considered as unresponsive, from the CoreOS/etcd/fleet cluster point of view. The high cpu/network load caused the problem and fleet