I always used a large node for ZK to avoid sharing the machine, but the reason for doing that turned out to be incorrect. In fact, my problem was to do with GC on the client side.
I can't believe that they are seeing 50 second delays in EC2 due to I/O contention. GC can do that, but only on a large heap. Massive swapping of code pages can also cause this. My debug path here would be: a) verify the facts. The key fact is that the ZK cluster is occasionally giving massive latency. This must be verified to be the real problem and not an accidental incident. It is possible that the problem is not where we think it is. b) check for the usual configuration suspects. ZK should be alone on a machine. DNS should be checked. Connectivity should be checked between all hosts. c) look for swapping, look at GC logs. Something has to give a clue as to how the latency is 1000x longer than usual. d) fix what came from (b) or (c) step. I am at a loss here other than this general advice. I strongly suspect that something is being observed incorrectly or the machines are being massively abused. On Wed, Sep 2, 2009 at 12:37 PM, Patrick Hunt <ph...@apache.org> wrote: > I suspect that given a single disk is being used (not a dedicated disk for > the transaction log), and also given that this host is highly virtualized > (ec2), it seems to me that the most likely cause is IO. Specifically when > the zk cluster writes data to disk (due to client write) it must sync the > transaction log to disk. This sync behavior can impact the latency seen by > the clients. What type of ec2 node are you using? Ted, do you have any > insight on this? Any guidelines for the type of ec2 node to use for running > a zk cluster? > -- Ted Dunning, CTO DeepDyve