Re: problems on EC2?

Patrick Hunt Thu, 16 Apr 2009 10:58:30 -0700

Take a look at this section to start:
http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_commonProblems

What type of monitoring are you doing on your cluster? You could monitorat both the host and at the java (jmx) level. That will give you someinsight on where to look; cpu, memory, disk, network, etc... Also theZooKeeper JMX will give you information about latencies and such (youcan even use the "four letter words" for that if you want to hack upsome scripts instead of using jmx). JMX will also give you insight intothe JVM workings - so for example you could confirm/ruleout the scenariooutlined by Nitay (gc causing the jvm java threads to hang for > 30secat a time, including the ZK heartbeat).

I've seen similar to what you describe a few times now, in each case itwas something different. In one case for example there was a cluster of5k clients attaching to a ZK cluster, ~20% of the clients hadmis-configured nics, that was causing high tcp packet loss (andtherefore high network latency), which caused a similar situation towhat you are seeing, but only under fairly high network load (which madeit hard to track down!).

I've also seen situations where ppl run the entire zk cluster on a setof VMWare vms, all on the same host system. Latency on thisconfiguration was >>> 10sec in some cases due to resource issues (inparticular io - see the link I provided above, dedicated log devices arecritical to low latency operation of the ZK cluster).

In your scenario I think 5 sec timeout is too low, probably much toolow. Why? You are running in virtualized environments on non-dedicatedhardware outside your control/inspection. There is typically no way totell (unless you are running on the 8 core ec2 systems) if the ec2 hostyou are running on is over/under subscribed (other vms). There is no wayto control disk latency either. You could be seeing large latencies dueto resource contention on the ec2 host alone. In addition to that I'veheard that network latencies in ec2 are high relative to what you wouldsee if you were running on your own dedicated environment. It's hard totell the latency btw the servers and client->server w/in the ec2environment you are seeing w/out measuring it.

Keep in mind the the timeout period is used by both the client and theserver. If the ZK leader doesn't hear from the client w/in the timeout(say it's 5 sec) it will expire the session. The client is sending aping after 1/3 of the timeout period. It expects to hear a responsebefore another 1/3 of the timeout elapses, after which it will attemptto re-sync to another server in the cluster. In the 5 sec timeout caseyou are allowing 1.3 seconds for the request to go to the server, theserver to respond back to the client, and the client to process theresponse. Check the latencies in ZK's JMX as I suggested to the hbaseteam in order to get insight into this (i.e. if the server latency ishigh, say because of io issues, or jvm swapping, vm latency, etc... thatwill cause the client/sessions to timeout)


Hope this helps.

Patrick

Mahadev Konar wrote:

Hi Ted,

These problems seem to manifest around getting lots of anomalous disconnects
and session expirations even though we have the timeout values set to 2
seconds on the server side and 5 seconds on the client side.


 Your scenario might be a little differetn from what Nitay (Hbase) is
seeing. In their scenario the zookeeper client was not able to send out
pings to the server due to gc stalling threads in their zookeeper
application process.

The latencies in zookeeper clients are directly related to Zookeeper server
machines. It is very much dependant on the disk io latencies that you would
get on the zookeeper servers and network latencies with your cluster.

I am not sure how much sensitive you want your zookeeper application to be
-- but increasing the timeout should help. Also, we recommend using
dedicated disk for zookeeper log transactions.

http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperAdmin.html#sc_streng
thsAndLimitations

Also, we have seen Ntp having problems and clocks going back on one of our
vm setup. This would lead to session getting timed out earler than the set
session timeout.

I hope this helps.


mahadev

On 4/14/09 5:48 PM, "Ted Dunning" <ted.dunn...@gmail.com> wrote:

We have been using EC2 as a substrate for our search cluster with zookeeper
as our coordination layer and have been seeing some strange problems.

These problems seem to manifest around getting lots of anomalous disconnects
and session expirations even though we have the timeout values set to 2
seconds on the server side and 5 seconds on the client side.

Has anybody else been seeing this?

Is this related to clock jumps in a virtualized setting?

On a related note, what is best practice for handling session expiration?
Just deal with it as if it is a new start?

Re: problems on EC2?

Reply via email to