Good point Ted.

FYI, this is the case with the HBase<->ZK integration. The HBase java region server code, acting as a client of the ZK service, was seeing > 2min gc pauses in some cases. This was causing the ZK client session to timeout. Tuning has helped, however until the G1 compactor is available there are some nasty cases that can still show up regardless of how much tuning you do (although gc tuning on the client helps considerably, for the most part this is not an issue in HBase if properly tuned).

Patrick

Ted Dunning wrote:
Be very cautious about misdirection here.  It is easy to focus on the ZK
server-side GC's.  In my experience, I have had many more GC related ZK
problems caused by *client* side GC.  If the client checks out for a minute,
you get disconnects and session expiration which is good for debugging that
code, but generally indicates a really bad user experience.  Lots of times
that bad user experience is somewhat covered up by your load balancer and
other general redundancy to you may notice it first from ZK forcing you to
think about these things.

On Mon, Jan 25, 2010 at 9:43 AM, Patrick Hunt <ph...@apache.org> wrote:

GC and disk IO (transactional log in particular) will cause significant
latency in some cases. See this for details on the types of things you
should look at:

http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting

I've seen cases where the JVM will pause for 2+ minutes for GC, in some
cases I've seen 4+ and I've heard of worse than that. Tuning GC (in
particular using incremental/cms gc) is critical for consistently low
latencies.

Patrick

Josh Scheid wrote:

On Fri, Jan 22, 2010 at 17:48, Mahadev Konar <maha...@yahoo-inc.com>
wrote:

 The server latency does seem huge. What os and hardware are you running
it
on?

RHEL4 2.8GHz 2-core AMD.  8GB RAM.
I will check with my admin for evidence of swap activity, but I don't
anticipate any.

This is bursty.  I'm currently seeing ~200 connections, but maxlat in
the last hour has been 185ms with 4ms avg.

 What is usage model of zookeeper?
Distributed lock service.  Using the lock recipe.  Hosts hold a lock
for 5s to a couple of minutes with zero to dozens of waiters.

 How much memory are you allocating to the server?
That's a good question.  I'm not an expert at java deployment.  I just
use zkServer.sh defaults.
sun-jre 1.6.0_14.  It's taking 1188MB of virtual memory right now,
100MB resident.

 The debug well exacerbate the problem.
OK.

 A dedicated disk means the following:
Zookeeper has snapshots and transaction logs. The datadir is the
directory
that stores the transaction logs. Its highly recommended that this
directory
be on a separate disk that isnt being used by any other process. The
snapshots can sit on a disk that is being used by the OS and can be
shared.

Yeah, I understand.  I just need to get that set up and was hoping
that my load wouldn't warrant it.

 Also, Pat ran some tests for serve lantecies at:
http://wiki.apache.org/hadoop/ZooKeeper/ServiceLatencyOverview

You can take a look at that as well and see what the expected performance
should be for your workload.

I will take a look at that.  Thank you for your time.

-Josh



Reply via email to