Ok, well swapping, esp if combined with GC, can def. account for very
long delays.
Not sure if anyone provided this before but take a look at the swapping
section on the ZK troubleshooting page. That section, or perhaps one of
the other sections on that page, might give you addl insight.
http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting
Good Luck,
Patrick
Peter Falk wrote:
We had 4GB head for the region server, on a machine with 8GB that was
also running a data node and a zoo keeper. We have tried with the
incremental garbage collector before, but had problem with a running
away heap size, resulting in swapping. We were/are running with
the parallel GC now. When the session expire problem occurred, we
noticed swapping on the node just before. Therefore, we are a bit afraid
to increase heap size more, or to try to incremental GC again. We are
not running in any virtualized environment.
Thanks for the various responses, and the recommendations. I think it
would be nice with an option to automatically restart region server for
situations like this.
TIA,
Peter
On Tue, Mar 30, 2010 at 18:25, Patrick Hunt <ph...@apache.org
<mailto:ph...@apache.org>> wrote:
Are you running in a virtualized environment by chance? (ec2,
vmware, etc...) vms, esp oversubscribed/overloaded vms, can result
in significant io/memory related performance problems.
Patrick
Peter Falk wrote:
Thanks Jean-Daniel. I was not clear about what we have already
tried, and we
have tried all that you recommend in the updated wiki page,
including uppin'
the zookeepers session timeout. The node was heavily loaded at
the time and
it seems the cluster was simply overloaded.
However, would it not be possible to automatically start the
region server
again and let it request new regions? Seems to be dangerous to
let region
servers die under heavy load like this, and increase the load
further on
remaining nodes...
Sincerely,
Peter
On Mon, Mar 29, 2010 at 19:38, Jean-Daniel Cryans
<jdcry...@apache.org <mailto:jdcry...@apache.org>>wrote:
We already had an entry in the wiki for this issue but it
wasn't super
explicit about what's happening, so I completely rewrote it
using the
logs from this thread. See
http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9
Also I created a jira about putting that link directly into
the "We
slept Xms, ..." message so that people can get some answers
quickly.
See https://issues.apache.org/jira/browse/HBASE-2388
J-D