Ok, well swapping, esp if combined with GC, can def. account for very long delays.

Not sure if anyone provided this before but take a look at the swapping section on the ZK troubleshooting page. That section, or perhaps one of the other sections on that page, might give you addl insight.
http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting

Good Luck,

Patrick

Peter Falk wrote:
We had 4GB head for the region server, on a machine with 8GB that was also running a data node and a zoo keeper. We have tried with the incremental garbage collector before, but had problem with a running away heap size, resulting in swapping. We were/are running with the parallel GC now. When the session expire problem occurred, we noticed swapping on the node just before. Therefore, we are a bit afraid to increase heap size more, or to try to incremental GC again. We are not running in any virtualized environment.

Thanks for the various responses, and the recommendations. I think it would be nice with an option to automatically restart region server for situations like this.

TIA,
Peter

On Tue, Mar 30, 2010 at 18:25, Patrick Hunt <ph...@apache.org <mailto:ph...@apache.org>> wrote:

    Are you running in a virtualized environment by chance? (ec2,
    vmware, etc...) vms, esp oversubscribed/overloaded vms, can result
    in significant io/memory related performance problems.

    Patrick


    Peter Falk wrote:

        Thanks Jean-Daniel. I was not clear about what we have already
        tried, and we
        have tried all that you recommend in the updated wiki page,
        including uppin'
        the zookeepers session timeout. The node was heavily loaded at
        the time and
        it seems the cluster was simply overloaded.

        However, would it not be possible to automatically start the
        region server
        again and let it request new regions? Seems to be dangerous to
        let region
        servers die under heavy load like this, and increase the load
        further on
        remaining nodes...

        Sincerely,
        Peter

        On Mon, Mar 29, 2010 at 19:38, Jean-Daniel Cryans
        <jdcry...@apache.org <mailto:jdcry...@apache.org>>wrote:

            We already had an entry in the wiki for this issue but it
            wasn't super
            explicit about what's happening, so I completely rewrote it
            using the
            logs from this thread. See
            http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9

            Also I created a jira about putting that link directly into
            the "We
            slept Xms, ..." message so that people can get some answers
            quickly.
            See https://issues.apache.org/jira/browse/HBASE-2388

            J-D

Reply via email to