I came across just probably the culprit - I just discovered that the machines
that were dying had OS swap turned OFF.
So, I added swap to those machines, but also reduced the memory configs for
tasks. (was at 1 gig, now at 512m). Seems to be stable right now, but the
nightly longer term jobs
This could indicate swapping during GC.
On Thu, Dec 9, 2010 at 12:13 PM, Lance Riedel wrote:
> Seems reasonable, but having trouble making sense of the GC logs I had
> turned on. Basically since there was a full GC a minute before this happens
> on that server that lasts less than a second.
>
>
Seems reasonable, but having trouble making sense of the GC logs I had turned
on. Basically since there was a full GC a minute before this happens on that
server that lasts less than a second.
Example:
So, here is what the last of the GC logs say for that Regionserver (04.hadoop
on 10.100.
Lance,
Both those lines indicate the problem:
IPC Server handler 13 on 60020 took 182416ms
Client session timed out, have not heard from server in 182936ms
It's very clear that your region servers are suffering from
pause-of-the-world garbage collection issues. Basically this one GC'ed
for 3 m
We have a 6 node cluster, 5 with region serves. 2 of the region servers have
been stable for days, but 3 of them keep crashing. Here are the logs around
around when the crash occurs. (btw, we are shoving approximately the twitter
firehose into hbase via flume) I'm an hbase newbie, but I have