Re: Crashing Region Servers

2010-12-09 Thread Lance Riedel
I came across just probably the culprit - I just discovered that the machines that were dying had OS swap turned OFF. So, I added swap to those machines, but also reduced the memory configs for tasks. (was at 1 gig, now at 512m). Seems to be stable right now, but the nightly longer term jobs

Re: Crashing Region Servers

2010-12-09 Thread Ted Dunning
This could indicate swapping during GC. On Thu, Dec 9, 2010 at 12:13 PM, Lance Riedel wrote: > Seems reasonable, but having trouble making sense of the GC logs I had > turned on. Basically since there was a full GC a minute before this happens > on that server that lasts less than a second. > >

Re: Crashing Region Servers

2010-12-09 Thread Lance Riedel
Seems reasonable, but having trouble making sense of the GC logs I had turned on. Basically since there was a full GC a minute before this happens on that server that lasts less than a second. Example: So, here is what the last of the GC logs say for that Regionserver (04.hadoop on 10.100.

Re: Crashing Region Servers

2010-12-09 Thread Jean-Daniel Cryans
Lance, Both those lines indicate the problem: IPC Server handler 13 on 60020 took 182416ms Client session timed out, have not heard from server in 182936ms It's very clear that your region servers are suffering from pause-of-the-world garbage collection issues. Basically this one GC'ed for 3 m

Crashing Region Servers

2010-12-09 Thread Lance Riedel
We have a 6 node cluster, 5 with region serves. 2 of the region servers have been stable for days, but 3 of them keep crashing. Here are the logs around around when the crash occurs. (btw, we are shoving approximately the twitter firehose into hbase via flume) I'm an hbase newbie, but I have