Long GC pause question

2010-12-28 Thread ChingShen
Hi all, I encounter a problem about long gc pause cause the region server's local zookeeper client cannot send heartbeats, the session times out. But I want to know why the HBase master sends a MSG_REGIONSERVER_STOP op to region sever to stop its services rather than reinitialize a new zookeepe

Re: Long GC pause question

2010-12-28 Thread Stack
On Tue, Dec 28, 2010 at 6:59 PM, ChingShen wrote: >  But I want to know why the HBase master sends a MSG_REGIONSERVER_STOP op to > region sever to stop its services rather than reinitialize a new zookeeper > client or restart region server? > Can I see more regionserver log? If session expired,

Re: Long GC pause question

2010-12-28 Thread ChingShen
Hi St.Ack, Please see the attach file, and there are 3 RS/DN/TT + 1 MS/NN/JT in my cluster. (Hadoop-0.20.2, HBase 0.20.6) Thanks. Shen On Wed, Dec 29, 2010 at 1:34 PM, Stack wrote: > On Tue, Dec 28, 2010 at 6:59 PM, ChingShen > wrote: > > But I want to know why the HBase master sends a MS

Re: Long GC pause question

2010-12-29 Thread Stack
OK. There is nothing enlightening there. There didn't seem to be master log in the attachment? I should have asked you include that. I see that one server thought the filesystem had gone away. Did you pull HDFS out from under it at around this time per chance? St.Ack On Tue, Dec 28, 2010 at 1

Re: Long GC pause question

2011-01-06 Thread Jean-Daniel Cryans
Shen, It's a design decision, and we historically preferred to let cluster managers decide whether they want to restart the processes that died or investigate why it has died then decide on what they want to do. You can easily write tools that will restart the region servers if they die, but the f

Re: Long GC pause question

2011-01-07 Thread ChingShen
Hi J-D, Yes, I run a MR job on my cluster, and when I set the MR configs as below that long gc pause is occurred. MR config: (4-core cpu per RS/DN/TT node) mapred.tasktracker.reduce.tasks.maximum = 3 mapred.tasktracker.map.tasks.maximum = 4 mapred.reduce.slowstart.completed.maps = 0.05

Re: Long GC pause question

2011-01-10 Thread Jean-Daniel Cryans
Your MR job is likely generating a lot of IO and possibly starving HBase while it's running (it would require some monitoring on your end to figure that out). Less tasks per machine will leave more breathing room, there's not that many ways to unload overloaded machines. J-D On Fri, Jan 7, 2011 a