Re: Long GC pause question

ChingShen Fri, 07 Jan 2011 00:29:44 -0800

Hi J-D,

  Yes, I run a MR job on my cluster, and when I set the MR configs as below
that long gc pause is occurred.
  MR config: (4-core cpu per RS/DN/TT node)
  mapred.tasktracker.reduce.tasks.maximum = 3
  mapred.tasktracker.map.tasks.maximum = 4
  mapred.reduce.slowstart.completed.maps = 0.05


  So, I try to adjust some configs as below, and run same job to find that
error is occurred either map or reduce phase.
  MR config:
  mapred.tasktracker.reduce.tasks.maximum = 7
  mapred.tasktracker.map.tasks.maximum = 7
  mapred.reduce.slowstart.completed.maps = 1

  Finally, I got the long gc pause is occurred in reduce phase, and my CPU
usage is almost always at 100% in reduce phase, does it mean that high cpu
usage causes long gc pause? In addition to decrease number of tasks.maximum,
what else to avoid the problem, or is there any criteria? because when I
decrease number of tasks.maximum to run same job, it's ok.

Shen

On Fri, Jan 7, 2011 at 3:20 AM, Jean-Daniel Cryans <jdcry...@apache.org>wrote:

> Shen,
>
> It's a design decision, and we historically preferred to let cluster
> managers decide whether they want to restart the processes that died
> or investigate why it has died then decide on what they want to do.
> You can easily write tools that will restart the region servers if
> they die, but the fact that they die in the first place is the real
> issue.
>
> Looking at your logs, I cannot tell why exactly your region server
> died (also the master log you gave is refers to the death of PC3 one
> hour later, not PC4). I do see that the zookeeper server expired the
> session almost a whole minute before the region server figured it out,
> but the RS is really quiet... is there anything else running on that
> cluster that doesn't touch hbase but that could affect it? Like MR
> jobs that don't use HBase or something like that?
>
> J-D
>
> On Wed, Jan 5, 2011 at 11:44 PM, ChingShen <chingshenc...@gmail.com>
> wrote:
> > Hi all,
> >
> >    I encounter a problem about long gc pause cause the region server's
> local
> > zookeeper client cannot send heartbeats, the session times out.
> >  But I want to know why the HBase master sends a MSG_REGIONSERVER_STOP op
> to
> > region sever to stop its services rather than reinitialize a new
> zookeeper
> > client or restart region server?
> >
> >   There are 3 RS/DN/TT and 1 MS/NN/JT in my cluster(Hadoop-0.20.2,  HBase
> > 0.20.6), and set vm.swappiness to zero.
> >
> > hbase-ites-master-clusterPC1.log
> > 2011-01-06 13:10:57,003 INFO
> org.apache.hadoop.hbase.master.ServerManager:
> > clusterPC4,60020,1294280765301 znode expired
> > 2011-01-06 13:10:57,004 DEBUG org.apache.hadoop.hbase.master.HMaster:
> > Processing todo: ProcessServerShutdown of
> > ites-clusterPC4,60020,1294280765301
> > 2011-01-06 13:10:57,004 INFO
> > org.apache.hadoop.hbase.master.RegionServerOperation: process shutdown of
> > server clusterPC4,60020,1294280765301: logSplit: false, rootRescanned:
> > false, numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
> > 2011-01-06 13:10:57,007 INFO org.apache.hadoop.hbase.regionserver.HLog:
> > Splitting 1 hlog(s) in
> >
> hdfs://clusterPC1:54001/hbase20_6/.logs/ites-clusterPC4,60020,1294280765301
> > 2011-01-06 13:10:57,007 DEBUG org.apache.hadoop.hbase.regionserver.HLog:
> > Splitting hlog 1 of 1:
> >
> hdfs://clusterPC1:54001/hbase20_6/.logs/ites-clusterPC4,60020,1294280765301/hlog.dat.1294280765667,
> > length=0
> > .............
> >
> > hbase-ites-regionserver-clusterPC4.log:
> > 2011-01-06 12:21:15,773 DEBUG
> > org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period
> 3600000ms
> > elapsed
> > 2011-01-06 13:11:03,849 WARN org.apache.zookeeper.ClientCnxn: Exception
> > closing session 0x12d59208f560000 to sun.nio.ch.selectionkeyi...@402f0df1
> > java.io.IOException: TIMED OUT
> >         at
> > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> > 2011-01-06 13:11:09,628 INFO
> > org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGIONSERVER_STOP
> > 2011-01-06 13:11:31,491 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
> > server on 60020
> > ............
> >
> > Please see the attach files.
> > Thanks.
> >
> > Shen
> >
>

Re: Long GC pause question

Reply via email to