Re: YouAreDeadException

Ted Yu Fri, 14 Jan 2011 19:50:04 -0800

Thanks for your analysis, Ryan.
The dev cluster has half as many nodes as our staging cluster. Each node has
half the number of cores as the node in staging.


I agree with your conclusion.

I will report back after I collect more data - the flow uses hbase heavily
toward the end.

On Fri, Jan 14, 2011 at 6:20 PM, Ryan Rawson <[email protected]> wrote:

> I'm seeing not much in the way of errors, timeouts, all to one machine
> ending with .80, so that is probably your failed node.
>
> Other than that, the log doesnt seem to say too much.  Searching for
> strings like FATAL and Exception is the way to go here.
>
> Also things like this:
> 2011-01-14 23:38:52,936 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been
> PENDING_OPEN for too long, reassigning region=
>
> NIGHTLYDEVGRIDSGRIDSQL-THREEGPPSPEECHCALLS-1294897314309,@v[h\xE2%\x83\xD4\xAC@v
> [h\xE2%\x83\xD4\xAC@v[h\xE2%\x83\xD4\xAC@v[h\xDC,129489731602
> 7.2c40637c6c648a67162cc38d8c6d8ee9.
>
>
> Guessing, I'd probably say your nodes hit some performance wall, with
> io-wait, or networking, or something, and Regionserver processes
> stopped responding, but did not time out from zookeeper yet... so you
> would run into a situation where some nodes are unresponsive, so any
> data hosted there would be difficult to talk to.  Until the
> regionserver times out it's zookeeper node, the master doesnt know
> about the fault of the regionserver.
>
> The master web UI is probably inaccessible because the META table is
> on a regionserver that went AWOL.  You should check your load, your
> ganglia graphs.  Also remember, despite having lots of disks, each
> node is a gigabit ethernet which means about 110-120 MB/sec.  It's
> quite possible you are running into network limitations, remember that
> regionservers must write to 2 additional datanodes, and there will be
> overlap, thus you have to share some of that 110-120MB/sec per node
> figure with other nodes, not to mention that you also need to factor
> inbound bandwidth (from client->hbase regionserver) and outbound
> bandwidth (from datanode replica 1 -> dn replica 2).
>
> -ryan
>
> On Fri, Jan 14, 2011 at 3:57 PM, Ted Yu <[email protected]> wrote:
> > Now I cannot access master web UI, This happened after I doubled the
> amount
> > of data processed in our flow.
> >
> > I am attaching master log.
> >
> > On Fri, Jan 14, 2011 at 3:10 PM, Ryan Rawson <[email protected]> wrote:
> >>
> >> This is the cause:
> >>
> >> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
> server
> >> serverName=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378,
> >> load=(requests=0, regions=6, usedHeap=514, maxHeap=3983):
> >> regionserver:60020-0x12d7b7b1c760004
> regionserver:60020-0x12d7b7b1c760004
> >> received expired from ZooKeeper, aborting
> >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>
> >> Why did the session expire?  Typically it's GC, what does your GC logs
> >> say?  Otherwise, network issues perhaps?  Swapping?  Other machine
> >> related systems problems?
> >>
> >> -ryan
> >>
> >>
> >> On Fri, Jan 14, 2011 at 3:04 PM, Ted Yu <[email protected]> wrote:
> >> > I ran 0.90 RC3 in dev cluster.
> >> >
> >> > I saw the following in region server log:
> >> >
> >> > Caused by: org.apache.hadoop.ipc.RemoteException:
> >> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> >> > currently processing sjc1-hadoop1.sjc1.carrieriq.com
> ,60020,1294856823378
> >> > as
> >> > dead server
> >> >    at
> >> >
> >> >
> org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:197)
> >> >    at
> >> >
> >> >
> org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:247)
> >> >    at
> >> >
> >> >
> org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:648)
> >> >    at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
> >> >    at
> >> >
> >> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >> >    at java.lang.reflect.Method.invoke(Method.java:597)
> >> >    at
> >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> >> >    at
> >> >
> >> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1036)
> >> >
> >> >    at
> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:753)
> >> >    at
> >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
> >> >    at $Proxy0.regionServerReport(Unknown Source)
> >> >    at
> >> >
> >> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:702)
> >> >    ... 2 more
> >> > 2011-01-13 03:55:08,982 INFO org.apache.zookeeper.ZooKeeper:
> Initiating
> >> > client connection,
> >> > connectString=sjc1-hadoop0.sjc1.carrieriq.com:2181
> sessionTimeout=90000
> >> > watcher=hconnection
> >> > 2011-01-13 03:55:08,914 FATAL
> >> > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
> >> > server
> >> > serverName=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378,
> >> > load=(requests=0, regions=6, usedHeap=514, maxHeap=3983):
> >> > regionserver:60020-0x12d7b7b1c760004
> >> > regionserver:60020-0x12d7b7b1c760004
> >> > received expired from ZooKeeper, aborting
> >> > org.apache.zookeeper.KeeperException$SessionExpiredException:
> >> > KeeperErrorCode = Session expired
> >> >    at
> >> >
> >> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:328)
> >> >    at
> >> >
> >> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:246)
> >> >    at
> >> >
> >> >
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
> >> >    at
> >> > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
> >> >
> >> > ---------------
> >> >
> >> > And the following from master log:
> >> >
> >> > 2011-01-13 03:52:42,003 INFO
> >> > org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer
> >> > ephemeral node deleted, processing expiration [
> >> > sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378]
> >> > 2011-01-13 03:52:42,005 DEBUG
> >> > org.apache.hadoop.hbase.master.ServerManager:
> >> > Added=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378 to dead
> >> > servers,
> >> > submitted shutdown handler to be executed, root=false, meta=false
> >> > 2011-01-13 03:52:42,005 INFO
> >> > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
> Splitting
> >> > logs
> >> > for sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378
> >> > 2011-01-13 03:52:42,092 INFO
> >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting 1
> >> > hlog(s)
> >> > in hdfs://
> >> >
> >> >
> sjc1-hadoop0.sjc1.carrieriq.com:9000/hbase/.logs/sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378
> >> > 2011-01-13 03:52:42,093 DEBUG
> >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread
> >> > Thread[WriterThread-0,5,main]: starting
> >> > 2011-01-13 03:52:42,094 DEBUG
> >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread
> >> > Thread[WriterThread-1,5,main]: starting
> >> > 2011-01-13 03:52:42,096 DEBUG
> >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog
> 1
> >> > of
> >> > 1: hdfs://
> >> >
> >> >
> sjc1-hadoop0.sjc1.carrieriq.com:9000/hbase/.logs/sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378/sjc1-hadoop1.sjc1.carrieriq.com%3A60020.1294860449407
> ,
> >> > length=0
> >> >
> >> > Please advise what could be the cause.
> >> >
> >> > Thanks
> >> >
> >
> >
>

Re: YouAreDeadException

Reply via email to