Re: RegionServer dying every two or three days

Neil Yalowitz Thu, 19 Jan 2012 11:50:12 -0800

We have experienced many problems with our cluster on EC2.  The blunt
solution was to increase the Zookeeper timeout to 5 minutes or even more.


Even with a long timeout, however, it's not uncommon for us to see an EC2
instance to become unresponsive to pings and SSH several times during a
week.  It's been a very bad environment for clusters.


Neil

On Thu, Jan 19, 2012 at 11:49 AM, Leonardo Gamas
<leoga...@jusbrasil.com.br>wrote:

> Hi Guys,
>
> I have tested the parameters provided by Sandy, and it solved the GC
> problems with the -XX:+UseParallelOldGC, thanks for the help Sandy.
> I'm still experiencing some difficulties, the RegionServer continues to
> shutdown, but it seems related to I/O. It starts to timeout many
> connections, new connections to/from the machine timeout too, and finally
> the RegionServer dies because of YouAreDeadException. I will collect more
> data, but i think it's an Amazon/Virtualized Environment inherent issue.
>
> Thanks for the great help provided so far.
>
> 2012/1/5 Leonardo Gamas <leoga...@jusbrasil.com.br>
>
> > I don't think so, if Amazon stopped the machine it would cause a stop of
> > minutes, not seconds, and since the DataNode, TaskTracker and Zookeeper
> > continue to work normally.
> > But it can be related to the shared environment nature of Amazon, maybe
> > some spike in I/O caused by another virtualized server in the same
> physical
> > machine.
> >
> > But the intance type i'm using:
> >
> > *Extra Large Instance*
> >
> > 15 GB memory
> > 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each)
> > 1,690 GB instance storage
> > 64-bit platform
> > I/O Performance: High
> > API name: m1.xlarge
> > I was not expecting to suffer from this problems, or at least not much.
> >
> >
> > 2012/1/5 Sandy Pratt <prat...@adobe.com>
> >
> >> You think it's an Amazon problem maybe?  Like they paused or migrated
> >> your virtual machine, and it just happens to be during GC, leaving us to
> >> think the GC ran long when it didn't?  I don't have a lot of experience
> >> with Amazon so I don't know if that sort of thing is common.
> >>
> >> > -----Original Message-----
> >> > From: Leonardo Gamas [mailto:leoga...@jusbrasil.com.br]
> >> > Sent: Thursday, January 05, 2012 13:15
> >> > To: user@hbase.apache.org
> >> > Subject: Re: RegionServer dying every two or three days
> >> >
> >> > I checked the CPU Utilization graphics provided by Amazon (it's not
> >> accurate,
> >> > since the sample time is about 5 minutes) and don't see any
> >> abnormality. I
> >> > will setup TSDB with Nagios to have a more reliable source of
> >> performance
> >> > data.
> >> >
> >> > The machines don't have swap space, if i run:
> >> >
> >> > $ swapon -s
> >> >
> >> > To display swap usage summary, it returns an empty list.
> >> >
> >> > I will drop -XX:+DoEscapeAnalysis -XX:+AggressiveOpts in my to tests.
> >> >
> >> > I don't have payed much attention to the value of the new size param.
> >> >
> >> > Thanks again for the help!!
> >> >
> >> > 2012/1/5 Sandy Pratt <prat...@adobe.com>
> >> >
> >> > > That size heap doesn't seem like it should cause a 36 second GC (a
> >> > > minor GC even if I remember your logs correctly), so I tend to think
> >> > > that other things are probably going on.
> >> > >
> >> > > This line here:
> >> > >
> >> > > 14251.690: [GC 14288.620: [ParNew: 105352K->413K(118016K), 0.0361840
> >> > > secs]
> >> > > 954388K->849478K(1705776K), 0.0364200 secs] [Times: user=0.05
> >> > > 954388K->sys=0.01,
> >> > > real=36.96 secs]
> >> > >
> >> > > is really mysterious to me.  It seems to indicate that the process
> was
> >> > > blocked for almost 37 seconds during a minor collection.  Note the
> CPU
> >> > > times are very low but the wall time is very high.  If it was
> actually
> >> > > doing GC work, I'd expect to see user time higher than real time, as
> >> > > it is in other parallel collections (see your log snippet).  Were
> you
> >> > > really so CPU starved that it took 37 seconds to get in 50ms of
> work?
> >> > > I can't make sense of that.  I'm trying to think of something that
> >> > > would block you for that long while all your threads are stopped for
> >> > > GC, other than being in swap, but I can't come up with anything.
> >>  You're
> >> > certain you're not in swap?
> >> > >
> >> > > Maybe drop these guys -XX:+DoEscapeAnalysis -XX:+AggressiveOpts
> while
> >> > > you troubleshoot?
> >> > >
> >> > > Why is your new size so small?  This generally means that relatively
> >> > > more objects are being tenured than would be with a larger new size.
> >> > > This could make collections of the old gen worse (GC time is said to
> >> > > be proportional to the number of live objects in the generation, and
> >> > > CMS does indeed cause STW pauses).  A typical new to tenured ratio
> >> > > might be 1:3.  Were the new gen GCs taking too long?  This is
> probably
> >> > > orthogonal to your immediate issue, though.
> >> > >
> >> > >
> >> > >
> >> > > -----Original Message-----
> >> > > From: Leonardo Gamas [mailto:leoga...@jusbrasil.com.br]
> >> > > Sent: Thursday, January 05, 2012 5:33 AM
> >> > > To: user@hbase.apache.org
> >> > > Subject: Re: RegionServer dying every two or three days
> >> > >
> >> > >  St.Ack,
> >> > >
> >> > > I don't have made any attempt in GC tunning, yet.
> >> > > I will read the perf section as suggested.
> >> > > I'm currently using Nagios + JMX to monitor the cluster, but it's
> >> > > currently used for alert only, the perfdata is not been stored, so
> >> > > it's kind of useless right now, but i was thinking in use TSDB to
> >> > > store it, any known case of integration?
> >> > > ---
> >> > >
> >> > > Sandy,
> >> > >
> >> > > Yes, my timeout is 30 seconds:
> >> > >
> >> > > <property>
> >> > >   <name>zookeeper.session.timeout</name>
> >> > >   <value>30000</value>
> >> > > </property>
> >> > >
> >> > > To our application it's a sufferable time to wait in case a
> >> > > RegionServer go offline.
> >> > >
> >> > > My heap is 4GB and my JVM params are:
> >> > >
> >> > > -Xmx4096m -server -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
> >> > > -XX:CMSInitiatingOccupancyFraction=70 -XX:NewSize=128m
> >> > > -XX:MaxNewSize=128m -XX:+DoEscapeAnalysis -XX:+AggressiveOpts
> >> > > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
> >> > > -Xloggc:/usr/lib/hbase/logs/hbase-regionserver-gc.log
> >> > >
> >> > > I will try the -XX:+UseParallelOldGC param and post my feedback
> here.
> >> > > ---
> >> > >
> >> > > Ramkrishna,
> >> > >
> >> > > Seems the GC is the root of all evil in this case.
> >> > > ----
> >> > >
> >> > > Thank you all for the answers. I will try out these valuable advices
> >> > > given here and post my results.
> >> > >
> >> > > Leo Gamas.
> >> > >
> >> > > 2012/1/5 Ramkrishna S Vasudevan <ramkrishna.vasude...@huawei.com>
> >> > >
> >> > > > Recently we faced a similar problem and it was due to GC config.
> >> > > > Pls check your GC.
> >> > > >
> >> > > > Regards
> >> > > > Ram
> >> > > >
> >> > > > -----Original Message-----
> >> > > > From: saint....@gmail.com [mailto:saint....@gmail.com] On Behalf
> Of
> >> > > > Stack
> >> > > > Sent: Thursday, January 05, 2012 2:50 AM
> >> > > > To: user@hbase.apache.org
> >> > > > Subject: Re: RegionServer dying every two or three days
> >> > > >
> >> > > > On Wed, Jan 4, 2012 at 12:17 PM, Leonardo Gamas
> >> > > > <leoga...@jusbrasil.com.br> wrote:
> >> > > > > The third line took 36.96 seconds to execute, can this be
> causing
> >> > > > > this problem?
> >> > > > >
> >> > > >
> >> > > > Probably.  Have you made any attempt at GC tuning?
> >> > > >
> >> > > >
> >> > > > > Reading the code a little it seems that, even if it's disabled,
> if
> >> > > > > all files are target in a compaction, it's considered a major
> >> > > > > compaction. Is
> >> > > > it
> >> > > > > right?
> >> > > > >
> >> > > >
> >> > > > That is right.  They get 'upgraded' from minor to major.
> >> > > >
> >> > > > This should be fine though.  What you are avoiding setting major
> >> > > > compactions to 0 is all regions being major compacted on a
> period, a
> >> > > > heavy weight effective rewrite of all your data (unless already
> >> major
> >> > > > compacted).   It looks like you have this disabled which is good
> >> until
> >> > > > you've wrestled your cluster into submission.
> >> > > >
> >> > > >
> >> > > > > The machines don't have swap, so the swappiness parameter don't
> >> > > > > seem to apply here. Any other suggestion?
> >> > > > >
> >> > > >
> >> > > > See the perf section of the hbase manual.  It has our current
> list.
> >> > > >
> >> > > > Are you monitoring your cluster w/ ganglia or tsdb?
> >> > > >
> >> > > >
> >> > > > St.Ack
> >> > > >
> >> > > > > Thanks.
> >> > > > >
> >> > > > > 2012/1/4 Leonardo Gamas <leoga...@jusbrasil.com.br>
> >> > > > >
> >> > > > >> I will investigate this, thanks for the response.
> >> > > > >>
> >> > > > >>
> >> > > > >> 2012/1/3 Sandy Pratt <prat...@adobe.com>
> >> > > > >>
> >> > > > >>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session
> >> > > > >>> timed out, have not heard from server in 61103ms for sessionid
> >> > > > >>> 0x23462a4cf93a8fc, closing socket connection and attempting
> >> > > > >>> reconnect
> >> > > > >>>
> >> > > > >>> It looks like the process has been unresponsive for some time,
> >> > > > >>> so ZK
> >> > > > has
> >> > > > >>> terminated the session.  Did you experience a long GC pause
> >> > > > >>> right
> >> > > > before
> >> > > > >>> this?  If you don't have GC logging enabled for the RS, you
> can
> >> > > > sometimes
> >> > > > >>> tell by noticing a gap in the timestamps of the log statements
> >> > > > >>> leading
> >> > > > up
> >> > > > >>> to the crash.
> >> > > > >>>
> >> > > > >>> If it turns out to be GC, you might want to look at your
> kernel
> >> > > > >>> swappiness setting (set it to 0) and your JVM params.
> >> > > > >>>
> >> > > > >>> Sandy
> >> > > > >>>
> >> > > > >>>
> >> > > > >>> > -----Original Message-----
> >> > > > >>> > From: Leonardo Gamas [mailto:leoga...@jusbrasil.com.br]
> >> > > > >>> > Sent: Thursday, December 29, 2011 07:44
> >> > > > >>> > To: user@hbase.apache.org
> >> > > > >>> > Subject: RegionServer dying every two or three days
> >> > > > >>> >
> >> > > > >>> > Hi,
> >> > > > >>> >
> >> > > > >>> > I have a HBase/Hadoop CDH3 cluster with 4 machines (1
> Master +
> >> > > > >>> > 3
> >> > > > >>> Slaves),
> >> > > > >>> > running on Amazon EC2. The master is a High-Memory Extra
> Large
> >> > > > Instance
> >> > > > >>> > (m2.xlarge) with NameNode, JobTracker, HMaster and
> Zookeeper.
> >> > > > >>> > The slaves are Extra Large Instances (m1.xlarge) running
> >> > > > >>> > Datanode,
> >> > > > >>> TaskTracker,
> >> > > > >>> > RegionServer and Zookeeper.
> >> > > > >>> >
> >> > > > >>> > From time to time, every two or three days, one of the
> >> > > > >>> > RegionServers processes goes down, but the other processes
> >> > > > >>> > (DataNode, TaskTracker,
> >> > > > >>> > Zookeeper) continue normally.
> >> > > > >>> >
> >> > > > >>> > Reading the logs:
> >> > > > >>> >
> >> > > > >>> > The connection with Zookeeper timed out:
> >> > > > >>> >
> >> > > > >>> > ---------------------------
> >> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session
> >> > > > >>> > timed
> >> > > > out,
> >> > > > >>> have
> >> > > > >>> > not heard from server in 61103ms for sessionid
> >> > > > >>> > 0x23462a4cf93a8fc,
> >> > > > >>> closing
> >> > > > >>> > socket connection and attempting reconnect
> >> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session
> >> > > > >>> > timed
> >> > > > out,
> >> > > > >>> have
> >> > > > >>> > not heard from server in 61205ms for sessionid
> >> > > > >>> > 0x346c561a55953e,
> >> > > > closing
> >> > > > >>> > socket connection and attempting reconnect
> >> > > > >>> > ---------------------------
> >> > > > >>> >
> >> > > > >>> > And the Handlers start to fail:
> >> > > > >>> >
> >> > > > >>> > ---------------------------
> >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server
> Responder,
> >> > > > >>> > call
> >> > > > >>> > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf)
> >> > > > >>> > from
> >> > > > >>> > xx.xx.xx.xx:xxxx: output error
> >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler
> 81
> >> > > > >>> > on
> >> > > > 60020
> >> > > > >>> > caught: java.nio.channels.ClosedChannelException
> >> > > > >>> >         at
> >> > > > >>> >
> >> > > >
> >> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:
> >> > > > 13
> >> > > > >>> > 3)
> >> > > > >>> >         at
> >> > > > >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > > >
> >> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> >> > > > >>> > 1341)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo
> >> > > > >>> > ns
> >> > > > >>> > e(HB
> >> > > > >>> > aseServer.java:727)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB
> >> > > > >>> > as
> >> > > > >>> > eSe
> >> > > > >>> > rver.java:792)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > > >
> >> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> >> > > > :1
> >> > > > >>> > 083)
> >> > > > >>> >
> >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server
> Responder,
> >> > > > >>> > call
> >> > > > >>> > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430)
> >> > > > >>> > from
> >> > > > >>> > xx.xx.xx.xx:xxxx: output error
> >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler
> 62
> >> > > > >>> > on
> >> > > > 60020
> >> > > > >>> > caught: java.nio.channels.ClosedChannelException
> >> > > > >>> >         at
> >> > > > >>> >
> >> > > >
> >> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:
> >> > > > 13
> >> > > > >>> > 3)
> >> > > > >>> >         at
> >> > > > >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > > >
> >> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> >> > > > >>> > 1341)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo
> >> > > > >>> > ns
> >> > > > >>> > e(HB
> >> > > > >>> > aseServer.java:727)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB
> >> > > > >>> > as
> >> > > > >>> > eSe
> >> > > > >>> > rver.java:792)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > > >
> >> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> >> > > > :1
> >> > > > >>> > 083)
> >> > > > >>> > ---------------------------
> >> > > > >>> >
> >> > > > >>> > And finally the server throws a YouAreDeadException :( :
> >> > > > >>> >
> >> > > > >>> > ---------------------------
> >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket
> >> > > > connection
> >> > > > >>> to
> >> > > > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket
> connection
> >> > > > >>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> >> > > > initiating
> >> > > > >>> session
> >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to
> >> > > > >>> > reconnect to ZooKeeper service, session 0x23462a4cf93a8fc
> has
> >> > > > >>> > expired, closing
> >> > > > socket
> >> > > > >>> > connection
> >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket
> >> > > > connection
> >> > > > >>> to
> >> > > > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket
> connection
> >> > > > >>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> >> > > > initiating
> >> > > > >>> session
> >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to
> >> > > > >>> > reconnect to ZooKeeper service, session 0x346c561a55953e has
> >> > > > >>> > expired, closing
> >> > > > socket
> >> > > > >>> > connection
> >> > > > >>> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer: ABORTING
> >> > > > >>> > region server
> >> > > > >>> > serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
> >> > > > >>> > load=(requests=447, regions=206, usedHeap=1584,
> >> > maxHeap=4083):
> >> > > > >>> > Unhandled
> >> > > > >>> > exception: org.apache.hadoop.hbase.YouAreDeadException:
> >> > Server
> >> > > > >>> > REPORT rejected; currently processing
> >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead
> server
> >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException:
> >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> >> > > > >>> > rejected; currently processing
> >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> >> > > > as
> >> > > > >>> > dead server
> >> > > > >>> >         at
> >> > > > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> >> > > > >>> > Method)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > > >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruc
> >> > > > to
> >> > > > r
> >> > > > >>> > AccessorImpl.java:39)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > > >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Delegating
> >> > > > Co
> >> > > > n
> >> > > > >>> > structorAccessorImpl.java:27)
> >> > > > >>> >         at
> >> > > > >>>
> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > org.apache.hadoop.ipc.RemoteException.instantiateException(Rem
> >> > > > >>> > ot
> >> > > > >>> > eExce
> >> > > > >>> > ption.java:95)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Re
> >> > > > >>> > mo
> >> > > > >>> > te
> >> > > > >>> > Exception.java:79)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe
> >> > > > >>> > rv
> >> > > > >>> > erRep
> >> > > > >>> > ort(HRegionServer.java:735)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > > >
> >> > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer
> >> > > > .j
> >> > > > >>> > ava:596)
> >> > > > >>> >         at java.lang.Thread.run(Thread.java:662)
> >> > > > >>> > Caused by: org.apache.hadoop.ipc.RemoteException:
> >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> >> > > > >>> > rejected; currently processing
> >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> >> > > > as
> >> > > > >>> > dead server
> >> > > > >>> >         at
> >> > > > >>> >
> >> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(Serve
> >> > > > >>> > rM
> >> > > > >>> > ana
> >> > > > >>> > ger.java:204)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > org.apache.hadoop.hbase.master.ServerManager.regionServerRepor
> >> > > > >>> > t(
> >> > > > >>> > Serv
> >> > > > >>> > erManager.java:262)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMas
> >> > > > >>> > te
> >> > > > >>> > r.jav
> >> > > > >>> > a:669)
> >> > > > >>> >         at
> sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
> >> > > > Source)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMeth
> >> > > > >>> > od
> >> > > > >>> > Acces
> >> > > > >>> > sorImpl.java:25)
> >> > > > >>> >         at java.lang.reflect.Method.invoke(Method.java:597)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > > >
> >> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> >> > > > :1
> >> > > > >>> > 039)
> >> > > > >>> >
> >> > > > >>> >         at
> >> > > > >>> >
> >> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.j
> >> > > > >>> > av
> >> > > > >>> > a:257
> >> > > > >>> > )
> >> > > > >>> >         at $Proxy6.regionServerReport(Unknown Source)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe
> >> > > > >>> > rv
> >> > > > >>> > erRep
> >> > > > >>> > ort(HRegionServer.java:729)
> >> > > > >>> >         ... 2 more
> >> > > > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: Dump of
> >> > metrics:
> >> > > > >>> > requests=66, regions=206, stores=2078, storefiles=970,
> >> > > > >>> > storefileIndexSize=78, memstoreSize=796,
> >> > > > >>> > compactionQueueSize=0, flushQueueSize=0, usedHeap=1672,
> >> > > > >>> > maxHeap=4083, blockCacheSize=705907552,
> >> > > > >>> > blockCacheFree=150412064, blockCacheCount=10648,
> >> > > > >>> > blockCacheHitCount=79578618, blockCacheMissCount=3036335,
> >> > > > >>> > blockCacheEvictedCount=1401352, blockCacheHitRatio=96,
> >> > > > >>> > blockCacheHitCachingRatio=98
> >> > > > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: STOPPED:
> >> > > > >>> > Unhandled
> >> > > > >>> > exception: org.apache.hadoop.hbase.YouAreDeadException:
> >> > Server
> >> > > > >>> > REPORT rejected; currently processing
> >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead
> server
> >> > > > >>> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping server on
> >> > > > >>> > 60020
> >> > > > >>> > ---------------------------
> >> > > > >>> >
> >> > > > >>> > Then i restart the RegionServer and everything is back to
> >> normal.
> >> > > > >>> > Reading the DataNode, Zookeeper and TaskTracker logs, i
> don't
> >> > > > >>> > see any abnormality in the same time window.
> >> > > > >>> > I think it was caused by the lost of connection to
> zookeeper.
> >> > > > >>> > Is it
> >> > > > >>> advisable to
> >> > > > >>> > run zookeeper in the same machines?
> >> > > > >>> > if the RegionServer lost it's connection to Zookeeper,
> there's
> >> > > > >>> > a way
> >> > > > (a
> >> > > > >>> > configuration perhaps) to re-join the cluster, and not only
> >> die?
> >> > > > >>> >
> >> > > > >>> > Any idea what is causing this?? Or to prevent it from
> >> happening?
> >> > > > >>> >
> >> > > > >>> > Any help is appreciated.
> >> > > > >>> >
> >> > > > >>> > Best Regards,
> >> > > > >>> >
> >> > > > >>> > --
> >> > > > >>> >
> >> > > > >>> > *Leonardo Gamas*
> >> > > > >>> > Software Engineer
> >> > > > >>> > +557134943514
> >> > > > >>> > +557581347440
> >> > > > >>> > leoga...@jusbrasil.com.br
> >> > > > >>> > www.jusbrasil.com.br
> >> > > > >>>
> >> > > > >>
> >> > > > >>
> >> > > > >>
> >> > > > >> --
> >> > > > >>
> >> > > > >> *Leonardo Gamas*
> >> > > > >> Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C
> (75)
> >> > > > >> 8134-7440 leoga...@jusbrasil.com.br www.jusbrasil.com.br
> >> > > > >>
> >> > > > >>
> >> > > > >
> >> > > > >
> >> > > > > --
> >> > > > >
> >> > > > > *Leonardo Gamas*
> >> > > > > Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C (75)
> >> > > > > 8134-7440 leoga...@jusbrasil.com.br www.jusbrasil.com.br
> >> > > >
> >> > > >
> >> > >
> >> > >
> >> > > --
> >> > >
> >> > > *Leonardo Gamas*
> >> > > Software Engineer
> >> > > +557134943514
> >> > > +557581347440
> >> > > leoga...@jusbrasil.com.br
> >> > > www.jusbrasil.com.br
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> >
> >> > *Leonardo Gamas*
> >> > Software Engineer
> >> > T +55 (71) 3494-3514
> >> > C +55 (75) 8134-7440
> >> > leoga...@jusbrasil.com.br
> >> > www.jusbrasil.com.br
> >>
> >
> >
> >
> > --
> >
> > *Leonardo Gamas*
> >
> > Software Engineer
> > T +55 (71) 3494-3514
> > C +55 (75) 8134-7440
> > leoga...@jusbrasil.com.br
> >
> > www.jusbrasil.com.br
> >
> >
>
>
> --
>
> *Leonardo Gamas*
> Software Engineer
> T +55 (71) 3494-3514
> C +55 (75) 8134-7440
> leoga...@jusbrasil.com.br
> www.jusbrasil.com.br
>

Re: RegionServer dying every two or three days

Reply via email to