Thanks Neil for sharing your experience with AWS! Could you tell what instance type are you using? We are using m1.xlarge, that has 4 virtual cores, but i normally see recommendations for machines with 8 cores like c1.xlarge, m2.4xlarge, etc. In principle these 8-core machines don't suffer too much with I/O problems since they don't share the physical server. Is there any piece of information from Amazon or other source that affirms that or it's based in empirical analysis?
2012/1/19 Neil Yalowitz <neilyalow...@gmail.com> > We have experienced many problems with our cluster on EC2. The blunt > solution was to increase the Zookeeper timeout to 5 minutes or even more. > > Even with a long timeout, however, it's not uncommon for us to see an EC2 > instance to become unresponsive to pings and SSH several times during a > week. It's been a very bad environment for clusters. > > > Neil > > On Thu, Jan 19, 2012 at 11:49 AM, Leonardo Gamas > <leoga...@jusbrasil.com.br>wrote: > > > Hi Guys, > > > > I have tested the parameters provided by Sandy, and it solved the GC > > problems with the -XX:+UseParallelOldGC, thanks for the help Sandy. > > I'm still experiencing some difficulties, the RegionServer continues to > > shutdown, but it seems related to I/O. It starts to timeout many > > connections, new connections to/from the machine timeout too, and finally > > the RegionServer dies because of YouAreDeadException. I will collect more > > data, but i think it's an Amazon/Virtualized Environment inherent issue. > > > > Thanks for the great help provided so far. > > > > 2012/1/5 Leonardo Gamas <leoga...@jusbrasil.com.br> > > > > > I don't think so, if Amazon stopped the machine it would cause a stop > of > > > minutes, not seconds, and since the DataNode, TaskTracker and Zookeeper > > > continue to work normally. > > > But it can be related to the shared environment nature of Amazon, maybe > > > some spike in I/O caused by another virtualized server in the same > > physical > > > machine. > > > > > > But the intance type i'm using: > > > > > > *Extra Large Instance* > > > > > > 15 GB memory > > > 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each) > > > 1,690 GB instance storage > > > 64-bit platform > > > I/O Performance: High > > > API name: m1.xlarge > > > I was not expecting to suffer from this problems, or at least not much. > > > > > > > > > 2012/1/5 Sandy Pratt <prat...@adobe.com> > > > > > >> You think it's an Amazon problem maybe? Like they paused or migrated > > >> your virtual machine, and it just happens to be during GC, leaving us > to > > >> think the GC ran long when it didn't? I don't have a lot of > experience > > >> with Amazon so I don't know if that sort of thing is common. > > >> > > >> > -----Original Message----- > > >> > From: Leonardo Gamas [mailto:leoga...@jusbrasil.com.br] > > >> > Sent: Thursday, January 05, 2012 13:15 > > >> > To: user@hbase.apache.org > > >> > Subject: Re: RegionServer dying every two or three days > > >> > > > >> > I checked the CPU Utilization graphics provided by Amazon (it's not > > >> accurate, > > >> > since the sample time is about 5 minutes) and don't see any > > >> abnormality. I > > >> > will setup TSDB with Nagios to have a more reliable source of > > >> performance > > >> > data. > > >> > > > >> > The machines don't have swap space, if i run: > > >> > > > >> > $ swapon -s > > >> > > > >> > To display swap usage summary, it returns an empty list. > > >> > > > >> > I will drop -XX:+DoEscapeAnalysis -XX:+AggressiveOpts in my to > tests. > > >> > > > >> > I don't have payed much attention to the value of the new size > param. > > >> > > > >> > Thanks again for the help!! > > >> > > > >> > 2012/1/5 Sandy Pratt <prat...@adobe.com> > > >> > > > >> > > That size heap doesn't seem like it should cause a 36 second GC (a > > >> > > minor GC even if I remember your logs correctly), so I tend to > think > > >> > > that other things are probably going on. > > >> > > > > >> > > This line here: > > >> > > > > >> > > 14251.690: [GC 14288.620: [ParNew: 105352K->413K(118016K), > 0.0361840 > > >> > > secs] > > >> > > 954388K->849478K(1705776K), 0.0364200 secs] [Times: user=0.05 > > >> > > 954388K->sys=0.01, > > >> > > real=36.96 secs] > > >> > > > > >> > > is really mysterious to me. It seems to indicate that the process > > was > > >> > > blocked for almost 37 seconds during a minor collection. Note the > > CPU > > >> > > times are very low but the wall time is very high. If it was > > actually > > >> > > doing GC work, I'd expect to see user time higher than real time, > as > > >> > > it is in other parallel collections (see your log snippet). Were > > you > > >> > > really so CPU starved that it took 37 seconds to get in 50ms of > > work? > > >> > > I can't make sense of that. I'm trying to think of something that > > >> > > would block you for that long while all your threads are stopped > for > > >> > > GC, other than being in swap, but I can't come up with anything. > > >> You're > > >> > certain you're not in swap? > > >> > > > > >> > > Maybe drop these guys -XX:+DoEscapeAnalysis -XX:+AggressiveOpts > > while > > >> > > you troubleshoot? > > >> > > > > >> > > Why is your new size so small? This generally means that > relatively > > >> > > more objects are being tenured than would be with a larger new > size. > > >> > > This could make collections of the old gen worse (GC time is said > to > > >> > > be proportional to the number of live objects in the generation, > and > > >> > > CMS does indeed cause STW pauses). A typical new to tenured ratio > > >> > > might be 1:3. Were the new gen GCs taking too long? This is > > probably > > >> > > orthogonal to your immediate issue, though. > > >> > > > > >> > > > > >> > > > > >> > > -----Original Message----- > > >> > > From: Leonardo Gamas [mailto:leoga...@jusbrasil.com.br] > > >> > > Sent: Thursday, January 05, 2012 5:33 AM > > >> > > To: user@hbase.apache.org > > >> > > Subject: Re: RegionServer dying every two or three days > > >> > > > > >> > > St.Ack, > > >> > > > > >> > > I don't have made any attempt in GC tunning, yet. > > >> > > I will read the perf section as suggested. > > >> > > I'm currently using Nagios + JMX to monitor the cluster, but it's > > >> > > currently used for alert only, the perfdata is not been stored, so > > >> > > it's kind of useless right now, but i was thinking in use TSDB to > > >> > > store it, any known case of integration? > > >> > > --- > > >> > > > > >> > > Sandy, > > >> > > > > >> > > Yes, my timeout is 30 seconds: > > >> > > > > >> > > <property> > > >> > > <name>zookeeper.session.timeout</name> > > >> > > <value>30000</value> > > >> > > </property> > > >> > > > > >> > > To our application it's a sufferable time to wait in case a > > >> > > RegionServer go offline. > > >> > > > > >> > > My heap is 4GB and my JVM params are: > > >> > > > > >> > > -Xmx4096m -server -XX:+UseParNewGC -XX:+UseConcMarkSweepGC > > >> > > -XX:CMSInitiatingOccupancyFraction=70 -XX:NewSize=128m > > >> > > -XX:MaxNewSize=128m -XX:+DoEscapeAnalysis -XX:+AggressiveOpts > > >> > > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps > > >> > > -Xloggc:/usr/lib/hbase/logs/hbase-regionserver-gc.log > > >> > > > > >> > > I will try the -XX:+UseParallelOldGC param and post my feedback > > here. > > >> > > --- > > >> > > > > >> > > Ramkrishna, > > >> > > > > >> > > Seems the GC is the root of all evil in this case. > > >> > > ---- > > >> > > > > >> > > Thank you all for the answers. I will try out these valuable > advices > > >> > > given here and post my results. > > >> > > > > >> > > Leo Gamas. > > >> > > > > >> > > 2012/1/5 Ramkrishna S Vasudevan <ramkrishna.vasude...@huawei.com> > > >> > > > > >> > > > Recently we faced a similar problem and it was due to GC config. > > >> > > > Pls check your GC. > > >> > > > > > >> > > > Regards > > >> > > > Ram > > >> > > > > > >> > > > -----Original Message----- > > >> > > > From: saint....@gmail.com [mailto:saint....@gmail.com] On > Behalf > > Of > > >> > > > Stack > > >> > > > Sent: Thursday, January 05, 2012 2:50 AM > > >> > > > To: user@hbase.apache.org > > >> > > > Subject: Re: RegionServer dying every two or three days > > >> > > > > > >> > > > On Wed, Jan 4, 2012 at 12:17 PM, Leonardo Gamas > > >> > > > <leoga...@jusbrasil.com.br> wrote: > > >> > > > > The third line took 36.96 seconds to execute, can this be > > causing > > >> > > > > this problem? > > >> > > > > > > >> > > > > > >> > > > Probably. Have you made any attempt at GC tuning? > > >> > > > > > >> > > > > > >> > > > > Reading the code a little it seems that, even if it's > disabled, > > if > > >> > > > > all files are target in a compaction, it's considered a major > > >> > > > > compaction. Is > > >> > > > it > > >> > > > > right? > > >> > > > > > > >> > > > > > >> > > > That is right. They get 'upgraded' from minor to major. > > >> > > > > > >> > > > This should be fine though. What you are avoiding setting major > > >> > > > compactions to 0 is all regions being major compacted on a > > period, a > > >> > > > heavy weight effective rewrite of all your data (unless already > > >> major > > >> > > > compacted). It looks like you have this disabled which is good > > >> until > > >> > > > you've wrestled your cluster into submission. > > >> > > > > > >> > > > > > >> > > > > The machines don't have swap, so the swappiness parameter > don't > > >> > > > > seem to apply here. Any other suggestion? > > >> > > > > > > >> > > > > > >> > > > See the perf section of the hbase manual. It has our current > > list. > > >> > > > > > >> > > > Are you monitoring your cluster w/ ganglia or tsdb? > > >> > > > > > >> > > > > > >> > > > St.Ack > > >> > > > > > >> > > > > Thanks. > > >> > > > > > > >> > > > > 2012/1/4 Leonardo Gamas <leoga...@jusbrasil.com.br> > > >> > > > > > > >> > > > >> I will investigate this, thanks for the response. > > >> > > > >> > > >> > > > >> > > >> > > > >> 2012/1/3 Sandy Pratt <prat...@adobe.com> > > >> > > > >> > > >> > > > >>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session > > >> > > > >>> timed out, have not heard from server in 61103ms for > sessionid > > >> > > > >>> 0x23462a4cf93a8fc, closing socket connection and attempting > > >> > > > >>> reconnect > > >> > > > >>> > > >> > > > >>> It looks like the process has been unresponsive for some > time, > > >> > > > >>> so ZK > > >> > > > has > > >> > > > >>> terminated the session. Did you experience a long GC pause > > >> > > > >>> right > > >> > > > before > > >> > > > >>> this? If you don't have GC logging enabled for the RS, you > > can > > >> > > > sometimes > > >> > > > >>> tell by noticing a gap in the timestamps of the log > statements > > >> > > > >>> leading > > >> > > > up > > >> > > > >>> to the crash. > > >> > > > >>> > > >> > > > >>> If it turns out to be GC, you might want to look at your > > kernel > > >> > > > >>> swappiness setting (set it to 0) and your JVM params. > > >> > > > >>> > > >> > > > >>> Sandy > > >> > > > >>> > > >> > > > >>> > > >> > > > >>> > -----Original Message----- > > >> > > > >>> > From: Leonardo Gamas [mailto:leoga...@jusbrasil.com.br] > > >> > > > >>> > Sent: Thursday, December 29, 2011 07:44 > > >> > > > >>> > To: user@hbase.apache.org > > >> > > > >>> > Subject: RegionServer dying every two or three days > > >> > > > >>> > > > >> > > > >>> > Hi, > > >> > > > >>> > > > >> > > > >>> > I have a HBase/Hadoop CDH3 cluster with 4 machines (1 > > Master + > > >> > > > >>> > 3 > > >> > > > >>> Slaves), > > >> > > > >>> > running on Amazon EC2. The master is a High-Memory Extra > > Large > > >> > > > Instance > > >> > > > >>> > (m2.xlarge) with NameNode, JobTracker, HMaster and > > Zookeeper. > > >> > > > >>> > The slaves are Extra Large Instances (m1.xlarge) running > > >> > > > >>> > Datanode, > > >> > > > >>> TaskTracker, > > >> > > > >>> > RegionServer and Zookeeper. > > >> > > > >>> > > > >> > > > >>> > From time to time, every two or three days, one of the > > >> > > > >>> > RegionServers processes goes down, but the other processes > > >> > > > >>> > (DataNode, TaskTracker, > > >> > > > >>> > Zookeeper) continue normally. > > >> > > > >>> > > > >> > > > >>> > Reading the logs: > > >> > > > >>> > > > >> > > > >>> > The connection with Zookeeper timed out: > > >> > > > >>> > > > >> > > > >>> > --------------------------- > > >> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client > session > > >> > > > >>> > timed > > >> > > > out, > > >> > > > >>> have > > >> > > > >>> > not heard from server in 61103ms for sessionid > > >> > > > >>> > 0x23462a4cf93a8fc, > > >> > > > >>> closing > > >> > > > >>> > socket connection and attempting reconnect > > >> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client > session > > >> > > > >>> > timed > > >> > > > out, > > >> > > > >>> have > > >> > > > >>> > not heard from server in 61205ms for sessionid > > >> > > > >>> > 0x346c561a55953e, > > >> > > > closing > > >> > > > >>> > socket connection and attempting reconnect > > >> > > > >>> > --------------------------- > > >> > > > >>> > > > >> > > > >>> > And the Handlers start to fail: > > >> > > > >>> > > > >> > > > >>> > --------------------------- > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server > > Responder, > > >> > > > >>> > call > > >> > > > >>> > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf > ) > > >> > > > >>> > from > > >> > > > >>> > xx.xx.xx.xx:xxxx: output error > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler > > 81 > > >> > > > >>> > on > > >> > > > 60020 > > >> > > > >>> > caught: java.nio.channels.ClosedChannelException > > >> > > > >>> > at > > >> > > > >>> > > > >> > > > > > >> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java: > > >> > > > 13 > > >> > > > >>> > 3) > > >> > > > >>> > at > > >> > > > >>> > sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) > > >> > > > >>> > at > > >> > > > >>> > > > >> > > > > > >> > > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java: > > >> > > > >>> > 1341) > > >> > > > >>> > at > > >> > > > >>> > > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo > > >> > > > >>> > ns > > >> > > > >>> > e(HB > > >> > > > >>> > aseServer.java:727) > > >> > > > >>> > at > > >> > > > >>> > > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB > > >> > > > >>> > as > > >> > > > >>> > eSe > > >> > > > >>> > rver.java:792) > > >> > > > >>> > at > > >> > > > >>> > > > >> > > > > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java > > >> > > > :1 > > >> > > > >>> > 083) > > >> > > > >>> > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server > > Responder, > > >> > > > >>> > call > > >> > > > >>> > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430 > ) > > >> > > > >>> > from > > >> > > > >>> > xx.xx.xx.xx:xxxx: output error > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler > > 62 > > >> > > > >>> > on > > >> > > > 60020 > > >> > > > >>> > caught: java.nio.channels.ClosedChannelException > > >> > > > >>> > at > > >> > > > >>> > > > >> > > > > > >> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java: > > >> > > > 13 > > >> > > > >>> > 3) > > >> > > > >>> > at > > >> > > > >>> > sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) > > >> > > > >>> > at > > >> > > > >>> > > > >> > > > > > >> > > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java: > > >> > > > >>> > 1341) > > >> > > > >>> > at > > >> > > > >>> > > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo > > >> > > > >>> > ns > > >> > > > >>> > e(HB > > >> > > > >>> > aseServer.java:727) > > >> > > > >>> > at > > >> > > > >>> > > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB > > >> > > > >>> > as > > >> > > > >>> > eSe > > >> > > > >>> > rver.java:792) > > >> > > > >>> > at > > >> > > > >>> > > > >> > > > > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java > > >> > > > :1 > > >> > > > >>> > 083) > > >> > > > >>> > --------------------------- > > >> > > > >>> > > > >> > > > >>> > And finally the server throws a YouAreDeadException :( : > > >> > > > >>> > > > >> > > > >>> > --------------------------- > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening > socket > > >> > > > connection > > >> > > > >>> to > > >> > > > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181 > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket > > connection > > >> > > > >>> > established to > ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181, > > >> > > > initiating > > >> > > > >>> session > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to > > >> > > > >>> > reconnect to ZooKeeper service, session 0x23462a4cf93a8fc > > has > > >> > > > >>> > expired, closing > > >> > > > socket > > >> > > > >>> > connection > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening > socket > > >> > > > connection > > >> > > > >>> to > > >> > > > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181 > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket > > connection > > >> > > > >>> > established to > ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181, > > >> > > > initiating > > >> > > > >>> session > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to > > >> > > > >>> > reconnect to ZooKeeper service, session 0x346c561a55953e > has > > >> > > > >>> > expired, closing > > >> > > > socket > > >> > > > >>> > connection > > >> > > > >>> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer: > ABORTING > > >> > > > >>> > region server > > >> > > > >>> > > serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741, > > >> > > > >>> > load=(requests=447, regions=206, usedHeap=1584, > > >> > maxHeap=4083): > > >> > > > >>> > Unhandled > > >> > > > >>> > exception: org.apache.hadoop.hbase.YouAreDeadException: > > >> > Server > > >> > > > >>> > REPORT rejected; currently processing > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead > > server > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException: > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT > > >> > > > >>> > rejected; currently processing > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 > > >> > > > as > > >> > > > >>> > dead server > > >> > > > >>> > at > > >> > > > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > > >> > > > >>> > Method) > > >> > > > >>> > at > > >> > > > >>> > > > >> > > > > > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruc > > >> > > > to > > >> > > > r > > >> > > > >>> > AccessorImpl.java:39) > > >> > > > >>> > at > > >> > > > >>> > > > >> > > > > > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Delegating > > >> > > > Co > > >> > > > n > > >> > > > >>> > structorAccessorImpl.java:27) > > >> > > > >>> > at > > >> > > > >>> > > java.lang.reflect.Constructor.newInstance(Constructor.java:513) > > >> > > > >>> > at > > >> > > > >>> > > > >> > org.apache.hadoop.ipc.RemoteException.instantiateException(Rem > > >> > > > >>> > ot > > >> > > > >>> > eExce > > >> > > > >>> > ption.java:95) > > >> > > > >>> > at > > >> > > > >>> > > > >> > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Re > > >> > > > >>> > mo > > >> > > > >>> > te > > >> > > > >>> > Exception.java:79) > > >> > > > >>> > at > > >> > > > >>> > > > >> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe > > >> > > > >>> > rv > > >> > > > >>> > erRep > > >> > > > >>> > ort(HRegionServer.java:735) > > >> > > > >>> > at > > >> > > > >>> > > > >> > > > > > >> > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer > > >> > > > .j > > >> > > > >>> > ava:596) > > >> > > > >>> > at java.lang.Thread.run(Thread.java:662) > > >> > > > >>> > Caused by: org.apache.hadoop.ipc.RemoteException: > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT > > >> > > > >>> > rejected; currently processing > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 > > >> > > > as > > >> > > > >>> > dead server > > >> > > > >>> > at > > >> > > > >>> > > > >> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(Serve > > >> > > > >>> > rM > > >> > > > >>> > ana > > >> > > > >>> > ger.java:204) > > >> > > > >>> > at > > >> > > > >>> > > > >> > org.apache.hadoop.hbase.master.ServerManager.regionServerRepor > > >> > > > >>> > t( > > >> > > > >>> > Serv > > >> > > > >>> > erManager.java:262) > > >> > > > >>> > at > > >> > > > >>> > > > >> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMas > > >> > > > >>> > te > > >> > > > >>> > r.jav > > >> > > > >>> > a:669) > > >> > > > >>> > at > > sun.reflect.GeneratedMethodAccessor3.invoke(Unknown > > >> > > > Source) > > >> > > > >>> > at > > >> > > > >>> > > > >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMeth > > >> > > > >>> > od > > >> > > > >>> > Acces > > >> > > > >>> > sorImpl.java:25) > > >> > > > >>> > at > java.lang.reflect.Method.invoke(Method.java:597) > > >> > > > >>> > at > > >> > > > >>> > > > >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) > > >> > > > >>> > at > > >> > > > >>> > > > >> > > > > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java > > >> > > > :1 > > >> > > > >>> > 039) > > >> > > > >>> > > > >> > > > >>> > at > > >> > > > >>> > > > >> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771) > > >> > > > >>> > at > > >> > > > >>> > > > >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.j > > >> > > > >>> > av > > >> > > > >>> > a:257 > > >> > > > >>> > ) > > >> > > > >>> > at $Proxy6.regionServerReport(Unknown Source) > > >> > > > >>> > at > > >> > > > >>> > > > >> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe > > >> > > > >>> > rv > > >> > > > >>> > erRep > > >> > > > >>> > ort(HRegionServer.java:729) > > >> > > > >>> > ... 2 more > > >> > > > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: Dump of > > >> > metrics: > > >> > > > >>> > requests=66, regions=206, stores=2078, storefiles=970, > > >> > > > >>> > storefileIndexSize=78, memstoreSize=796, > > >> > > > >>> > compactionQueueSize=0, flushQueueSize=0, usedHeap=1672, > > >> > > > >>> > maxHeap=4083, blockCacheSize=705907552, > > >> > > > >>> > blockCacheFree=150412064, blockCacheCount=10648, > > >> > > > >>> > blockCacheHitCount=79578618, blockCacheMissCount=3036335, > > >> > > > >>> > blockCacheEvictedCount=1401352, blockCacheHitRatio=96, > > >> > > > >>> > blockCacheHitCachingRatio=98 > > >> > > > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: > STOPPED: > > >> > > > >>> > Unhandled > > >> > > > >>> > exception: org.apache.hadoop.hbase.YouAreDeadException: > > >> > Server > > >> > > > >>> > REPORT rejected; currently processing > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead > > server > > >> > > > >>> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping server on > > >> > > > >>> > 60020 > > >> > > > >>> > --------------------------- > > >> > > > >>> > > > >> > > > >>> > Then i restart the RegionServer and everything is back to > > >> normal. > > >> > > > >>> > Reading the DataNode, Zookeeper and TaskTracker logs, i > > don't > > >> > > > >>> > see any abnormality in the same time window. > > >> > > > >>> > I think it was caused by the lost of connection to > > zookeeper. > > >> > > > >>> > Is it > > >> > > > >>> advisable to > > >> > > > >>> > run zookeeper in the same machines? > > >> > > > >>> > if the RegionServer lost it's connection to Zookeeper, > > there's > > >> > > > >>> > a way > > >> > > > (a > > >> > > > >>> > configuration perhaps) to re-join the cluster, and not > only > > >> die? > > >> > > > >>> > > > >> > > > >>> > Any idea what is causing this?? Or to prevent it from > > >> happening? > > >> > > > >>> > > > >> > > > >>> > Any help is appreciated. > > >> > > > >>> > > > >> > > > >>> > Best Regards, > > >> > > > >>> > > > >> > > > >>> > -- > > >> > > > >>> > > > >> > > > >>> > *Leonardo Gamas* > > >> > > > >>> > Software Engineer > > >> > > > >>> > +557134943514 > > >> > > > >>> > +557581347440 > > >> > > > >>> > leoga...@jusbrasil.com.br > > >> > > > >>> > www.jusbrasil.com.br > > >> > > > >>> > > >> > > > >> > > >> > > > >> > > >> > > > >> > > >> > > > >> -- > > >> > > > >> > > >> > > > >> *Leonardo Gamas* > > >> > > > >> Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C > > (75) > > >> > > > >> 8134-7440 leoga...@jusbrasil.com.br www.jusbrasil.com.br > > >> > > > >> > > >> > > > >> > > >> > > > > > > >> > > > > > > >> > > > > -- > > >> > > > > > > >> > > > > *Leonardo Gamas* > > >> > > > > Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C > (75) > > >> > > > > 8134-7440 leoga...@jusbrasil.com.br www.jusbrasil.com.br > > >> > > > > > >> > > > > > >> > > > > >> > > > > >> > > -- > > >> > > > > >> > > *Leonardo Gamas* > > >> > > Software Engineer > > >> > > +557134943514 > > >> > > +557581347440 > > >> > > leoga...@jusbrasil.com.br > > >> > > www.jusbrasil.com.br > > >> > > > > >> > > > >> > > > >> > > > >> > -- > > >> > > > >> > *Leonardo Gamas* > > >> > Software Engineer > > >> > T +55 (71) 3494-3514 > > >> > C +55 (75) 8134-7440 > > >> > leoga...@jusbrasil.com.br > > >> > www.jusbrasil.com.br > > >> > > > > > > > > > > > > -- > > > > > > *Leonardo Gamas* > > > > > > Software Engineer > > > T +55 (71) 3494-3514 > > > C +55 (75) 8134-7440 > > > leoga...@jusbrasil.com.br > > > > > > www.jusbrasil.com.br > > > > > > > > > > > > -- > > > > *Leonardo Gamas* > > Software Engineer > > T +55 (71) 3494-3514 > > C +55 (75) 8134-7440 > > leoga...@jusbrasil.com.br > > www.jusbrasil.com.br > > > -- *Leonardo Gamas* Software Engineer T +55 (71) 3494-3514 C +55 (75) 8134-7440 leoga...@jusbrasil.com.br www.jusbrasil.com.br