FYI It looks like increasing the number of Zookeeper Quorums can solve the following error message : org.apache.hadoop.hbase. client.NoServerForRegionException: Timed out trying to locate root region at org.apache.hadoop.hbase.
Now I am running Zookeeper quorum on each node I have. However, I am still having issues about losing regionserver. Is there a way to browse the Znode in zookeeper? thanks zhenyu On Wed, Oct 28, 2009 at 3:40 PM, Zhenyu Zhong <[email protected]>wrote: > JG, > > > Thanks a lot for the tips. > I set the HEAP to 4GB and GC options as -XX:ParallelGCThreads=8 > -XX:+UseConcMarkSweepGC. > > I checked the logs in my Master an RS and found the following errors. > Basically, master got exception error while scanning ROOT, then the ROOT > region was offline and unset. Thus the regionserver can't get > NotservingRegion errors. > > In the master: > 2009-10-28 19:00:30,591 INFO org.apache.hadoop.hbase.master.BaseScanner: > RegionManager.rootScanner scanning meta region {server: x.x.x. > x:60021, regionname: -ROOT-,,0, startKey: <>} > 2009-10-28 19:00:30,591 WARN org.apache.hadoop.hbase.master.BaseScanner: > Scan ROOT region > java.io.IOException: Call to /x.x.x.x:60021 failed on local exception: > java.io.EOFException > at > org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:757) > at > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:727) > at > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328) > at $Proxy1.openScanner(Unknown Source) > at > org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160) > at > org.apache.hadoop.hbase.master.RootScanner.scanRoot(RootScanner.java:54) > at > org.apache.hadoop.hbase.master.RootScanner.maintenanceScan(RootScanner.java:79) > at > org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136) > at org.apache.hadoop.hbase.Chore.run(Chore.java:68) > Caused by: java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:375) > at > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:504) > at > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:448) > 2009-10-28 19:00:30,591 INFO org.apache.hadoop.hbase.master.BaseScanner: > RegionManager.metaScanner scanning meta region {server: x.x.x. > x:60021, regionname: .META.,,1, startKey: <>} > 2009-10-28 19:00:30,591 WARN org.apache.hadoop.hbase.master.BaseScanner: > Scan one META region: {server: x.x.x.x:60021, regionname: .M > ETA.,,1, startKey: <>} > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) > at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404) > at > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:308) > at > org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:831) > at > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:712) > at > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328) > at $Proxy1.openScanner(Unknown Source) > at > org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160) > at > org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73) > at > org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129) > at > org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136) > at org.apache.hadoop.hbase.Chore.run(Chore.java:68) > 2009-10-28 19:00:30,591 INFO org.apache.hadoop.hbase.master.BaseScanner: > All 1 .META. region(s) scanned > 2009-10-28 19:00:31,395 INFO org.apache.hadoop.hbase.master.ServerManager: > Removing server's info YYYY,60021,125675547057 > 0 > 2009-10-28 19:00:31,395 INFO org.apache.hadoop.hbase.master.RegionManager: > Offlined ROOT server: x.x.x.x:60021 > > 2009-10-28 19:00:31,395 INFO org.apache.hadoop.hbase.master.RegionManager: > -ROOT- region unset (but not set to be reassigned) > 2009-10-28 19:00:31,395 INFO org.apache.hadoop.hbase.master.RegionManager: > ROOT inserted into regionsInTransition > 2009-10-28 19:00:31,395 INFO org.apache.hadoop.hbase.master.RegionManager: > Offlining META region: {server: x.x.x.x:60021, regionname: .META.,,1, > startKey: <>} > 2009-10-28 19:00:31,395 INFO org.apache.hadoop.hbase.master.RegionManager: > META region removed from onlineMetaRegions > > > > On the regionserver: > 2009-10-28 18:51:14,578 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: > test,,1256755871065 > 2009-10-28 18:51:14,578 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: > test,,1256755871065 > 2009-10-28 18:51:14,578 INFO org.apache.hadoop.hbase.regionserver.HRegion: > region test,,1256755871065/796855017 available; sequence id is 10013291 > 2009-10-28 18:51:14,578 INFO org.apache.hadoop.hbase.regionserver.HRegion: > Starting compaction on region test,,1256755871065 > 2009-10-28 18:51:18,388 DEBUG org.apache.zookeeper.ClientCnxn: Got ping > response for sessionid:0x249c76021d0001 after 0ms > 2009-10-28 18:51:19,341 ERROR > org.apache.hadoop.hbase.regionserver.HRegionServer: > org.apache.hadoop.hbase.NotServingRegionException: test,,1256754924503 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2307) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1784) > at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) > 2009-10-28 18:51:19,341 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 0 on 60021, call get([...@21fefd80, row=1053508149, maxVersions=1, > timeRange=[0,9223372036854775807), families={(family=email_ip_activity, > columns=ALL}) from x.x.x.x:54669: error: > org.apache.hadoop.hbase.NotServingRegionException: test,,1256754924503 > > > > > > > On Wed, Oct 28, 2009 at 2:56 PM, Jonathan Gray <[email protected]> wrote: > >> These client error messages are not particular descriptive as to the root >> cause (they are fatal errors, or close to it). >> >> What is going on in your regionservers when these errors happen? Check >> the master and RS logs. >> >> Also, you definitely do not want 19 zookeeper nodes. Reduce that to 3 or >> 5 max. >> >> What is the hardware you are using for these nodes, and what settings do >> you have for heap/GC? >> >> JG >> >> >> Zhenyu Zhong wrote: >> >>> Stack, >>> >>> Thank you very much for your comments. >>> I am running a cluster with 20 nodes. I set 19 as both regionserver and >>> zookeeper quorums. >>> The versions I am using are Hadoop0.20.1 and HBase0.20.1. >>> I started with an empty table and try to load 200 million records into >>> that >>> empty table. >>> There is a key in each record. Logically, in my MR program, during the >>> setup, I opened an HTable, in my mapper, I fetch the record from HTable >>> via >>> key in the record, then make some changes to the columns and update that >>> row >>> back to HTable through TableOutputFormat by passing a put. There is no >>> reduce tasks involved here. (Though it is unnecessary to fetch row from >>> an >>> empty table, I just intended to do that) >>> >>> Additionally, when I reduced the number of regionservers and number of >>> zookeeper quorums. >>> I had different errors: >>> org.apache.hadoop.hbase.client.NoServerForRegionException: Timed out >>> trying >>> to locate root region at >>> >>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:929) >>> at >>> >>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:580) >>> at >>> >>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562) >>> at >>> >>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693) >>> at >>> >>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:589) >>> at >>> >>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562) >>> at >>> >>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693) >>> at >>> >>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:593) >>> at >>> >>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:556) >>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:127) at >>> org.apache.hadoop.hbase.client.HTable.(HTable.java:105) at >>> >>> org.apache.hadoop.hbase.mapreduce.TableOutputFormat.getRecordWriter(TableOutputFormat.java:116) >>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:573) at >>> org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at >>> org.apache.hadoop.mapred.Child.main(Child.java:170) >>> >>> Many thanks in advance. >>> zhenyu >>> >>> >>> >>> >>> On Wed, Oct 28, 2009 at 12:39 PM, stack <[email protected]> wrote: >>> >>> Whats your cluster topology? How many nodes involved? When you see the >>>> below message, how many regions in your table? How are you loading your >>>> table? >>>> Thanks, >>>> St.Ack >>>> >>>> On Wed, Oct 28, 2009 at 7:45 AM, Zhenyu Zhong <[email protected] >>>> >>>>> wrote: >>>>> Nitay, >>>>> >>>>> I am very appreciated. >>>>> >>>>> As Ryan suggested, I increased the zookeeper session timeout to >>>>> 40seconds >>>>> along with the GC options -XX:ParallelGCThreads=8 >>>>> >>>> -XX:+UseConcMarkSweepGC >>>> >>>>> in place. I set the Heapsize to 4GB. I also set the vm.swappiness=0. >>>>> >>>>> However it still ran into problem. Please find the following errors. >>>>> >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to >>>>> contact region server x.x.x.x:60021 for region >>>>> YYYY,117.99.7.153,1256396118155, row '1170491458', but failed after 10 >>>>> attempts. >>>>> Exceptions: >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>>> setting up proxy to /x.x.x.:60021 after attempts=1 >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1 >>>>> >>>>> at >>>>> >>>>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1001) >>>> >>>>> at org.apache.hadoop.hbase.client.HTable.get(HTable.java:413) >>>>> >>>>> >>>>> The input file is about 10GB around 200million rows of data. >>>>> This load doesn't seem too large. However this kind of errors keep >>>>> >>>> popping >>>> >>>>> up. >>>>> >>>>> Does Regionserver need to be deployed to dedicated machines? >>>>> Does Zookeeper need to be deployed to dedicated machines as well? >>>>> >>>>> Best, >>>>> zhenyu >>>>> >>>>> >>>>> >>>>> On Wed, Oct 28, 2009 at 1:37 AM, nitay <[email protected]> wrote: >>>>> >>>>> Hi Zhenyu, >>>>>> >>>>>> Sorry for the delay. I started working on this a while back, before I >>>>>> >>>>> left >>>>> >>>>>> my job for another company. Since then I haven't had much time to work >>>>>> >>>>> on >>>> >>>>> HBase unfortunately :(. I'll try to dig up what I had and see what >>>>>> >>>>> shape >>>> >>>>> it's in and update you. >>>>>> >>>>>> Cheers, >>>>>> -n >>>>>> >>>>>> >>>>>> On Oct 27, 2009, at 3:38 PM, Ryan Rawson wrote: >>>>>> >>>>>> Sorry I must have mistyped, I meant to say "40 seconds". You can >>>>>> >>>>>>> still see multi-second pauses at times, so you need to give yourself >>>>>>> a >>>>>>> bigger buffer. >>>>>>> >>>>>>> The parallel threads argument should not be necessary, but you do >>>>>>> need >>>>>>> the UseConcMarkSweepGC flag as well. >>>>>>> >>>>>>> Let us know how it goes! >>>>>>> -ryan >>>>>>> >>>>>>> >>>>>>> On Tue, Oct 27, 2009 at 3:19 PM, Zhenyu Zhong < >>>>>>> >>>>>> [email protected]> >>>> >>>>> wrote: >>>>>>> >>>>>>> Ryan, >>>>>>>> I am very appreciated for your feedbacks. >>>>>>>> I have set the zookeeper.session.timeout to seconds which is way >>>>>>>> >>>>>>> higher >>>> >>>>> than >>>>>>>> 40ms. >>>>>>>> In the same time, the -Xms is set to 4GB, which should be >>>>>>>> sufficient. >>>>>>>> I also tried GC options like >>>>>>>> >>>>>>>> -XX:ParallelGCThreads=8 >>>>>>>> -XX:+UseConcMarkSweepGC >>>>>>>> >>>>>>>> I even set the vm.swappiness=0 >>>>>>>> >>>>>>>> However, I still came across the problem that a RegionServer >>>>>>>> shutdown >>>>>>>> itself. >>>>>>>> >>>>>>>> Best, >>>>>>>> zhong >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Oct 27, 2009 at 6:05 PM, Ryan Rawson <[email protected]> >>>>>>>> >>>>>>> wrote: >>>>> >>>>>> Set the ZK timeout to something like 40ms, and give the GC enough >>>>>>>> >>>>>>> Xmx >>>> >>>>> so you never risk entering the much dreaded concurrent-mode-failure >>>>>>>>> whereby the entire heap must be GCed. >>>>>>>>> >>>>>>>>> Consider testing Java 7 and the G1 GC. >>>>>>>>> >>>>>>>>> We could get a JNI thread to do this, but no one has done so yet. I >>>>>>>>> >>>>>>>> am >>>> >>>>> personally hoping for G1 and in the meantime overprovision our Xmx >>>>>>>>> >>>>>>>> to >>>> >>>>> avoid the concurrent mode failures. >>>>>>>>> >>>>>>>>> -ryan >>>>>>>>> >>>>>>>>> On Tue, Oct 27, 2009 at 2:59 PM, Zhenyu Zhong < >>>>>>>>> >>>>>>>> [email protected]> >>>>> >>>>>> wrote: >>>>>>>>> >>>>>>>>> Ryan, >>>>>>>>>> >>>>>>>>>> Thank you very much. >>>>>>>>>> May I ask whether there are any ways to get around this problem to >>>>>>>>>> >>>>>>>>> make >>>>> >>>>>> HBase more stable? >>>>>>>>>> >>>>>>>>>> best, >>>>>>>>>> zhong >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Oct 27, 2009 at 4:06 PM, Ryan Rawson <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> There isnt any working code yet. Just an idea, and a prototype. >>>>>>>>>> >>>>>>>>>>> There is some sense that if we can get the G1 GC that we could >>>>>>>>>>> get >>>>>>>>>>> >>>>>>>>>> rid >>>>> >>>>>> of all long pauses, and avoid the need for this. >>>>>>>>>>> >>>>>>>>>>> -ryan >>>>>>>>>>> >>>>>>>>>>> On Mon, Oct 26, 2009 at 2:30 PM, Zhenyu Zhong < >>>>>>>>>>> [email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> I am very interesting to the solution that Joey proposed and >>>>>>>>>>>> >>>>>>>>>>> would >>>> >>>>> like >>>>>>>>>>> >>>>>>>>>> to >>>>>>>>>> >>>>>>>>>>> have a try. >>>>>>>>>>>> Does anyone have any ideas on how to deploy this zk_wrapper in >>>>>>>>>>>> >>>>>>>>>>> JNI >>>> >>>>> integration? >>>>>>>>>>>> >>>>>>>>>>>> I would be very appreciated. >>>>>>>>>>>> >>>>>>>>>>>> thanks >>>>>>>>>>>> zhong >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>> >
