Re: regarding to HBase 1316 ZooKeeper: use native threads to avoid GC stalls (JNI integration)

Zhenyu Zhong Thu, 29 Oct 2009 15:01:23 -0700

Anything that possibly gets started is another MR job working on other
dataset in the same time as this test was running. So some node might  be
under heavy loads.
I am wondering whether that would cause the connection timeout.


thanks
zhenyu



On Thu, Oct 29, 2009 at 5:32 PM, stack <[email protected]> wrote:

> On Thu, Oct 29, 2009 at 2:23 PM, Zhenyu Zhong <[email protected]
> >wrote:
>
> > I have 19 quorum members now.
> >
> > Thats too many.  Have 3 or maybe 5.  See zk site for rationale.
>
>
>
> > When I did test on loading data to two columnfamilies of one table in
> HBase
> > using two seperate MR jobs, I lost my regionserver and the test failed.
> >
> > Does HBase allow such table update operation?
> >
> > The errors I got while I lost my regionserver is:
> > 2009-10-29 21:09:34,705 INFO org.apache.hadoop.hbase.regionserver.HLog:
> > Roll
> > /hbase/.logs/YYYY,60021,1256849619429/hlog.d
> > at.1256849620029, entries=271911, calcsize=63754142, filesize=33975611.
> New
> > hlog /hbase/.logs/YYYY,60021,1256849619429/hl
> > og.dat.1256850574705
> > 2009-10-29 21:09:50,322 WARN
> > org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
> > org.apache.hadoop.hbase.Leases$LeaseStillHeldException
> >
>
>
> You have read the 'Getting Started' and made the necessary changes to
> filedescriptors and xceiver count?
>
> You will see above message after a regionserver has restarted and tries to
> go back to the master (what hbase is this? I think you said it 0.20.x).
>
>
>
>
> > java.io.IOException: TIMED OUT
> >        at
> > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> > 2009-10-29 21:09:50,873 INFO
> > org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
> > state: Disconnected, type: None, path:
> > null
> >
>
> This is timeout against zk.  You've lost your session.  The RS will go
> down.  The connection to zk is basic to hbase.  Something is up.  In the
> past others have reported things like incorrect bios settings on disks that
> have made the disks run slow or just something up with the networking.  Can
> you check all is healthy?  You seem to be having too many issues for such a
> small loading with such a large cluster.
>
> St.Ack
>
>
>
> >
> >
> >
> >
> > On Thu, Oct 29, 2009 at 2:51 PM, stack <[email protected]> wrote:
> >
> > > On Thu, Oct 29, 2009 at 11:46 AM, Zhenyu Zhong <
> [email protected]
> > > >wrote:
> > >
> > > > FYI
> > > > It looks like increasing the number of Zookeeper Quorums can solve
> the
> > > > following error message : org.apache.hadoop.hbase.
> > > > client.NoServerForRegionException: Timed out trying to locate root
> > region
> > > > at
> > > > org.apache.hadoop.hbase.
> > > >
> > > > You mean quorum members?  How many do you have now?
> > >
> > >
> > >
> > > > Now I am running Zookeeper quorum on each node I have.
> > > > However, I am still having issues about losing regionserver.
> > > >
> > > > Whats in the logs?
> > >
> > >
> > >
> > >
> > > > Is there a way to browse the Znode in zookeeper?
> > > >
> > > >
> > > Type 'zk' in the hbase shell.
> > >
> > > You can get to the zk shell from hbase shell.  You so things like:
> > >
> > > > zk "ls /"
> > >
> > > (Yes, quotes needed).
> > >
> > > St.Ack
> > >
> > >
> > >
> > > > thanks
> > > > zhenyu
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Oct 28, 2009 at 3:40 PM, Zhenyu Zhong <
> [email protected]
> > > > >wrote:
> > > >
> > > > > JG,
> > > > >
> > > > >
> > > > > Thanks a lot for the tips.
> > > > > I set the HEAP to 4GB and GC options as -XX:ParallelGCThreads=8
> > > > >  -XX:+UseConcMarkSweepGC.
> > > > >
> > > > > I checked the logs in my Master an RS and found the following
> errors.
> > > > > Basically, master got exception error while scanning ROOT, then the
> > > ROOT
> > > > > region was offline and unset.  Thus the regionserver can't get
> > > > > NotservingRegion errors.
> > > > >
> > > > > In the master:
> > > > > 2009-10-28 19:00:30,591 INFO
> > > org.apache.hadoop.hbase.master.BaseScanner:
> > > > > RegionManager.rootScanner scanning meta region {server: x.x.x.
> > > > > x:60021, regionname: -ROOT-,,0, startKey: <>}
> > > > > 2009-10-28 19:00:30,591 WARN
> > > org.apache.hadoop.hbase.master.BaseScanner:
> > > > > Scan ROOT region
> > > > > java.io.IOException: Call to /x.x.x.x:60021 failed on local
> > exception:
> > > > > java.io.EOFException
> > > > >         at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:757)
> > > > >         at
> > > > > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:727)
> > > > >         at
> > > > >
> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328)
> > > > >         at $Proxy1.openScanner(Unknown Source)
> > > > >         at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160)
> > > > >         at
> > > > >
> > >
> org.apache.hadoop.hbase.master.RootScanner.scanRoot(RootScanner.java:54)
> > > > >         at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.RootScanner.maintenanceScan(RootScanner.java:79)
> > > > >         at
> > > > >
> > org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136)
> > > > >         at org.apache.hadoop.hbase.Chore.run(Chore.java:68)
> > > > > Caused by: java.io.EOFException
> > > > >         at
> java.io.DataInputStream.readInt(DataInputStream.java:375)
> > > > >         at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:504)
> > > > >         at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:448)
> > > > > 2009-10-28 19:00:30,591 INFO
> > > org.apache.hadoop.hbase.master.BaseScanner:
> > > > > RegionManager.metaScanner scanning meta region {server: x.x.x.
> > > > > x:60021, regionname: .META.,,1, startKey: <>}
> > > > > 2009-10-28 19:00:30,591 WARN
> > > org.apache.hadoop.hbase.master.BaseScanner:
> > > > > Scan one META region: {server: x.x.x.x:60021, regionname: .M
> > > > > ETA.,,1, startKey: <>}
> > > > > java.net.ConnectException: Connection refused
> > > > >         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> > > > >         at
> > > > >
> > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
> > > > >         at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> > > > >         at
> org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
> > > > >         at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:308)
> > > > >         at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:831)
> > > > >         at
> > > > > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:712)
> > > > >         at
> > > > >
> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328)
> > > > >         at $Proxy1.openScanner(Unknown Source)
> > > > >         at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160)
> > > > >         at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73)
> > > > >         at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129)
> > > > >         at
> > > > >
> > org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136)
> > > > >         at org.apache.hadoop.hbase.Chore.run(Chore.java:68)
> > > > > 2009-10-28 19:00:30,591 INFO
> > > org.apache.hadoop.hbase.master.BaseScanner:
> > > > > All 1 .META. region(s) scanned
> > > > > 2009-10-28 19:00:31,395 INFO
> > > > org.apache.hadoop.hbase.master.ServerManager:
> > > > > Removing server's info YYYY,60021,125675547057
> > > > > 0
> > > > > 2009-10-28 19:00:31,395 INFO
> > > > org.apache.hadoop.hbase.master.RegionManager:
> > > > > Offlined ROOT server: x.x.x.x:60021
> > > > >
> > > > > 2009-10-28 19:00:31,395 INFO
> > > > org.apache.hadoop.hbase.master.RegionManager:
> > > > > -ROOT- region unset (but not set to be reassigned)
> > > > > 2009-10-28 19:00:31,395 INFO
> > > > org.apache.hadoop.hbase.master.RegionManager:
> > > > > ROOT inserted into regionsInTransition
> > > > > 2009-10-28 19:00:31,395 INFO
> > > > org.apache.hadoop.hbase.master.RegionManager:
> > > > > Offlining META region: {server: x.x.x.x:60021, regionname:
> .META.,,1,
> > > > > startKey: <>}
> > > > > 2009-10-28 19:00:31,395 INFO
> > > > org.apache.hadoop.hbase.master.RegionManager:
> > > > > META region removed from onlineMetaRegions
> > > > >
> > > > >
> > > > >
> > > > > On the regionserver:
> > > > > 2009-10-28 18:51:14,578 INFO
> > > > > org.apache.hadoop.hbase.regionserver.HRegionServer:
> MSG_REGION_OPEN:
> > > > > test,,1256755871065
> > > > > 2009-10-28 18:51:14,578 INFO
> > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: Worker:
> > > > MSG_REGION_OPEN:
> > > > > test,,1256755871065
> > > > > 2009-10-28 18:51:14,578 INFO
> > > > org.apache.hadoop.hbase.regionserver.HRegion:
> > > > > region test,,1256755871065/796855017 available; sequence id is
> > 10013291
> > > > > 2009-10-28 18:51:14,578 INFO
> > > > org.apache.hadoop.hbase.regionserver.HRegion:
> > > > > Starting compaction on region test,,1256755871065
> > > > > 2009-10-28 18:51:18,388 DEBUG org.apache.zookeeper.ClientCnxn: Got
> > ping
> > > > > response for sessionid:0x249c76021d0001 after 0ms
> > > > > 2009-10-28 18:51:19,341 ERROR
> > > > > org.apache.hadoop.hbase.regionserver.HRegionServer:
> > > > > org.apache.hadoop.hbase.NotServingRegionException:
> > test,,1256754924503
> > > > >         at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2307)
> > > > >         at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1784)
> > > > >         at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown
> > Source)
> > > > >         at
> > > > >
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > > > >         at java.lang.reflect.Method.invoke(Method.java:597)
> > > > >         at
> > > > > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
> > > > >         at
> > > > >
> > >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
> > > > > 2009-10-28 18:51:19,341 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > > > Server
> > > > > handler 0 on 60021, call get([...@21fefd80, row=1053508149,
> > > maxVersions=1,
> > > > > timeRange=[0,9223372036854775807),
> > families={(family=email_ip_activity,
> > > > > columns=ALL}) from x.x.x.x:54669: error:
> > > > > org.apache.hadoop.hbase.NotServingRegionException:
> > test,,1256754924503
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Oct 28, 2009 at 2:56 PM, Jonathan Gray <[email protected]>
> > > > wrote:
> > > > >
> > > > >> These client error messages are not particular descriptive as to
> the
> > > > root
> > > > >> cause (they are fatal errors, or close to it).
> > > > >>
> > > > >> What is going on in your regionservers when these errors happen?
> > >  Check
> > > > >> the master and RS logs.
> > > > >>
> > > > >> Also, you definitely do not want 19 zookeeper nodes.  Reduce that
> to
> > 3
> > > > or
> > > > >> 5 max.
> > > > >>
> > > > >> What is the hardware you are using for these nodes, and what
> > settings
> > > do
> > > > >> you have for heap/GC?
> > > > >>
> > > > >> JG
> > > > >>
> > > > >>
> > > > >> Zhenyu Zhong wrote:
> > > > >>
> > > > >>> Stack,
> > > > >>>
> > > > >>> Thank you very much for your comments.
> > > > >>> I am running a cluster with 20 nodes. I set 19 as both
> regionserver
> > > and
> > > > >>> zookeeper quorums.
> > > > >>> The versions I am using are  Hadoop0.20.1 and HBase0.20.1.
> > > > >>> I started with an empty table and try to load 200 million records
> > > into
> > > > >>> that
> > > > >>> empty table.
> > > > >>> There is a key in each record. Logically, in my MR program,
> during
> > > the
> > > > >>> setup, I opened an HTable, in my mapper, I fetch the record from
> > > HTable
> > > > >>> via
> > > > >>> key in the record, then make some changes to the columns and
> update
> > > > that
> > > > >>> row
> > > > >>> back to HTable through TableOutputFormat by passing a put. There
> is
> > > no
> > > > >>> reduce tasks involved here.  (Though it is unnecessary to fetch
> row
> > > > from
> > > > >>> an
> > > > >>> empty table, I just intended to do that)
> > > > >>>
> > > > >>> Additionally, when I reduced the number of regionservers and
> number
> > > of
> > > > >>> zookeeper quorums.
> > > > >>> I had different errors:
> > > > >>> org.apache.hadoop.hbase.client.NoServerForRegionException: Timed
> > out
> > > > >>> trying
> > > > >>> to locate root region at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:929)
> > > > >>> at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:580)
> > > > >>> at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562)
> > > > >>> at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693)
> > > > >>> at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:589)
> > > > >>> at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562)
> > > > >>> at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693)
> > > > >>> at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:593)
> > > > >>> at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:556)
> > > > >>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:127) at
> > > > >>> org.apache.hadoop.hbase.client.HTable.(HTable.java:105) at
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.hadoop.hbase.mapreduce.TableOutputFormat.getRecordWriter(TableOutputFormat.java:116)
> > > > >>> at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:573)
> > at
> > > > >>> org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at
> > > > >>> org.apache.hadoop.mapred.Child.main(Child.java:170)
> > > > >>>
> > > > >>> Many thanks in advance.
> > > > >>> zhenyu
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> On Wed, Oct 28, 2009 at 12:39 PM, stack <[email protected]>
> wrote:
> > > > >>>
> > > > >>>  Whats your cluster topology?  How many nodes involved?  When you
> > see
> > > > the
> > > > >>>> below message, how many regions in your table?  How are you
> > loading
> > > > your
> > > > >>>> table?
> > > > >>>> Thanks,
> > > > >>>> St.Ack
> > > > >>>>
> > > > >>>> On Wed, Oct 28, 2009 at 7:45 AM, Zhenyu Zhong <
> > > > [email protected]
> > > > >>>>
> > > > >>>>> wrote:
> > > > >>>>> Nitay,
> > > > >>>>>
> > > > >>>>> I am very appreciated.
> > > > >>>>>
> > > > >>>>> As Ryan suggested, I increased the zookeeper session timeout to
> > > > >>>>> 40seconds
> > > > >>>>> along with the GC options -XX:ParallelGCThreads=8
> > > > >>>>>
> > > > >>>>  -XX:+UseConcMarkSweepGC
> > > > >>>>
> > > > >>>>> in place. I set the Heapsize to 4GB.  I also set the
> > > vm.swappiness=0.
> > > > >>>>>
> > > > >>>>> However it still ran into problem. Please find the following
> > > errors.
> > > > >>>>>
> > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException:
> Trying
> > to
> > > > >>>>> contact region server x.x.x.x:60021 for region
> > > > >>>>> YYYY,117.99.7.153,1256396118155, row '1170491458', but failed
> > after
> > > > 10
> > > > >>>>> attempts.
> > > > >>>>> Exceptions:
> > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException:
> Failed
> > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1
> > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException:
> Failed
> > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1
> > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException:
> Failed
> > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1
> > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException:
> Failed
> > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1
> > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException:
> Failed
> > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1
> > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException:
> Failed
> > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1
> > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException:
> Failed
> > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1
> > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException:
> Failed
> > > > >>>>> setting up proxy to /x.x.x.:60021 after attempts=1
> > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException:
> Failed
> > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1
> > > > >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException:
> Failed
> > > > >>>>> setting up proxy to /x.x.x.x:60021 after attempts=1
> > > > >>>>>
> > > > >>>>>       at
> > > > >>>>>
> > > > >>>>>
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1001)
> > > > >>>>
> > > > >>>>>       at
> > org.apache.hadoop.hbase.client.HTable.get(HTable.java:413)
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> The input file is about 10GB around 200million rows of data.
> > > > >>>>> This load doesn't seem too large. However this kind of errors
> > keep
> > > > >>>>>
> > > > >>>> popping
> > > > >>>>
> > > > >>>>> up.
> > > > >>>>>
> > > > >>>>> Does Regionserver need to be deployed to dedicated machines?
> > > > >>>>> Does Zookeeper need to be deployed to dedicated machines as
> well?
> > > > >>>>>
> > > > >>>>> Best,
> > > > >>>>> zhenyu
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> On Wed, Oct 28, 2009 at 1:37 AM, nitay <[email protected]>
> wrote:
> > > > >>>>>
> > > > >>>>>  Hi Zhenyu,
> > > > >>>>>>
> > > > >>>>>> Sorry for the delay. I started working on this a while back,
> > > before
> > > > I
> > > > >>>>>>
> > > > >>>>> left
> > > > >>>>>
> > > > >>>>>> my job for another company. Since then I haven't had much time
> > to
> > > > work
> > > > >>>>>>
> > > > >>>>> on
> > > > >>>>
> > > > >>>>> HBase unfortunately :(. I'll try to dig up what I had and see
> > what
> > > > >>>>>>
> > > > >>>>> shape
> > > > >>>>
> > > > >>>>> it's in and update you.
> > > > >>>>>>
> > > > >>>>>> Cheers,
> > > > >>>>>> -n
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> On Oct 27, 2009, at 3:38 PM, Ryan Rawson wrote:
> > > > >>>>>>
> > > > >>>>>>  Sorry I must have mistyped, I meant to say "40 seconds".  You
> > can
> > > > >>>>>>
> > > > >>>>>>> still see multi-second pauses at times, so you need to give
> > > > yourself
> > > > >>>>>>> a
> > > > >>>>>>> bigger buffer.
> > > > >>>>>>>
> > > > >>>>>>> The parallel threads argument should not be necessary, but
> you
> > do
> > > > >>>>>>> need
> > > > >>>>>>> the UseConcMarkSweepGC flag as well.
> > > > >>>>>>>
> > > > >>>>>>> Let us know how it goes!
> > > > >>>>>>> -ryan
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> On Tue, Oct 27, 2009 at 3:19 PM, Zhenyu Zhong <
> > > > >>>>>>>
> > > > >>>>>> [email protected]>
> > > > >>>>
> > > > >>>>>  wrote:
> > > > >>>>>>>
> > > > >>>>>>>  Ryan,
> > > > >>>>>>>> I am very appreciated for your feedbacks.
> > > > >>>>>>>> I have set the zookeeper.session.timeout to seconds which is
> > way
> > > > >>>>>>>>
> > > > >>>>>>> higher
> > > > >>>>
> > > > >>>>>  than
> > > > >>>>>>>> 40ms.
> > > > >>>>>>>> In the same time, the -Xms is set to 4GB, which should be
> > > > >>>>>>>> sufficient.
> > > > >>>>>>>> I also tried GC options like
> > > > >>>>>>>>
> > > > >>>>>>>>  -XX:ParallelGCThreads=8
> > > > >>>>>>>> -XX:+UseConcMarkSweepGC
> > > > >>>>>>>>
> > > > >>>>>>>> I even set the vm.swappiness=0
> > > > >>>>>>>>
> > > > >>>>>>>> However, I still came across the problem that a RegionServer
> > > > >>>>>>>> shutdown
> > > > >>>>>>>> itself.
> > > > >>>>>>>>
> > > > >>>>>>>> Best,
> > > > >>>>>>>> zhong
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> On Tue, Oct 27, 2009 at 6:05 PM, Ryan Rawson <
> > > [email protected]>
> > > > >>>>>>>>
> > > > >>>>>>> wrote:
> > > > >>>>>
> > > > >>>>>>   Set the ZK timeout to something like 40ms, and give the GC
> > > enough
> > > > >>>>>>>>
> > > > >>>>>>> Xmx
> > > > >>>>
> > > > >>>>>  so you never risk entering the much dreaded
> > > concurrent-mode-failure
> > > > >>>>>>>>> whereby the entire heap must be GCed.
> > > > >>>>>>>>>
> > > > >>>>>>>>> Consider testing Java 7 and the G1 GC.
> > > > >>>>>>>>>
> > > > >>>>>>>>> We could get a JNI thread to do this, but no one has done
> so
> > > yet.
> > > > I
> > > > >>>>>>>>>
> > > > >>>>>>>> am
> > > > >>>>
> > > > >>>>>  personally hoping for G1 and in the meantime overprovision our
> > Xmx
> > > > >>>>>>>>>
> > > > >>>>>>>> to
> > > > >>>>
> > > > >>>>>  avoid the concurrent mode failures.
> > > > >>>>>>>>>
> > > > >>>>>>>>> -ryan
> > > > >>>>>>>>>
> > > > >>>>>>>>> On Tue, Oct 27, 2009 at 2:59 PM, Zhenyu Zhong <
> > > > >>>>>>>>>
> > > > >>>>>>>> [email protected]>
> > > > >>>>>
> > > > >>>>>>  wrote:
> > > > >>>>>>>>>
> > > > >>>>>>>>>  Ryan,
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Thank you very much.
> > > > >>>>>>>>>> May I ask whether there are any ways to get around this
> > > problem
> > > > to
> > > > >>>>>>>>>>
> > > > >>>>>>>>> make
> > > > >>>>>
> > > > >>>>>>  HBase more stable?
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> best,
> > > > >>>>>>>>>> zhong
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> On Tue, Oct 27, 2009 at 4:06 PM, Ryan Rawson <
> > > > [email protected]>
> > > > >>>>>>>>>> wrote:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>  There isnt any working code yet. Just an idea, and a
> > > prototype.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>> There is some sense that if we can get the G1 GC that we
> > > could
> > > > >>>>>>>>>>> get
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>> rid
> > > > >>>>>
> > > > >>>>>>  of all long pauses, and avoid the need for this.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> -ryan
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> On Mon, Oct 26, 2009 at 2:30 PM, Zhenyu Zhong <
> > > > >>>>>>>>>>> [email protected]>
> > > > >>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>  Hi,
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> I am very interesting to the solution that Joey proposed
> > and
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>> would
> > > > >>>>
> > > > >>>>>   like
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>> to
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>> have a try.
> > > > >>>>>>>>>>>> Does anyone have any ideas on how to deploy this
> > zk_wrapper
> > > in
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>> JNI
> > > > >>>>
> > > > >>>>>   integration?
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> I would be very appreciated.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> thanks
> > > > >>>>>>>>>>>> zhong
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>
> > > > >
> > > >
> > >
> >
>

Re: regarding to HBase 1316 ZooKeeper: use native threads to avoid GC stalls (JNI integration)

Reply via email to