Hi All I have moved this as jira.
https://issues.apache.org/jira/browse/HBASE-13942 Please post all your opinions there. Thanks On Mon, Jun 22, 2015 at 10:53 AM, Ted Yu <yuzhih...@gmail.com> wrote: > I was out of the country this past week where access to gmail was > difficult. > > Looking at client stack trace, it seems the hang corresponded to the > following: > at > > org.apache.hadoop.hbase.client.ClientSmallReversedScanner.next(ClientSmallReversedScanner.java:145) > at > > org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1200) > > Will continue digging through the stack traces / logs. > > Cheers > > On Wed, Jun 17, 2015 at 10:35 PM, mukund murrali <mukundmurra...@gmail.com > > > wrote: > > > Even with 1.1.0 the issue persists. Client side blocking wait still > happens > > during first region split. Tried in distributed set up with 1.0.0 as > > suggested by you and had the same results. > > > > Client jstack - http://pastebin.com/Ptw0JhdG > > > > RS Hosting Table Log - http://pastebin.com/ZSD4YUE5 > > > > One point to note is The RS having hbase:meta showed no logs of split but > > the master had info about it. Why is it so? hbase:meta moved to master? > > > > Master Log: http://pastebin.com/f2suyNr1 > > > > One more interesting finding is in thread stack of RS Hosting table from > > the time client hangs, there is a hconnection in waiting state. > Subsequent > > thread dumps also had hconnection in waiting state. Is there any > deadlock? > > See if it can be of any use for analyzing. > > > > Thread Stack of RS hosting table - http://pastebin.com/rGbJyrPB > > > > Also AM.ZK.Worker threads waiting in Master. The pastebin of HMaster > during > > client hang and region split is > > > > http://pastebin.com/3pgVYpYW > > > > Thanks > > > > On Thu, Jun 11, 2015 at 10:48 PM, Ted Yu <yuzhih...@gmail.com> wrote: > > > > > Looking at the revision history for ClientSmallReversedScanner.java > which > > > appeared in the stack trace, there have been several bug fixes on top > of > > > the hbase release you're using. > > > > > > Can you try hbase 1.1.0 to see if the problem can be reproduced (in > > cluster > > > deployment) ? > > > > > > Thanks > > > > > > On Tue, Jun 9, 2015 at 11:42 PM, mukund murrali < > > mukundmurra...@gmail.com> > > > wrote: > > > > > > > Kindly look into this for full trace of RS. > > > > http://pastebin.com/VS17vVd8 > > > > > > > > Thanks > > > > > > > > On Wed, Jun 10, 2015 at 11:35 AM, Ted Yu <yuzhih...@gmail.com> > wrote: > > > > > > > > > Can you pastebin the complete stack trace for the region server ? > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > > On Jun 9, 2015, at 10:52 PM, mukund murrali < > > > mukundmurra...@gmail.com> > > > > > wrote: > > > > > > > > > > > > We are using HBase-1.0.0. Just before the client stalled, in RS > > there > > > > > were > > > > > > few handler threads that were blocked for MVCC(thread stack > below) > > > > > check. > > > > > > Not sure if it could cause a problem. I don't see anything > unusual > > in > > > > RS > > > > > > threads. Also the same client can connect to regionserver after > > > > restart. > > > > > At > > > > > > that instant what causing the problem is what we are confused. > > > > > > > > > > > > > > > > > > java.lang.Thread.State: BLOCKED (on object monitor) > > > > > > at java.lang.Object.wait(Native Method) > > > > > > at > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.MultiVersionConsistencyControl.waitForPreviousTransactionsComplete(MultiVersionConsistencyControl.java:224) > > > > > > - locked <0x00000007ac0e0e88> (a java.util.LinkedList) > > > > > > at > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.MultiVersionConsistencyControl.completeMemstoreInsertWithSeqNum(MultiVersionConsistencyControl.java:127) > > > > > > at > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:2822) > > > > > > at > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2476) > > > > > > at > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2430) > > > > > > at > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2434) > > > > > > at > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:640) > > > > > > at > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:604) > > > > > > at > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:1832) > > > > > > at > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:31313) > > > > > > at > > > > org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2031) > > > > > > at > > > > org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107) > > > > > > at > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) > > > > > > at > > > > > > > org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) > > > > > > at java.lang.Thread.run(Thread.java:745) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> On Tue, Jun 9, 2015 at 6:48 PM, Anoop John < > anoop.hb...@gmail.com > > > > > > > > wrote: > > > > > >> > > > > > >> Can you see at this time, what the threads at RS doing? Handlers > > > > > mainly.. > > > > > >> which version oh hbase? > > > > > >> > > > > > >>> On Tuesday, June 9, 2015, mukund murrali < > > mukundmurra...@gmail.com > > > > > > > > > wrote: > > > > > >>> Hi > > > > > >>> > > > > > >>> I wrote a sample program with default client configurations and > > > > > created a > > > > > >>> single connection. I spawn client threads > > > > > > hbase.hconnection.threads.max > > > > > >>> from my client application and each thread insert data to hbase > > > > > cluster. > > > > > >>> Once a region split happens, all the hconnection threads(core > > pool > > > > and > > > > > >> max > > > > > >>> pool size were kept at 256) stalled at > > > > BoundedCompletionService.take() > > > > > >>> indefinitely. Even after the split completed it never resumed. > > > > > >>> > > > > > >>> So does it mean I have to create more instances of connection > > > object > > > > > for > > > > > >> a > > > > > >>> cluster in such scenarios (which is really not needed) ? There > > was > > > no > > > > > >>> exception (I expected a RejectedExecution) also in client side. > > So > > > > > >> changing > > > > > >>> the hbase.hconnection.threads.max, > > hbase.hconnection.threads.core > > > > can > > > > > >>> create such problem? > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> On Sat, Jun 6, 2015 at 5:02 PM, ramkrishna vasudevan < > > > > > >>> ramkrishna.s.vasude...@gmail.com> wrote: > > > > > >>> > > > > > >>>> Not very sure on what could be the problem when the meta > update > > > > > >> happened. > > > > > >>>> I would think that when the region split happened, there was > > some > > > > > issue > > > > > >> on > > > > > >>>> the meta update (as you said in the later mail). The splitted > > > > regions > > > > > >> would > > > > > >>>> not have been updated properly in the META. So any client > > > > > updates/reads > > > > > >>>> happening to this region would have stalled and hence your > > client > > > > > >>>> application also stalled. > > > > > >>>> > > > > > >>>> As I said the logs would be important here to know what > > happened. > > > > > This > > > > > >>>> could be one of a case and could be identified with the logs. > > > > > >>>> > > > > > >>>> Regards > > > > > >>>> Ram > > > > > >>>> > > > > > >>>> On Sat, Jun 6, 2015 at 1:25 PM, mukund murrali < > > > > > >> mukundmurra...@gmail.com> > > > > > >>>> wrote: > > > > > >>>> > > > > > >>>>> Sorry for misleading by specifying it as meta split. It was > > meta > > > > > >> update > > > > > >>>>> during a user region split. This had caused the stallation > > > > probably. > > > > > >> We > > > > > >>>>> have right now reverting client configs. Till now we didn't > > face > > > > the > > > > > >>>> issue > > > > > >>>>> again. Those changes causing some kindof exceptions or > timeout > > > was > > > > > >> what > > > > > >>>> we > > > > > >>>>> expected, but clients stalling indefinitely is what worrying > > us. > > > > > >>>>> > > > > > >>>>> On Friday 5 June 2015, Vladimir Rodionov < > > vladrodio...@gmail.com > > > > > > > > > >> wrote: > > > > > >>>>> > > > > > >>>>>> I would suggest reverting client config changes back to > > > defaults. > > > > At > > > > > >>>>> least > > > > > >>>>>> we will know if the issue is somehow related to client > config > > > > > >> changes. > > > > > >>>>>> On Jun 5, 2015 6:15 AM, "ramkrishna vasudevan" < > > > > > >>>>>> ramkrishna.s.vasude...@gmail.com <javascript:;>> wrote: > > > > > >>>>>> > > > > > >>>>>>> Hbase:meta getting split? It may b some user region, can u > > > check > > > > > >>>> that? > > > > > >>>>> If > > > > > >>>>>>> ur meta was splitting then there is something wrong. > > > > > >>>>>>> Can u attach the log snippets. > > > > > >>>>>>> > > > > > >>>>>>> Sent from phone. Excuse typos. > > > > > >>>>>>> On Jun 5, 2015 6:00 PM, "mukund murrali" < > > > > > >> mukundmurra...@gmail.com > > > > > >>>>>> <javascript:;>> wrote: > > > > > >>>>>>> > > > > > >>>>>>>> Hi > > > > > >>>>>>>> > > > > > >>>>>>>> In our case there at that instance when the client thread > > > > > >> stalled, > > > > > >>>>>> there > > > > > >>>>>>>> was a hbase:meta region split happening. So what went > wrong? > > > If > > > > > >>>> there > > > > > >>>>>> is > > > > > >>>>>>> a > > > > > >>>>>>>> split why should hconnection thread stall? Since we > changed > > > the > > > > > >>>>> client > > > > > >>>>>>>> configuration caused this? I am once again specifying our > > > client > > > > > >>>>>> related > > > > > >>>>>>>> changes we did > > > > > >>>>>>>> > > > > > >>>>>>>> hbase.client.retries.number => 5 > > > > > >>>>>>>> zookeeper.recovery.retry => 0 > > > > > >>>>>>>> zookeeper.session.timeout => 1000 > > > > > >>>>>>>> zookeeper.recovery.retry. > > > > > >>>>>>>> intervalmilli => 1 > > > > > >>>>>>>> hbase.rpc.timeout => 30000. > > > > > >>>>>>>> > > > > > >>>>>>>> Is zk timeout too low? > > > > > >>>>>>>> > > > > > >>>>>>>> > > > > > >>>>>>>> > > > > > >>>>>>>> > > > > > >>>>>>>> > > > > > >>>>>>>> > > > > > >>>>>>>> On Fri, Jun 5, 2015 at 11:37 AM, ramkrishna vasudevan < > > > > > >>>>>>>> ramkrishna.s.vasude...@gmail.com <javascript:;>> wrote: > > > > > >>>>>>>> > > > > > >>>>>>>>> When you started your client server was the META table > > > > > >> assigned. > > > > > >>>>>> May > > > > > >>>>>>> be > > > > > >>>>>>>>> some thing happened around that time and the client app > was > > > > > >> just > > > > > >>>>>>> waiting > > > > > >>>>>>>> on > > > > > >>>>>>>>> the meta table to be assigned. It would have retried - > Can > > > > > >> you > > > > > >>>>> check > > > > > >>>>>>> the > > > > > >>>>>>>>> logs.? > > > > > >>>>>>>>> > > > > > >>>>>>>>> So the best part here is the stand alone client was able > to > > > be > > > > > >>>>>>>> successful - > > > > > >>>>>>>>> which means the new clients were able to talk > successfully > > > > > >> with > > > > > >>>> the > > > > > >>>>>>>>> server. And hence the restart of your client has solved > > > your > > > > > >>>>>> problem. > > > > > >>>>>>>> It > > > > > >>>>>>>>> may be difficult to trouble shoot the exact issue with > the > > > > > >>>> limited > > > > > >>>>>>> info - > > > > > >>>>>>>>> but see if your client app regularly gets stalled and > then > > it > > > > > >> is > > > > > >>>>>> better > > > > > >>>>>>>> to > > > > > >>>>>>>>> trouble shoot your app and the way it accesses the > server. > > > > > >>>>>>>>> > > > > > >>>>>>>>> On Fri, Jun 5, 2015 at 11:21 AM, PRANEESH KUMAR < > > > > > >>>>>>>> praneesh.san...@gmail.com <javascript:;> > > > > > >>>>>>>>> wrote: > > > > > >>>>>>>>> > > > > > >>>>>>>>>> The client connection was in stalled state. But there > was > > > > > >> only > > > > > >>>>> one > > > > > >>>>>>>>>> hconnection thread found in our thread dump, which was > > > > > >> waiting > > > > > >>>>>>>>> indefinitely > > > > > >>>>>>>>>> in BoundedCompletionService.take call. Meanwhile we ran > a > > > > > >>>>>> standalone > > > > > >>>>>>>> test > > > > > >>>>>>>>>> program which was successful. > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> Once we restarted the client server, the problem got > > > > > >> resolved. > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> The basic doubt is, when the hconnection thread stalled, > > why > > > > > >>>> the > > > > > >>>>>>> HBase > > > > > >>>>>>>>>> client failed to create any more hconnections(max pool > > size > > > > > >> was > > > > > >>>>>> 10). > > > > > >>>>>>> In > > > > > >>>>>>>>>> case of problem with table/meta regions how come the > test > > > > > >>>> program > > > > > >>>>>>>>>> succeeded. > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> Regards, > > > > > >>>>>>>>>> Praneesh > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> On Fri, Jun 5, 2015 at 10:21 AM, ramkrishna vasudevan < > > > > > >>>>>>>>>> ramkrishna.s.vasude...@gmail.com <javascript:;>> wrote: > > > > > >>>>>>>>>> > > > > > >>>>>>>>>>> Can you tell us more. Is your client not working at all > > > > > >> and > > > > > >>>> it > > > > > >>>>> is > > > > > >>>>>>>>>> stalled ? > > > > > >>>>>>>>>>> Are you seeing some results but you find it slow than > you > > > > > >>>>>> expected? > > > > > >>>>>>>>>>> > > > > > >>>>>>>>>>> What type of workload are you running? All the tables > > are > > > > > >>>>>> healthy? > > > > > >>>>>>>>> Are > > > > > >>>>>>>>>>> you able to read or write to them individually using > the > > > > > >>>> hbase > > > > > >>>>>>> shell? > > > > > >>>>>>>>>>> > > > > > >>>>>>>>>>> On Fri, Jun 5, 2015 at 10:18 AM, PRANEESH KUMAR < > > > > > >>>>>>>>>> praneesh.san...@gmail.com <javascript:;> > > > > > >>>>>>>>>>> wrote: > > > > > >>>>>>>>>>> > > > > > >>>>>>>>>>>> Hi Ram, > > > > > >>>>>>>>>>>> > > > > > >>>>>>>>>>>> The cluster ran without any problem for about 2 to 3 > > > > > >> days > > > > > >>>>> with > > > > > >>>>>>> low > > > > > >>>>>>>>>> load, > > > > > >>>>>>>>>>>> once we enabled it for high load we immediately faced > > > > > >> this > > > > > >>>>>> issue. > > > > > >>>>>>>>>>>> > > > > > >>>>>>>>>>>> > > > > > >>>>>>>>>>>> Regards, > > > > > >>>>>>>>>>>> Praneesh. > > > > > >>>>>>>>>>>> > > > > > >>>>>>>>>>>> On Thursday 4 June 2015, ramkrishna vasudevan < > > > > > >>>>>>>>>>>> ramkrishna.s.vasude...@gmail.com <javascript:;>> > wrote: > > > > > >>>>>>>>>>>> > > > > > >>>>>>>>>>>>> Is your cluster in working condition. Can you see if > > > > > >> the > > > > > >>>>>> META > > > > > >>>>>>>> has > > > > > >>>>>>>>>> been > > > > > >>>>>>>>>>>>> assigned properly? If the META table is not > > > > > >> initialized > > > > > >>>>> and > > > > > >>>>>>>> opened > > > > > >>>>>>>>>>> then > > > > > >>>>>>>>>>>>> your client thread will hang. > > > > > >>>>>>>>>>>>> > > > > > >>>>>>>>>>>>> Regards > > > > > >>>>>>>>>>>>> Ram > > > > > >>>>>>>>>>>>> > > > > > >>>>>>>>>>>>> On Thu, Jun 4, 2015 at 9:05 PM, PRANEESH KUMAR < > > > > > >>>>>>>>>>>> praneesh.san...@gmail.com <javascript:;> > > > > > >>>>>>>>>>>>> <javascript:;>> > > > > > >>>>>>>>>>>>> wrote: > > > > > >>>>>>>>>>>>> > > > > > >>>>>>>>>>>>>> Hi, > > > > > >>>>>>>>>>>>>> > > > > > >>>>>>>>>>>>>> We are using Hbase-1.0.0. We also facing the same > > > > > >> issue > > > > > >>>>>> that > > > > > >>>>>>>>> client > > > > > >>>>>>>>>>>>>> connection thread is waiting at > > > > > >> > > > > > >> > > > > > > > > > > > > > > > org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1200). > > > > > >>>>>>>>>>>>>> > > > > > >>>>>>>>>>>>>> Any help is appreciated. > > > > > >>>>>>>>>>>>>> > > > > > >>>>>>>>>>>>>> Regards, > > > > > >>>>>>>>>>>>>> Praneesh > > > > > >> > > > > > > > > > > > > > > >