Re: HConnection thread waiting on blocking queue indefinitely

Ted Yu Sun, 21 Jun 2015 22:23:52 -0700

I was out of the country this past week where access to gmail was difficult.


Looking at client stack trace, it seems the hang corresponded to the
following:
        at
org.apache.hadoop.hbase.client.ClientSmallReversedScanner.next(ClientSmallReversedScanner.java:145)
        at
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1200)

Will continue digging through the stack traces / logs.

Cheers

On Wed, Jun 17, 2015 at 10:35 PM, mukund murrali <mukundmurra...@gmail.com>
wrote:

> Even with 1.1.0 the issue persists. Client side blocking wait still happens
> during first region split. Tried in distributed set up with 1.0.0 as
> suggested by you and had the same results.
>
> Client jstack - http://pastebin.com/Ptw0JhdG
>
> RS Hosting Table Log - http://pastebin.com/ZSD4YUE5
>
> One point to note is The RS having hbase:meta showed no logs of split but
> the master had info about it. Why is it so? hbase:meta moved to master?
>
> Master Log: http://pastebin.com/f2suyNr1
>
> One more interesting finding is in thread stack of RS Hosting table from
> the time client hangs, there is a hconnection in waiting state. Subsequent
> thread dumps also had hconnection in waiting state. Is there any deadlock?
> See if it can be of any use for analyzing.
>
> Thread Stack of RS hosting table - http://pastebin.com/rGbJyrPB
>
> Also AM.ZK.Worker threads waiting in Master. The pastebin of HMaster during
> client hang and region split is
>
> http://pastebin.com/3pgVYpYW
>
> Thanks
>
> On Thu, Jun 11, 2015 at 10:48 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
> > Looking at the revision history for ClientSmallReversedScanner.java which
> > appeared in the stack trace, there have been several bug fixes on top of
> > the hbase release you're using.
> >
> > Can you try hbase 1.1.0 to see if the problem can be reproduced (in
> cluster
> > deployment) ?
> >
> > Thanks
> >
> > On Tue, Jun 9, 2015 at 11:42 PM, mukund murrali <
> mukundmurra...@gmail.com>
> > wrote:
> >
> > > Kindly look into this for full trace of RS.
> > > http://pastebin.com/VS17vVd8
> > >
> > > Thanks
> > >
> > > On Wed, Jun 10, 2015 at 11:35 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> > >
> > > > Can you pastebin the complete stack trace for the region server ?
> > > >
> > > > Thanks
> > > >
> > > >
> > > >
> > > > > On Jun 9, 2015, at 10:52 PM, mukund murrali <
> > mukundmurra...@gmail.com>
> > > > wrote:
> > > > >
> > > > > We are using HBase-1.0.0. Just before the client stalled, in RS
> there
> > > > were
> > > > > few handler threads that were blocked for  MVCC(thread stack below)
> > > > check.
> > > > > Not sure if it could cause a problem. I don't see anything unusual
> in
> > > RS
> > > > > threads. Also the same client can connect to regionserver after
> > > restart.
> > > > At
> > > > > that instant what causing the problem is what we are confused.
> > > > >
> > > > >
> > > > > java.lang.Thread.State: BLOCKED (on object monitor)
> > > > >        at java.lang.Object.wait(Native Method)
> > > > >        at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.MultiVersionConsistencyControl.waitForPreviousTransactionsComplete(MultiVersionConsistencyControl.java:224)
> > > > >        - locked <0x00000007ac0e0e88> (a java.util.LinkedList)
> > > > >        at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.MultiVersionConsistencyControl.completeMemstoreInsertWithSeqNum(MultiVersionConsistencyControl.java:127)
> > > > >        at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:2822)
> > > > >        at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2476)
> > > > >        at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2430)
> > > > >        at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2434)
> > > > >        at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:640)
> > > > >        at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:604)
> > > > >        at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:1832)
> > > > >        at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:31313)
> > > > >        at
> > > org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2031)
> > > > >        at
> > > org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
> > > > >        at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
> > > > >        at
> > > > > org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
> > > > >        at java.lang.Thread.run(Thread.java:745)
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >> On Tue, Jun 9, 2015 at 6:48 PM, Anoop John <anoop.hb...@gmail.com
> >
> > > > wrote:
> > > > >>
> > > > >> Can you see at this time, what the threads at RS doing? Handlers
> > > > mainly..
> > > > >> which version oh hbase?
> > > > >>
> > > > >>> On Tuesday, June 9, 2015, mukund murrali <
> mukundmurra...@gmail.com
> > >
> > > > wrote:
> > > > >>> Hi
> > > > >>>
> > > > >>> I wrote a sample program with default client configurations and
> > > > created a
> > > > >>> single connection. I spawn client threads >
> > > > hbase.hconnection.threads.max
> > > > >>> from my client application and each thread insert data to hbase
> > > > cluster.
> > > > >>> Once a region split happens, all the hconnection threads(core
> pool
> > > and
> > > > >> max
> > > > >>> pool size were kept at 256) stalled at
> > > BoundedCompletionService.take()
> > > > >>> indefinitely. Even after the split completed it never resumed.
> > > > >>>
> > > > >>> So does it mean I have to create more instances of connection
> > object
> > > > for
> > > > >> a
> > > > >>> cluster in such scenarios (which is really not needed) ? There
> was
> > no
> > > > >>> exception (I expected a RejectedExecution) also in client side.
> So
> > > > >> changing
> > > > >>> the  hbase.hconnection.threads.max,
> hbase.hconnection.threads.core
> > > can
> > > > >>> create such problem?
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> On Sat, Jun 6, 2015 at 5:02 PM, ramkrishna vasudevan <
> > > > >>> ramkrishna.s.vasude...@gmail.com> wrote:
> > > > >>>
> > > > >>>> Not very sure on what could be the problem when the meta update
> > > > >> happened.
> > > > >>>> I would think that when the region split happened, there was
> some
> > > > issue
> > > > >> on
> > > > >>>> the meta update (as you said in the later mail). The splitted
> > > regions
> > > > >> would
> > > > >>>> not have been updated properly in the META.  So any client
> > > > updates/reads
> > > > >>>> happening to this region would have stalled and hence your
> client
> > > > >>>> application also stalled.
> > > > >>>>
> > > > >>>> As I said the logs would be important here to know what
> happened.
> > > > This
> > > > >>>> could be one of a case and could be identified with the logs.
> > > > >>>>
> > > > >>>> Regards
> > > > >>>> Ram
> > > > >>>>
> > > > >>>> On Sat, Jun 6, 2015 at 1:25 PM, mukund murrali <
> > > > >> mukundmurra...@gmail.com>
> > > > >>>> wrote:
> > > > >>>>
> > > > >>>>> Sorry for misleading by specifying it as meta split. It was
> meta
> > > > >> update
> > > > >>>>> during a user region split. This had caused the stallation
> > > probably.
> > > > >> We
> > > > >>>>> have right now reverting client configs. Till now we didn't
> face
> > > the
> > > > >>>> issue
> > > > >>>>> again. Those changes causing some kindof exceptions or timeout
> > was
> > > > >> what
> > > > >>>> we
> > > > >>>>> expected, but clients stalling indefinitely is what worrying
> us.
> > > > >>>>>
> > > > >>>>> On Friday 5 June 2015, Vladimir Rodionov <
> vladrodio...@gmail.com
> > >
> > > > >> wrote:
> > > > >>>>>
> > > > >>>>>> I would suggest reverting client config changes back to
> > defaults.
> > > At
> > > > >>>>> least
> > > > >>>>>> we will know if the issue is somehow related to client config
> > > > >> changes.
> > > > >>>>>> On Jun 5, 2015 6:15 AM, "ramkrishna vasudevan" <
> > > > >>>>>> ramkrishna.s.vasude...@gmail.com <javascript:;>> wrote:
> > > > >>>>>>
> > > > >>>>>>> Hbase:meta getting split? It may b some user region, can u
> > check
> > > > >>>> that?
> > > > >>>>> If
> > > > >>>>>>> ur meta was splitting then there is something wrong.
> > > > >>>>>>> Can u attach the log snippets.
> > > > >>>>>>>
> > > > >>>>>>> Sent from phone. Excuse typos.
> > > > >>>>>>> On Jun 5, 2015 6:00 PM, "mukund murrali" <
> > > > >> mukundmurra...@gmail.com
> > > > >>>>>> <javascript:;>> wrote:
> > > > >>>>>>>
> > > > >>>>>>>> Hi
> > > > >>>>>>>>
> > > > >>>>>>>> In our case there at that instance when the client thread
> > > > >> stalled,
> > > > >>>>>> there
> > > > >>>>>>>> was a hbase:meta region split happening. So what went wrong?
> > If
> > > > >>>> there
> > > > >>>>>> is
> > > > >>>>>>> a
> > > > >>>>>>>> split why should hconnection thread stall? Since we changed
> > the
> > > > >>>>> client
> > > > >>>>>>>> configuration caused this? I am once again specifying our
> > client
> > > > >>>>>> related
> > > > >>>>>>>> changes we did
> > > > >>>>>>>>
> > > > >>>>>>>> hbase.client.retries.number => 5
> > > > >>>>>>>> zookeeper.recovery.retry => 0
> > > > >>>>>>>> zookeeper.session.timeout => 1000
> > > > >>>>>>>> zookeeper.recovery.retry.
> > > > >>>>>>>> intervalmilli => 1
> > > > >>>>>>>> hbase.rpc.timeout => 30000.
> > > > >>>>>>>>
> > > > >>>>>>>> Is zk timeout too low?
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> On Fri, Jun 5, 2015 at 11:37 AM, ramkrishna vasudevan <
> > > > >>>>>>>> ramkrishna.s.vasude...@gmail.com <javascript:;>> wrote:
> > > > >>>>>>>>
> > > > >>>>>>>>> When you started  your client server was the META table
> > > > >> assigned.
> > > > >>>>>> May
> > > > >>>>>>> be
> > > > >>>>>>>>> some thing happened around that time and the client app was
> > > > >> just
> > > > >>>>>>> waiting
> > > > >>>>>>>> on
> > > > >>>>>>>>> the meta table to be assigned.  It would have retried - Can
> > > > >> you
> > > > >>>>> check
> > > > >>>>>>> the
> > > > >>>>>>>>> logs.?
> > > > >>>>>>>>>
> > > > >>>>>>>>> So the best part here is the stand alone client was able to
> > be
> > > > >>>>>>>> successful -
> > > > >>>>>>>>> which means the new clients were able to talk successfully
> > > > >> with
> > > > >>>> the
> > > > >>>>>>>>> server.  And hence the restart of your client has solved
> > your
> > > > >>>>>> problem.
> > > > >>>>>>>> It
> > > > >>>>>>>>> may be difficult to trouble shoot the exact issue with the
> > > > >>>> limited
> > > > >>>>>>> info -
> > > > >>>>>>>>> but see if your client app regularly gets stalled and then
> it
> > > > >> is
> > > > >>>>>> better
> > > > >>>>>>>> to
> > > > >>>>>>>>> trouble shoot your app and the way it accesses the server.
> > > > >>>>>>>>>
> > > > >>>>>>>>> On Fri, Jun 5, 2015 at 11:21 AM, PRANEESH KUMAR <
> > > > >>>>>>>> praneesh.san...@gmail.com <javascript:;>
> > > > >>>>>>>>> wrote:
> > > > >>>>>>>>>
> > > > >>>>>>>>>> The client connection was in stalled state. But there was
> > > > >> only
> > > > >>>>> one
> > > > >>>>>>>>>> hconnection thread found in our thread dump, which was
> > > > >> waiting
> > > > >>>>>>>>> indefinitely
> > > > >>>>>>>>>> in BoundedCompletionService.take call. Meanwhile we ran a
> > > > >>>>>> standalone
> > > > >>>>>>>> test
> > > > >>>>>>>>>> program which was successful.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Once we restarted the client server, the problem got
> > > > >> resolved.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> The basic doubt is, when the hconnection thread stalled,
> why
> > > > >>>> the
> > > > >>>>>>> HBase
> > > > >>>>>>>>>> client failed to create any more hconnections(max pool
> size
> > > > >> was
> > > > >>>>>> 10).
> > > > >>>>>>> In
> > > > >>>>>>>>>> case of problem with table/meta regions how come the test
> > > > >>>> program
> > > > >>>>>>>>>> succeeded.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Regards,
> > > > >>>>>>>>>> Praneesh
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> On Fri, Jun 5, 2015 at 10:21 AM, ramkrishna vasudevan <
> > > > >>>>>>>>>> ramkrishna.s.vasude...@gmail.com <javascript:;>> wrote:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>> Can you tell us more. Is your client not working at all
> > > > >> and
> > > > >>>> it
> > > > >>>>> is
> > > > >>>>>>>>>> stalled ?
> > > > >>>>>>>>>>> Are you seeing some results but you find it slow than you
> > > > >>>>>> expected?
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> What type of workload are you running?  All the tables
> are
> > > > >>>>>> healthy?
> > > > >>>>>>>>> Are
> > > > >>>>>>>>>>> you able to read or write to them individually using the
> > > > >>>> hbase
> > > > >>>>>>> shell?
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> On Fri, Jun 5, 2015 at 10:18 AM, PRANEESH KUMAR <
> > > > >>>>>>>>>> praneesh.san...@gmail.com <javascript:;>
> > > > >>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> Hi Ram,
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> The cluster ran without any problem for about 2 to 3
> > > > >> days
> > > > >>>>> with
> > > > >>>>>>> low
> > > > >>>>>>>>>> load,
> > > > >>>>>>>>>>>> once we enabled it for high load we immediately faced
> > > > >> this
> > > > >>>>>> issue.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Regards,
> > > > >>>>>>>>>>>> Praneesh.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> On Thursday 4 June 2015, ramkrishna vasudevan <
> > > > >>>>>>>>>>>> ramkrishna.s.vasude...@gmail.com <javascript:;>> wrote:
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Is your cluster in working condition.  Can you see if
> > > > >> the
> > > > >>>>>> META
> > > > >>>>>>>> has
> > > > >>>>>>>>>> been
> > > > >>>>>>>>>>>>> assigned properly?  If the META table is not
> > > > >> initialized
> > > > >>>>> and
> > > > >>>>>>>> opened
> > > > >>>>>>>>>>> then
> > > > >>>>>>>>>>>>> your client thread will hang.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Regards
> > > > >>>>>>>>>>>>> Ram
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> On Thu, Jun 4, 2015 at 9:05 PM, PRANEESH KUMAR <
> > > > >>>>>>>>>>>> praneesh.san...@gmail.com <javascript:;>
> > > > >>>>>>>>>>>>> <javascript:;>>
> > > > >>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Hi,
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> We are using Hbase-1.0.0. We also facing the same
> > > > >> issue
> > > > >>>>>> that
> > > > >>>>>>>>> client
> > > > >>>>>>>>>>>>>> connection thread is waiting at
> > > > >>
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1200).
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Any help is appreciated.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Regards,
> > > > >>>>>>>>>>>>>> Praneesh
> > > > >>
> > > >
> > >
> >
>

Re: HConnection thread waiting on blocking queue indefinitely

Reply via email to