Fwd: HConnection thread waiting on blocking queue indefinitely

mukund murrali Wed, 17 Jun 2015 22:35:53 -0700

Even with 1.1.0 the issue persists. Client side blocking wait still happens
during first region split. Tried in distributed set up with 1.0.0 as
suggested by you and had the same results.


Client jstack - http://pastebin.com/Ptw0JhdG

RS Hosting Table Log - http://pastebin.com/ZSD4YUE5

One point to note is The RS having hbase:meta showed no logs of split but
the master had info about it. Why is it so? hbase:meta moved to master?

Master Log: http://pastebin.com/f2suyNr1

One more interesting finding is in thread stack of RS Hosting table from
the time client hangs, there is a hconnection in waiting state. Subsequent
thread dumps also had hconnection in waiting state. Is there any deadlock?
See if it can be of any use for analyzing.

Thread Stack of RS hosting table - http://pastebin.com/rGbJyrPB

Also AM.ZK.Worker threads waiting in Master. The pastebin of HMaster during
client hang and region split is

http://pastebin.com/3pgVYpYW

Thanks

On Thu, Jun 11, 2015 at 10:48 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> Looking at the revision history for ClientSmallReversedScanner.java which
> appeared in the stack trace, there have been several bug fixes on top of
> the hbase release you're using.
>
> Can you try hbase 1.1.0 to see if the problem can be reproduced (in cluster
> deployment) ?
>
> Thanks
>
> On Tue, Jun 9, 2015 at 11:42 PM, mukund murrali <mukundmurra...@gmail.com>
> wrote:
>
> > Kindly look into this for full trace of RS.
> > http://pastebin.com/VS17vVd8
> >
> > Thanks
> >
> > On Wed, Jun 10, 2015 at 11:35 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> >
> > > Can you pastebin the complete stack trace for the region server ?
> > >
> > > Thanks
> > >
> > >
> > >
> > > > On Jun 9, 2015, at 10:52 PM, mukund murrali <
> mukundmurra...@gmail.com>
> > > wrote:
> > > >
> > > > We are using HBase-1.0.0. Just before the client stalled, in RS there
> > > were
> > > > few handler threads that were blocked for  MVCC(thread stack below)
> > > check.
> > > > Not sure if it could cause a problem. I don't see anything unusual in
> > RS
> > > > threads. Also the same client can connect to regionserver after
> > restart.
> > > At
> > > > that instant what causing the problem is what we are confused.
> > > >
> > > >
> > > > java.lang.Thread.State: BLOCKED (on object monitor)
> > > >        at java.lang.Object.wait(Native Method)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.MultiVersionConsistencyControl.waitForPreviousTransactionsComplete(MultiVersionConsistencyControl.java:224)
> > > >        - locked <0x00000007ac0e0e88> (a java.util.LinkedList)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.MultiVersionConsistencyControl.completeMemstoreInsertWithSeqNum(MultiVersionConsistencyControl.java:127)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:2822)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2476)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2430)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2434)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:640)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:604)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:1832)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:31313)
> > > >        at
> > org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2031)
> > > >        at
> > org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
> > > >        at
> > > > org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
> > > >        at java.lang.Thread.run(Thread.java:745)
> > > >
> > > >
> > > >
> > > >
> > > >> On Tue, Jun 9, 2015 at 6:48 PM, Anoop John <anoop.hb...@gmail.com>
> > > wrote:
> > > >>
> > > >> Can you see at this time, what the threads at RS doing? Handlers
> > > mainly..
> > > >> which version oh hbase?
> > > >>
> > > >>> On Tuesday, June 9, 2015, mukund murrali <mukundmurra...@gmail.com
> >
> > > wrote:
> > > >>> Hi
> > > >>>
> > > >>> I wrote a sample program with default client configurations and
> > > created a
> > > >>> single connection. I spawn client threads >
> > > hbase.hconnection.threads.max
> > > >>> from my client application and each thread insert data to hbase
> > > cluster.
> > > >>> Once a region split happens, all the hconnection threads(core pool
> > and
> > > >> max
> > > >>> pool size were kept at 256) stalled at
> > BoundedCompletionService.take()
> > > >>> indefinitely. Even after the split completed it never resumed.
> > > >>>
> > > >>> So does it mean I have to create more instances of connection
> object
> > > for
> > > >> a
> > > >>> cluster in such scenarios (which is really not needed) ? There was
> no
> > > >>> exception (I expected a RejectedExecution) also in client side. So
> > > >> changing
> > > >>> the  hbase.hconnection.threads.max, hbase.hconnection.threads.core
> > can
> > > >>> create such problem?
> > > >>>
> > > >>>
> > > >>>
> > > >>> On Sat, Jun 6, 2015 at 5:02 PM, ramkrishna vasudevan <
> > > >>> ramkrishna.s.vasude...@gmail.com> wrote:
> > > >>>
> > > >>>> Not very sure on what could be the problem when the meta update
> > > >> happened.
> > > >>>> I would think that when the region split happened, there was some
> > > issue
> > > >> on
> > > >>>> the meta update (as you said in the later mail). The splitted
> > regions
> > > >> would
> > > >>>> not have been updated properly in the META.  So any client
> > > updates/reads
> > > >>>> happening to this region would have stalled and hence your client
> > > >>>> application also stalled.
> > > >>>>
> > > >>>> As I said the logs would be important here to know what happened.
> > > This
> > > >>>> could be one of a case and could be identified with the logs.
> > > >>>>
> > > >>>> Regards
> > > >>>> Ram
> > > >>>>
> > > >>>> On Sat, Jun 6, 2015 at 1:25 PM, mukund murrali <
> > > >> mukundmurra...@gmail.com>
> > > >>>> wrote:
> > > >>>>
> > > >>>>> Sorry for misleading by specifying it as meta split. It was meta
> > > >> update
> > > >>>>> during a user region split. This had caused the stallation
> > probably.
> > > >> We
> > > >>>>> have right now reverting client configs. Till now we didn't face
> > the
> > > >>>> issue
> > > >>>>> again. Those changes causing some kindof exceptions or timeout
> was
> > > >> what
> > > >>>> we
> > > >>>>> expected, but clients stalling indefinitely is what worrying us.
> > > >>>>>
> > > >>>>> On Friday 5 June 2015, Vladimir Rodionov <vladrodio...@gmail.com
> >
> > > >> wrote:
> > > >>>>>
> > > >>>>>> I would suggest reverting client config changes back to
> defaults.
> > At
> > > >>>>> least
> > > >>>>>> we will know if the issue is somehow related to client config
> > > >> changes.
> > > >>>>>> On Jun 5, 2015 6:15 AM, "ramkrishna vasudevan" <
> > > >>>>>> ramkrishna.s.vasude...@gmail.com <javascript:;>> wrote:
> > > >>>>>>
> > > >>>>>>> Hbase:meta getting split? It may b some user region, can u
> check
> > > >>>> that?
> > > >>>>> If
> > > >>>>>>> ur meta was splitting then there is something wrong.
> > > >>>>>>> Can u attach the log snippets.
> > > >>>>>>>
> > > >>>>>>> Sent from phone. Excuse typos.
> > > >>>>>>> On Jun 5, 2015 6:00 PM, "mukund murrali" <
> > > >> mukundmurra...@gmail.com
> > > >>>>>> <javascript:;>> wrote:
> > > >>>>>>>
> > > >>>>>>>> Hi
> > > >>>>>>>>
> > > >>>>>>>> In our case there at that instance when the client thread
> > > >> stalled,
> > > >>>>>> there
> > > >>>>>>>> was a hbase:meta region split happening. So what went wrong?
> If
> > > >>>> there
> > > >>>>>> is
> > > >>>>>>> a
> > > >>>>>>>> split why should hconnection thread stall? Since we changed
> the
> > > >>>>> client
> > > >>>>>>>> configuration caused this? I am once again specifying our
> client
> > > >>>>>> related
> > > >>>>>>>> changes we did
> > > >>>>>>>>
> > > >>>>>>>> hbase.client.retries.number => 5
> > > >>>>>>>> zookeeper.recovery.retry => 0
> > > >>>>>>>> zookeeper.session.timeout => 1000
> > > >>>>>>>> zookeeper.recovery.retry.
> > > >>>>>>>> intervalmilli => 1
> > > >>>>>>>> hbase.rpc.timeout => 30000.
> > > >>>>>>>>
> > > >>>>>>>> Is zk timeout too low?
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> On Fri, Jun 5, 2015 at 11:37 AM, ramkrishna vasudevan <
> > > >>>>>>>> ramkrishna.s.vasude...@gmail.com <javascript:;>> wrote:
> > > >>>>>>>>
> > > >>>>>>>>> When you started  your client server was the META table
> > > >> assigned.
> > > >>>>>> May
> > > >>>>>>> be
> > > >>>>>>>>> some thing happened around that time and the client app was
> > > >> just
> > > >>>>>>> waiting
> > > >>>>>>>> on
> > > >>>>>>>>> the meta table to be assigned.  It would have retried - Can
> > > >> you
> > > >>>>> check
> > > >>>>>>> the
> > > >>>>>>>>> logs.?
> > > >>>>>>>>>
> > > >>>>>>>>> So the best part here is the stand alone client was able to
> be
> > > >>>>>>>> successful -
> > > >>>>>>>>> which means the new clients were able to talk successfully
> > > >> with
> > > >>>> the
> > > >>>>>>>>> server.  And hence the restart of your client has solved
> your
> > > >>>>>> problem.
> > > >>>>>>>> It
> > > >>>>>>>>> may be difficult to trouble shoot the exact issue with the
> > > >>>> limited
> > > >>>>>>> info -
> > > >>>>>>>>> but see if your client app regularly gets stalled and then it
> > > >> is
> > > >>>>>> better
> > > >>>>>>>> to
> > > >>>>>>>>> trouble shoot your app and the way it accesses the server.
> > > >>>>>>>>>
> > > >>>>>>>>> On Fri, Jun 5, 2015 at 11:21 AM, PRANEESH KUMAR <
> > > >>>>>>>> praneesh.san...@gmail.com <javascript:;>
> > > >>>>>>>>> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> The client connection was in stalled state. But there was
> > > >> only
> > > >>>>> one
> > > >>>>>>>>>> hconnection thread found in our thread dump, which was
> > > >> waiting
> > > >>>>>>>>> indefinitely
> > > >>>>>>>>>> in BoundedCompletionService.take call. Meanwhile we ran a
> > > >>>>>> standalone
> > > >>>>>>>> test
> > > >>>>>>>>>> program which was successful.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Once we restarted the client server, the problem got
> > > >> resolved.
> > > >>>>>>>>>>
> > > >>>>>>>>>> The basic doubt is, when the hconnection thread stalled, why
> > > >>>> the
> > > >>>>>>> HBase
> > > >>>>>>>>>> client failed to create any more hconnections(max pool size
> > > >> was
> > > >>>>>> 10).
> > > >>>>>>> In
> > > >>>>>>>>>> case of problem with table/meta regions how come the test
> > > >>>> program
> > > >>>>>>>>>> succeeded.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Regards,
> > > >>>>>>>>>> Praneesh
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Fri, Jun 5, 2015 at 10:21 AM, ramkrishna vasudevan <
> > > >>>>>>>>>> ramkrishna.s.vasude...@gmail.com <javascript:;>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>>> Can you tell us more. Is your client not working at all
> > > >> and
> > > >>>> it
> > > >>>>> is
> > > >>>>>>>>>> stalled ?
> > > >>>>>>>>>>> Are you seeing some results but you find it slow than you
> > > >>>>>> expected?
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> What type of workload are you running?  All the tables are
> > > >>>>>> healthy?
> > > >>>>>>>>> Are
> > > >>>>>>>>>>> you able to read or write to them individually using the
> > > >>>> hbase
> > > >>>>>>> shell?
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On Fri, Jun 5, 2015 at 10:18 AM, PRANEESH KUMAR <
> > > >>>>>>>>>> praneesh.san...@gmail.com <javascript:;>
> > > >>>>>>>>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> Hi Ram,
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> The cluster ran without any problem for about 2 to 3
> > > >> days
> > > >>>>> with
> > > >>>>>>> low
> > > >>>>>>>>>> load,
> > > >>>>>>>>>>>> once we enabled it for high load we immediately faced
> > > >> this
> > > >>>>>> issue.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Regards,
> > > >>>>>>>>>>>> Praneesh.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> On Thursday 4 June 2015, ramkrishna vasudevan <
> > > >>>>>>>>>>>> ramkrishna.s.vasude...@gmail.com <javascript:;>> wrote:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> Is your cluster in working condition.  Can you see if
> > > >> the
> > > >>>>>> META
> > > >>>>>>>> has
> > > >>>>>>>>>> been
> > > >>>>>>>>>>>>> assigned properly?  If the META table is not
> > > >> initialized
> > > >>>>> and
> > > >>>>>>>> opened
> > > >>>>>>>>>>> then
> > > >>>>>>>>>>>>> your client thread will hang.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Regards
> > > >>>>>>>>>>>>> Ram
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> On Thu, Jun 4, 2015 at 9:05 PM, PRANEESH KUMAR <
> > > >>>>>>>>>>>> praneesh.san...@gmail.com <javascript:;>
> > > >>>>>>>>>>>>> <javascript:;>>
> > > >>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Hi,
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> We are using Hbase-1.0.0. We also facing the same
> > > >> issue
> > > >>>>>> that
> > > >>>>>>>>> client
> > > >>>>>>>>>>>>>> connection thread is waiting at
> > > >>
> > > >>
> > >
> >
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1200).
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Any help is appreciated.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Regards,
> > > >>>>>>>>>>>>>> Praneesh
> > > >>
> > >
> >
>

Fwd: HConnection thread waiting on blocking queue indefinitely

Reply via email to