Re: HConnection thread waiting on blocking queue indefinitely

mukund murrali Sun, 21 Jun 2015 22:51:01 -0700

Hi All

I have moved this as jira.


https://issues.apache.org/jira/browse/HBASE-13942

Please post all your opinions there.

Thanks

On Mon, Jun 22, 2015 at 10:53 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> I was out of the country this past week where access to gmail was
> difficult.
>
> Looking at client stack trace, it seems the hang corresponded to the
> following:
>         at
>
> org.apache.hadoop.hbase.client.ClientSmallReversedScanner.next(ClientSmallReversedScanner.java:145)
>         at
>
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1200)
>
> Will continue digging through the stack traces / logs.
>
> Cheers
>
> On Wed, Jun 17, 2015 at 10:35 PM, mukund murrali <mukundmurra...@gmail.com
> >
> wrote:
>
> > Even with 1.1.0 the issue persists. Client side blocking wait still
> happens
> > during first region split. Tried in distributed set up with 1.0.0 as
> > suggested by you and had the same results.
> >
> > Client jstack - http://pastebin.com/Ptw0JhdG
> >
> > RS Hosting Table Log - http://pastebin.com/ZSD4YUE5
> >
> > One point to note is The RS having hbase:meta showed no logs of split but
> > the master had info about it. Why is it so? hbase:meta moved to master?
> >
> > Master Log: http://pastebin.com/f2suyNr1
> >
> > One more interesting finding is in thread stack of RS Hosting table from
> > the time client hangs, there is a hconnection in waiting state.
> Subsequent
> > thread dumps also had hconnection in waiting state. Is there any
> deadlock?
> > See if it can be of any use for analyzing.
> >
> > Thread Stack of RS hosting table - http://pastebin.com/rGbJyrPB
> >
> > Also AM.ZK.Worker threads waiting in Master. The pastebin of HMaster
> during
> > client hang and region split is
> >
> > http://pastebin.com/3pgVYpYW
> >
> > Thanks
> >
> > On Thu, Jun 11, 2015 at 10:48 PM, Ted Yu <yuzhih...@gmail.com> wrote:
> >
> > > Looking at the revision history for ClientSmallReversedScanner.java
> which
> > > appeared in the stack trace, there have been several bug fixes on top
> of
> > > the hbase release you're using.
> > >
> > > Can you try hbase 1.1.0 to see if the problem can be reproduced (in
> > cluster
> > > deployment) ?
> > >
> > > Thanks
> > >
> > > On Tue, Jun 9, 2015 at 11:42 PM, mukund murrali <
> > mukundmurra...@gmail.com>
> > > wrote:
> > >
> > > > Kindly look into this for full trace of RS.
> > > > http://pastebin.com/VS17vVd8
> > > >
> > > > Thanks
> > > >
> > > > On Wed, Jun 10, 2015 at 11:35 AM, Ted Yu <yuzhih...@gmail.com>
> wrote:
> > > >
> > > > > Can you pastebin the complete stack trace for the region server ?
> > > > >
> > > > > Thanks
> > > > >
> > > > >
> > > > >
> > > > > > On Jun 9, 2015, at 10:52 PM, mukund murrali <
> > > mukundmurra...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > We are using HBase-1.0.0. Just before the client stalled, in RS
> > there
> > > > > were
> > > > > > few handler threads that were blocked for  MVCC(thread stack
> below)
> > > > > check.
> > > > > > Not sure if it could cause a problem. I don't see anything
> unusual
> > in
> > > > RS
> > > > > > threads. Also the same client can connect to regionserver after
> > > > restart.
> > > > > At
> > > > > > that instant what causing the problem is what we are confused.
> > > > > >
> > > > > >
> > > > > > java.lang.Thread.State: BLOCKED (on object monitor)
> > > > > >        at java.lang.Object.wait(Native Method)
> > > > > >        at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.MultiVersionConsistencyControl.waitForPreviousTransactionsComplete(MultiVersionConsistencyControl.java:224)
> > > > > >        - locked <0x00000007ac0e0e88> (a java.util.LinkedList)
> > > > > >        at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.MultiVersionConsistencyControl.completeMemstoreInsertWithSeqNum(MultiVersionConsistencyControl.java:127)
> > > > > >        at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:2822)
> > > > > >        at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2476)
> > > > > >        at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2430)
> > > > > >        at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2434)
> > > > > >        at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:640)
> > > > > >        at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:604)
> > > > > >        at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:1832)
> > > > > >        at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:31313)
> > > > > >        at
> > > > org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2031)
> > > > > >        at
> > > > org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
> > > > > >        at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
> > > > > >        at
> > > > > >
> org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
> > > > > >        at java.lang.Thread.run(Thread.java:745)
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >> On Tue, Jun 9, 2015 at 6:48 PM, Anoop John <
> anoop.hb...@gmail.com
> > >
> > > > > wrote:
> > > > > >>
> > > > > >> Can you see at this time, what the threads at RS doing? Handlers
> > > > > mainly..
> > > > > >> which version oh hbase?
> > > > > >>
> > > > > >>> On Tuesday, June 9, 2015, mukund murrali <
> > mukundmurra...@gmail.com
> > > >
> > > > > wrote:
> > > > > >>> Hi
> > > > > >>>
> > > > > >>> I wrote a sample program with default client configurations and
> > > > > created a
> > > > > >>> single connection. I spawn client threads >
> > > > > hbase.hconnection.threads.max
> > > > > >>> from my client application and each thread insert data to hbase
> > > > > cluster.
> > > > > >>> Once a region split happens, all the hconnection threads(core
> > pool
> > > > and
> > > > > >> max
> > > > > >>> pool size were kept at 256) stalled at
> > > > BoundedCompletionService.take()
> > > > > >>> indefinitely. Even after the split completed it never resumed.
> > > > > >>>
> > > > > >>> So does it mean I have to create more instances of connection
> > > object
> > > > > for
> > > > > >> a
> > > > > >>> cluster in such scenarios (which is really not needed) ? There
> > was
> > > no
> > > > > >>> exception (I expected a RejectedExecution) also in client side.
> > So
> > > > > >> changing
> > > > > >>> the  hbase.hconnection.threads.max,
> > hbase.hconnection.threads.core
> > > > can
> > > > > >>> create such problem?
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> On Sat, Jun 6, 2015 at 5:02 PM, ramkrishna vasudevan <
> > > > > >>> ramkrishna.s.vasude...@gmail.com> wrote:
> > > > > >>>
> > > > > >>>> Not very sure on what could be the problem when the meta
> update
> > > > > >> happened.
> > > > > >>>> I would think that when the region split happened, there was
> > some
> > > > > issue
> > > > > >> on
> > > > > >>>> the meta update (as you said in the later mail). The splitted
> > > > regions
> > > > > >> would
> > > > > >>>> not have been updated properly in the META.  So any client
> > > > > updates/reads
> > > > > >>>> happening to this region would have stalled and hence your
> > client
> > > > > >>>> application also stalled.
> > > > > >>>>
> > > > > >>>> As I said the logs would be important here to know what
> > happened.
> > > > > This
> > > > > >>>> could be one of a case and could be identified with the logs.
> > > > > >>>>
> > > > > >>>> Regards
> > > > > >>>> Ram
> > > > > >>>>
> > > > > >>>> On Sat, Jun 6, 2015 at 1:25 PM, mukund murrali <
> > > > > >> mukundmurra...@gmail.com>
> > > > > >>>> wrote:
> > > > > >>>>
> > > > > >>>>> Sorry for misleading by specifying it as meta split. It was
> > meta
> > > > > >> update
> > > > > >>>>> during a user region split. This had caused the stallation
> > > > probably.
> > > > > >> We
> > > > > >>>>> have right now reverting client configs. Till now we didn't
> > face
> > > > the
> > > > > >>>> issue
> > > > > >>>>> again. Those changes causing some kindof exceptions or
> timeout
> > > was
> > > > > >> what
> > > > > >>>> we
> > > > > >>>>> expected, but clients stalling indefinitely is what worrying
> > us.
> > > > > >>>>>
> > > > > >>>>> On Friday 5 June 2015, Vladimir Rodionov <
> > vladrodio...@gmail.com
> > > >
> > > > > >> wrote:
> > > > > >>>>>
> > > > > >>>>>> I would suggest reverting client config changes back to
> > > defaults.
> > > > At
> > > > > >>>>> least
> > > > > >>>>>> we will know if the issue is somehow related to client
> config
> > > > > >> changes.
> > > > > >>>>>> On Jun 5, 2015 6:15 AM, "ramkrishna vasudevan" <
> > > > > >>>>>> ramkrishna.s.vasude...@gmail.com <javascript:;>> wrote:
> > > > > >>>>>>
> > > > > >>>>>>> Hbase:meta getting split? It may b some user region, can u
> > > check
> > > > > >>>> that?
> > > > > >>>>> If
> > > > > >>>>>>> ur meta was splitting then there is something wrong.
> > > > > >>>>>>> Can u attach the log snippets.
> > > > > >>>>>>>
> > > > > >>>>>>> Sent from phone. Excuse typos.
> > > > > >>>>>>> On Jun 5, 2015 6:00 PM, "mukund murrali" <
> > > > > >> mukundmurra...@gmail.com
> > > > > >>>>>> <javascript:;>> wrote:
> > > > > >>>>>>>
> > > > > >>>>>>>> Hi
> > > > > >>>>>>>>
> > > > > >>>>>>>> In our case there at that instance when the client thread
> > > > > >> stalled,
> > > > > >>>>>> there
> > > > > >>>>>>>> was a hbase:meta region split happening. So what went
> wrong?
> > > If
> > > > > >>>> there
> > > > > >>>>>> is
> > > > > >>>>>>> a
> > > > > >>>>>>>> split why should hconnection thread stall? Since we
> changed
> > > the
> > > > > >>>>> client
> > > > > >>>>>>>> configuration caused this? I am once again specifying our
> > > client
> > > > > >>>>>> related
> > > > > >>>>>>>> changes we did
> > > > > >>>>>>>>
> > > > > >>>>>>>> hbase.client.retries.number => 5
> > > > > >>>>>>>> zookeeper.recovery.retry => 0
> > > > > >>>>>>>> zookeeper.session.timeout => 1000
> > > > > >>>>>>>> zookeeper.recovery.retry.
> > > > > >>>>>>>> intervalmilli => 1
> > > > > >>>>>>>> hbase.rpc.timeout => 30000.
> > > > > >>>>>>>>
> > > > > >>>>>>>> Is zk timeout too low?
> > > > > >>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>> On Fri, Jun 5, 2015 at 11:37 AM, ramkrishna vasudevan <
> > > > > >>>>>>>> ramkrishna.s.vasude...@gmail.com <javascript:;>> wrote:
> > > > > >>>>>>>>
> > > > > >>>>>>>>> When you started  your client server was the META table
> > > > > >> assigned.
> > > > > >>>>>> May
> > > > > >>>>>>> be
> > > > > >>>>>>>>> some thing happened around that time and the client app
> was
> > > > > >> just
> > > > > >>>>>>> waiting
> > > > > >>>>>>>> on
> > > > > >>>>>>>>> the meta table to be assigned.  It would have retried -
> Can
> > > > > >> you
> > > > > >>>>> check
> > > > > >>>>>>> the
> > > > > >>>>>>>>> logs.?
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> So the best part here is the stand alone client was able
> to
> > > be
> > > > > >>>>>>>> successful -
> > > > > >>>>>>>>> which means the new clients were able to talk
> successfully
> > > > > >> with
> > > > > >>>> the
> > > > > >>>>>>>>> server.  And hence the restart of your client has solved
> > > your
> > > > > >>>>>> problem.
> > > > > >>>>>>>> It
> > > > > >>>>>>>>> may be difficult to trouble shoot the exact issue with
> the
> > > > > >>>> limited
> > > > > >>>>>>> info -
> > > > > >>>>>>>>> but see if your client app regularly gets stalled and
> then
> > it
> > > > > >> is
> > > > > >>>>>> better
> > > > > >>>>>>>> to
> > > > > >>>>>>>>> trouble shoot your app and the way it accesses the
> server.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> On Fri, Jun 5, 2015 at 11:21 AM, PRANEESH KUMAR <
> > > > > >>>>>>>> praneesh.san...@gmail.com <javascript:;>
> > > > > >>>>>>>>> wrote:
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> The client connection was in stalled state. But there
> was
> > > > > >> only
> > > > > >>>>> one
> > > > > >>>>>>>>>> hconnection thread found in our thread dump, which was
> > > > > >> waiting
> > > > > >>>>>>>>> indefinitely
> > > > > >>>>>>>>>> in BoundedCompletionService.take call. Meanwhile we ran
> a
> > > > > >>>>>> standalone
> > > > > >>>>>>>> test
> > > > > >>>>>>>>>> program which was successful.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Once we restarted the client server, the problem got
> > > > > >> resolved.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> The basic doubt is, when the hconnection thread stalled,
> > why
> > > > > >>>> the
> > > > > >>>>>>> HBase
> > > > > >>>>>>>>>> client failed to create any more hconnections(max pool
> > size
> > > > > >> was
> > > > > >>>>>> 10).
> > > > > >>>>>>> In
> > > > > >>>>>>>>>> case of problem with table/meta regions how come the
> test
> > > > > >>>> program
> > > > > >>>>>>>>>> succeeded.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Regards,
> > > > > >>>>>>>>>> Praneesh
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> On Fri, Jun 5, 2015 at 10:21 AM, ramkrishna vasudevan <
> > > > > >>>>>>>>>> ramkrishna.s.vasude...@gmail.com <javascript:;>> wrote:
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> Can you tell us more. Is your client not working at all
> > > > > >> and
> > > > > >>>> it
> > > > > >>>>> is
> > > > > >>>>>>>>>> stalled ?
> > > > > >>>>>>>>>>> Are you seeing some results but you find it slow than
> you
> > > > > >>>>>> expected?
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> What type of workload are you running?  All the tables
> > are
> > > > > >>>>>> healthy?
> > > > > >>>>>>>>> Are
> > > > > >>>>>>>>>>> you able to read or write to them individually using
> the
> > > > > >>>> hbase
> > > > > >>>>>>> shell?
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> On Fri, Jun 5, 2015 at 10:18 AM, PRANEESH KUMAR <
> > > > > >>>>>>>>>> praneesh.san...@gmail.com <javascript:;>
> > > > > >>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>> Hi Ram,
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> The cluster ran without any problem for about 2 to 3
> > > > > >> days
> > > > > >>>>> with
> > > > > >>>>>>> low
> > > > > >>>>>>>>>> load,
> > > > > >>>>>>>>>>>> once we enabled it for high load we immediately faced
> > > > > >> this
> > > > > >>>>>> issue.
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> Regards,
> > > > > >>>>>>>>>>>> Praneesh.
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> On Thursday 4 June 2015, ramkrishna vasudevan <
> > > > > >>>>>>>>>>>> ramkrishna.s.vasude...@gmail.com <javascript:;>>
> wrote:
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Is your cluster in working condition.  Can you see if
> > > > > >> the
> > > > > >>>>>> META
> > > > > >>>>>>>> has
> > > > > >>>>>>>>>> been
> > > > > >>>>>>>>>>>>> assigned properly?  If the META table is not
> > > > > >> initialized
> > > > > >>>>> and
> > > > > >>>>>>>> opened
> > > > > >>>>>>>>>>> then
> > > > > >>>>>>>>>>>>> your client thread will hang.
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Regards
> > > > > >>>>>>>>>>>>> Ram
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> On Thu, Jun 4, 2015 at 9:05 PM, PRANEESH KUMAR <
> > > > > >>>>>>>>>>>> praneesh.san...@gmail.com <javascript:;>
> > > > > >>>>>>>>>>>>> <javascript:;>>
> > > > > >>>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> Hi,
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> We are using Hbase-1.0.0. We also facing the same
> > > > > >> issue
> > > > > >>>>>> that
> > > > > >>>>>>>>> client
> > > > > >>>>>>>>>>>>>> connection thread is waiting at
> > > > > >>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1200).
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> Any help is appreciated.
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> Regards,
> > > > > >>>>>>>>>>>>>> Praneesh
> > > > > >>
> > > > >
> > > >
> > >
> >
>

Re: HConnection thread waiting on blocking queue indefinitely

Reply via email to