Re: single RegionServer stuck, causing cluster to hang

2014-08-24 Thread Johannes Schaback
Great. Thank you. Checking out now. I hope I get everything assembled. I keep you posted. On Sun, Aug 24, 2014 at 9:57 AM, Stack wrote: > I put up patches in the issue Johannes. Hopefully the reproduced > stackoverflow is same as yours. See HBASE-11813. > St.Ack > > > On Sat, Aug 23, 2014 at 9:

Re: single RegionServer stuck, causing cluster to hang

2014-08-24 Thread Stack
I put up patches in the issue Johannes. Hopefully the reproduced stackoverflow is same as yours. See HBASE-11813. St.Ack On Sat, Aug 23, 2014 at 9:25 PM, Johannes Schaback < johannes.schab...@visual-meta.com> wrote: > We us all plain gets and puts (sometimes batched). > > We have hbase.client.ke

Re: single RegionServer stuck, causing cluster to hang

2014-08-23 Thread Johannes Schaback
We us all plain gets and puts (sometimes batched). We have hbase.client.keyvalue.maxsize increased to 536870912 bytes on the client. That is the only thing I can see. I am about to send you a zip file with the respective classes to your email address directly. I probably better dont post the code

Re: single RegionServer stuck, causing cluster to hang

2014-08-23 Thread Johannes Schaback
Hi Qiang, no, we dont use coprocessors. Thanks, Johannes On Sun, Aug 24, 2014 at 6:04 AM, Qiang Tian wrote: > Hi Johannes, > Do you use endpoint / coprocessor stuff? > thanks. > > > On Sun, Aug 24, 2014 at 11:51 AM, Qiang Tian wrote: > > > Hi Stack, > > I think you are right. the multiple q

Re: single RegionServer stuck, causing cluster to hang

2014-08-23 Thread Qiang Tian
Hi Johannes, Do you use endpoint / coprocessor stuff? thanks. On Sun, Aug 24, 2014 at 11:51 AM, Qiang Tian wrote: > Hi Stack, > I think you are right. the multiple queue change was introduced by > HBASE-11355(0.98.4). if there is only 1 queue, the stuck will not happen. > some handlers are gon

Re: single RegionServer stuck, causing cluster to hang

2014-08-23 Thread Qiang Tian
Hi Stack, I think you are right. the multiple queue change was introduced by HBASE-11355(0.98.4). if there is only 1 queue, the stuck will not happen. some handlers are gone but still some left to service the request(all handlers gone looks a rare case)... so the problem might have been there for s

Re: single RegionServer stuck, causing cluster to hang

2014-08-23 Thread Stack
On Sat, Aug 23, 2014 at 4:06 PM, Stack wrote: > ... > If you were looking for something to try, set > hbase.ipc.server.callqueue.handler.factor > to 0. Multiple queues is what is new here. It should not make a difference > but... > > Hmm. Ignore above I'd say. I can't see how it would trigger

Re: single RegionServer stuck, causing cluster to hang

2014-08-23 Thread Stack
I am having trouble reproducing the stack overflow. Some particular response is triggering it (the code here has been around a while). Any particulars on how your client is accessing hbase? Anything unusual? If you were looking for something to try, set hbase.ipc.server.callqueue.handler.factor t

Re: single RegionServer stuck, causing cluster to hang

2014-08-23 Thread Johannes Schaback
Thank you. >From the proposed resolution I imagine that the RS would then die in case of a handler error. So the question remains what error originally occured in the handler in the first place. The log of the entire lifecycle of the RS (http://schabby.de/wp-content/uploads/2014/08/filtered.txt) d

Re: single RegionServer stuck, causing cluster to hang

2014-08-23 Thread Andrew Purtell
On Sat, Aug 23, 2014 at 12:11 PM, Johannes Schaback < johannes.schab...@visual-meta.com> wrote: > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.ha

Re: single RegionServer stuck, causing cluster to hang

2014-08-23 Thread Ted Yu
Can you show the complete stack trace for StackOverflowException (using pastebin) ? Thanks On Aug 23, 2014, at 12:11 PM, Johannes Schaback wrote: > Hi, > > we had to reduce load on the cluster yesterday night which reduced the > frequency of the phenomenon. That is why I could not get a jsta

Re: single RegionServer stuck, causing cluster to hang

2014-08-23 Thread Johannes Schaback
Hi, we had to reduce load on the cluster yesterday night which reduced the frequency of the phenomenon. That is why I could not get a jstack dump yet because it did not occur since a couple hours. We will now get the load back up hoping to trigger it again. Yes, I cut out the properties from the

Re: single RegionServer stuck, causing cluster to hang

2014-08-23 Thread Stack
Anything in your .out that could help explain our losing handlers if you can't find anything in the logs? You did the 'snipp' in the below, right Johannes? RS Configuration: === [snipp] no fancy stuff, all default, except absolute necessary

Re: single RegionServer stuck, causing cluster to hang

2014-08-23 Thread Qiang Tian
Did you set hbase.ipc.server.callqueue.handler.factor? it looks there are 3 queues, handlers on queue 1 are all gone as Stack mentioned. jstack and pastebin regions server log would help. On Sat, Aug 23, 2014 at 7:02 AM, Stack wrote: > On Fri, Aug 22, 2014 at 3:24 PM, Johannes Schaback < >

Re: single RegionServer stuck, causing cluster to hang

2014-08-22 Thread Stack
On Fri, Aug 22, 2014 at 3:24 PM, Johannes Schaback < johannes.schab...@visual-meta.com> wrote: > ... > I grep'ed "defaultRpcServer.handler=" on the log from that particular RS. > The > RS started at 15:35. After that, the handlers > > 6, 24, 0, 15, 28, 26, 7, 19, 21, 3, 5 and 23 > > make an appear

Re: single RegionServer stuck, causing cluster to hang

2014-08-22 Thread Johannes Schaback
I havent managed to pull a jstack of a stuck node yet (I will do that first thing in the morning). But... I just killed and restarted a RS and called /dump right away to see whether the defaultRpcServer.handler instances are present. And yes, they are. From 0 to 29, even in consecutive order. I ki

Re: single RegionServer stuck, causing cluster to hang

2014-08-22 Thread Stack
Are we losing handler threads, the workers that take from the pool we are blocked on? The attached thread dump has ten with non-sequential numbers: Thread 97 (defaultRpcServer.handler=27,queue=0,port=60020): Thread 94 (defaultRpcServer.handler=24,queue=0,port=60020): Thread 91 (defaultRpcServer.h

Re: single RegionServer stuck, causing cluster to hang

2014-08-22 Thread Stack
nvm. misread. Trying to figure why the scheduling queue is filled to the brim such that no more calls can be added/dispatched... St.Ack On Fri, Aug 22, 2014 at 12:45 PM, Stack wrote: > Are you replicating? > St.Ack > > > On Fri, Aug 22, 2014 at 10:28 AM, Johannes Schaback < > johannes.schab...

Re: single RegionServer stuck, causing cluster to hang

2014-08-22 Thread Stack
Are you replicating? St.Ack On Fri, Aug 22, 2014 at 10:28 AM, Johannes Schaback < johannes.schab...@visual-meta.com> wrote: > Dear HBase-Pros, > > we face a serious issue with our HBase production cluster for two days now. > Every couple minutes, a random RegionServer gets stuck and does not pro

Re: single RegionServer stuck, causing cluster to hang

2014-08-22 Thread Stack
Do you have a few thread dumps from a 'deaf' instance? St.Ack On Fri, Aug 22, 2014 at 10:28 AM, Johannes Schaback < johannes.schab...@visual-meta.com> wrote: > Dear HBase-Pros, > > we face a serious issue with our HBase production cluster for two days now. > Every couple minutes, a random Regio

single RegionServer stuck, causing cluster to hang

2014-08-22 Thread Johannes Schaback
Dear HBase-Pros, we face a serious issue with our HBase production cluster for two days now. Every couple minutes, a random RegionServer gets stuck and does not process any requests. In addition this causes the other RegionServers to freeze within a minute which brings down the entire cluster. Sto