regionserver stuck

2015-12-18 Thread wangkai
Hi, all: our hbase cluster often occurs a strange phenomenon,we can’t query or execute bulkload when one machine goes down,like the hbase cluster is crashed. So we took a look at the jstack of the regionserver, we found some threads were blocked, waiting for the lock. Here is the jstack of

Re: single RegionServer stuck, causing cluster to hang

2014-08-24 Thread Johannes Schaback
Great. Thank you. Checking out now. I hope I get everything assembled. I keep you posted. On Sun, Aug 24, 2014 at 9:57 AM, Stack wrote: > I put up patches in the issue Johannes. Hopefully the reproduced > stackoverflow is same as yours. See HBASE-11813. > St.Ack > > > On Sat, Aug 23, 2014 at 9:

Re: single RegionServer stuck, causing cluster to hang

2014-08-24 Thread Stack
I put up patches in the issue Johannes. Hopefully the reproduced stackoverflow is same as yours. See HBASE-11813. St.Ack On Sat, Aug 23, 2014 at 9:25 PM, Johannes Schaback < johannes.schab...@visual-meta.com> wrote: > We us all plain gets and puts (sometimes batched). > > We have hbase.client.ke

Re: single RegionServer stuck, causing cluster to hang

2014-08-23 Thread Johannes Schaback
We us all plain gets and puts (sometimes batched). We have hbase.client.keyvalue.maxsize increased to 536870912 bytes on the client. That is the only thing I can see. I am about to send you a zip file with the respective classes to your email address directly. I probably better dont post the code

Re: single RegionServer stuck, causing cluster to hang

2014-08-23 Thread Johannes Schaback
Hi Qiang, no, we dont use coprocessors. Thanks, Johannes On Sun, Aug 24, 2014 at 6:04 AM, Qiang Tian wrote: > Hi Johannes, > Do you use endpoint / coprocessor stuff? > thanks. > > > On Sun, Aug 24, 2014 at 11:51 AM, Qiang Tian wrote: > > > Hi Stack, > > I think you are right. the multiple q

Re: single RegionServer stuck, causing cluster to hang

2014-08-23 Thread Qiang Tian
Hi Johannes, Do you use endpoint / coprocessor stuff? thanks. On Sun, Aug 24, 2014 at 11:51 AM, Qiang Tian wrote: > Hi Stack, > I think you are right. the multiple queue change was introduced by > HBASE-11355(0.98.4). if there is only 1 queue, the stuck will not happen. > some handlers are gon

Re: single RegionServer stuck, causing cluster to hang

2014-08-23 Thread Qiang Tian
Hi Stack, I think you are right. the multiple queue change was introduced by HBASE-11355(0.98.4). if there is only 1 queue, the stuck will not happen. some handlers are gone but still some left to service the request(all handlers gone looks a rare case)... so the problem might have been there for s

Re: single RegionServer stuck, causing cluster to hang

2014-08-23 Thread Stack
On Sat, Aug 23, 2014 at 4:06 PM, Stack wrote: > ... > If you were looking for something to try, set > hbase.ipc.server.callqueue.handler.factor > to 0. Multiple queues is what is new here. It should not make a difference > but... > > Hmm. Ignore above I'd say. I can't see how it would trigger

Re: single RegionServer stuck, causing cluster to hang

2014-08-23 Thread Stack
I am having trouble reproducing the stack overflow. Some particular response is triggering it (the code here has been around a while). Any particulars on how your client is accessing hbase? Anything unusual? If you were looking for something to try, set hbase.ipc.server.callqueue.handler.factor t

Re: single RegionServer stuck, causing cluster to hang

2014-08-23 Thread Johannes Schaback
Thank you. >From the proposed resolution I imagine that the RS would then die in case of a handler error. So the question remains what error originally occured in the handler in the first place. The log of the entire lifecycle of the RS (http://schabby.de/wp-content/uploads/2014/08/filtered.txt) d

Re: single RegionServer stuck, causing cluster to hang

2014-08-23 Thread Andrew Purtell
On Sat, Aug 23, 2014 at 12:11 PM, Johannes Schaback < johannes.schab...@visual-meta.com> wrote: > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.ha

Re: single RegionServer stuck, causing cluster to hang

2014-08-23 Thread Ted Yu
Can you show the complete stack trace for StackOverflowException (using pastebin) ? Thanks On Aug 23, 2014, at 12:11 PM, Johannes Schaback wrote: > Hi, > > we had to reduce load on the cluster yesterday night which reduced the > frequency of the phenomenon. That is why I could not get a jsta

Re: single RegionServer stuck, causing cluster to hang

2014-08-23 Thread Johannes Schaback
Hi, we had to reduce load on the cluster yesterday night which reduced the frequency of the phenomenon. That is why I could not get a jstack dump yet because it did not occur since a couple hours. We will now get the load back up hoping to trigger it again. Yes, I cut out the properties from the

Re: single RegionServer stuck, causing cluster to hang

2014-08-23 Thread Stack
Anything in your .out that could help explain our losing handlers if you can't find anything in the logs? You did the 'snipp' in the below, right Johannes? RS Configuration: === [snipp] no fancy stuff, all default, except absolute necessary

Re: single RegionServer stuck, causing cluster to hang

2014-08-23 Thread Qiang Tian
Did you set hbase.ipc.server.callqueue.handler.factor? it looks there are 3 queues, handlers on queue 1 are all gone as Stack mentioned. jstack and pastebin regions server log would help. On Sat, Aug 23, 2014 at 7:02 AM, Stack wrote: > On Fri, Aug 22, 2014 at 3:24 PM, Johannes Schaback < >

Re: single RegionServer stuck, causing cluster to hang

2014-08-22 Thread Stack
On Fri, Aug 22, 2014 at 3:24 PM, Johannes Schaback < johannes.schab...@visual-meta.com> wrote: > ... > I grep'ed "defaultRpcServer.handler=" on the log from that particular RS. > The > RS started at 15:35. After that, the handlers > > 6, 24, 0, 15, 28, 26, 7, 19, 21, 3, 5 and 23 > > make an appear

Re: single RegionServer stuck, causing cluster to hang

2014-08-22 Thread Johannes Schaback
I havent managed to pull a jstack of a stuck node yet (I will do that first thing in the morning). But... I just killed and restarted a RS and called /dump right away to see whether the defaultRpcServer.handler instances are present. And yes, they are. From 0 to 29, even in consecutive order. I ki

Re: single RegionServer stuck, causing cluster to hang

2014-08-22 Thread Stack
Are we losing handler threads, the workers that take from the pool we are blocked on? The attached thread dump has ten with non-sequential numbers: Thread 97 (defaultRpcServer.handler=27,queue=0,port=60020): Thread 94 (defaultRpcServer.handler=24,queue=0,port=60020): Thread 91 (defaultRpcServer.h

Re: single RegionServer stuck, causing cluster to hang

2014-08-22 Thread Stack
nvm. misread. Trying to figure why the scheduling queue is filled to the brim such that no more calls can be added/dispatched... St.Ack On Fri, Aug 22, 2014 at 12:45 PM, Stack wrote: > Are you replicating? > St.Ack > > > On Fri, Aug 22, 2014 at 10:28 AM, Johannes Schaback < > johannes.schab...

Re: single RegionServer stuck, causing cluster to hang

2014-08-22 Thread Stack
Are you replicating? St.Ack On Fri, Aug 22, 2014 at 10:28 AM, Johannes Schaback < johannes.schab...@visual-meta.com> wrote: > Dear HBase-Pros, > > we face a serious issue with our HBase production cluster for two days now. > Every couple minutes, a random RegionServer gets stuck and does not pro

Re: single RegionServer stuck, causing cluster to hang

2014-08-22 Thread Stack
Do you have a few thread dumps from a 'deaf' instance? St.Ack On Fri, Aug 22, 2014 at 10:28 AM, Johannes Schaback < johannes.schab...@visual-meta.com> wrote: > Dear HBase-Pros, > > we face a serious issue with our HBase production cluster for two days now. > Every couple minutes, a random Regio

single RegionServer stuck, causing cluster to hang

2014-08-22 Thread Johannes Schaback
Dear HBase-Pros, we face a serious issue with our HBase production cluster for two days now. Every couple minutes, a random RegionServer gets stuck and does not process any requests. In addition this causes the other RegionServers to freeze within a minute which brings down the entire cluster. Sto

Re: RegionServer stuck in internalObtainRowLock forever - HBase 0.94.7

2014-04-29 Thread Asaf Mesika
We had this issue again in production. We had to shutdown the region server. Restart didn't help since this RS was bombarded with write requests and execute coprocessors requests, which made it open regions in the rate of 1 region in 2 minutes or so. Do you think its related to this jira: https://

Re: RegionServer stuck in internalObtainRowLock forever - HBase 0.94.7

2014-02-18 Thread Stack
On Mon, Feb 17, 2014 at 1:59 AM, Asaf Mesika wrote: > Hi, > > Apparently this just happened on Staging machine as well. The common ground > between is a failed disk (1 out of 8). > > It seems like a bug if HBase can't recover from a failed disk. Could it be > that short circuit is causing it? > >

Re: RegionServer stuck in internalObtainRowLock forever - HBase 0.94.7

2014-02-18 Thread Stack
On Mon, Feb 10, 2014 at 12:25 AM, Asaf Mesika wrote: > Hi, > > We have HBase 0.94.7 deployed in production with 54 Region Servers (Hadoop > 1). > Couple of days ago, we had an incident which made our system unusable for > several hours. > HBase started emitting WARN exceptions indefinitely, thus

Re: RegionServer stuck in internalObtainRowLock forever - HBase 0.94.7

2014-02-17 Thread Asaf Mesika
Hi, Apparently this just happened on Staging machine as well. The common ground between is a failed disk (1 out of 8). It seems like a bug if HBase can't recover from a failed disk. Could it be that short circuit is causing it? Couple of interesting exceptions: 1. The following repeated several

RegionServer stuck in internalObtainRowLock forever - HBase 0.94.7

2014-02-10 Thread Asaf Mesika
Hi, We have HBase 0.94.7 deployed in production with 54 Region Servers (Hadoop 1). Couple of days ago, we had an incident which made our system unusable for several hours. HBase started emitting WARN exceptions indefinitely, thus failing any writes to it. Until stopped this RS, the issue wasn't re