Even we are facing same problem, is it fixed in hbase 0.94.8 or 0.97.6 ? If it is fixed we will migrate, can some one conform about this? Thanks,Sandeep.
> From: nkey...@gmail.com > Date: Thu, 13 Jun 2013 09:00:46 +0200 > Subject: Re: Handling regionserver crashes in production cluster > To: user@hbase.apache.org > > Hum... So even a simple get shows the issue? > It would be a (surprising) critical bug. Could you please try the 95.1 or > the 94.8? Or write an unit test? > > Thanks, > > Nicolas > > > On Thu, Jun 13, 2013 at 5:43 AM, kiran <kiran.sarvabho...@gmail.com> wrote: > > > Its a simple kill... > > Scan is used using startrow and stoprow > > Scan scan = new Scan(Bytes.toBytes("adidas"), Bytes.toBytes("adidas1")); > > > > > > Our cluster size is 15. The load average when I see in master is 78%...It > > is not that overloaded. but writes are happening in the cluster... > > > > Thanks > > Kiran > > > > > > > > On Wed, Jun 12, 2013 at 10:49 PM, Nicolas Liochon <nkey...@gmail.com> > > wrote: > > > > > Yeah, it should not block the other regions. > > > > > > For the region server, was it a kill -9 or in simple kill (the former > > > triggers a recovery, the later will close the region before stopping the > > > process)? > > > > > > How do you select the scan scope? With stop/start rows? > > > Can you share the client code you're using? > > > What's the cluster size? Was it already very loaded before you killed the > > > region server? > > > > > > Nicolas > > > > > > > > > > > > On Wed, Jun 12, 2013 at 6:11 PM, kiran <kiran.sarvabho...@gmail.com> > > > wrote: > > > > > > > Yes we killed the region server but datanode is still running on the > > > > node... > > > > > > > > Sample Test scenario: Assume, I have table with pre-splits a upto z > > > (about > > > > 26 regions). I brought down region server purposefully with regions > > > having > > > > prefixes c and d. Then I used client API to scan data from regions with > > > > prefixes other than c and d. The response was very slow and sometimes > > not > > > > coming at all. > > > > > > > > My doubt was if only regions with prefix c and d are getting relocated > > or > > > > in transition. Why is it affecting the regions with other prefixes.... > > > But > > > > once the region transition is over, the response is very fast as > > > expected. > > > > > > > > > > > > > > > > On Wed, Jun 12, 2013 at 8:50 PM, rajesh babu chintaguntla < > > > > chrajeshbab...@gmail.com> wrote: > > > > > > > > > You can configure below to more value to close more regions at a > > time. > > > > > > > > > > <property> > > > > > <name>hbase.regionserver.executor.closeregion.threads</name> > > > > > <value>3</value> > > > > > </property> > > > > > > > > > > > > > > > On Wed, Jun 12, 2013 at 7:38 PM, Nicolas Liochon <nkey...@gmail.com> > > > > > wrote: > > > > > > > > > > > What was your test exactly? You killed -9 a region server but kept > > > the > > > > > > datanode alive? > > > > > > Could you detail the queries you were doing? > > > > > > > > > > > > > > > > > > On Wed, Jun 12, 2013 at 2:10 PM, kiran < > > kiran.sarvabho...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > It is not possible for us to migrate to new version immediately. > > > > > > > > > > > > > > @Anoop we purposefully brought down one regionserver, then we > > > > observed > > > > > > the > > > > > > > website is taking too much time to respond. We observed the > > pattern > > > > for > > > > > > > about 5 min till the regions are relocated. > > > > > > > Also we issued queries in our website taking care that the > > queries > > > > did > > > > > > n't > > > > > > > come under the regions in the regionserver we brought down. > > > > > > > > > > > > > > Is there any configuration workaround to mitigate it?? > > > > > > > > > > > > > > Thanks > > > > > > > Kiran > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Jun 6, 2013 at 8:27 PM, Jean-Marc Spaggiari < > > > > > > > jean-m...@spaggiari.org > > > > > > > > wrote: > > > > > > > > > > > > > > > Hi Kiran, > > > > > > > > > > > > > > > > Also, any chance for you to migrate to 0.94.8? There have been > > > > > > > > hundreds of fixes since 0.94.1... > > > > > > > > > > > > > > > > JM > > > > > > > > > > > > > > > > 2013/6/6 Anoop John <anoop.hb...@gmail.com>: > > > > > > > > > How many total RS in the cluster? You mean u can not do any > > > > > > operation > > > > > > > on > > > > > > > > > other regions in the live clusters? It should not happen.. > > Is > > > > it > > > > > so > > > > > > > > > happening that the client ops are targetted at the regions > > > which > > > > > were > > > > > > > in > > > > > > > > > the dead RS( and in transition now)? Can u have a closer > > look > > > > and > > > > > > > see? > > > > > > > > > If not pls check the RS threads were they are getting > > blocked. > > > > > > > > > > > > > > > > > > -Anoop- > > > > > > > > > > > > > > > > > > On Wed, Jun 5, 2013 at 10:50 PM, kiran < > > > > > kiran.sarvabho...@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > >> Dear All, > > > > > > > > >> > > > > > > > > >> We have production cluster that runs on hbase 0.94.1. The > > > issue > > > > we > > > > > > are > > > > > > > > >> facing is whenever one regionserver goes down, the cluster > > > > becomes > > > > > > > > >> unresponsive until all the regions are allocated to another > > > > > > > > >> regionserver(s). The transition is taking about 3-5 mins and > > > > > during > > > > > > > this > > > > > > > > >> time we are unable to any do client operation on the > > cluster. > > > > > > > > >> > > > > > > > > >> Is there any way we can make the transition to run in > > > > background ? > > > > > > > > >> > > > > > > > > >> Also, it is acceptable for us if the client operations such > > as > > > > > scan > > > > > > or > > > > > > > > get > > > > > > > > >> does not work on the rowkeys of regions in transition. But, > > > they > > > > > are > > > > > > > not > > > > > > > > >> working on the entire cluster until all the regions are > > moved > > > > out > > > > > of > > > > > > > > >> transition. We can't afford 3-5 minutes of downtime. > > > > > > > > >> > > > > > > > > >> -- > > > > > > > > >> Thank you > > > > > > > > >> Kiran Sarvabhotla > > > > > > > > >> > > > > > > > > >> -----Even a correct decision is wrong when it is taken late > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Thank you > > > > > > > Kiran Sarvabhotla > > > > > > > > > > > > > > -----Even a correct decision is wrong when it is taken late > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Thank you > > > > Kiran Sarvabhotla > > > > > > > > -----Even a correct decision is wrong when it is taken late > > > > > > > > > > > > > > > -- > > Thank you > > Kiran Sarvabhotla > > > > -----Even a correct decision is wrong when it is taken late > >