Thanks stack and ted for your help. After check the code, i think the reason is RS send split request with parent region, two daughter regions, then RS crash.
Master update two daughter regions to be SPLIT_NEW state and put them in regionsInTransition which is stored in memory of master. And in 0.98.11-, serverOffline not handle this situation when region is in SPLIT_NEW state. So we have to restart master. As ted said, HBASE-12958 has fixed it. As for "set_quota" command, it was introduced after 1.1, i will upgrade my cluster. Thanks guys for your help. 2016-02-25 11:41 GMT+08:00 Stack <st...@duboce.net>: > On Wed, Feb 24, 2016 at 3:31 PM, Heng Chen <heng.chen.1...@gmail.com> > wrote: > > > The story is I run one MR job on my production cluster (0.98.6), it > needs > > to scan one table during map procedure. > > > > Because of the heavy load from the job, all my RS crashed due to OOM. > > > > > Really big rows? If so, can you narrow your scan or ask for partial rows > (IIRC, you can do this in 0.98.x) or move up on to hbase 1.1+ where > scanning does 'chunking'? > > > > After i restart all RS, i found one problem. > > > > All regions were reopened on one RS, > > > > ... the others took a while to check in? Thats usual reason one RS gets a > bunch of regions. > > > > > and balancer could not run because of > > two regions were in transition. The cluster got in stuck a long time > > until i restarted master. > > > > 1. why this happened? > > > > Would need logs. I see you posted some later. Good to go to the server > that was doing the split and look in log around the time of split fail. > > > > 2. If cluster has a lots of regions, after all RS crash, how to restart > > the cluster. If restart RS one by one, it means OOM may happen because > one > > RS has to hold all regions and it will cost a long time. > > > > > Best to restart cluster in this case (after figuring why others took a > while to check in... look at their logs around startup time to see why they > dally) > > > > 3. Is it possible to make each table with some requests quotas, it > means > > when one table is requested heavily, it has no impact to other tables on > > cluster. > > > > > Not sure what the state of this is in 0.98. Maybe someone closer to 0.98 > knows. > > St.Ack > > > > > > > Thanks > > >