On Wed, Feb 24, 2016 at 3:31 PM, Heng Chen <heng.chen.1...@gmail.com> wrote:
> The story is I run one MR job on my production cluster (0.98.6), it needs > to scan one table during map procedure. > > Because of the heavy load from the job, all my RS crashed due to OOM. > > Really big rows? If so, can you narrow your scan or ask for partial rows (IIRC, you can do this in 0.98.x) or move up on to hbase 1.1+ where scanning does 'chunking'? > After i restart all RS, i found one problem. > > All regions were reopened on one RS, ... the others took a while to check in? Thats usual reason one RS gets a bunch of regions. > and balancer could not run because of > two regions were in transition. The cluster got in stuck a long time > until i restarted master. > > 1. why this happened? > > Would need logs. I see you posted some later. Good to go to the server that was doing the split and look in log around the time of split fail. > 2. If cluster has a lots of regions, after all RS crash, how to restart > the cluster. If restart RS one by one, it means OOM may happen because one > RS has to hold all regions and it will cost a long time. > > Best to restart cluster in this case (after figuring why others took a while to check in... look at their logs around startup time to see why they dally) > 3. Is it possible to make each table with some requests quotas, it means > when one table is requested heavily, it has no impact to other tables on > cluster. > > Not sure what the state of this is in 0.98. Maybe someone closer to 0.98 knows. St.Ack > > Thanks >