Re: Some problems in one accident on my production cluster

Stack Wed, 24 Feb 2016 19:43:15 -0800

On Wed, Feb 24, 2016 at 3:31 PM, Heng Chen <heng.chen.1...@gmail.com> wrote:


> The story is I run one MR job on my production cluster (0.98.6),   it needs
> to scan one table during map procedure.
>
> Because of the heavy load from the job,  all my RS crashed due to OOM.
>
>
Really big rows? If so, can you narrow your scan or ask for partial rows
(IIRC, you can do this in 0.98.x) or move up on to hbase 1.1+ where
scanning does 'chunking'?


> After i restart all RS,  i found one problem.
>
> All regions were reopened on one RS,



... the others took a while to check in? Thats usual reason one RS gets a
bunch of regions.



> and balancer could not run because of
> two regions were in transition.   The cluster got in stuck a long time
> until i restarted master.
>
> 1.  why this happened?
>
> Would need logs. I see you posted some later. Good to go to the server
that was doing the split and look in log around the time of split fail.


> 2.  If cluster has a lots of regions, after all RS crash,  how to restart
> the cluster.  If restart RS one by one, it means OOM may happen because one
> RS has to hold all regions and it will cost a long time.
>
>
Best to restart cluster in this case (after figuring why others took a
while to check in... look at their logs around startup time to see why they
dally)


> 3.  Is it possible to make each table with some requests quotas,  it means
> when one table is requested heavily, it has no impact to other tables on
> cluster.
>
>
Not sure what the state of this is in 0.98. Maybe someone closer to 0.98
knows.

St.Ack



>
> Thanks
>

Re: Some problems in one accident on my production cluster

Reply via email to