Re: Some problems in one accident on my production cluster

Heng Chen Wed, 24 Feb 2016 22:31:39 -0800

Thanks stack and ted for your help.

After check the code, i think the reason is RS send split request with
parent region, two daughter regions,  then RS crash.


Master update two daughter regions to be SPLIT_NEW state and put them
in regionsInTransition
which is stored in memory of master.

And in 0.98.11-,  serverOffline not handle this situation when region is in
SPLIT_NEW state. So we have to restart master.

As ted said, HBASE-12958 has fixed it.

As for "set_quota" command, it was introduced after 1.1,  i will upgrade my
cluster.

Thanks guys for your help.



2016-02-25 11:41 GMT+08:00 Stack <st...@duboce.net>:

> On Wed, Feb 24, 2016 at 3:31 PM, Heng Chen <heng.chen.1...@gmail.com>
> wrote:
>
> > The story is I run one MR job on my production cluster (0.98.6),   it
> needs
> > to scan one table during map procedure.
> >
> > Because of the heavy load from the job,  all my RS crashed due to OOM.
> >
> >
> Really big rows? If so, can you narrow your scan or ask for partial rows
> (IIRC, you can do this in 0.98.x) or move up on to hbase 1.1+ where
> scanning does 'chunking'?
>
>
> > After i restart all RS,  i found one problem.
> >
> > All regions were reopened on one RS,
>
>
>
> ... the others took a while to check in? Thats usual reason one RS gets a
> bunch of regions.
>
>
>
> > and balancer could not run because of
> > two regions were in transition.   The cluster got in stuck a long time
> > until i restarted master.
> >
> > 1.  why this happened?
> >
> > Would need logs. I see you posted some later. Good to go to the server
> that was doing the split and look in log around the time of split fail.
>
>
> > 2.  If cluster has a lots of regions, after all RS crash,  how to restart
> > the cluster.  If restart RS one by one, it means OOM may happen because
> one
> > RS has to hold all regions and it will cost a long time.
> >
> >
> Best to restart cluster in this case (after figuring why others took a
> while to check in... look at their logs around startup time to see why they
> dally)
>
>
> > 3.  Is it possible to make each table with some requests quotas,  it
> means
> > when one table is requested heavily, it has no impact to other tables on
> > cluster.
> >
> >
> Not sure what the state of this is in 0.98. Maybe someone closer to 0.98
> knows.
>
> St.Ack
>
>
>
> >
> > Thanks
> >
>

Re: Some problems in one accident on my production cluster

Reply via email to