Re: Some problems in one accident on my production cluster

Ted Yu Wed, 24 Feb 2016 19:04:37 -0800

bq. RegionStates: THIS SHOULD NOT HAPPEN: unexpected {
ad283942aff2bba6c0b94ff98a904d1a state=SPLITTING_NEW


Looks like the above wouldn't have happened if you are using 0.98.11+

See HBASE-12958

On Wed, Feb 24, 2016 at 6:39 PM, Heng Chen <heng.chen.1...@gmail.com> wrote:

> I pick up some logs in master.log about one region
> "ad283942aff2bba6c0b94ff98a904d1a"
>
>
> 2016-02-24 16:24:35,610 INFO  [AM.ZK.Worker-pool2-t3491]
> master.RegionStates: Transition null to {ad283942aff2bba6c0b94ff98a904d1a
> state=SPLITTING_NEW, ts=1456302275610,
> server=dx-common-regionserver1-online,60020,1456302268068}
> 2016-02-24 16:25:40,472 WARN
>  [MASTER_SERVER_OPERATIONS-dx-common-hmaster1-online:60000-0]
> master.RegionStates: THIS SHOULD NOT HAPPEN: unexpected
> {ad283942aff2bba6c0b94ff98a904d1a state=SPLITTING_NEW, ts=1456302275610,
> server=dx-common-regionserver1-online,60020,1456302268068}
> 2016-02-24 16:34:24,769 DEBUG
> [dx-common-hmaster1-online,60000,1433937470611-BalancerChore]
> master.HMaster: Not running balancer because 2 region(s) in transition:
> {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
> state=SPLITTING_NEW, ts=1456302275610,
> server=dx-common-regionserver1-online,60020,1456302268068},
> ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
> state=SPLITTING_NEW...
> 2016-02-24 16:39:24,768 DEBUG
> [dx-common-hmaster1-online,60000,1433937470611-BalancerChore]
> master.HMaster: Not running balancer because 2 region(s) in transition:
> {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
> state=SPLITTING_NEW, ts=1456302275610,
> server=dx-common-regionserver1-online,60020,1456302268068},
> ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
> state=SPLITTING_NEW...
> 2016-02-24 16:44:24,768 DEBUG
> [dx-common-hmaster1-online,60000,1433937470611-BalancerChore]
> master.HMaster: Not running balancer because 2 region(s) in transition:
> {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
> state=SPLITTING_NEW, ts=1456302275610,
> server=dx-common-regionserver1-online,60020,1456302268068},
> ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
> state=SPLITTING_NEW...
> 2016-02-24 16:45:37,749 DEBUG [FifoRpcScheduler.handler1-thread-10]
> master.HMaster: Not running balancer because 2 region(s) in transition:
> {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
> state=SPLITTING_NEW, ts=1456302275610,
> server=dx-common-regionserver1-online,60020,1456302268068},
> ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
> state=SPLITTING_NEW...
> 2016-02-24 16:49:24,769 DEBUG
> [dx-common-hmaster1-online,60000,1433937470611-BalancerChore]
> master.HMaster: Not running balancer because 2 region(s) in transition:
> {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
> state=SPLITTING_NEW, ts=1456302275610,
> server=dx-common-regionserver1-online,60020,1456302268068},
> ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
> state=SPLITTING_NEW...
> 2016-02-24 16:54:24,768 DEBUG
> [dx-common-hmaster1-online,60000,1433937470611-BalancerChore]
> master.HMaster: Not running balancer because 2 region(s) in transition:
> {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
> state=SPLITTING_NEW, ts=1456302275610,
> server=dx-common-regionserver1-online,60020,1456302268068},
> ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
> state=SPLITTING_NEW...
> 2016-02-24 16:59:24,768 DEBUG
> [dx-common-hmaster1-online,60000,1433937470611-BalancerChore]
> master.HMaster: Not running balancer because 2 region(s) in transition:
> {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
> state=SPLITTING_NEW, ts=1456302275610,
> server=dx-common-regionserver1-online,60020,1456302268068},
> ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
> state=SPLITTING_NEW...
> 2016-02-24 17:04:24,769 DEBUG
> [dx-common-hmaster1-online,60000,1433937470611-BalancerChore]
> master.HMaster: Not running balancer because 2 region(s) in transition:
> {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
> state=SPLITTING_NEW, ts=1456302275610,
> server=dx-common-regionserver1-online,60020,1456302268068},
> ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
> state=SPLITTING_NEW...
> 2016-02-24 17:09:24,768 DEBUG
> [dx-common-hmaster1-online,60000,1433937470611-BalancerChore]
> master.HMaster: Not running balancer because 2 region(s) in transition:
> {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
> state=SPLITTING_NEW, ts=1456302275610,
> server=dx-common-regionserver1-online,60020,1456302268068},
> ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
> state=SPLITTING_NEW...
>
>
>
>
>
> 2016-02-25 10:05 GMT+08:00 Ted Yu <yuzhih...@gmail.com>:
>
> > bq. two regions were in transition
> >
> > Can you pastebin related server logs w.r.t. these two regions so that we
> > can have more clue ?
> >
> > For #2, please see http://hbase.apache.org/book.html#big.cluster.config
> >
> > For #3, please see
> >
> >
> http://hbase.apache.org/book.html#_running_multiple_workloads_on_a_single_cluster
> >
> > On Wed, Feb 24, 2016 at 3:31 PM, Heng Chen <heng.chen.1...@gmail.com>
> > wrote:
> >
> > > The story is I run one MR job on my production cluster (0.98.6),   it
> > needs
> > > to scan one table during map procedure.
> > >
> > > Because of the heavy load from the job,  all my RS crashed due to OOM.
> > >
> > > After i restart all RS,  i found one problem.
> > >
> > > All regions were reopened on one RS,  and balancer could not run
> because
> > of
> > > two regions were in transition.   The cluster got in stuck a long time
> > > until i restarted master.
> > >
> > > 1.  why this happened?
> > >
> > > 2.  If cluster has a lots of regions, after all RS crash,  how to
> restart
> > > the cluster.  If restart RS one by one, it means OOM may happen because
> > one
> > > RS has to hold all regions and it will cost a long time.
> > >
> > > 3.  Is it possible to make each table with some requests quotas,  it
> > means
> > > when one table is requested heavily, it has no impact to other tables
> on
> > > cluster.
> > >
> > >
> > > Thanks
> > >
> >
>

Re: Some problems in one accident on my production cluster

Reply via email to