Re: Some problems in one accident on my production cluster
Thanks stack and ted for your help. After check the code, i think the reason is RS send split request with parent region, two daughter regions, then RS crash. Master update two daughter regions to be SPLIT_NEW state and put them in regionsInTransition which is stored in memory of master. And in 0.98.11-, serverOffline not handle this situation when region is in SPLIT_NEW state. So we have to restart master. As ted said, HBASE-12958 has fixed it. As for "set_quota" command, it was introduced after 1.1, i will upgrade my cluster. Thanks guys for your help. 2016-02-25 11:41 GMT+08:00 Stack : > On Wed, Feb 24, 2016 at 3:31 PM, Heng Chen > wrote: > > > The story is I run one MR job on my production cluster (0.98.6), it > needs > > to scan one table during map procedure. > > > > Because of the heavy load from the job, all my RS crashed due to OOM. > > > > > Really big rows? If so, can you narrow your scan or ask for partial rows > (IIRC, you can do this in 0.98.x) or move up on to hbase 1.1+ where > scanning does 'chunking'? > > > > After i restart all RS, i found one problem. > > > > All regions were reopened on one RS, > > > > ... the others took a while to check in? Thats usual reason one RS gets a > bunch of regions. > > > > > and balancer could not run because of > > two regions were in transition. The cluster got in stuck a long time > > until i restarted master. > > > > 1. why this happened? > > > > Would need logs. I see you posted some later. Good to go to the server > that was doing the split and look in log around the time of split fail. > > > > 2. If cluster has a lots of regions, after all RS crash, how to restart > > the cluster. If restart RS one by one, it means OOM may happen because > one > > RS has to hold all regions and it will cost a long time. > > > > > Best to restart cluster in this case (after figuring why others took a > while to check in... look at their logs around startup time to see why they > dally) > > > > 3. Is it possible to make each table with some requests quotas, it > means > > when one table is requested heavily, it has no impact to other tables on > > cluster. > > > > > Not sure what the state of this is in 0.98. Maybe someone closer to 0.98 > knows. > > St.Ack > > > > > > > Thanks > > >
Re: Some problems in one accident on my production cluster
On Wed, Feb 24, 2016 at 3:31 PM, Heng Chen wrote: > The story is I run one MR job on my production cluster (0.98.6), it needs > to scan one table during map procedure. > > Because of the heavy load from the job, all my RS crashed due to OOM. > > Really big rows? If so, can you narrow your scan or ask for partial rows (IIRC, you can do this in 0.98.x) or move up on to hbase 1.1+ where scanning does 'chunking'? > After i restart all RS, i found one problem. > > All regions were reopened on one RS, ... the others took a while to check in? Thats usual reason one RS gets a bunch of regions. > and balancer could not run because of > two regions were in transition. The cluster got in stuck a long time > until i restarted master. > > 1. why this happened? > > Would need logs. I see you posted some later. Good to go to the server that was doing the split and look in log around the time of split fail. > 2. If cluster has a lots of regions, after all RS crash, how to restart > the cluster. If restart RS one by one, it means OOM may happen because one > RS has to hold all regions and it will cost a long time. > > Best to restart cluster in this case (after figuring why others took a while to check in... look at their logs around startup time to see why they dally) > 3. Is it possible to make each table with some requests quotas, it means > when one table is requested heavily, it has no impact to other tables on > cluster. > > Not sure what the state of this is in 0.98. Maybe someone closer to 0.98 knows. St.Ack > > Thanks >
Re: Some problems in one accident on my production cluster
bq. RegionStates: THIS SHOULD NOT HAPPEN: unexpected { ad283942aff2bba6c0b94ff98a904d1a state=SPLITTING_NEW Looks like the above wouldn't have happened if you are using 0.98.11+ See HBASE-12958 On Wed, Feb 24, 2016 at 6:39 PM, Heng Chen wrote: > I pick up some logs in master.log about one region > "ad283942aff2bba6c0b94ff98a904d1a" > > > 2016-02-24 16:24:35,610 INFO [AM.ZK.Worker-pool2-t3491] > master.RegionStates: Transition null to {ad283942aff2bba6c0b94ff98a904d1a > state=SPLITTING_NEW, ts=1456302275610, > server=dx-common-regionserver1-online,60020,1456302268068} > 2016-02-24 16:25:40,472 WARN > [MASTER_SERVER_OPERATIONS-dx-common-hmaster1-online:6-0] > master.RegionStates: THIS SHOULD NOT HAPPEN: unexpected > {ad283942aff2bba6c0b94ff98a904d1a state=SPLITTING_NEW, ts=1456302275610, > server=dx-common-regionserver1-online,60020,1456302268068} > 2016-02-24 16:34:24,769 DEBUG > [dx-common-hmaster1-online,6,1433937470611-BalancerChore] > master.HMaster: Not running balancer because 2 region(s) in transition: > {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a > state=SPLITTING_NEW, ts=1456302275610, > server=dx-common-regionserver1-online,60020,1456302268068}, > ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef > state=SPLITTING_NEW... > 2016-02-24 16:39:24,768 DEBUG > [dx-common-hmaster1-online,6,1433937470611-BalancerChore] > master.HMaster: Not running balancer because 2 region(s) in transition: > {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a > state=SPLITTING_NEW, ts=1456302275610, > server=dx-common-regionserver1-online,60020,1456302268068}, > ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef > state=SPLITTING_NEW... > 2016-02-24 16:44:24,768 DEBUG > [dx-common-hmaster1-online,6,1433937470611-BalancerChore] > master.HMaster: Not running balancer because 2 region(s) in transition: > {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a > state=SPLITTING_NEW, ts=1456302275610, > server=dx-common-regionserver1-online,60020,1456302268068}, > ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef > state=SPLITTING_NEW... > 2016-02-24 16:45:37,749 DEBUG [FifoRpcScheduler.handler1-thread-10] > master.HMaster: Not running balancer because 2 region(s) in transition: > {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a > state=SPLITTING_NEW, ts=1456302275610, > server=dx-common-regionserver1-online,60020,1456302268068}, > ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef > state=SPLITTING_NEW... > 2016-02-24 16:49:24,769 DEBUG > [dx-common-hmaster1-online,6,1433937470611-BalancerChore] > master.HMaster: Not running balancer because 2 region(s) in transition: > {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a > state=SPLITTING_NEW, ts=1456302275610, > server=dx-common-regionserver1-online,60020,1456302268068}, > ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef > state=SPLITTING_NEW... > 2016-02-24 16:54:24,768 DEBUG > [dx-common-hmaster1-online,6,1433937470611-BalancerChore] > master.HMaster: Not running balancer because 2 region(s) in transition: > {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a > state=SPLITTING_NEW, ts=1456302275610, > server=dx-common-regionserver1-online,60020,1456302268068}, > ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef > state=SPLITTING_NEW... > 2016-02-24 16:59:24,768 DEBUG > [dx-common-hmaster1-online,6,1433937470611-BalancerChore] > master.HMaster: Not running balancer because 2 region(s) in transition: > {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a > state=SPLITTING_NEW, ts=1456302275610, > server=dx-common-regionserver1-online,60020,1456302268068}, > ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef > state=SPLITTING_NEW... > 2016-02-24 17:04:24,769 DEBUG > [dx-common-hmaster1-online,6,1433937470611-BalancerChore] > master.HMaster: Not running balancer because 2 region(s) in transition: > {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a > state=SPLITTING_NEW, ts=1456302275610, > server=dx-common-regionserver1-online,60020,1456302268068}, > ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef > state=SPLITTING_NEW... > 2016-02-24 17:09:24,768 DEBUG > [dx-common-hmaster1-online,6,1433937470611-BalancerChore] > master.HMaster: Not running balancer because 2 region(s) in transition: > {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a > state=SPLITTING_NEW, ts=1456302275610, > server=dx-common-regionserver1-online,60020,1456302268068}, > ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef > state=SPLITTING_NEW... > > > > > > 2016-02-25 10:05 GMT+08:00 Ted Yu : > > > bq. two regions were in transition > > > > Can you pastebin related server logs w.r.t. these two regions so that we > > can have more clue ? > > > > For #2, please see h
Re: Some problems in one accident on my production cluster
Thanks @ted, your suggestions about 2 and 3 are what i need ! 2016-02-25 10:39 GMT+08:00 Heng Chen : > I pick up some logs in master.log about one region > "ad283942aff2bba6c0b94ff98a904d1a" > > > 2016-02-24 16:24:35,610 INFO [AM.ZK.Worker-pool2-t3491] > master.RegionStates: Transition null to {ad283942aff2bba6c0b94ff98a904d1a > state=SPLITTING_NEW, ts=1456302275610, > server=dx-common-regionserver1-online,60020,1456302268068} > 2016-02-24 16:25:40,472 WARN > [MASTER_SERVER_OPERATIONS-dx-common-hmaster1-online:6-0] > master.RegionStates: THIS SHOULD NOT HAPPEN: unexpected > {ad283942aff2bba6c0b94ff98a904d1a state=SPLITTING_NEW, ts=1456302275610, > server=dx-common-regionserver1-online,60020,1456302268068} > 2016-02-24 16:34:24,769 DEBUG > [dx-common-hmaster1-online,6,1433937470611-BalancerChore] > master.HMaster: Not running balancer because 2 region(s) in transition: > {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a > state=SPLITTING_NEW, ts=1456302275610, > server=dx-common-regionserver1-online,60020,1456302268068}, > ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef > state=SPLITTING_NEW... > 2016-02-24 16:39:24,768 DEBUG > [dx-common-hmaster1-online,6,1433937470611-BalancerChore] > master.HMaster: Not running balancer because 2 region(s) in transition: > {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a > state=SPLITTING_NEW, ts=1456302275610, > server=dx-common-regionserver1-online,60020,1456302268068}, > ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef > state=SPLITTING_NEW... > 2016-02-24 16:44:24,768 DEBUG > [dx-common-hmaster1-online,6,1433937470611-BalancerChore] > master.HMaster: Not running balancer because 2 region(s) in transition: > {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a > state=SPLITTING_NEW, ts=1456302275610, > server=dx-common-regionserver1-online,60020,1456302268068}, > ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef > state=SPLITTING_NEW... > 2016-02-24 16:45:37,749 DEBUG [FifoRpcScheduler.handler1-thread-10] > master.HMaster: Not running balancer because 2 region(s) in transition: > {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a > state=SPLITTING_NEW, ts=1456302275610, > server=dx-common-regionserver1-online,60020,1456302268068}, > ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef > state=SPLITTING_NEW... > 2016-02-24 16:49:24,769 DEBUG > [dx-common-hmaster1-online,6,1433937470611-BalancerChore] > master.HMaster: Not running balancer because 2 region(s) in transition: > {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a > state=SPLITTING_NEW, ts=1456302275610, > server=dx-common-regionserver1-online,60020,1456302268068}, > ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef > state=SPLITTING_NEW... > 2016-02-24 16:54:24,768 DEBUG > [dx-common-hmaster1-online,6,1433937470611-BalancerChore] > master.HMaster: Not running balancer because 2 region(s) in transition: > {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a > state=SPLITTING_NEW, ts=1456302275610, > server=dx-common-regionserver1-online,60020,1456302268068}, > ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef > state=SPLITTING_NEW... > 2016-02-24 16:59:24,768 DEBUG > [dx-common-hmaster1-online,6,1433937470611-BalancerChore] > master.HMaster: Not running balancer because 2 region(s) in transition: > {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a > state=SPLITTING_NEW, ts=1456302275610, > server=dx-common-regionserver1-online,60020,1456302268068}, > ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef > state=SPLITTING_NEW... > 2016-02-24 17:04:24,769 DEBUG > [dx-common-hmaster1-online,6,1433937470611-BalancerChore] > master.HMaster: Not running balancer because 2 region(s) in transition: > {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a > state=SPLITTING_NEW, ts=1456302275610, > server=dx-common-regionserver1-online,60020,1456302268068}, > ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef > state=SPLITTING_NEW... > 2016-02-24 17:09:24,768 DEBUG > [dx-common-hmaster1-online,6,1433937470611-BalancerChore] > master.HMaster: Not running balancer because 2 region(s) in transition: > {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a > state=SPLITTING_NEW, ts=1456302275610, > server=dx-common-regionserver1-online,60020,1456302268068}, > ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef > state=SPLITTING_NEW... > > > > > > 2016-02-25 10:05 GMT+08:00 Ted Yu : > >> bq. two regions were in transition >> >> Can you pastebin related server logs w.r.t. these two regions so that we >> can have more clue ? >> >> For #2, please see http://hbase.apache.org/book.html#big.cluster.config >> >> For #3, please see >> >> http://hbase.apache.org/book.html#_running_multiple_workloads_on_a
Re: Some problems in one accident on my production cluster
I pick up some logs in master.log about one region "ad283942aff2bba6c0b94ff98a904d1a" 2016-02-24 16:24:35,610 INFO [AM.ZK.Worker-pool2-t3491] master.RegionStates: Transition null to {ad283942aff2bba6c0b94ff98a904d1a state=SPLITTING_NEW, ts=1456302275610, server=dx-common-regionserver1-online,60020,1456302268068} 2016-02-24 16:25:40,472 WARN [MASTER_SERVER_OPERATIONS-dx-common-hmaster1-online:6-0] master.RegionStates: THIS SHOULD NOT HAPPEN: unexpected {ad283942aff2bba6c0b94ff98a904d1a state=SPLITTING_NEW, ts=1456302275610, server=dx-common-regionserver1-online,60020,1456302268068} 2016-02-24 16:34:24,769 DEBUG [dx-common-hmaster1-online,6,1433937470611-BalancerChore] master.HMaster: Not running balancer because 2 region(s) in transition: {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a state=SPLITTING_NEW, ts=1456302275610, server=dx-common-regionserver1-online,60020,1456302268068}, ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef state=SPLITTING_NEW... 2016-02-24 16:39:24,768 DEBUG [dx-common-hmaster1-online,6,1433937470611-BalancerChore] master.HMaster: Not running balancer because 2 region(s) in transition: {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a state=SPLITTING_NEW, ts=1456302275610, server=dx-common-regionserver1-online,60020,1456302268068}, ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef state=SPLITTING_NEW... 2016-02-24 16:44:24,768 DEBUG [dx-common-hmaster1-online,6,1433937470611-BalancerChore] master.HMaster: Not running balancer because 2 region(s) in transition: {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a state=SPLITTING_NEW, ts=1456302275610, server=dx-common-regionserver1-online,60020,1456302268068}, ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef state=SPLITTING_NEW... 2016-02-24 16:45:37,749 DEBUG [FifoRpcScheduler.handler1-thread-10] master.HMaster: Not running balancer because 2 region(s) in transition: {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a state=SPLITTING_NEW, ts=1456302275610, server=dx-common-regionserver1-online,60020,1456302268068}, ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef state=SPLITTING_NEW... 2016-02-24 16:49:24,769 DEBUG [dx-common-hmaster1-online,6,1433937470611-BalancerChore] master.HMaster: Not running balancer because 2 region(s) in transition: {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a state=SPLITTING_NEW, ts=1456302275610, server=dx-common-regionserver1-online,60020,1456302268068}, ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef state=SPLITTING_NEW... 2016-02-24 16:54:24,768 DEBUG [dx-common-hmaster1-online,6,1433937470611-BalancerChore] master.HMaster: Not running balancer because 2 region(s) in transition: {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a state=SPLITTING_NEW, ts=1456302275610, server=dx-common-regionserver1-online,60020,1456302268068}, ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef state=SPLITTING_NEW... 2016-02-24 16:59:24,768 DEBUG [dx-common-hmaster1-online,6,1433937470611-BalancerChore] master.HMaster: Not running balancer because 2 region(s) in transition: {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a state=SPLITTING_NEW, ts=1456302275610, server=dx-common-regionserver1-online,60020,1456302268068}, ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef state=SPLITTING_NEW... 2016-02-24 17:04:24,769 DEBUG [dx-common-hmaster1-online,6,1433937470611-BalancerChore] master.HMaster: Not running balancer because 2 region(s) in transition: {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a state=SPLITTING_NEW, ts=1456302275610, server=dx-common-regionserver1-online,60020,1456302268068}, ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef state=SPLITTING_NEW... 2016-02-24 17:09:24,768 DEBUG [dx-common-hmaster1-online,6,1433937470611-BalancerChore] master.HMaster: Not running balancer because 2 region(s) in transition: {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a state=SPLITTING_NEW, ts=1456302275610, server=dx-common-regionserver1-online,60020,1456302268068}, ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef state=SPLITTING_NEW... 2016-02-25 10:05 GMT+08:00 Ted Yu : > bq. two regions were in transition > > Can you pastebin related server logs w.r.t. these two regions so that we > can have more clue ? > > For #2, please see http://hbase.apache.org/book.html#big.cluster.config > > For #3, please see > > http://hbase.apache.org/book.html#_running_multiple_workloads_on_a_single_cluster > > On Wed, Feb 24, 2016 at 3:31 PM, Heng Chen > wrote: > > > The story is I run one MR job on my production cluster (0.98.6), it > needs > > to scan one table during map procedure. > > > > Because of the heavy load from the job, all my RS crashed due to OOM. > > > > Af
Re: Some problems in one accident on my production cluster
bq. two regions were in transition Can you pastebin related server logs w.r.t. these two regions so that we can have more clue ? For #2, please see http://hbase.apache.org/book.html#big.cluster.config For #3, please see http://hbase.apache.org/book.html#_running_multiple_workloads_on_a_single_cluster On Wed, Feb 24, 2016 at 3:31 PM, Heng Chen wrote: > The story is I run one MR job on my production cluster (0.98.6), it needs > to scan one table during map procedure. > > Because of the heavy load from the job, all my RS crashed due to OOM. > > After i restart all RS, i found one problem. > > All regions were reopened on one RS, and balancer could not run because of > two regions were in transition. The cluster got in stuck a long time > until i restarted master. > > 1. why this happened? > > 2. If cluster has a lots of regions, after all RS crash, how to restart > the cluster. If restart RS one by one, it means OOM may happen because one > RS has to hold all regions and it will cost a long time. > > 3. Is it possible to make each table with some requests quotas, it means > when one table is requested heavily, it has no impact to other tables on > cluster. > > > Thanks >