Load balancer falls into pathological state if one server under average - slop;
endless churn
---------------------------------------------------------------------------------------------
Key: HBASE-2617
URL: https://issues.apache.org/jira/browse/HBASE-2617
Project: HBase
Issue Type: Bug
Reporter: stack
Priority: Blocker
Fix For: 0.20.5
I'm looking at a 0.20.4 cluster of 80 fast machines. It runs fine for a while
and then falls over into crazy balancing churn (My view on logs is sporadic but
have a log before me where this is happening). Things I see that seem to be of
0.20.4 particularly:
+ We don't reach an equilibrium or at least it takes so long, its as though it
wasn't every going to happen
+ Master log filled w/ open, close of one or two regions usually the same ones
over and over (The Regions that are candidates to close are provided by the RS.
They are ordered by hash of their name. We return the top ten from this Set
every time. So, we always close the same regions all the time even if we just
opened it)
+ Often, we'll tell an RS to close a region. It will do the job. In 0.20.4 we
made it so if RS has any work at all for the master, that we return immediately
rather than wait for the reporting period to elaspse. So, on these fast
machines, it can be back near immediately if it just opened another some other
region, say. It can get assigned the region it just closed. Seems to happen
frequently enough.
For example, look at the below extract featuring a single regions life. Its
opened and closed 5 times in about 1/2 a second:
{code}
2010-05-25 11:01:05,488 DEBUG org.apache.hadoop.hbase.master.HMaster:
Processing todo: ProcessRegionClose of
GenericMetaStore,139757491,1274779304880, false, reassign: true
2010-05-25 11:01:05,489 INFO
org.apache.hadoop.hbase.master.ProcessRegionClose$1: region set as unassigned:
GenericMetaStore,139757491,1274779304880
2010-05-25 11:01:05,490 INFO org.apache.hadoop.hbase.master.RegionManager:
Assigning region GenericMetaStore,139757491,1274779304880 to
a025.example.com,60020,1274744064673
2010-05-25 11:01:05,510 INFO org.apache.hadoop.hbase.master.ServerManager:
Processing MSG_REPORT_OPEN: GenericMetaStore,139757491,1274779304880 from
a025.example.com,60020,1274744064673; 1 of 1
2010-05-25 11:01:05,510 INFO
org.apache.hadoop.hbase.master.RegionServerOperation:
GenericMetaStore,139757491,1274779304880 open on 10.209.32.189:60020
2010-05-25 11:01:05,511 INFO
org.apache.hadoop.hbase.master.RegionServerOperation: Updated row
GenericMetaStore,139757491,1274779304880 in region .META.,,1 with
startcode=1274744064673, server=10.209.32.189:60020
2010-05-25 11:01:05,548 DEBUG org.apache.hadoop.hbase.master.RegionManager:
Going to close region GenericMetaStore,139757491,1274779304880
2010-05-25 11:01:05,552 INFO org.apache.hadoop.hbase.master.ServerManager:
Processing MSG_REPORT_CLOSE: GenericMetaStore,139757491,1274779304880 from
a025.example.com,60020,1274744064673; 1 of 2
2010-05-25 11:01:05,552 DEBUG org.apache.hadoop.hbase.master.HMaster:
Processing todo: ProcessRegionClose of
GenericMetaStore,139757491,1274779304880, false, reassign: true
2010-05-25 11:01:05,552 INFO
org.apache.hadoop.hbase.master.ProcessRegionClose$1: region set as unassigned:
GenericMetaStore,139757491,1274779304880
2010-05-25 11:01:05,556 INFO org.apache.hadoop.hbase.master.RegionManager:
Assigning region GenericMetaStore,139757491,1274779304880 to
a028.example.com,60020,1274747560769
2010-05-25 11:01:05,578 INFO org.apache.hadoop.hbase.master.ServerManager:
Processing MSG_REPORT_OPEN: GenericMetaStore,139757491,1274779304880 from
a028.example.com,60020,1274747560769; 1 of 1
2010-05-25 11:01:05,578 INFO
org.apache.hadoop.hbase.master.RegionServerOperation:
GenericMetaStore,139757491,1274779304880 open on 10.209.32.185:60020
2010-05-25 11:01:05,579 INFO
org.apache.hadoop.hbase.master.RegionServerOperation: Updated row
GenericMetaStore,139757491,1274779304880 in region .META.,,1 with
startcode=1274747560769, server=10.209.32.185:60020
2010-05-25 11:01:05,599 DEBUG org.apache.hadoop.hbase.master.RegionManager:
Going to close region GenericMetaStore,139757491,1274779304880
2010-05-25 11:01:05,605 INFO org.apache.hadoop.hbase.master.ServerManager:
Processing MSG_REPORT_CLOSE: GenericMetaStore,139757491,1274779304880 from
a028.example.com,60020,1274747560769; 1 of 2
2010-05-25 11:01:05,605 DEBUG org.apache.hadoop.hbase.master.HMaster:
Processing todo: ProcessRegionClose of
GenericMetaStore,139757491,1274779304880, false, reassign: true
2010-05-25 11:01:05,606 INFO
org.apache.hadoop.hbase.master.ProcessRegionClose$1: region set as unassigned:
GenericMetaStore,139757491,1274779304880
2010-05-25 11:01:05,607 INFO org.apache.hadoop.hbase.master.RegionManager:
Assigning region GenericMetaStore,139757491,1274779304880 to
sjc1c104.example.com,60020,1274747062601
2010-05-25 11:01:05,640 INFO org.apache.hadoop.hbase.master.ServerManager:
Processing MSG_REPORT_OPEN: GenericMetaStore,139757491,1274779304880 from
sjc1c104.example.com,60020,1274747062601; 1 of 1
2010-05-25 11:01:05,640 INFO
org.apache.hadoop.hbase.master.RegionServerOperation:
GenericMetaStore,139757491,1274779304880 open on 10.209.42.181:60020
2010-05-25 11:01:05,641 INFO
org.apache.hadoop.hbase.master.RegionServerOperation: Updated row
GenericMetaStore,139757491,1274779304880 in region .META.,,1 with
startcode=1274747062601, server=10.209.42.181:60020
2010-05-25 11:01:05,723 DEBUG org.apache.hadoop.hbase.master.RegionManager:
Going to close region GenericMetaStore,139757491,1274779304880
2010-05-25 11:01:05,729 INFO org.apache.hadoop.hbase.master.ServerManager:
Processing MSG_REPORT_CLOSE: GenericMetaStore,139757491,1274779304880 from
sjc1c104.example.com,60020,1274747062601; 1 of 4
2010-05-25 11:01:05,729 DEBUG org.apache.hadoop.hbase.master.HMaster:
Processing todo: ProcessRegionClose of
GenericMetaStore,139757491,1274779304880, false, reassign: true
2010-05-25 11:01:05,730 INFO
org.apache.hadoop.hbase.master.ProcessRegionClose$1: region set as unassigned:
GenericMetaStore,139757491,1274779304880
2010-05-25 11:01:05,731 INFO org.apache.hadoop.hbase.master.RegionManager:
Assigning region GenericMetaStore,139757491,1274779304880 to
sjc1c091.example.com,60020,1274747056415
2010-05-25 11:01:05,751 INFO org.apache.hadoop.hbase.master.ServerManager:
Processing MSG_REPORT_OPEN: GenericMetaStore,139757491,1274779304880 from
sjc1c091.example.com,60020,1274747056415; 1 of 1
2010-05-25 11:01:05,752 INFO
org.apache.hadoop.hbase.master.RegionServerOperation:
GenericMetaStore,139757491,1274779304880 open on 10.209.42.238:60020
2010-05-25 11:01:05,752 INFO
org.apache.hadoop.hbase.master.RegionServerOperation: Updated row
GenericMetaStore,139757491,1274779304880 in region .META.,,1 with
startcode=1274747056415, server=10.209.42.238:60020
2010-05-25 11:01:05,775 DEBUG org.apache.hadoop.hbase.master.RegionManager:
Going to close region GenericMetaStore,139757491,1274779304880
2010-05-25 11:01:05,780 INFO org.apache.hadoop.hbase.master.ServerManager:
Processing MSG_REPORT_CLOSE: GenericMetaStore,139757491,1274779304880 from
sjc1c091.example.com,60020,1274747056415; 1 of 2
2010-05-25 11:01:05,780 DEBUG org.apache.hadoop.hbase.master.HMaster:
Processing todo: ProcessRegionClose of
GenericMetaStore,139757491,1274779304880, false, reassign: true
2010-05-25 11:01:05,780 INFO
org.apache.hadoop.hbase.master.ProcessRegionClose$1: region set as unassigned:
GenericMetaStore,139757491,1274779304880
2010-05-25 11:01:05,808 INFO org.apache.hadoop.hbase.master.RegionManager:
Assigning region GenericMetaStore,139757491,1274779304880 to
sjc1a003.example.com,60020,1274747057557
2010-05-25 11:01:05,828 INFO org.apache.hadoop.hbase.master.ServerManager:
Processing MSG_REPORT_OPEN: GenericMetaStore,139757491,1274779304880 from
sjc1a003.example.com,60020,1274747057557; 1 of 1
2010-05-25 11:01:05,828 INFO
org.apache.hadoop.hbase.master.RegionServerOperation:
GenericMetaStore,139757491,1274779304880 open on 10.209.32.148:60020
2010-05-25 11:01:05,829 INFO
org.apache.hadoop.hbase.master.RegionServerOperation: Updated row
GenericMetaStore,139757491,1274779304880 in region .META.,,1 with
startcode=1274747057557, server=10.209.32.148:60020
{code}
The culprit seems to be the code that wants to bring up underloaded
regionservers up to average. That and something about the lightly loaded
servers math that is off.
I'm marking this as a blocker. I'm not sure why its not more common. There
were some issues on this cluster regards disks filling but though such an event
may have provoked the issue, we should have evened out eventually.
Making this a blocker on 0.20.5. Need to fix it for this user at least.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.