[
https://issues.apache.org/jira/browse/HBASE-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
stack resolved HBASE-2617.
--------------------------
Hadoop Flags: [Reviewed]
Resolution: Fixed
I applied patch. Thanks for review Jon (I changed message to say 'overloaded'
again).
> Load balancer falls into pathological state if one server under average -
> slop; endless churn
> ---------------------------------------------------------------------------------------------
>
> Key: HBASE-2617
> URL: https://issues.apache.org/jira/browse/HBASE-2617
> Project: HBase
> Issue Type: Bug
> Reporter: stack
> Assignee: stack
> Priority: Blocker
> Fix For: 0.20.5
>
> Attachments: 2617-v2.txt, 2617.txt
>
>
> I'm looking at a 0.20.4 cluster of 80 fast machines. It runs fine for a
> while and then falls over into crazy balancing churn (My view on logs is
> sporadic but have a log before me where this is happening). Things I see
> that seem to be of 0.20.4 particularly:
> + We don't reach an equilibrium or at least it takes so long, its as though
> it wasn't every going to happen
> + Master log filled w/ open, close of one or two regions usually the same
> ones over and over (The Regions that are candidates to close are provided by
> the RS. They are ordered by hash of their name. We return the top ten from
> this Set every time. So, we always close the same regions all the time even
> if we just opened it)
> + Often, we'll tell an RS to close a region. It will do the job. In 0.20.4
> we made it so if RS has any work at all for the master, that we return
> immediately rather than wait for the reporting period to elaspse. So, on
> these fast machines, it can be back near immediately if it just opened
> another some other region, say. It can get assigned the region it just
> closed. Seems to happen frequently enough.
> For example, look at the below extract featuring a single regions life. Its
> opened and closed 5 times in about 1/2 a second:
> {code}
> 2010-05-25 11:01:05,488 DEBUG org.apache.hadoop.hbase.master.HMaster:
> Processing todo: ProcessRegionClose of
> GenericMetaStore,139757491,1274779304880, false, reassign: true
> 2010-05-25 11:01:05,489 INFO
> org.apache.hadoop.hbase.master.ProcessRegionClose$1: region set as
> unassigned: GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,490 INFO org.apache.hadoop.hbase.master.RegionManager:
> Assigning region GenericMetaStore,139757491,1274779304880 to
> a025.example.com,60020,1274744064673
> 2010-05-25 11:01:05,510 INFO org.apache.hadoop.hbase.master.ServerManager:
> Processing MSG_REPORT_OPEN: GenericMetaStore,139757491,1274779304880 from
> a025.example.com,60020,1274744064673; 1 of 1
> 2010-05-25 11:01:05,510 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation:
> GenericMetaStore,139757491,1274779304880 open on 10.209.32.189:60020
> 2010-05-25 11:01:05,511 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation: Updated row
> GenericMetaStore,139757491,1274779304880 in region .META.,,1 with
> startcode=1274744064673, server=10.209.32.189:60020
> 2010-05-25 11:01:05,548 DEBUG org.apache.hadoop.hbase.master.RegionManager:
> Going to close region GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,552 INFO org.apache.hadoop.hbase.master.ServerManager:
> Processing MSG_REPORT_CLOSE: GenericMetaStore,139757491,1274779304880 from
> a025.example.com,60020,1274744064673; 1 of 2
> 2010-05-25 11:01:05,552 DEBUG org.apache.hadoop.hbase.master.HMaster:
> Processing todo: ProcessRegionClose of
> GenericMetaStore,139757491,1274779304880, false, reassign: true
> 2010-05-25 11:01:05,552 INFO
> org.apache.hadoop.hbase.master.ProcessRegionClose$1: region set as
> unassigned: GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,556 INFO org.apache.hadoop.hbase.master.RegionManager:
> Assigning region GenericMetaStore,139757491,1274779304880 to
> a028.example.com,60020,1274747560769
> 2010-05-25 11:01:05,578 INFO org.apache.hadoop.hbase.master.ServerManager:
> Processing MSG_REPORT_OPEN: GenericMetaStore,139757491,1274779304880 from
> a028.example.com,60020,1274747560769; 1 of 1
> 2010-05-25 11:01:05,578 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation:
> GenericMetaStore,139757491,1274779304880 open on 10.209.32.185:60020
> 2010-05-25 11:01:05,579 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation: Updated row
> GenericMetaStore,139757491,1274779304880 in region .META.,,1 with
> startcode=1274747560769, server=10.209.32.185:60020
> 2010-05-25 11:01:05,599 DEBUG org.apache.hadoop.hbase.master.RegionManager:
> Going to close region GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,605 INFO org.apache.hadoop.hbase.master.ServerManager:
> Processing MSG_REPORT_CLOSE: GenericMetaStore,139757491,1274779304880 from
> a028.example.com,60020,1274747560769; 1 of 2
> 2010-05-25 11:01:05,605 DEBUG org.apache.hadoop.hbase.master.HMaster:
> Processing todo: ProcessRegionClose of
> GenericMetaStore,139757491,1274779304880, false, reassign: true
> 2010-05-25 11:01:05,606 INFO
> org.apache.hadoop.hbase.master.ProcessRegionClose$1: region set as
> unassigned: GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,607 INFO org.apache.hadoop.hbase.master.RegionManager:
> Assigning region GenericMetaStore,139757491,1274779304880 to
> sjc1c104.example.com,60020,1274747062601
> 2010-05-25 11:01:05,640 INFO org.apache.hadoop.hbase.master.ServerManager:
> Processing MSG_REPORT_OPEN: GenericMetaStore,139757491,1274779304880 from
> sjc1c104.example.com,60020,1274747062601; 1 of 1
> 2010-05-25 11:01:05,640 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation:
> GenericMetaStore,139757491,1274779304880 open on 10.209.42.181:60020
> 2010-05-25 11:01:05,641 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation: Updated row
> GenericMetaStore,139757491,1274779304880 in region .META.,,1 with
> startcode=1274747062601, server=10.209.42.181:60020
> 2010-05-25 11:01:05,723 DEBUG org.apache.hadoop.hbase.master.RegionManager:
> Going to close region GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,729 INFO org.apache.hadoop.hbase.master.ServerManager:
> Processing MSG_REPORT_CLOSE: GenericMetaStore,139757491,1274779304880 from
> sjc1c104.example.com,60020,1274747062601; 1 of 4
> 2010-05-25 11:01:05,729 DEBUG org.apache.hadoop.hbase.master.HMaster:
> Processing todo: ProcessRegionClose of
> GenericMetaStore,139757491,1274779304880, false, reassign: true
> 2010-05-25 11:01:05,730 INFO
> org.apache.hadoop.hbase.master.ProcessRegionClose$1: region set as
> unassigned: GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,731 INFO org.apache.hadoop.hbase.master.RegionManager:
> Assigning region GenericMetaStore,139757491,1274779304880 to
> sjc1c091.example.com,60020,1274747056415
> 2010-05-25 11:01:05,751 INFO org.apache.hadoop.hbase.master.ServerManager:
> Processing MSG_REPORT_OPEN: GenericMetaStore,139757491,1274779304880 from
> sjc1c091.example.com,60020,1274747056415; 1 of 1
> 2010-05-25 11:01:05,752 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation:
> GenericMetaStore,139757491,1274779304880 open on 10.209.42.238:60020
> 2010-05-25 11:01:05,752 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation: Updated row
> GenericMetaStore,139757491,1274779304880 in region .META.,,1 with
> startcode=1274747056415, server=10.209.42.238:60020
> 2010-05-25 11:01:05,775 DEBUG org.apache.hadoop.hbase.master.RegionManager:
> Going to close region GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,780 INFO org.apache.hadoop.hbase.master.ServerManager:
> Processing MSG_REPORT_CLOSE: GenericMetaStore,139757491,1274779304880 from
> sjc1c091.example.com,60020,1274747056415; 1 of 2
> 2010-05-25 11:01:05,780 DEBUG org.apache.hadoop.hbase.master.HMaster:
> Processing todo: ProcessRegionClose of
> GenericMetaStore,139757491,1274779304880, false, reassign: true
> 2010-05-25 11:01:05,780 INFO
> org.apache.hadoop.hbase.master.ProcessRegionClose$1: region set as
> unassigned: GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,808 INFO org.apache.hadoop.hbase.master.RegionManager:
> Assigning region GenericMetaStore,139757491,1274779304880 to
> sjc1a003.example.com,60020,1274747057557
> 2010-05-25 11:01:05,828 INFO org.apache.hadoop.hbase.master.ServerManager:
> Processing MSG_REPORT_OPEN: GenericMetaStore,139757491,1274779304880 from
> sjc1a003.example.com,60020,1274747057557; 1 of 1
> 2010-05-25 11:01:05,828 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation:
> GenericMetaStore,139757491,1274779304880 open on 10.209.32.148:60020
> 2010-05-25 11:01:05,829 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation: Updated row
> GenericMetaStore,139757491,1274779304880 in region .META.,,1 with
> startcode=1274747057557, server=10.209.32.148:60020
> {code}
> The culprit seems to be the code that wants to bring up underloaded
> regionservers up to average. That and something about the lightly loaded
> servers math that is off.
> I'm marking this as a blocker. I'm not sure why its not more common. There
> were some issues on this cluster regards disks filling but though such an
> event may have provoked the issue, we should have evened out eventually.
> Making this a blocker on 0.20.5. Need to fix it for this user at least.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.