[jira] [Commented] (HBASE-5128) [uber hbck] Enable hbck to automatically repair table integrity problems as well as region consistency problems while online.

[email protected] (Commented) (JIRA) Wed, 25 Jan 2012 09:27:11 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13193139#comment-13193139
 ]


[email protected] commented on HBASE-5128:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3435/
-----------------------------------------------------------

(Updated 2012-01-25 17:24:41.277326)


Review request for hbase, Todd Lipcon, Ted Yu, Michael Stack, and Jean-Daniel 
Cryans.


Changes
-------

This version includes updates after testing against real online but idle 
clusters with real induced corruptions.  This was hbck was tested successfully 
against apache/0.90+this patch branch region servers and regionservers on 
cdh3u2 (an 0.90.4-based hbase without the new offline method).  

I'm going to post usage description and images I've created to explain this 
better on the JIRA.

High level changes in this rev.
- hbck now wraps calls to the offline method and will use unasssign if the 
target region server does not support offline.
- restructured hdfs integrity repairs into more phases -- when compound 
problems were present we'd get into a loop where orphan repair would cause new 
overlaps on a subsequent integrity repair iteration.  This new approach should 
be deterministic. The new phases are 1) Find hdfs holes and patch (post 
condition: no more holes), 2) adopt orphan hdfs regions  (post condition: no 
orphan data in hdfs) 3) reload and fix overlaps (precondition: no holes but 
overlaps possible; post condition: no overlaps).  Previously integrity repairs 
would interate doing all three until it converged (but this didn't always 
happen in practice!). 
- Added more command line options that allow this hbck to only attempt certain 
repairs (which is necessary to get overlap repairs to work more 
deterministically, and needed in to get non-offline supporting hbases to 
converge)
- Added a few more test cases for new corruptions.

One big caveat with this rev is that the hbase was online but idle (no writes 
happening).   It was also suggested that I need to worry about compactions when 
I close regions during overlap merging (JD -- I didn't see anything in 
OnlineMerge -- why wasn't this a concern there?).  If so, I'd like advice on 
how to add guards to protect the user (is a glaring warning message or 
requiring confirmation sufficient?).  I'm going to do some initial testing on 
online and active cases -- but ideally would like this to come in follow on 
jiras.  


Summary
-------

I'm posting a preliminary version that I'm currently testing on real clusters. 
The tests are flakey on the 0.90 branch (so there is something async that I 
didn't synchronize properly), and there are a few more TODO's I want to knock 
out before this is ready for full review to be considered for committing. It's 
got some problems I need some advice figuring out.

Problem 1:

In the unit tests, I have a few cases where I fabricate new regions and try to 
force the overlapping regions to be closed. For some of these, I cannot delete 
a table after it is repaired without causing subsequent tests to fail. I think 
this is due to a few things:

1) The disable table handler uses in-memory assignment manager state while 
delete uses in META assignment information.
2) Currently I'm using the sneaky closeRegion that purposely doesn't go through 
the master and in turn doesn't modify in-memory state – disable uses out of 
date in-memory region assignments. If I use the unassign method sends RIT 
transitions to the master, but which ends up attempting to assign it again, 
causing timing/transient states.

What is a good way to clear the HMaster's assignment manager's assignment data 
for particular regions or to force it to re-read from META? (without modifying 
the 0.90 HBase's it is meant to repair).

Problem 2:

Sometimes test fail reporting HOLE_IN_REGION_CHAIN and 
SERVER_DOES_NOT_MATCH_META. This means the old and new regions are confiused 
with each other and basically something is still happening asynchronously. I 
think this is the new region is being assigned and is still transitioning. 
Sound about right? To make the unit test deterministic, should hbck wait for 
these to settle or should just the unit test wait?


This addresses bug HBASE-5128.
    https://issues.apache.org/jira/browse/HBASE-5128


Diffs (updated)
-----

  src/main/java/org/apache/hadoop/hbase/ipc/HMasterInterface.java c56b3a6 
  src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 9520b95 
  src/main/java/org/apache/hadoop/hbase/master/HMaster.java f7ad064 
  src/main/java/org/apache/hadoop/hbase/util/HBaseFsck.java 6d3401d 
  src/main/java/org/apache/hadoop/hbase/util/HBaseFsckRepair.java a3d8b8b 
  src/main/java/org/apache/hadoop/hbase/util/hbck/OfflineMetaRepair.java 
29e8bb2 
  
src/main/java/org/apache/hadoop/hbase/util/hbck/TableIntegrityErrorHandler.java 
PRE-CREATION 
  src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java 7138d63 
  src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsck.java a640d57 
  src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsckComparator.java 
2c4a79e 
  src/test/java/org/apache/hadoop/hbase/util/hbck/HbckTestingUtil.java dbb97f8 
  
src/test/java/org/apache/hadoop/hbase/util/hbck/TestOfflineMetaRebuildHole.java 
11a1151 

Diff: https://reviews.apache.org/r/3435/diff


Testing
-------

All unit tests pass sometimes.  Some fail sometimes (generally the cases that 
fabricate new regions).  

Not ready for commit.


Thanks,

jmhsieh


                
> [uber hbck] Enable hbck to automatically repair table integrity problems as 
> well as region consistency problems while online.
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5128
>                 URL: https://issues.apache.org/jira/browse/HBASE-5128
>             Project: HBase
>          Issue Type: New Feature
>          Components: hbck
>    Affects Versions: 0.90.5, 0.92.0
>            Reporter: Jonathan Hsieh
>            Assignee: Jonathan Hsieh
>
> The current (0.90.5, 0.92.0rc2) versions of hbck detects most of region 
> consistency and table integrity invariant violations.  However with '-fix' it 
> can only automatically repair region consistency cases having to do with 
> deployment problems.  This updated version should be able to handle all cases 
> (including a new orphan regiondir case).  When complete will likely deprecate 
> the OfflineMetaRepair tool and subsume several open META-hole related issue.
> Here's the approach (from the comment of at the top of the new version of the 
> file).
> {code}
> /**
>  * HBaseFsck (hbck) is a tool for checking and repairing region consistency 
> and
>  * table integrity.  
>  * 
>  * Region consistency checks verify that META, region deployment on
>  * region servers and the state of data in HDFS (.regioninfo files) all are in
>  * accordance. 
>  * 
>  * Table integrity checks verify that that all possible row keys can resolve 
> to
>  * exactly one region of a table.  This means there are no individual 
> degenerate
>  * or backwards regions; no holes between regions; and that there no 
> overlapping
>  * regions. 
>  * 
>  * The general repair strategy works in these steps.
>  * 1) Repair Table Integrity on HDFS. (merge or fabricate regions)
>  * 2) Repair Region Consistency with META and assignments
>  * 
>  * For table integrity repairs, the tables their region directories are 
> scanned
>  * for .regioninfo files.  Each table's integrity is then verified.  If there 
>  * are any orphan regions (regions with no .regioninfo files), or holes, new 
>  * regions are fabricated.  Backwards regions are sidelined as well as empty
>  * degenerate (endkey==startkey) regions.  If there are any overlapping 
> regions,
>  * a new region is created and all data is merged into the new region.  
>  * 
>  * Table integrity repairs deal solely with HDFS and can be done offline -- 
> the
>  * hbase region servers or master do not need to be running.  These phase can 
> be
>  * use to completely reconstruct the META table in an offline fashion. 
>  * 
>  * Region consistency requires three conditions -- 1) valid .regioninfo file 
>  * present in an hdfs region dir,  2) valid row with .regioninfo data in META,
>  * and 3) a region is deployed only at the regionserver that is was assigned 
> to.
>  * 
>  * Region consistency requires hbck to contact the HBase master and region
>  * servers, so the connect() must first be called successfully.  Much of the
>  * region consistency information is transient and less risky to repair.
>  */
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5128) [uber hbck] Enable hbck to automatically repair table integrity problems as well as region consistency problems while online.

Reply via email to