[ https://issues.apache.org/jira/browse/HBASE-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13193120#comment-13193120 ]
jirapos...@reviews.apache.org commented on HBASE-5128: ------------------------------------------------------ bq. On 2012-01-14 05:43:38, Lars Hofhansl wrote: bq. > src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java, line 586 bq. > <https://reviews.apache.org/r/3435/diff/2/?file=68919#file68919line586> bq. > bq. > I liked this better before :) I probably broke this out to be easier to step debug. I can restore. bq. On 2012-01-14 05:43:38, Lars Hofhansl wrote: bq. > src/main/java/org/apache/hadoop/hbase/util/HBaseFsckRepair.java, line 154 bq. > <https://reviews.apache.org/r/3435/diff/2/?file=68922#file68922line154> bq. > bq. > No wait in case of exception. Is that by design? nice catch. bq. On 2012-01-14 05:43:38, Lars Hofhansl wrote: bq. > src/main/java/org/apache/hadoop/hbase/util/HBaseFsck.java, line 1083 bq. > <https://reviews.apache.org/r/3435/diff/2/?file=68921#file68921line1083> bq. > bq. > I think you said in the intro, that you need to check the availability of this rpc. done in next version. bq. On 2012-01-14 05:43:38, Lars Hofhansl wrote: bq. > src/main/java/org/apache/hadoop/hbase/util/HBaseFsck.java, line 1072 bq. > <https://reviews.apache.org/r/3435/diff/2/?file=68921#file68921line1072> bq. > bq. > <0.90.6? updated to 0.90.6, with the assumption that this feature will not make it there, (but hopefully in to a 0.90.7) bq. On 2012-01-14 05:43:38, Lars Hofhansl wrote: bq. > src/main/java/org/apache/hadoop/hbase/util/HBaseFsck.java, line 2275 bq. > <https://reviews.apache.org/r/3435/diff/2/?file=68921#file68921line2275> bq. > bq. > I know this is not new, but this ErrorReporter is used for status messages as well as error reporting. Should maybe have a different name. bq. > bq. > Also should messages go to STDOUT (out) and error go to STDERR (err)? TODO -- I'll follow up on this after the next round. bq. On 2012-01-14 05:43:38, Lars Hofhansl wrote: bq. > src/main/java/org/apache/hadoop/hbase/master/HMaster.java, line 1053 bq. > <https://reviews.apache.org/r/3435/diff/2/?file=68920#file68920line1053> bq. > bq. > Should we add a double check here that the region is in fact offline (by checking .META.) or is that too expensive/not-needed? bq. > bq. > I'm thinking, once this method exists folks will eventually called for other reasons. Currently, we needed this method to explicitly remove information from the Master's memory. In the cases where this is used, I've "directly" removed data from meta (Delete into .META.) and closed the regions on region servers directly (HRegionInterface#closeRegion). I haven't worked it out completely yet but it probably makes sense to fix closeRegion to properly add an param that will remove this in memory master state as well. I was under the gun get something working out, and now having accomplished this I'm definitely open to refactor this to make it saner and to clean this up more. bq. On 2012-01-14 05:43:38, Lars Hofhansl wrote: bq. > src/main/java/org/apache/hadoop/hbase/util/HBaseFsck.java, line 90 bq. > <https://reviews.apache.org/r/3435/diff/2/?file=68921#file68921line90> bq. > bq. > Nice documentation. This tool is awesome. thanks! - jmhsieh ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3435/#review4384 ----------------------------------------------------------- On 2012-01-13 22:49:33, jmhsieh wrote: bq. bq. ----------------------------------------------------------- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/3435/ bq. ----------------------------------------------------------- bq. bq. (Updated 2012-01-13 22:49:33) bq. bq. bq. Review request for hbase, Todd Lipcon, Ted Yu, Michael Stack, and Jean-Daniel Cryans. bq. bq. bq. Summary bq. ------- bq. bq. I'm posting a preliminary version that I'm currently testing on real clusters. The tests are flakey on the 0.90 branch (so there is something async that I didn't synchronize properly), and there are a few more TODO's I want to knock out before this is ready for full review to be considered for committing. It's got some problems I need some advice figuring out. bq. bq. Problem 1: bq. bq. In the unit tests, I have a few cases where I fabricate new regions and try to force the overlapping regions to be closed. For some of these, I cannot delete a table after it is repaired without causing subsequent tests to fail. I think this is due to a few things: bq. bq. 1) The disable table handler uses in-memory assignment manager state while delete uses in META assignment information. bq. 2) Currently I'm using the sneaky closeRegion that purposely doesn't go through the master and in turn doesn't modify in-memory state – disable uses out of date in-memory region assignments. If I use the unassign method sends RIT transitions to the master, but which ends up attempting to assign it again, causing timing/transient states. bq. bq. What is a good way to clear the HMaster's assignment manager's assignment data for particular regions or to force it to re-read from META? (without modifying the 0.90 HBase's it is meant to repair). bq. bq. Problem 2: bq. bq. Sometimes test fail reporting HOLE_IN_REGION_CHAIN and SERVER_DOES_NOT_MATCH_META. This means the old and new regions are confiused with each other and basically something is still happening asynchronously. I think this is the new region is being assigned and is still transitioning. Sound about right? To make the unit test deterministic, should hbck wait for these to settle or should just the unit test wait? bq. bq. bq. This addresses bug HBASE-5128. bq. https://issues.apache.org/jira/browse/HBASE-5128 bq. bq. bq. Diffs bq. ----- bq. bq. src/main/java/org/apache/hadoop/hbase/ipc/HMasterInterface.java c56b3a6 bq. src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 330a7cc bq. src/main/java/org/apache/hadoop/hbase/master/HMaster.java 3c7b68d bq. src/main/java/org/apache/hadoop/hbase/util/HBaseFsck.java 6d3401d bq. src/main/java/org/apache/hadoop/hbase/util/HBaseFsckRepair.java a3d8b8b bq. src/main/java/org/apache/hadoop/hbase/util/hbck/OfflineMetaRepair.java 29e8bb2 bq. src/main/java/org/apache/hadoop/hbase/util/hbck/TableIntegrityErrorHandler.java PRE-CREATION bq. src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsck.java a640d57 bq. src/test/java/org/apache/hadoop/hbase/util/hbck/HbckTestingUtil.java dbb97f8 bq. src/test/java/org/apache/hadoop/hbase/util/hbck/TestOfflineMetaRebuildBase.java 3e8729d bq. src/test/java/org/apache/hadoop/hbase/util/hbck/TestOfflineMetaRebuildHole.java 11a1151 bq. src/test/java/org/apache/hadoop/hbase/util/hbck/TestOfflineMetaRebuildOverlap.java 4a09ce2 bq. bq. Diff: https://reviews.apache.org/r/3435/diff bq. bq. bq. Testing bq. ------- bq. bq. All unit tests pass sometimes. Some fail sometimes (generally the cases that fabricate new regions). bq. bq. Not ready for commit. bq. bq. bq. Thanks, bq. bq. jmhsieh bq. bq. > [uber hbck] Enable hbck to automatically repair table integrity problems as > well as region consistency problems while online. > ----------------------------------------------------------------------------------------------------------------------------- > > Key: HBASE-5128 > URL: https://issues.apache.org/jira/browse/HBASE-5128 > Project: HBase > Issue Type: New Feature > Components: hbck > Affects Versions: 0.90.5, 0.92.0 > Reporter: Jonathan Hsieh > Assignee: Jonathan Hsieh > > The current (0.90.5, 0.92.0rc2) versions of hbck detects most of region > consistency and table integrity invariant violations. However with '-fix' it > can only automatically repair region consistency cases having to do with > deployment problems. This updated version should be able to handle all cases > (including a new orphan regiondir case). When complete will likely deprecate > the OfflineMetaRepair tool and subsume several open META-hole related issue. > Here's the approach (from the comment of at the top of the new version of the > file). > {code} > /** > * HBaseFsck (hbck) is a tool for checking and repairing region consistency > and > * table integrity. > * > * Region consistency checks verify that META, region deployment on > * region servers and the state of data in HDFS (.regioninfo files) all are in > * accordance. > * > * Table integrity checks verify that that all possible row keys can resolve > to > * exactly one region of a table. This means there are no individual > degenerate > * or backwards regions; no holes between regions; and that there no > overlapping > * regions. > * > * The general repair strategy works in these steps. > * 1) Repair Table Integrity on HDFS. (merge or fabricate regions) > * 2) Repair Region Consistency with META and assignments > * > * For table integrity repairs, the tables their region directories are > scanned > * for .regioninfo files. Each table's integrity is then verified. If there > * are any orphan regions (regions with no .regioninfo files), or holes, new > * regions are fabricated. Backwards regions are sidelined as well as empty > * degenerate (endkey==startkey) regions. If there are any overlapping > regions, > * a new region is created and all data is merged into the new region. > * > * Table integrity repairs deal solely with HDFS and can be done offline -- > the > * hbase region servers or master do not need to be running. These phase can > be > * use to completely reconstruct the META table in an offline fashion. > * > * Region consistency requires three conditions -- 1) valid .regioninfo file > * present in an hdfs region dir, 2) valid row with .regioninfo data in META, > * and 3) a region is deployed only at the regionserver that is was assigned > to. > * > * Region consistency requires hbck to contact the HBase master and region > * servers, so the connect() must first be called successfully. Much of the > * region consistency information is transient and less risky to repair. > */ > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira