[ https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352271#comment-15352271 ]
Ted Yu commented on HBASE-16052: -------------------------------- >From https://builds.apache.org/job/PreCommit-HBASE-Build/2383/console : {code} | -1 | unit | 92m 21s | hbase-server in the patch failed. | +1 | asflicense | 0m 17s | Patch does not generate ASF License | | | | warnings. | | | 135m 14s | || Subsystem || Report/Notes || ============================================================================ | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12813817/HBASE-16052-v3-branch-1.patch | | JIRA Issue | HBASE-16052 | | Optional Tests | asflicense javac javadoc unit findbugs hadoopcheck hbaseanti checkstyle compile | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / 424b789 | | Default Java | 1.7.0_80 | | Multi-JDK versions | /home/jenkins/tools/java/jdk1.8.0:1.8.0 /home/jenkins/jenkins-slave/tools/hudson.model.JDK/JDK_1.7_latest_:1.7.0_80 | | findbugs | v3.0.0 | | unit | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/patchprocess/patch-unit-hbase-server.txt | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/2383/testReport/ | | modules | C: hbase-server U: hbase-server | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/2383/console | | Powered by | Apache Yetus 0.2.1 http://yetus.apache.org | {code} https://builds.apache.org/job/PreCommit-HBASE-Build/2383/testReport/TEST-org.apache.hadoop.hbase.replication.TestReplicationKillSlaveRS/xml/_init_/ Not related to the patch. > Improve HBaseFsck Scalability > ----------------------------- > > Key: HBASE-16052 > URL: https://issues.apache.org/jira/browse/HBASE-16052 > Project: HBase > Issue Type: Improvement > Components: hbck > Reporter: Ben Lau > Attachments: HBASE-16052-master.patch, HBASE-16052-v3-branch-1.patch, > HBASE-16052-v3-master.patch > > > There are some problems with HBaseFsck that make it unnecessarily slow > especially for large tables or clusters with many regions. > This patch tries to fix the biggest bottlenecks and also include a couple of > bug fixes for some of the race conditions caused by gathering and holding > state about a live cluster that is no longer true by the time you use that > state in Fsck processing. These race conditions cause Fsck to crash and > become unusable on large clusters with lots of region splits/merges. > Here are some scalability/performance problems in HBaseFsck and the changes > the patch makes: > - Unnecessary I/O and RPCs caused by fetching an array of FileStatuses and > then discarding everything but the Paths, then passing the Paths to a > PathFilter, and then having the filter look up the (previously discarded) > FileStatuses of the paths again. This is actually worse than double I/O > because the first lookup obtains a batch of FileStatuses while all the other > lookups are individual RPCs performed sequentially. > -- Avoid this by adding a FileStatusFilter so that filtering can happen > directly on FileStatuses > -- This performance bug affects more than Fsck, but also to some extent > things like snapshots, hfile archival, etc. I didn't have time to look too > deep into other things affected and didn't want to increase the scope of this > ticket so I focus mostly on Fsck and make only a few improvements to other > codepaths. The changes in this patch though should make it fairly easy to > fix other code paths in later jiras if we feel there are some other features > strongly impacted by this problem. > - OfflineReferenceFileRepair is the most expensive part of Fsck (often 50% of > Fsck runtime) and the running time scales with the number of store files, yet > the function is completely serial > -- Make offlineReferenceFileRepair multithreaded > - LoadHdfsRegionDirs() uses table-level concurrency, which is a big > bottleneck if you have 1 large cluster with 1 very large table that has > nearly all the regions > -- Change loadHdfsRegionDirs() to region-level parallelism instead of > table-level parallelism for operations. > The changes benefit all clusters but are especially noticeable for large > clusters with a few very large tables. On our version of 0.98 with the > original patch we had a moderately sized production cluster with 2 (user) > tables and ~160k regions where HBaseFsck went from taking 18 min to 5 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)