[jira] [Commented] (HBASE-16052) Improve HBaseFsck Scalability

Ted Yu (JIRA) Mon, 27 Jun 2016 20:15:08 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352271#comment-15352271
 ]


Ted Yu commented on HBASE-16052:
--------------------------------

>From https://builds.apache.org/job/PreCommit-HBASE-Build/2383/console :
{code}
|  -1  |           unit  |  92m 21s   | hbase-server in the patch failed. 
|  +1  |     asflicense  |  0m 17s    | Patch does not generate ASF License 
|      |                 |            | warnings.
|      |                 |  135m 14s  | 


|| Subsystem || Report/Notes ||
============================================================================
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12813817/HBASE-16052-v3-branch-1.patch
 |
| JIRA Issue | HBASE-16052 |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  hadoopcheck  
hbaseanti  checkstyle  compile  |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | master / 424b789 |
| Default Java | 1.7.0_80 |
| Multi-JDK versions |  /home/jenkins/tools/java/jdk1.8.0:1.8.0 
/home/jenkins/jenkins-slave/tools/hudson.model.JDK/JDK_1.7_latest_:1.7.0_80 |
| findbugs | v3.0.0 |
| unit | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/patchprocess/patch-unit-hbase-server.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HBASE-Build/2383/testReport/ |
| modules | C: hbase-server U: hbase-server |
| Console output | 
https://builds.apache.org/job/PreCommit-HBASE-Build/2383/console |
| Powered by | Apache Yetus 0.2.1   http://yetus.apache.org |
{code}
https://builds.apache.org/job/PreCommit-HBASE-Build/2383/testReport/TEST-org.apache.hadoop.hbase.replication.TestReplicationKillSlaveRS/xml/_init_/

Not related to the patch.

> Improve HBaseFsck Scalability
> -----------------------------
>
>                 Key: HBASE-16052
>                 URL: https://issues.apache.org/jira/browse/HBASE-16052
>             Project: HBase
>          Issue Type: Improvement
>          Components: hbck
>            Reporter: Ben Lau
>         Attachments: HBASE-16052-master.patch, HBASE-16052-v3-branch-1.patch, 
> HBASE-16052-v3-master.patch
>
>
> There are some problems with HBaseFsck that make it unnecessarily slow 
> especially for large tables or clusters with many regions.  
> This patch tries to fix the biggest bottlenecks and also include a couple of 
> bug fixes for some of the race conditions caused by gathering and holding 
> state about a live cluster that is no longer true by the time you use that 
> state in Fsck processing.  These race conditions cause Fsck to crash and 
> become unusable on large clusters with lots of region splits/merges.
> Here are some scalability/performance problems in HBaseFsck and the changes 
> the patch makes:
> - Unnecessary I/O and RPCs caused by fetching an array of FileStatuses and 
> then discarding everything but the Paths, then passing the Paths to a 
> PathFilter, and then having the filter look up the (previously discarded) 
> FileStatuses of the paths again.  This is actually worse than double I/O 
> because the first lookup obtains a batch of FileStatuses while all the other 
> lookups are individual RPCs performed sequentially.
> -- Avoid this by adding a FileStatusFilter so that filtering can happen 
> directly on FileStatuses
> -- This performance bug affects more than Fsck, but also to some extent 
> things like snapshots, hfile archival, etc.  I didn't have time to look too 
> deep into other things affected and didn't want to increase the scope of this 
> ticket so I focus mostly on Fsck and make only a few improvements to other 
> codepaths.  The changes in this patch though should make it fairly easy to 
> fix other code paths in later jiras if we feel there are some other features 
> strongly impacted by this problem.  
> - OfflineReferenceFileRepair is the most expensive part of Fsck (often 50% of 
> Fsck runtime) and the running time scales with the number of store files, yet 
> the function is completely serial
> -- Make offlineReferenceFileRepair multithreaded
> - LoadHdfsRegionDirs() uses table-level concurrency, which is a big 
> bottleneck if you have 1 large cluster with 1 very large table that has 
> nearly all the regions
> -- Change loadHdfsRegionDirs() to region-level parallelism instead of 
> table-level parallelism for operations.
> The changes benefit all clusters but are especially noticeable for large 
> clusters with a few very large tables.  On our version of 0.98 with the 
> original patch we had a moderately sized production cluster with 2 (user) 
> tables and ~160k regions where HBaseFsck went from taking 18 min to 5 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-16052) Improve HBaseFsck Scalability

Reply via email to