[ https://issues.apache.org/jira/browse/HBASE-23679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17015461#comment-17015461 ]
Hudson commented on HBASE-23679: -------------------------------- Results for branch branch-2 [build #2416 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2416/]: (x) *{color:red}-1 overall{color}* ---- details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2416//General_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2416//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2416//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > FileSystem instance leaks due to bulk loads with Kerberos enabled > ----------------------------------------------------------------- > > Key: HBASE-23679 > URL: https://issues.apache.org/jira/browse/HBASE-23679 > Project: HBase > Issue Type: Bug > Reporter: Josh Elser > Assignee: Josh Elser > Priority: Critical > Fix For: 3.0.0, 2.3.0, 2.1.9, 2.2.4 > > > Spent the better part of a week chasing an issue on HBase 2.x where the > number of DistributedFileSystem instances on the heap of a RegionServer would > grow unbounded. Looking at multiple heap-dumps, it was obvious to see that we > had an immense number of DFS instances cached (in FileSystem$Cache) for the > same user, with the unique number of Tokens contained in that DFS's UGI > member (one hbase delegation token, and two HDFS delegation tokens – we only > do this for bulk loads). For the user's clusters, they eventually experienced > 10x perf degradation as RegionServers spent all of their time in JVM GC (they > were unlucky to not have RegionServers crash outright, as this would've, > albeit temporarily, fixed the issue). > The problem seems to be two-fold with changes by HBASE-15291 being largely > the cause. This issue tried to close FileSystem instances which were being > leaked – however, it did this by instrumenting the method > {{SecureBulkLoadManager.cleanupBulkLoad(..)}}. Two big issues with this > approach: > # It relies on clients to call this method (client's hanging up will leak > resources in RegionServers) > # This method is only called on the RegionServer hosting the first Region of > the table which was bulk-loaded into. For multiple RegionServers, they are > left to leak resources. > HBASE-21342 later tried to fix an issue where FS objects were now being > closed prematurely via reference-counting (which appears to work fine), but > does not address the other two issues above. Point #2 makes debugging this > issue harder than normal because it doesn't manifest on a single node > instance :) > Through all of this, I (re)learned the dirty history of UGI and how its > caching doesn't work so great HADOOP-6670. I see trying to continue to > leverage the FileSystem$CACHE as a potentially dangerous thing (we've been > back here multiple times already). My opinion at this point is that we should > cleanly create a new FileSystem instance during the call to > {{SecureBulkLoadManager#secureBulkLoadHFiles(..)}} and close it in a finally > block in that same method. This both simplifies the lifecycle of a FileSystem > instance in the bulk-load codepath but also helps us avoid future problems > with UGI and FS caching. The one downside is that we pay the penalty to > create a new FileSystem instance, but I'm of the opinion that we cross that > bridge when we get there. > Thanks for [~jdcryans] and [~busbey] for their help along the way. -- This message was sent by Atlassian Jira (v8.3.4#803005)