[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13767210#comment-13767210
 ] 

Xi Fang commented on MAPREDUCE-5508:
------------------------------------

This bug was found in Microsoft's large scale test with about 200,000 job 
submissions. The memory usage is steadily growing up. 

There is a long discussion between Hortonworks (thanks [~cnauroth] and 
[~vinodkv]) and Microsoft on this issue. Here is the summary of the discussion.

1. The heap dumps are showing DistributedFileSystem instances that are only 
referred to from the cache's HashMap entries. Since nothing else has a 
reference, nothing else can ever attempt to close it, and therefore it will 
never be removed from the cache. 

2. The special check for "tempDirFS" (see code in description) in the patch for 
MAPREDUCE-5351 is intended as an optimization so that CleanupQueue doesn't need 
to immediately reopen a FileSystem that was just closed. However, we observed 
that we're getting different identity hash code values on the subject in the 
key. The code is assuming that CleanupQueue will find the same Subject that was 
used inside JobInProgress. Unfortunately, this is not guaranteed, because we 
may have crossed into a different access control context at this point, via 
UserGroupInformation#doAs. Even though it's conceptually the same user, the 
Subject is a function of the current AccessControlContext:
{code}
  public synchronized
  static UserGroupInformation getCurrentUser() throws IOException {
    AccessControlContext context = AccessController.getContext();
    Subject subject = Subject.getSubject(context);
{code}
Even if the contexts are logically equivalent between JobInProgress and 
CleanupQueue, we see no guarantee that Java will give you the same Subject 
instance, which is required for successful lookup in the FileSystem cache 
(because of the use of identity hash code).

A fix is abandon this optimization and close the FileSystem within the same 
AccessControlContext that opened it.  

                
> Memory leak caused by unreleased FileSystem objects in 
> JobInProgress#cleanupJob
> -------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5508
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5508
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker
>    Affects Versions: 1-win
>            Reporter: Xi Fang
>            Assignee: Xi Fang
>            Priority: Critical
>
> MAPREDUCE-5351 fixed a memory leak problem but introducing another filesystem 
> object that is properly released.
> {code} JobInProgress#cleanupJob()
>   void cleanupJob() {
> ...
>           tempDirFs = jobTempDirPath.getFileSystem(conf);
>           CleanupQueue.getInstance().addToQueue(
>               new PathDeletionContext(jobTempDirPath, conf, userUGI, jobId));
> ...
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to