[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13767401#comment-13767401
 ] 

Xi Fang commented on MAPREDUCE-5508:
------------------------------------

[~sandyr] Thanks for your comments.

bq. Have you tested this fix.

Yes. We have tested this fix on our test cluster (about 130,000 submission). 
After the workflow was done, we waited for a couple of minutes (jobs were 
retiring), then forced GC, and then dumped the memory. We manually checked the 
FileSystem#Cache. There was no memory leak.

bq. For your analysis 

1. I agree with "it doesn't appear that tempDirFs and fs are ever even ending 
up equal because tempDirFs is created with the wrong UGI."  
2. I think tempDir would be fine because  1) JobInProgess#cleanupJob won't 
introduce a file system instance for tempDir and 2) the fs in 
CleanupQueue@deletePath would be reused (i.e. only one instance would exist in 
FileSystem#Cache). My initial thought was this part has a memory leak. But a 
test shows that there is no problem here.
3. The problem is actually 
{code}
tempDirFs = jobTempDirPath.getFileSystem(conf);
{code}
The problem here is that this guy "MAY" (I will explain later) put a new entry 
in FileSystem#Cache. Note that this would eventually go into 
UserGroupInformation#getCurrentUser to get a UGI with a current 
AccessControlContext.  CleanupQueue#deletePath won't close this entry because a 
different UGI (i.e. "userUGI" created in JobInProgress) is used there. Here is 
the tricky part which we had a long discussion with [~cnauroth] and [~vinodkv]. 
The problem here is that although we may only have one current user, the 
following code "MAY" return different subjects.
{code}
 static UserGroupInformation getCurrentUser() throws IOException {
    AccessControlContext context = AccessController.getContext();
-->    Subject subject = Subject.getSubject(context);   
-------------------------< 
{code}
Because the entry of FileSystem#Cache uses identityHashCode of a subject to 
construct the key, a file system object created by  
"jobTempDirPath.getFileSystem(conf)" may not be found later when this code is 
executed again, although we may have the same principle (i.e. the current 
user). This eventually leads to an unbounded number of file system instances in 
FileSystem#Cache. Nothing is going to remove them from the cache.
 
Please let me know if you have any questions. 
                
> JobTracker memory leak caused by unreleased FileSystem objects in 
> JobInProgress#cleanupJob
> ------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5508
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5508
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker
>    Affects Versions: 1-win, 1.2.1
>            Reporter: Xi Fang
>            Assignee: Xi Fang
>            Priority: Critical
>         Attachments: MAPREDUCE-5508.patch
>
>
> MAPREDUCE-5351 fixed a memory leak problem but introducing another filesystem 
> object (see "tempDirFs") that is not properly released.
> {code} JobInProgress#cleanupJob()
>   void cleanupJob() {
> ...
>           tempDirFs = jobTempDirPath.getFileSystem(conf);
>           CleanupQueue.getInstance().addToQueue(
>               new PathDeletionContext(jobTempDirPath, conf, userUGI, jobId));
> ...
>  if (tempDirFs != fs) {
>       try {
>         fs.close();
>       } catch (IOException ie) {
> ...
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to