[ 
https://issues.apache.org/jira/browse/HADOOP-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hemanth Yamijala resolved HADOOP-4937.
--------------------------------------

       Resolution: Fixed
    Fix Version/s: 0.20.0
     Hadoop Flags: [Reviewed]

I committed this to trunk and the Hadoop 0.20 branch, as it will make it easier 
to fix HADOOP-4938, an important bug that we are facing for a long while now.

Thanks, Peeyush !

> [HOD] Include ringmaster RPC port information in the notes attribute
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4937
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4937
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/hod
>            Reporter: Hemanth Yamijala
>            Assignee: Peeyush Bishnoi
>             Fix For: 0.20.0
>
>         Attachments: hadoop-4937-1.txt, hadoop-4937-2.txt, hadoop-4937.txt
>
>
> In large cluster deployments, due to node failures, it sometimes happens that 
> HOD clusters get allocated, but not deallocated even after the idleness limit 
> of the cluster (the time for which no jobs are run) exceeds. One of the main 
> reasons for this is the ringmaster process which is responsible for tracking 
> and cleaning an idle cluster (of which it is a part) itself goes down. To 
> handle such scenarios it makes sense to centrally track the ringmaster nodes 
> for suspicious clusters. But since the information about which port the 
> ringmaster is bound to is not centrally available, this becomes impossible to 
> monitor.
> This issue is an enhancement request to include ringmaster RPC port 
> information along with the JT and NN info as part of the resource manager's 
> notes attribute so that it can be used by any monitoring processes built 
> around it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to