[ 
https://issues.apache.org/jira/browse/HADOOP-4938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669554#action_12669554
 ] 

Hemanth Yamijala commented on HADOOP-4938:
------------------------------------------

Peeyush, as we discussed, please make the following changes:

- Pass options as command line parameters. I think this will be easier to 
manage for now. Look at how logcondense.py works.
- The state file and the log file locations should be configurable. Default can 
be /tmp and /var/log
- The code is checking the sum of runningJobs and submittedJobs is < the number 
stored in the state file. Since submittedJobs already includes runningJobs, you 
don't need to sum them up.
- The SMTP recepient address should be configurable. Also does the library you 
are using support multiple addresses and a remote SMTP host ?
- Submit this as a patch, I think the file should be under the 
$HOD_HOME/support.
- Include the ASF header in the file.
- Can you also submit documentation for this in Forrest ?

> [HOD] Cleanup idle HOD clusters whose ringmaster nodes might have gone down
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-4938
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4938
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Hemanth Yamijala
>            Assignee: Peeyush Bishnoi
>         Attachments: externalIdleTracker.py
>
>
> As mentioned in HADOOP-4937, sometimes in large cluster deployments, faulty 
> nodes on which the ringmaster process comes up may go down after the cluster 
> is successfully allocated. Such clusters fail to deallocate automatically 
> even if the idleness limit of the cluster is exceeded. This is because the 
> idleness is tracked by the ringmaster process which itself has gone down.
> As large number of nodes can get held up due to this, such clusters should be 
> detected and deallocated in some manner.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to