[
https://issues.apache.org/jira/browse/OODT-692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Imesha Sudasingha updated OODT-692:
-----------------------------------
Fix Version/s: (was: 2.0)
(was: 1.9)
> Use lsof to stop Workflow/Resource Manager task/job PIDs
> ---------------------------------------------------------
>
> Key: OODT-692
> URL: https://issues.apache.org/jira/browse/OODT-692
> Project: OODT
> Issue Type: Bug
> Components: pge wrapper framework, resource manager, workflow manager
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Priority: Major
> Labels: killjob, manager, oodt, pid, resource, unix, workflow
>
> We can exploit a combination of LSOF, JobDir, and WorkflowInstanceId to
> actually kill the process ID and fully stop a job kicked off by the resource
> manager and workflow manager. I've been testing this process by hand on the
> ASO process and it's totally useable by hand in practice, so we should
> automate it. For example:
> {noformat}
> [snowdeploy@trango-private bin]$ lsof -p 37558
> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
> idl 37558 snowdeploy cwd DIR 253,2 4096 488284165
> /data/jobs/CASI/ISSP/20140511f1_184151_1399903013836
> ..
> {noformat}
> Reveals to use that the process ID 37558 (one of the IDL jobs running in ASO
> for the ORTHO process) corresponds to _JobDir_
> {noformat}
> /data/jobs/CASI/ISSP/20140511f1_184151_1399903013836
> {noformat}
> We can also find out from WorklowInstanceMetadata that the _JobDir_
> corresponding to the line _184151_ is _726af17c-c131-4682-845e-4ef6b4a7eeee_.
> So, from a Workflow Instance Id, we need:
> # the resolved JobDir by CAS-PGE. If it's not a CAS-PGE job, we need the
> WorkflowTask to specify a JobDir, or else this functionality will simply
> print out a message saying Kill without JobDir not supported.
> # a map for processes to interrogate with lsof e.g., PCS_JobKillProcessName
> # the use of lsof to interrogate the PID table, find the job corresponding
> JobDir, and then kill. If PCS_JobKillProcessName is not specified, then
> interrogate all jobs to determine the job to kill.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)