On Aug 9, 2012, at 11:54 AM, Andrew C. Connolly wrote:

> Today I sent a large batch to condor (~50 jobs), then tried to kill the
> all with 'condor_rm andy'.
> This removed them from the condor_q list of running processes, but left
> them all actually running.
> I then had to kill them all manually using htop.  
> 
> the only thing I did was not normal was start the jobs from within an
> ipython session using "!" to escape the shell ( i don't think that
> should matter, but just in case.). I then tried called condor_rm from
> another bash shell...
> 
> The program I was running was a pyMVPA surface based searchlight, which 
> I believe uses a serial process for feature selection and only parallelizes 
> after starting to compute the actual searchlight measures. And I believe 
> that the programs were in the feature selection phase - serial processes, 
> when I tried to kill them.
> 
> nb. this was run on our "hydra" cluster


I'm not familiar with ipython, so please excuse any questions that have obvious 
answers.

I see that ipython has the ability to submit jobs into a batch system. Is this 
feature how you were submitting jobs into Condor?

Were all of the jobs running on your local machine or on multiple machines in a 
cluster? 
Or, more technically, do you know what Condor universe the jobs were running 
under?

I suspect that when you removed the jobs, Condor killed the processes that it 
spawned, but didn't know about some child processes they had in turn spawned. 
Condor uses several techniques to track the child processes of jobs, none of 
which are perfect.

First, Condor periodically looks for all processes that have the original job 
process as their parent. It then looks for processes whose parent is one of 
those processes, etc. It also looks for processes in the same process group as 
the original job process. This will miss a child process that's trying to 
escape the session that spawned it. Your mention of using "!" in ipython sounds 
suspiciously like that.

Second, Condor sets an environment variable in the job process. It can then 
identify child processes by the presence of that variable. This will miss child 
processes that sanitize their environment.

Third, Condor can add a supplemental group to a job process, then search for 
processes with this group. This option is disabled by default, as it requires 
the admin to set aside a range of group ids for Condor's use. See section 
3.12.11 of the Condor manual for details:
http://research.cs.wisc.edu/condor/manual/v7.8/3_12Setting_Up.html#sec:GroupTracking

Lastly, Condor can spawn jobs under accounts dedicated for use by Condor. Then, 
all processes owned by that account are part of the job. This option is also 
disabled by default, as it requires the admin to create a set of user accounts 
for Condor's use. See section 3.6.13.2 of the Condor manual for details:
http://research.cs.wisc.edu/condor/manual/v7.8/3_6Security.html#sec:RunAsNobody

Thanks and regards,
Jaime Frey
UW-Madison Condor Team


--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Reply via email to