On Aug 9, 2012, at 11:54 AM, Andrew C. Connolly wrote: > Today I sent a large batch to condor (~50 jobs), then tried to kill the > all with 'condor_rm andy'. > This removed them from the condor_q list of running processes, but left > them all actually running. > I then had to kill them all manually using htop. > > the only thing I did was not normal was start the jobs from within an > ipython session using "!" to escape the shell ( i don't think that > should matter, but just in case.). I then tried called condor_rm from > another bash shell... > > The program I was running was a pyMVPA surface based searchlight, which > I believe uses a serial process for feature selection and only parallelizes > after starting to compute the actual searchlight measures. And I believe > that the programs were in the feature selection phase - serial processes, > when I tried to kill them. > > nb. this was run on our "hydra" cluster
I'm not familiar with ipython, so please excuse any questions that have obvious answers. I see that ipython has the ability to submit jobs into a batch system. Is this feature how you were submitting jobs into Condor? Were all of the jobs running on your local machine or on multiple machines in a cluster? Or, more technically, do you know what Condor universe the jobs were running under? I suspect that when you removed the jobs, Condor killed the processes that it spawned, but didn't know about some child processes they had in turn spawned. Condor uses several techniques to track the child processes of jobs, none of which are perfect. First, Condor periodically looks for all processes that have the original job process as their parent. It then looks for processes whose parent is one of those processes, etc. It also looks for processes in the same process group as the original job process. This will miss a child process that's trying to escape the session that spawned it. Your mention of using "!" in ipython sounds suspiciously like that. Second, Condor sets an environment variable in the job process. It can then identify child processes by the presence of that variable. This will miss child processes that sanitize their environment. Third, Condor can add a supplemental group to a job process, then search for processes with this group. This option is disabled by default, as it requires the admin to set aside a range of group ids for Condor's use. See section 3.12.11 of the Condor manual for details: http://research.cs.wisc.edu/condor/manual/v7.8/3_12Setting_Up.html#sec:GroupTracking Lastly, Condor can spawn jobs under accounts dedicated for use by Condor. Then, all processes owned by that account are part of the job. This option is also disabled by default, as it requires the admin to create a set of user accounts for Condor's use. See section 3.6.13.2 of the Condor manual for details: http://research.cs.wisc.edu/condor/manual/v7.8/3_6Security.html#sec:RunAsNobody Thanks and regards, Jaime Frey UW-Madison Condor Team -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org