probably it is the Maker which does not have proper handling of signals? Maybe you can try to use a script to run the job, rather than run binary directly, to see if it can work. Also you can add some signal handling commands in your script to check...
Best, Feng On Tue, Nov 13, 2018 at 7:07 PM <ad...@genome.arizona.edu> wrote: > > We have a cluster with gridengine 6.5u2 and noticing a strange behavior > when running MPI jobs. Our application will finish, yet the processes > continue to run and use up the CPU. We did configure a parallel > environment for MPI as follows: > > pe_name mpi > slots 500 > user_lists NONE > xuser_lists NONE > start_proc_args NONE > stop_proc_args NONE > allocation_rule $round_robin > control_slaves TRUE > job_is_first_task FALSE > urgency_slots min > accounting_summary FALSE > > Then we have run our application "Maker" like this, > qsub -cwd -N <NAME> -b y -V -pe mpi <CPUs> > /opt/mpich-install/bin/mpiexec maker <maker options> > > It seems to run fine and qstat will show it running. Once it has > completed, qstat is empty again and we have the desired output. > However, the "maker" process have continued to run on the compute nodes > until I login to each node and "kill -9" the processes. We did not have > this problem when running mpiexec directly with Maker, or running Maker > in stand-alone mode (without MPI), so I guess it is a problem with our > qsub command or parallel environment? Any Ideas? > > Thanks, > -- > Chandler / Systems Administrator > Arizona Genomics Institute > www.genome.arizona.edu > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users