Well, not intentionally, anyway! We generally have "ForwardX11 yes" set when SSHing into the login node, so that X will be available for interactive jobs, but I'm not trying to do anything graphical here. To make sure I'm being clear, the workflow we have is:
1) SSH into login node (with X11 usually enabled but not intentionally used in this case) 2) qsub -pe openmpi <numslots> some-mpi-job.sh 3) some-mpi-job.sh then calls "mpirun -np $NSLOTS some-mpi-task" Does it seem weird to find mpirun referencing qsh.c here? Now that you mention it, http://www.open-mpi.org/faq/?category=sge says that mpirun "spawns remote processes via 'qrsh' so that SGE can control and monitor them"... so why *is* it talking about qsh in my case? Could this be the root of the problem? Also, this PE is set up like this: pe_name openmpi slots 999 user_lists NONE xuser_lists NONE start_proc_args /bin/true stop_proc_args /bin/true allocation_rule $round_robin control_slaves TRUE job_is_first_task FALSE urgency_slots min accounting_summary FALSE Thanks Reuti! Jesse On 6/18/14, 4:20 PM, "Reuti" <[email protected]> wrote: >Hi, > >Am 18.06.2014 um 20:45 schrieb Connell, Jesse: > >> We've been having a seemingly-random problem with MPI jobs on our >>install >> of Open Grid Scheduler 2011.11. For some varying length of time from >>when >> the execd processes start up, MPI jobs running across multiple hosts >>will >> run fine. Then, at some point, they will start failing at the mpirun >> step, and will keep failing until execd is restarted on the affected >> hosts. They then work again, before eventually failing, and so on. If >>I >> increase the SGE debug level before calling mpirun in my job script, I >>see >> things like this: >> >> 842 11556 main ../clients/qsh/qsh.c 1840 executing task >>of > >qsh? Are you using an X11 session? > >-- Reuti > > >> job 6805430 failed: failed sending task to execd@<hostname>: got send >>error >> >> ...but nothing more interesting that I can see. (I also get the same >>sort >> of "send error" message from mpirun itself if I use its --mca >> ras_gridengine_debug --mca ras_gridengine_verbose flags, but nothing >> else.) Jobs that run on multiple cores on a single host are fine, but >> ones that try to start up workers on additional hosts fail. Since >> restarting execd makes it work again, I assumed the problem was on that >> end, and tried dumping verbose log output for execd (using dl 10) to a >> file. But, despite many thousands of lines, I can't spot anything that >> looks different when the jobs start failing from when they are working, >>as >> far as execd is concerned. Ordinary grid jobs (no parallel environment) >> continue to run fine no matter what. >> >> So for now, I'm stumped! Any other ideas of what to look for, or >>thoughts >> of what the unpredictable off-and-on behavior could possibly mean? >>Thanks >> in advance, >> >> Jesse >> >> P.S. This is on CentOS 6, with its openmpi 1.5.4 package. >> >> >> _______________________________________________ >> users mailing list >> [email protected] >> https://gridengine.org/mailman/listinfo/users > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
