Am 19.06.2014 um 14:34 schrieb Connell, Jesse: > Well, not intentionally, anyway! We generally have "ForwardX11 yes" set > when SSHing into the login node, so that X will be available for > interactive jobs, but I'm not trying to do anything graphical here. To > make sure I'm being clear, the workflow we have is:
> 1) SSH into login node (with X11 usually enabled but not intentionally > used in this case) > 2) qsub -pe openmpi <numslots> some-mpi-job.sh > 3) some-mpi-job.sh then calls "mpirun -np $NSLOTS some-mpi-task" > > Does it seem weird to find mpirun referencing qsh.c here? Sorry, I overlooked the symlink of `qrsh` to `qsh`. Hence it's the right application. -- Reuti > Now that you > mention it, http://www.open-mpi.org/faq/?category=sge says that mpirun > "spawns remote processes via 'qrsh' so that SGE can control and monitor > them"... so why *is* it talking about qsh in my case? Could this be the > root of the problem? > > Also, this PE is set up like this: > > pe_name openmpi > slots 999 > user_lists NONE > xuser_lists NONE > start_proc_args /bin/true > stop_proc_args /bin/true > allocation_rule $round_robin > control_slaves TRUE > job_is_first_task FALSE > urgency_slots min > accounting_summary FALSE > > Thanks Reuti! > > Jesse > > > > On 6/18/14, 4:20 PM, "Reuti" <[email protected]> wrote: > >> Hi, >> >> Am 18.06.2014 um 20:45 schrieb Connell, Jesse: >> >>> We've been having a seemingly-random problem with MPI jobs on our >>> install >>> of Open Grid Scheduler 2011.11. For some varying length of time from >>> when >>> the execd processes start up, MPI jobs running across multiple hosts >>> will >>> run fine. Then, at some point, they will start failing at the mpirun >>> step, and will keep failing until execd is restarted on the affected >>> hosts. They then work again, before eventually failing, and so on. If >>> I >>> increase the SGE debug level before calling mpirun in my job script, I >>> see >>> things like this: >>> >>> 842 11556 main ../clients/qsh/qsh.c 1840 executing task >>> of >> >> qsh? Are you using an X11 session? >> >> -- Reuti >> >> >>> job 6805430 failed: failed sending task to execd@<hostname>: got send >>> error >>> >>> ...but nothing more interesting that I can see. (I also get the same >>> sort >>> of "send error" message from mpirun itself if I use its --mca >>> ras_gridengine_debug --mca ras_gridengine_verbose flags, but nothing >>> else.) Jobs that run on multiple cores on a single host are fine, but >>> ones that try to start up workers on additional hosts fail. Since >>> restarting execd makes it work again, I assumed the problem was on that >>> end, and tried dumping verbose log output for execd (using dl 10) to a >>> file. But, despite many thousands of lines, I can't spot anything that >>> looks different when the jobs start failing from when they are working, >>> as >>> far as execd is concerned. Ordinary grid jobs (no parallel environment) >>> continue to run fine no matter what. >>> >>> So for now, I'm stumped! Any other ideas of what to look for, or >>> thoughts >>> of what the unpredictable off-and-on behavior could possibly mean? >>> Thanks >>> in advance, >>> >>> Jesse >>> >>> P.S. This is on CentOS 6, with its openmpi 1.5.4 package. >>> >>> >>> _______________________________________________ >>> users mailing list >>> [email protected] >>> https://gridengine.org/mailman/listinfo/users >> > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
