Huh, now here's something interesting, too... I ran strace on mpirun to try to find out more, and I saw that just before "got send error" it tries to connect to port 537 on the worker system. If I mmap scan across the queue I'm testing with, the two machines I've been throwing MPI jobs at to debug this (host04 and host05 below) show port 537 as closed, while most of the rest show it as open. The three other ones near the end are systems that previously had this problem but that I didn't restart execd on since jobs were currently running. (So, presumably MPI jobs would still fail on those too, but would work for a while on the rest if I tried.)
# Nmap 5.51 scan initiated Thu Jun 19 09:14:39 2014 as: nmap -R -p 537 -oG - 10.241.71.11-26 Host: 10.241.71.11 (host01.fqdn) Ports: 537/open/tcp///// Host: 10.241.71.12 (host02.fqdn) Ports: 537/open/tcp///// Host: 10.241.71.13 (host03.fqdn) Ports: 537/open/tcp///// Host: 10.241.71.14 (host04.fqdn) Ports: 537/closed/tcp///// Host: 10.241.71.15 (host05.fqdn) Ports: 537/closed/tcp///// Host: 10.241.71.16 (host06.fqdn) Ports: 537/open/tcp///// Host: 10.241.71.17 (host07.fqdn) Ports: 537/open/tcp///// Host: 10.241.71.18 (host08.fqdn) Ports: 537/open/tcp///// Host: 10.241.71.19 (host09.fqdn) Ports: 537/open/tcp///// Host: 10.241.71.20 (host10.fqdn) Ports: 537/open/tcp///// Host: 10.241.71.21 (host11.fqdn) Ports: 537/open/tcp///// Host: 10.241.71.22 (host12.fqdn) Ports: 537/open/tcp///// Host: 10.241.71.23 (host13.fqdn) Ports: 537/closed/tcp///// Host: 10.241.71.24 (host14.fqdn) Ports: 537/closed/tcp///// Host: 10.241.71.25 (host15.fqdn) Ports: 537/open/tcp///// Host: 10.241.71.26 (host16.fqdn) Ports: 537/closed/tcp///// # Nmap done at Thu Jun 19 09:14:39 2014 -- 16 IP addresses (16 hosts up) scanned in 0.04 seconds ... so it looks like when things go "bad," something changes about how execd is (or is not) listening on port 537. Does it possibly isolate my problem from anything related to MPI or my PE? But why do ordinary qsub'd jobs still run fine, even on those "broken" systems? This all seems even more bizarre to me now :) Jesse On 6/19/14, 8:34 AM, "Connell, Jesse" <[email protected]> wrote: >Well, not intentionally, anyway! We generally have "ForwardX11 yes" set >when SSHing into the login node, so that X will be available for >interactive jobs, but I'm not trying to do anything graphical here. To >make sure I'm being clear, the workflow we have is: > >1) SSH into login node (with X11 usually enabled but not intentionally >used in this case) >2) qsub -pe openmpi <numslots> some-mpi-job.sh >3) some-mpi-job.sh then calls "mpirun -np $NSLOTS some-mpi-task" > >Does it seem weird to find mpirun referencing qsh.c here? Now that you >mention it, http://www.open-mpi.org/faq/?category=sge says that mpirun >"spawns remote processes via 'qrsh' so that SGE can control and monitor >them"... so why *is* it talking about qsh in my case? Could this be the >root of the problem? > >Also, this PE is set up like this: > >pe_name openmpi >slots 999 >user_lists NONE >xuser_lists NONE >start_proc_args /bin/true >stop_proc_args /bin/true >allocation_rule $round_robin >control_slaves TRUE >job_is_first_task FALSE >urgency_slots min >accounting_summary FALSE > >Thanks Reuti! > >Jesse > > > >On 6/18/14, 4:20 PM, "Reuti" <[email protected]> wrote: > >>Hi, >> >>Am 18.06.2014 um 20:45 schrieb Connell, Jesse: >> >>> We've been having a seemingly-random problem with MPI jobs on our >>>install >>> of Open Grid Scheduler 2011.11. For some varying length of time from >>>when >>> the execd processes start up, MPI jobs running across multiple hosts >>>will >>> run fine. Then, at some point, they will start failing at the mpirun >>> step, and will keep failing until execd is restarted on the affected >>> hosts. They then work again, before eventually failing, and so on. If >>>I >>> increase the SGE debug level before calling mpirun in my job script, I >>>see >>> things like this: >>> >>> 842 11556 main ../clients/qsh/qsh.c 1840 executing task >>>of >> >>qsh? Are you using an X11 session? >> >>-- Reuti >> >> >>> job 6805430 failed: failed sending task to execd@<hostname>: got send >>>error >>> >>> ...but nothing more interesting that I can see. (I also get the same >>>sort >>> of "send error" message from mpirun itself if I use its --mca >>> ras_gridengine_debug --mca ras_gridengine_verbose flags, but nothing >>> else.) Jobs that run on multiple cores on a single host are fine, but >>> ones that try to start up workers on additional hosts fail. Since >>> restarting execd makes it work again, I assumed the problem was on that >>> end, and tried dumping verbose log output for execd (using dl 10) to a >>> file. But, despite many thousands of lines, I can't spot anything that >>> looks different when the jobs start failing from when they are working, >>>as >>> far as execd is concerned. Ordinary grid jobs (no parallel >>>environment) >>> continue to run fine no matter what. >>> >>> So for now, I'm stumped! Any other ideas of what to look for, or >>>thoughts >>> of what the unpredictable off-and-on behavior could possibly mean? >>>Thanks >>> in advance, >>> >>> Jesse >>> >>> P.S. This is on CentOS 6, with its openmpi 1.5.4 package. >>> >>> >>> _______________________________________________ >>> users mailing list >>> [email protected] >>> https://gridengine.org/mailman/listinfo/users >> > > >_______________________________________________ >users mailing list >[email protected] >https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
