Huh, now here's something interesting, too... I ran strace on mpirun to
try to find out more, and I saw that just before "got send error" it tries
to connect to port 537 on the worker system.  If I mmap scan across the
queue I'm testing with, the two machines I've been throwing MPI jobs at to
debug this (host04 and host05 below) show port 537 as closed, while most
of the rest show it as open.  The three other ones near the end are
systems  that previously had this problem but that I didn't restart execd
on since jobs were currently running.  (So, presumably MPI jobs would
still fail on those too, but would work for a while on the rest if I
tried.)

# Nmap 5.51 scan initiated Thu Jun 19 09:14:39 2014 as: nmap -R -p 537 -oG
- 10.241.71.11-26
Host: 10.241.71.11 (host01.fqdn)        Ports: 537/open/tcp/////
Host: 10.241.71.12 (host02.fqdn)        Ports: 537/open/tcp/////
Host: 10.241.71.13 (host03.fqdn)        Ports: 537/open/tcp/////
Host: 10.241.71.14 (host04.fqdn)        Ports: 537/closed/tcp/////
Host: 10.241.71.15 (host05.fqdn)        Ports: 537/closed/tcp/////
Host: 10.241.71.16 (host06.fqdn)        Ports: 537/open/tcp/////
Host: 10.241.71.17 (host07.fqdn)        Ports: 537/open/tcp/////
Host: 10.241.71.18 (host08.fqdn)        Ports: 537/open/tcp/////
Host: 10.241.71.19 (host09.fqdn)        Ports: 537/open/tcp/////
Host: 10.241.71.20 (host10.fqdn)        Ports: 537/open/tcp/////
Host: 10.241.71.21 (host11.fqdn)        Ports: 537/open/tcp/////
Host: 10.241.71.22 (host12.fqdn)        Ports: 537/open/tcp/////
Host: 10.241.71.23 (host13.fqdn)        Ports: 537/closed/tcp/////
Host: 10.241.71.24 (host14.fqdn)        Ports: 537/closed/tcp/////
Host: 10.241.71.25 (host15.fqdn)        Ports: 537/open/tcp/////
Host: 10.241.71.26 (host16.fqdn)        Ports: 537/closed/tcp/////
# Nmap done at Thu Jun 19 09:14:39 2014 -- 16 IP addresses (16 hosts up)
scanned in 0.04 seconds



... so it looks like when things go "bad," something changes about how
execd is (or is not) listening on port 537.  Does it possibly isolate my
problem from anything related to MPI or my PE?  But why do ordinary qsub'd
jobs still run fine, even on those "broken" systems?  This all seems even
more bizarre to me now :)

Jesse


On 6/19/14, 8:34 AM, "Connell, Jesse" <[email protected]> wrote:

>Well, not intentionally, anyway!  We generally have "ForwardX11 yes" set
>when SSHing into the login node, so that X will be available for
>interactive jobs, but I'm not trying to do anything graphical here.  To
>make sure I'm being clear, the workflow we have is:
>
>1) SSH into login node (with X11 usually enabled but not intentionally
>used in this case)
>2) qsub -pe openmpi <numslots> some-mpi-job.sh
>3) some-mpi-job.sh then calls "mpirun -np $NSLOTS some-mpi-task"
>
>Does it seem weird to find mpirun referencing qsh.c here?  Now that you
>mention it, http://www.open-mpi.org/faq/?category=sge says that mpirun
>"spawns remote processes via 'qrsh' so that SGE can control and monitor
>them"... so why *is* it talking about qsh in my case?  Could this be the
>root of the problem?
>
>Also, this PE is set up like this:
>
>pe_name            openmpi
>slots              999
>user_lists         NONE
>xuser_lists        NONE
>start_proc_args    /bin/true
>stop_proc_args     /bin/true
>allocation_rule    $round_robin
>control_slaves     TRUE
>job_is_first_task  FALSE
>urgency_slots      min
>accounting_summary FALSE
>
>Thanks Reuti!
>
>Jesse
>
>
>
>On 6/18/14, 4:20 PM, "Reuti" <[email protected]> wrote:
>
>>Hi,
>>
>>Am 18.06.2014 um 20:45 schrieb Connell, Jesse:
>>
>>> We've been having a seemingly-random problem with MPI jobs on our
>>>install
>>> of Open Grid Scheduler 2011.11.  For some varying length of time from
>>>when
>>> the execd processes start up, MPI jobs running across multiple hosts
>>>will
>>> run fine.  Then, at some point, they will start failing at the mpirun
>>> step, and will keep failing until execd is restarted on the affected
>>> hosts.  They then work again, before eventually failing, and so on.  If
>>>I
>>> increase the SGE debug level before calling mpirun in my job script, I
>>>see
>>> things like this:
>>> 
>>>   842  11556         main     ../clients/qsh/qsh.c 1840 executing task
>>>of
>>
>>qsh? Are you using an X11 session?
>>
>>-- Reuti
>>
>>
>>> job 6805430 failed: failed sending task to execd@<hostname>: got send
>>>error
>>> 
>>> ...but nothing more interesting that I can see.  (I also get the same
>>>sort
>>> of "send error" message from mpirun itself if I use its --mca
>>> ras_gridengine_debug --mca ras_gridengine_verbose flags, but nothing
>>> else.)  Jobs that run on multiple cores on a single host are fine, but
>>> ones that try to start up workers on additional hosts fail.  Since
>>> restarting execd makes it work again, I assumed the problem was on that
>>> end, and tried dumping verbose log output for execd (using dl 10) to a
>>> file.  But, despite many thousands of lines, I can't spot anything that
>>> looks different when the jobs start failing from when they are working,
>>>as
>>> far as execd is concerned.  Ordinary grid jobs (no parallel
>>>environment)
>>> continue to run fine no matter what.
>>> 
>>> So for now, I'm stumped!  Any other ideas of what to look for, or
>>>thoughts
>>> of what the unpredictable off-and-on behavior could possibly mean?
>>>Thanks
>>> in advance,
>>> 
>>> Jesse
>>> 
>>> P.S.  This is on CentOS 6, with its openmpi 1.5.4 package.
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> https://gridengine.org/mailman/listinfo/users
>>
>
>
>_______________________________________________
>users mailing list
>[email protected]
>https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to