Re: [gridengine users] Intermittent MPI problem on OGS 2011.11

Connell, Jesse Mon, 23 Jun 2014 11:55:04 -0700

Sorry for spamming the list a bit with this one problem, but I think I'm
getting close.  First off, qping and parallel jobs are rejected from any
host, head node or otherwise.  Oh well.  But more news:


A while back we'd noticed that execd would sometimes show up as running
full-speed on one CPU core, but it seemed to behave fine otherwise, so we
didn't make investigating it a priority.  Now I'm seeing this behavior is
an exact match with the problem I'm currently investigating.  I did a
little more digging into the running execd on a "broken" and "working"
system to compare side-by-side...

host02 is currently exhibiting the broken behavior: "top" shows execd
running at 100% CPU, qping gets no response (even run on the same system),
and parallel jobs don't work.

host03 is currently working: "top" shows execd idling as expected, qping
gets a response, parallel jobs work.

Running "ps -AL x | grep sge | grep -v grep" on host02:

32335 32335 ?        Sl     0:06
/mnt/nokrb/gridengine/bin/linux-x64/sge_execd
32335 32336 ?        Sl     0:00
/mnt/nokrb/gridengine/bin/linux-x64/sge_execd
32335 32337 ?        Sl     0:00
/mnt/nokrb/gridengine/bin/linux-x64/sge_execd
32335 32338 ?        Rl   201:15
/mnt/nokrb/gridengine/bin/linux-x64/sge_execd
32335 32339 ?        Sl     0:00
/mnt/nokrb/gridengine/bin/linux-x64/sge_execd

...and on host03:

10738 10738 ?        Sl     0:01
/mnt/nokrb/gridengine/bin/linux-x64/sge_execd
10738 10739 ?        Sl     0:00
/mnt/nokrb/gridengine/bin/linux-x64/sge_execd
10738 10740 ?        Sl     0:00
/mnt/nokrb/gridengine/bin/linux-x64/sge_execd
10738 10741 ?        Sl     0:00
/mnt/nokrb/gridengine/bin/linux-x64/sge_execd
10738 10742 ?        Sl     0:00
/mnt/nokrb/gridengine/bin/linux-x64/sge_execd

Running strace on the parent process on both systems I see each thread
cycling through this sort of thing:

[pid 10742] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed
out)
[pid 10742] futex(0x7f870e506e00, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 10742] futex(0x7f870e506e64,
FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 35445, {1403280697,
400749000}, ffffffff <unfinished ...>

But!  On host02, in addition to those things, I also see this rushing by
repeating much more quickly:

[pid 32338] poll([{fd=3, events=POLLIN|POLLPRI}, {fd=4,
events=POLLIN|POLLPRI}], 2, 1000) = 1 ([{fd=3, revents=POLLIN|POLLHUP}])
[pid 32338] accept(3, 0x7ff5a2a04990, [16]) = -1 EINVAL (Invalid argument)

...where 32338 was that thread running full-blast we noticed in top's
display.  The arguments to accept() are always exactly the same each time.

At this point should I file a bug report with OGS?  I can't see how
anything in our setup could reasonably cause this, and having it get stuck
on that same function call forever sure sounds like a bug to me.

Jesse



On 6/20/14, 11:11 AM, "Connell, Jesse" <[email protected]> wrote:

>On 6/20/14, 10:32 AM, "Reuti" <[email protected]> wrote:
>
>>Am 19.06.2014 um 15:23 schrieb Connell, Jesse:
>>
>>> Huh, now here's something interesting, too... I ran strace on mpirun to
>>> try to find out more, and I saw that just before "got send error" it
>>>tries
>>> to connect to port 537 on the worker system.  If I mmap scan across the
>>> queue I'm testing with, the two machines I've been throwing MPI jobs at
>>>to
>>> debug this (host04 and host05 below) show port 537 as closed, while
>>>most
>>> of the rest show it as open.  The three other ones near the end are
>>> systems  that previously had this problem but that I didn't restart
>>>execd
>>> on since jobs were currently running.  (So, presumably MPI jobs would
>>> still fail on those too, but would work for a while on the rest if I
>>> tried.)
>>> 
>>> # Nmap 5.51 scan initiated Thu Jun 19 09:14:39 2014 as: nmap -R -p 537
>>>-oG
>>> - 10.241.71.11-26
>>> Host: 10.241.71.11 (host01.fqdn)    Ports: 537/open/tcp/////
>>> Host: 10.241.71.12 (host02.fqdn)    Ports: 537/open/tcp/////
>>> Host: 10.241.71.13 (host03.fqdn)    Ports: 537/open/tcp/////
>>> Host: 10.241.71.14 (host04.fqdn)    Ports: 537/closed/tcp/////
>>> Host: 10.241.71.15 (host05.fqdn)    Ports: 537/closed/tcp/////
>>> Host: 10.241.71.16 (host06.fqdn)    Ports: 537/open/tcp/////
>>> Host: 10.241.71.17 (host07.fqdn)    Ports: 537/open/tcp/////
>>> Host: 10.241.71.18 (host08.fqdn)    Ports: 537/open/tcp/////
>>> Host: 10.241.71.19 (host09.fqdn)    Ports: 537/open/tcp/////
>>> Host: 10.241.71.20 (host10.fqdn)    Ports: 537/open/tcp/////
>>> Host: 10.241.71.21 (host11.fqdn)    Ports: 537/open/tcp/////
>>> Host: 10.241.71.22 (host12.fqdn)    Ports: 537/open/tcp/////
>>> Host: 10.241.71.23 (host13.fqdn)    Ports: 537/closed/tcp/////
>>> Host: 10.241.71.24 (host14.fqdn)    Ports: 537/closed/tcp/////
>>> Host: 10.241.71.25 (host15.fqdn)    Ports: 537/open/tcp/////
>>> Host: 10.241.71.26 (host16.fqdn)    Ports: 537/closed/tcp/////
>>> # Nmap done at Thu Jun 19 09:14:39 2014 -- 16 IP addresses (16 hosts
>>>up)
>>> scanned in 0.04 seconds
>>> 
>>> 
>>> 
>>> ... so it looks like when things go "bad," something changes about how
>>> execd is (or is not) listening on port 537.
>>
>>Ad port 537 is the intended one for your SGE installation? Any firewall
>>on this machine?
>
>Yep, we're using port 537 (I just double-checked and /etc/services shows
>"sge_execd       537/tcp"), and no, no firewall.  I had previously also
>tried "qping bungee04 537 execd 1" and it worked fine when these jobs
>worked, and failed when they didn't.  So it does look like something
>totally separate from any PE/MPI issues after all.
>
>>>   Does it possibly isolate my
>>> problem from anything related to MPI or my PE?  But why do ordinary
>>>qsub'd
>>> jobs still run fine, even on those "broken" systems?
>>
>>In principle you could set up the firewall to allow traffic on port 537
>>only from the headnode of the cluster, hence `qsub`ed jobs will end there
>>up fine. But you can block traffic between the nodes, which might happen
>>when `qrsh --inherit ...` is issued.
>
>Well, that isn't supposed to be my setup, but that description actually
>fits this behavior pretty closely!  If the traffic goes straight from the
>master to the workers and that path is blocked, the parallel job fails,
>but if it allows traffic still from the head node, those kinds of jobs
>keep working... now that I think of it, I can't remember where I submitted
>those other test jobs from or ran qping from, but all of our execution
>hosts are also submit hosts... so, when I catch it failing again, I'll try
>sending jobs from both places and see how it behaves.
>
>This still seems very mysterious (any thoughts on what in the OS, if not
>inside execd itself, might block traffic after a while for execd, but
>allow it again once execd restarts?  I'm not using SELinux, which is my
>usual scapegoat!)  But, at least I've got some more things to try out.
>Thanks again!
>
>Jesse
>
>P.S.  I just saw your note about qrsh -> qsh; good to know!
>
>>
>>-- Reuti
>>
>>
>>>  This all seems even
>>> more bizarre to me now :)
>>> 
>>> Jesse
>>> 
>>> 
>>> On 6/19/14, 8:34 AM, "Connell, Jesse" <[email protected]> wrote:
>>> 
>>>> Well, not intentionally, anyway!  We generally have "ForwardX11 yes"
>>>>set
>>>> when SSHing into the login node, so that X will be available for
>>>> interactive jobs, but I'm not trying to do anything graphical here.
>>>>To
>>>> make sure I'm being clear, the workflow we have is:
>>>> 
>>>> 1) SSH into login node (with X11 usually enabled but not intentionally
>>>> used in this case)
>>>> 2) qsub -pe openmpi <numslots> some-mpi-job.sh
>>>> 3) some-mpi-job.sh then calls "mpirun -np $NSLOTS some-mpi-task"
>>>> 
>>>> Does it seem weird to find mpirun referencing qsh.c here?  Now that
>>>>you
>>>> mention it, http://www.open-mpi.org/faq/?category=sge says that mpirun
>>>> "spawns remote processes via 'qrsh' so that SGE can control and
>>>>monitor
>>>> them"... so why *is* it talking about qsh in my case?  Could this be
>>>>the
>>>> root of the problem?
>>>> 
>>>> Also, this PE is set up like this:
>>>> 
>>>> pe_name            openmpi
>>>> slots              999
>>>> user_lists         NONE
>>>> xuser_lists        NONE
>>>> start_proc_args    /bin/true
>>>> stop_proc_args     /bin/true
>>>> allocation_rule    $round_robin
>>>> control_slaves     TRUE
>>>> job_is_first_task  FALSE
>>>> urgency_slots      min
>>>> accounting_summary FALSE
>>>> 
>>>> Thanks Reuti!
>>>> 
>>>> Jesse
>>>> 
>>>> 
>>>> 
>>>> On 6/18/14, 4:20 PM, "Reuti" <[email protected]> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> Am 18.06.2014 um 20:45 schrieb Connell, Jesse:
>>>>> 
>>>>>> We've been having a seemingly-random problem with MPI jobs on our
>>>>>> install
>>>>>> of Open Grid Scheduler 2011.11.  For some varying length of time
>>>>>>from
>>>>>> when
>>>>>> the execd processes start up, MPI jobs running across multiple hosts
>>>>>> will
>>>>>> run fine.  Then, at some point, they will start failing at the
>>>>>>mpirun
>>>>>> step, and will keep failing until execd is restarted on the affected
>>>>>> hosts.  They then work again, before eventually failing, and so on.
>>>>>>If
>>>>>> I
>>>>>> increase the SGE debug level before calling mpirun in my job script,
>>>>>>I
>>>>>> see
>>>>>> things like this:
>>>>>> 
>>>>>>  842  11556         main     ../clients/qsh/qsh.c 1840 executing
>>>>>>task
>>>>>> of
>>>>> 
>>>>> qsh? Are you using an X11 session?
>>>>> 
>>>>> -- Reuti
>>>>> 
>>>>> 
>>>>>> job 6805430 failed: failed sending task to execd@<hostname>: got
>>>>>>send
>>>>>> error
>>>>>> 
>>>>>> ...but nothing more interesting that I can see.  (I also get the
>>>>>>same
>>>>>> sort
>>>>>> of "send error" message from mpirun itself if I use its --mca
>>>>>> ras_gridengine_debug --mca ras_gridengine_verbose flags, but nothing
>>>>>> else.)  Jobs that run on multiple cores on a single host are fine,
>>>>>>but
>>>>>> ones that try to start up workers on additional hosts fail.  Since
>>>>>> restarting execd makes it work again, I assumed the problem was on
>>>>>>that
>>>>>> end, and tried dumping verbose log output for execd (using dl 10) to
>>>>>>a
>>>>>> file.  But, despite many thousands of lines, I can't spot anything
>>>>>>that
>>>>>> looks different when the jobs start failing from when they are
>>>>>>working,
>>>>>> as
>>>>>> far as execd is concerned.  Ordinary grid jobs (no parallel
>>>>>> environment)
>>>>>> continue to run fine no matter what.
>>>>>> 
>>>>>> So for now, I'm stumped!  Any other ideas of what to look for, or
>>>>>> thoughts
>>>>>> of what the unpredictable off-and-on behavior could possibly mean?
>>>>>> Thanks
>>>>>> in advance,
>>>>>> 
>>>>>> Jesse
>>>>>> 
>>>>>> P.S.  This is on CentOS 6, with its openmpi 1.5.4 package.
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> [email protected]
>>>>>> https://gridengine.org/mailman/listinfo/users
>>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> [email protected]
>>>> https://gridengine.org/mailman/listinfo/users
>>> 
>>> 
>>
>
>
>_______________________________________________
>users mailing list
>[email protected]
>https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Intermittent MPI problem on OGS 2011.11

Reply via email to