Sorry for spamming the list a bit with this one problem, but I think I'm
getting close. First off, qping and parallel jobs are rejected from any
host, head node or otherwise. Oh well. But more news:
A while back we'd noticed that execd would sometimes show up as running
full-speed on one CPU core, but it seemed to behave fine otherwise, so we
didn't make investigating it a priority. Now I'm seeing this behavior is
an exact match with the problem I'm currently investigating. I did a
little more digging into the running execd on a "broken" and "working"
system to compare side-by-side...
host02 is currently exhibiting the broken behavior: "top" shows execd
running at 100% CPU, qping gets no response (even run on the same system),
and parallel jobs don't work.
host03 is currently working: "top" shows execd idling as expected, qping
gets a response, parallel jobs work.
Running "ps -AL x | grep sge | grep -v grep" on host02:
32335 32335 ? Sl 0:06
/mnt/nokrb/gridengine/bin/linux-x64/sge_execd
32335 32336 ? Sl 0:00
/mnt/nokrb/gridengine/bin/linux-x64/sge_execd
32335 32337 ? Sl 0:00
/mnt/nokrb/gridengine/bin/linux-x64/sge_execd
32335 32338 ? Rl 201:15
/mnt/nokrb/gridengine/bin/linux-x64/sge_execd
32335 32339 ? Sl 0:00
/mnt/nokrb/gridengine/bin/linux-x64/sge_execd
...and on host03:
10738 10738 ? Sl 0:01
/mnt/nokrb/gridengine/bin/linux-x64/sge_execd
10738 10739 ? Sl 0:00
/mnt/nokrb/gridengine/bin/linux-x64/sge_execd
10738 10740 ? Sl 0:00
/mnt/nokrb/gridengine/bin/linux-x64/sge_execd
10738 10741 ? Sl 0:00
/mnt/nokrb/gridengine/bin/linux-x64/sge_execd
10738 10742 ? Sl 0:00
/mnt/nokrb/gridengine/bin/linux-x64/sge_execd
Running strace on the parent process on both systems I see each thread
cycling through this sort of thing:
[pid 10742] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed
out)
[pid 10742] futex(0x7f870e506e00, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 10742] futex(0x7f870e506e64,
FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 35445, {1403280697,
400749000}, ffffffff <unfinished ...>
But! On host02, in addition to those things, I also see this rushing by
repeating much more quickly:
[pid 32338] poll([{fd=3, events=POLLIN|POLLPRI}, {fd=4,
events=POLLIN|POLLPRI}], 2, 1000) = 1 ([{fd=3, revents=POLLIN|POLLHUP}])
[pid 32338] accept(3, 0x7ff5a2a04990, [16]) = -1 EINVAL (Invalid argument)
...where 32338 was that thread running full-blast we noticed in top's
display. The arguments to accept() are always exactly the same each time.
At this point should I file a bug report with OGS? I can't see how
anything in our setup could reasonably cause this, and having it get stuck
on that same function call forever sure sounds like a bug to me.
Jesse
On 6/20/14, 11:11 AM, "Connell, Jesse" <[email protected]> wrote:
>On 6/20/14, 10:32 AM, "Reuti" <[email protected]> wrote:
>
>>Am 19.06.2014 um 15:23 schrieb Connell, Jesse:
>>
>>> Huh, now here's something interesting, too... I ran strace on mpirun to
>>> try to find out more, and I saw that just before "got send error" it
>>>tries
>>> to connect to port 537 on the worker system. If I mmap scan across the
>>> queue I'm testing with, the two machines I've been throwing MPI jobs at
>>>to
>>> debug this (host04 and host05 below) show port 537 as closed, while
>>>most
>>> of the rest show it as open. The three other ones near the end are
>>> systems that previously had this problem but that I didn't restart
>>>execd
>>> on since jobs were currently running. (So, presumably MPI jobs would
>>> still fail on those too, but would work for a while on the rest if I
>>> tried.)
>>>
>>> # Nmap 5.51 scan initiated Thu Jun 19 09:14:39 2014 as: nmap -R -p 537
>>>-oG
>>> - 10.241.71.11-26
>>> Host: 10.241.71.11 (host01.fqdn) Ports: 537/open/tcp/////
>>> Host: 10.241.71.12 (host02.fqdn) Ports: 537/open/tcp/////
>>> Host: 10.241.71.13 (host03.fqdn) Ports: 537/open/tcp/////
>>> Host: 10.241.71.14 (host04.fqdn) Ports: 537/closed/tcp/////
>>> Host: 10.241.71.15 (host05.fqdn) Ports: 537/closed/tcp/////
>>> Host: 10.241.71.16 (host06.fqdn) Ports: 537/open/tcp/////
>>> Host: 10.241.71.17 (host07.fqdn) Ports: 537/open/tcp/////
>>> Host: 10.241.71.18 (host08.fqdn) Ports: 537/open/tcp/////
>>> Host: 10.241.71.19 (host09.fqdn) Ports: 537/open/tcp/////
>>> Host: 10.241.71.20 (host10.fqdn) Ports: 537/open/tcp/////
>>> Host: 10.241.71.21 (host11.fqdn) Ports: 537/open/tcp/////
>>> Host: 10.241.71.22 (host12.fqdn) Ports: 537/open/tcp/////
>>> Host: 10.241.71.23 (host13.fqdn) Ports: 537/closed/tcp/////
>>> Host: 10.241.71.24 (host14.fqdn) Ports: 537/closed/tcp/////
>>> Host: 10.241.71.25 (host15.fqdn) Ports: 537/open/tcp/////
>>> Host: 10.241.71.26 (host16.fqdn) Ports: 537/closed/tcp/////
>>> # Nmap done at Thu Jun 19 09:14:39 2014 -- 16 IP addresses (16 hosts
>>>up)
>>> scanned in 0.04 seconds
>>>
>>>
>>>
>>> ... so it looks like when things go "bad," something changes about how
>>> execd is (or is not) listening on port 537.
>>
>>Ad port 537 is the intended one for your SGE installation? Any firewall
>>on this machine?
>
>Yep, we're using port 537 (I just double-checked and /etc/services shows
>"sge_execd 537/tcp"), and no, no firewall. I had previously also
>tried "qping bungee04 537 execd 1" and it worked fine when these jobs
>worked, and failed when they didn't. So it does look like something
>totally separate from any PE/MPI issues after all.
>
>>> Does it possibly isolate my
>>> problem from anything related to MPI or my PE? But why do ordinary
>>>qsub'd
>>> jobs still run fine, even on those "broken" systems?
>>
>>In principle you could set up the firewall to allow traffic on port 537
>>only from the headnode of the cluster, hence `qsub`ed jobs will end there
>>up fine. But you can block traffic between the nodes, which might happen
>>when `qrsh --inherit ...` is issued.
>
>Well, that isn't supposed to be my setup, but that description actually
>fits this behavior pretty closely! If the traffic goes straight from the
>master to the workers and that path is blocked, the parallel job fails,
>but if it allows traffic still from the head node, those kinds of jobs
>keep working... now that I think of it, I can't remember where I submitted
>those other test jobs from or ran qping from, but all of our execution
>hosts are also submit hosts... so, when I catch it failing again, I'll try
>sending jobs from both places and see how it behaves.
>
>This still seems very mysterious (any thoughts on what in the OS, if not
>inside execd itself, might block traffic after a while for execd, but
>allow it again once execd restarts? I'm not using SELinux, which is my
>usual scapegoat!) But, at least I've got some more things to try out.
>Thanks again!
>
>Jesse
>
>P.S. I just saw your note about qrsh -> qsh; good to know!
>
>>
>>-- Reuti
>>
>>
>>> This all seems even
>>> more bizarre to me now :)
>>>
>>> Jesse
>>>
>>>
>>> On 6/19/14, 8:34 AM, "Connell, Jesse" <[email protected]> wrote:
>>>
>>>> Well, not intentionally, anyway! We generally have "ForwardX11 yes"
>>>>set
>>>> when SSHing into the login node, so that X will be available for
>>>> interactive jobs, but I'm not trying to do anything graphical here.
>>>>To
>>>> make sure I'm being clear, the workflow we have is:
>>>>
>>>> 1) SSH into login node (with X11 usually enabled but not intentionally
>>>> used in this case)
>>>> 2) qsub -pe openmpi <numslots> some-mpi-job.sh
>>>> 3) some-mpi-job.sh then calls "mpirun -np $NSLOTS some-mpi-task"
>>>>
>>>> Does it seem weird to find mpirun referencing qsh.c here? Now that
>>>>you
>>>> mention it, http://www.open-mpi.org/faq/?category=sge says that mpirun
>>>> "spawns remote processes via 'qrsh' so that SGE can control and
>>>>monitor
>>>> them"... so why *is* it talking about qsh in my case? Could this be
>>>>the
>>>> root of the problem?
>>>>
>>>> Also, this PE is set up like this:
>>>>
>>>> pe_name openmpi
>>>> slots 999
>>>> user_lists NONE
>>>> xuser_lists NONE
>>>> start_proc_args /bin/true
>>>> stop_proc_args /bin/true
>>>> allocation_rule $round_robin
>>>> control_slaves TRUE
>>>> job_is_first_task FALSE
>>>> urgency_slots min
>>>> accounting_summary FALSE
>>>>
>>>> Thanks Reuti!
>>>>
>>>> Jesse
>>>>
>>>>
>>>>
>>>> On 6/18/14, 4:20 PM, "Reuti" <[email protected]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Am 18.06.2014 um 20:45 schrieb Connell, Jesse:
>>>>>
>>>>>> We've been having a seemingly-random problem with MPI jobs on our
>>>>>> install
>>>>>> of Open Grid Scheduler 2011.11. For some varying length of time
>>>>>>from
>>>>>> when
>>>>>> the execd processes start up, MPI jobs running across multiple hosts
>>>>>> will
>>>>>> run fine. Then, at some point, they will start failing at the
>>>>>>mpirun
>>>>>> step, and will keep failing until execd is restarted on the affected
>>>>>> hosts. They then work again, before eventually failing, and so on.
>>>>>>If
>>>>>> I
>>>>>> increase the SGE debug level before calling mpirun in my job script,
>>>>>>I
>>>>>> see
>>>>>> things like this:
>>>>>>
>>>>>> 842 11556 main ../clients/qsh/qsh.c 1840 executing
>>>>>>task
>>>>>> of
>>>>>
>>>>> qsh? Are you using an X11 session?
>>>>>
>>>>> -- Reuti
>>>>>
>>>>>
>>>>>> job 6805430 failed: failed sending task to execd@<hostname>: got
>>>>>>send
>>>>>> error
>>>>>>
>>>>>> ...but nothing more interesting that I can see. (I also get the
>>>>>>same
>>>>>> sort
>>>>>> of "send error" message from mpirun itself if I use its --mca
>>>>>> ras_gridengine_debug --mca ras_gridengine_verbose flags, but nothing
>>>>>> else.) Jobs that run on multiple cores on a single host are fine,
>>>>>>but
>>>>>> ones that try to start up workers on additional hosts fail. Since
>>>>>> restarting execd makes it work again, I assumed the problem was on
>>>>>>that
>>>>>> end, and tried dumping verbose log output for execd (using dl 10) to
>>>>>>a
>>>>>> file. But, despite many thousands of lines, I can't spot anything
>>>>>>that
>>>>>> looks different when the jobs start failing from when they are
>>>>>>working,
>>>>>> as
>>>>>> far as execd is concerned. Ordinary grid jobs (no parallel
>>>>>> environment)
>>>>>> continue to run fine no matter what.
>>>>>>
>>>>>> So for now, I'm stumped! Any other ideas of what to look for, or
>>>>>> thoughts
>>>>>> of what the unpredictable off-and-on behavior could possibly mean?
>>>>>> Thanks
>>>>>> in advance,
>>>>>>
>>>>>> Jesse
>>>>>>
>>>>>> P.S. This is on CentOS 6, with its openmpi 1.5.4 package.
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> [email protected]
>>>>>> https://gridengine.org/mailman/listinfo/users
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> [email protected]
>>>> https://gridengine.org/mailman/listinfo/users
>>>
>>>
>>
>
>
>_______________________________________________
>users mailing list
>[email protected]
>https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users