Re: [gridengine users] Intermittent MPI problem on OGS 2011.11

Connell, Jesse Fri, 20 Jun 2014 08:15:11 -0700

On 6/20/14, 10:32 AM, "Reuti" <[email protected]> wrote:


>Am 19.06.2014 um 15:23 schrieb Connell, Jesse:
>
>> Huh, now here's something interesting, too... I ran strace on mpirun to
>> try to find out more, and I saw that just before "got send error" it
>>tries
>> to connect to port 537 on the worker system.  If I mmap scan across the
>> queue I'm testing with, the two machines I've been throwing MPI jobs at
>>to
>> debug this (host04 and host05 below) show port 537 as closed, while most
>> of the rest show it as open.  The three other ones near the end are
>> systems  that previously had this problem but that I didn't restart
>>execd
>> on since jobs were currently running.  (So, presumably MPI jobs would
>> still fail on those too, but would work for a while on the rest if I
>> tried.)
>> 
>> # Nmap 5.51 scan initiated Thu Jun 19 09:14:39 2014 as: nmap -R -p 537
>>-oG
>> - 10.241.71.11-26
>> Host: 10.241.71.11 (host01.fqdn)     Ports: 537/open/tcp/////
>> Host: 10.241.71.12 (host02.fqdn)     Ports: 537/open/tcp/////
>> Host: 10.241.71.13 (host03.fqdn)     Ports: 537/open/tcp/////
>> Host: 10.241.71.14 (host04.fqdn)     Ports: 537/closed/tcp/////
>> Host: 10.241.71.15 (host05.fqdn)     Ports: 537/closed/tcp/////
>> Host: 10.241.71.16 (host06.fqdn)     Ports: 537/open/tcp/////
>> Host: 10.241.71.17 (host07.fqdn)     Ports: 537/open/tcp/////
>> Host: 10.241.71.18 (host08.fqdn)     Ports: 537/open/tcp/////
>> Host: 10.241.71.19 (host09.fqdn)     Ports: 537/open/tcp/////
>> Host: 10.241.71.20 (host10.fqdn)     Ports: 537/open/tcp/////
>> Host: 10.241.71.21 (host11.fqdn)     Ports: 537/open/tcp/////
>> Host: 10.241.71.22 (host12.fqdn)     Ports: 537/open/tcp/////
>> Host: 10.241.71.23 (host13.fqdn)     Ports: 537/closed/tcp/////
>> Host: 10.241.71.24 (host14.fqdn)     Ports: 537/closed/tcp/////
>> Host: 10.241.71.25 (host15.fqdn)     Ports: 537/open/tcp/////
>> Host: 10.241.71.26 (host16.fqdn)     Ports: 537/closed/tcp/////
>> # Nmap done at Thu Jun 19 09:14:39 2014 -- 16 IP addresses (16 hosts up)
>> scanned in 0.04 seconds
>> 
>> 
>> 
>> ... so it looks like when things go "bad," something changes about how
>> execd is (or is not) listening on port 537.
>
>Ad port 537 is the intended one for your SGE installation? Any firewall
>on this machine?

Yep, we're using port 537 (I just double-checked and /etc/services shows
"sge_execd       537/tcp"), and no, no firewall.  I had previously also
tried "qping bungee04 537 execd 1" and it worked fine when these jobs
worked, and failed when they didn't.  So it does look like something
totally separate from any PE/MPI issues after all.

>>   Does it possibly isolate my
>> problem from anything related to MPI or my PE?  But why do ordinary
>>qsub'd
>> jobs still run fine, even on those "broken" systems?
>
>In principle you could set up the firewall to allow traffic on port 537
>only from the headnode of the cluster, hence `qsub`ed jobs will end there
>up fine. But you can block traffic between the nodes, which might happen
>when `qrsh --inherit ...` is issued.

Well, that isn't supposed to be my setup, but that description actually
fits this behavior pretty closely!  If the traffic goes straight from the
master to the workers and that path is blocked, the parallel job fails,
but if it allows traffic still from the head node, those kinds of jobs
keep working... now that I think of it, I can't remember where I submitted
those other test jobs from or ran qping from, but all of our execution
hosts are also submit hosts... so, when I catch it failing again, I'll try
sending jobs from both places and see how it behaves.

This still seems very mysterious (any thoughts on what in the OS, if not
inside execd itself, might block traffic after a while for execd, but
allow it again once execd restarts?  I'm not using SELinux, which is my
usual scapegoat!)  But, at least I've got some more things to try out.
Thanks again!

Jesse

P.S.  I just saw your note about qrsh -> qsh; good to know!

>
>-- Reuti
>
>
>>  This all seems even
>> more bizarre to me now :)
>> 
>> Jesse
>> 
>> 
>> On 6/19/14, 8:34 AM, "Connell, Jesse" <[email protected]> wrote:
>> 
>>> Well, not intentionally, anyway!  We generally have "ForwardX11 yes"
>>>set
>>> when SSHing into the login node, so that X will be available for
>>> interactive jobs, but I'm not trying to do anything graphical here.  To
>>> make sure I'm being clear, the workflow we have is:
>>> 
>>> 1) SSH into login node (with X11 usually enabled but not intentionally
>>> used in this case)
>>> 2) qsub -pe openmpi <numslots> some-mpi-job.sh
>>> 3) some-mpi-job.sh then calls "mpirun -np $NSLOTS some-mpi-task"
>>> 
>>> Does it seem weird to find mpirun referencing qsh.c here?  Now that you
>>> mention it, http://www.open-mpi.org/faq/?category=sge says that mpirun
>>> "spawns remote processes via 'qrsh' so that SGE can control and monitor
>>> them"... so why *is* it talking about qsh in my case?  Could this be
>>>the
>>> root of the problem?
>>> 
>>> Also, this PE is set up like this:
>>> 
>>> pe_name            openmpi
>>> slots              999
>>> user_lists         NONE
>>> xuser_lists        NONE
>>> start_proc_args    /bin/true
>>> stop_proc_args     /bin/true
>>> allocation_rule    $round_robin
>>> control_slaves     TRUE
>>> job_is_first_task  FALSE
>>> urgency_slots      min
>>> accounting_summary FALSE
>>> 
>>> Thanks Reuti!
>>> 
>>> Jesse
>>> 
>>> 
>>> 
>>> On 6/18/14, 4:20 PM, "Reuti" <[email protected]> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> Am 18.06.2014 um 20:45 schrieb Connell, Jesse:
>>>> 
>>>>> We've been having a seemingly-random problem with MPI jobs on our
>>>>> install
>>>>> of Open Grid Scheduler 2011.11.  For some varying length of time from
>>>>> when
>>>>> the execd processes start up, MPI jobs running across multiple hosts
>>>>> will
>>>>> run fine.  Then, at some point, they will start failing at the mpirun
>>>>> step, and will keep failing until execd is restarted on the affected
>>>>> hosts.  They then work again, before eventually failing, and so on.
>>>>>If
>>>>> I
>>>>> increase the SGE debug level before calling mpirun in my job script,
>>>>>I
>>>>> see
>>>>> things like this:
>>>>> 
>>>>>  842  11556         main     ../clients/qsh/qsh.c 1840 executing task
>>>>> of
>>>> 
>>>> qsh? Are you using an X11 session?
>>>> 
>>>> -- Reuti
>>>> 
>>>> 
>>>>> job 6805430 failed: failed sending task to execd@<hostname>: got send
>>>>> error
>>>>> 
>>>>> ...but nothing more interesting that I can see.  (I also get the same
>>>>> sort
>>>>> of "send error" message from mpirun itself if I use its --mca
>>>>> ras_gridengine_debug --mca ras_gridengine_verbose flags, but nothing
>>>>> else.)  Jobs that run on multiple cores on a single host are fine,
>>>>>but
>>>>> ones that try to start up workers on additional hosts fail.  Since
>>>>> restarting execd makes it work again, I assumed the problem was on
>>>>>that
>>>>> end, and tried dumping verbose log output for execd (using dl 10) to
>>>>>a
>>>>> file.  But, despite many thousands of lines, I can't spot anything
>>>>>that
>>>>> looks different when the jobs start failing from when they are
>>>>>working,
>>>>> as
>>>>> far as execd is concerned.  Ordinary grid jobs (no parallel
>>>>> environment)
>>>>> continue to run fine no matter what.
>>>>> 
>>>>> So for now, I'm stumped!  Any other ideas of what to look for, or
>>>>> thoughts
>>>>> of what the unpredictable off-and-on behavior could possibly mean?
>>>>> Thanks
>>>>> in advance,
>>>>> 
>>>>> Jesse
>>>>> 
>>>>> P.S.  This is on CentOS 6, with its openmpi 1.5.4 package.
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> [email protected]
>>>>> https://gridengine.org/mailman/listinfo/users
>>>> 
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> https://gridengine.org/mailman/listinfo/users
>> 
>> 
>


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Intermittent MPI problem on OGS 2011.11

Reply via email to