Re: [gridengine users] Intermittent MPI problem on OGS 2011.11

Reuti Fri, 20 Jun 2014 08:06:13 -0700

Am 19.06.2014 um 14:34 schrieb Connell, Jesse:

> Well, not intentionally, anyway!  We generally have "ForwardX11 yes" set
> when SSHing into the login node, so that X will be available for
> interactive jobs, but I'm not trying to do anything graphical here.  To
> make sure I'm being clear, the workflow we have is:


> 1) SSH into login node (with X11 usually enabled but not intentionally
> used in this case)
> 2) qsub -pe openmpi <numslots> some-mpi-job.sh
> 3) some-mpi-job.sh then calls "mpirun -np $NSLOTS some-mpi-task"
> 
> Does it seem weird to find mpirun referencing qsh.c here?

Sorry, I overlooked the symlink of `qrsh` to `qsh`. Hence it's the right 
application.

-- Reuti


>  Now that you
> mention it, http://www.open-mpi.org/faq/?category=sge says that mpirun
> "spawns remote processes via 'qrsh' so that SGE can control and monitor
> them"... so why *is* it talking about qsh in my case?  Could this be the
> root of the problem?
> 
> Also, this PE is set up like this:
> 
> pe_name            openmpi
> slots              999
> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    /bin/true
> stop_proc_args     /bin/true
> allocation_rule    $round_robin
> control_slaves     TRUE
> job_is_first_task  FALSE
> urgency_slots      min
> accounting_summary FALSE
> 
> Thanks Reuti!
> 
> Jesse
> 
> 
> 
> On 6/18/14, 4:20 PM, "Reuti" <[email protected]> wrote:
> 
>> Hi,
>> 
>> Am 18.06.2014 um 20:45 schrieb Connell, Jesse:
>> 
>>> We've been having a seemingly-random problem with MPI jobs on our
>>> install
>>> of Open Grid Scheduler 2011.11.  For some varying length of time from
>>> when
>>> the execd processes start up, MPI jobs running across multiple hosts
>>> will
>>> run fine.  Then, at some point, they will start failing at the mpirun
>>> step, and will keep failing until execd is restarted on the affected
>>> hosts.  They then work again, before eventually failing, and so on.  If
>>> I
>>> increase the SGE debug level before calling mpirun in my job script, I
>>> see
>>> things like this:
>>> 
>>>  842  11556         main     ../clients/qsh/qsh.c 1840 executing task
>>> of
>> 
>> qsh? Are you using an X11 session?
>> 
>> -- Reuti
>> 
>> 
>>> job 6805430 failed: failed sending task to execd@<hostname>: got send
>>> error
>>> 
>>> ...but nothing more interesting that I can see.  (I also get the same
>>> sort
>>> of "send error" message from mpirun itself if I use its --mca
>>> ras_gridengine_debug --mca ras_gridengine_verbose flags, but nothing
>>> else.)  Jobs that run on multiple cores on a single host are fine, but
>>> ones that try to start up workers on additional hosts fail.  Since
>>> restarting execd makes it work again, I assumed the problem was on that
>>> end, and tried dumping verbose log output for execd (using dl 10) to a
>>> file.  But, despite many thousands of lines, I can't spot anything that
>>> looks different when the jobs start failing from when they are working,
>>> as
>>> far as execd is concerned.  Ordinary grid jobs (no parallel environment)
>>> continue to run fine no matter what.
>>> 
>>> So for now, I'm stumped!  Any other ideas of what to look for, or
>>> thoughts
>>> of what the unpredictable off-and-on behavior could possibly mean?
>>> Thanks
>>> in advance,
>>> 
>>> Jesse
>>> 
>>> P.S.  This is on CentOS 6, with its openmpi 1.5.4 package.
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> https://gridengine.org/mailman/listinfo/users
>> 
> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Intermittent MPI problem on OGS 2011.11

Reply via email to