Well, not intentionally, anyway!  We generally have "ForwardX11 yes" set
when SSHing into the login node, so that X will be available for
interactive jobs, but I'm not trying to do anything graphical here.  To
make sure I'm being clear, the workflow we have is:

1) SSH into login node (with X11 usually enabled but not intentionally
used in this case)
2) qsub -pe openmpi <numslots> some-mpi-job.sh
3) some-mpi-job.sh then calls "mpirun -np $NSLOTS some-mpi-task"

Does it seem weird to find mpirun referencing qsh.c here?  Now that you
mention it, http://www.open-mpi.org/faq/?category=sge says that mpirun
"spawns remote processes via 'qrsh' so that SGE can control and monitor
them"... so why *is* it talking about qsh in my case?  Could this be the
root of the problem?

Also, this PE is set up like this:

pe_name            openmpi
slots              999
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $round_robin
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE

Thanks Reuti!

Jesse



On 6/18/14, 4:20 PM, "Reuti" <[email protected]> wrote:

>Hi,
>
>Am 18.06.2014 um 20:45 schrieb Connell, Jesse:
>
>> We've been having a seemingly-random problem with MPI jobs on our
>>install
>> of Open Grid Scheduler 2011.11.  For some varying length of time from
>>when
>> the execd processes start up, MPI jobs running across multiple hosts
>>will
>> run fine.  Then, at some point, they will start failing at the mpirun
>> step, and will keep failing until execd is restarted on the affected
>> hosts.  They then work again, before eventually failing, and so on.  If
>>I
>> increase the SGE debug level before calling mpirun in my job script, I
>>see
>> things like this:
>> 
>>   842  11556         main     ../clients/qsh/qsh.c 1840 executing task
>>of
>
>qsh? Are you using an X11 session?
>
>-- Reuti
>
>
>> job 6805430 failed: failed sending task to execd@<hostname>: got send
>>error
>> 
>> ...but nothing more interesting that I can see.  (I also get the same
>>sort
>> of "send error" message from mpirun itself if I use its --mca
>> ras_gridengine_debug --mca ras_gridengine_verbose flags, but nothing
>> else.)  Jobs that run on multiple cores on a single host are fine, but
>> ones that try to start up workers on additional hosts fail.  Since
>> restarting execd makes it work again, I assumed the problem was on that
>> end, and tried dumping verbose log output for execd (using dl 10) to a
>> file.  But, despite many thousands of lines, I can't spot anything that
>> looks different when the jobs start failing from when they are working,
>>as
>> far as execd is concerned.  Ordinary grid jobs (no parallel environment)
>> continue to run fine no matter what.
>> 
>> So for now, I'm stumped!  Any other ideas of what to look for, or
>>thoughts
>> of what the unpredictable off-and-on behavior could possibly mean?
>>Thanks
>> in advance,
>> 
>> Jesse
>> 
>> P.S.  This is on CentOS 6, with its openmpi 1.5.4 package.
>> 
>> 
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users
>


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to