1- We had, indeed,

​$ qconf -sconf
administrator_mail           none

hence no msgs. I've enabled it (-> user 'sge' and ~sge/.forward to send to
'real' users).

2- I was able to isolate the problem: it seems to be caused by a race
condition for job arrays that use a PE (combo of -t and -pe). In 1% of
cases, esp. on a loaded system (~95% usage for ~3,000 CPUs/slots), the job
script is not copied to the compute node pool disk (that has plenty of
room) in time: the task gets started before the script is copied.

The solution is to use '-b y' and avoid having SGE copy the script to the
local pool - but users must specify the job script full path, can't use
embedded directives ($#) and can't modify the script while the job array is
active.

​​(see
http://comments.gmane.org/gmane.comp.clustering.gridengine.users/19763)

​We run OGS/Grid Engine 2011.11 as distributed by Rocks​ 6.1.1 (Sand Boa) -
is there a easy fix (path) to that race condition? (w/out installing a diff
ver of SGE)


​Thanks, c​
heers,
    Sylvain
--


On Sat, Apr 16, 2016 at 7:06 AM, Reuti <[email protected]> wrote:

> Hi,
>
> Am 24.03.2016 um 18:32 schrieb Korzennik, Sylvain:
>
> > One of our user's job array causes for ~1% of his tasks an QERROR. The
> reporting file just shows a corresponding "job never ran -> schedule it
> again" but that does not help track the cause.
>
> The configured admin email address will get a copy where an error may show
> up like "Can't open input/output file" or "No such file or directory" which
> might point to an NFS bottleneck.
>
> Can you pleaes check the emails of SGE sent to:
>
> ​​
> $ qconf -sconf
> ...
> administrator_mail           reuti
>
> -- Reuti
>
>
> > Any clue where we should look? Eventually his job (1000 tasks) hoses a
> fraction of the slots in that queue.
> >
> >   Cheers,
> >     Sylvain
> > --
> >
> > _______________________________________________
> > users mailing list
> > [email protected]
> > https://gridengine.org/mailman/listinfo/users
>
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to