1- We had, indeed, $ qconf -sconf administrator_mail none
hence no msgs. I've enabled it (-> user 'sge' and ~sge/.forward to send to 'real' users). 2- I was able to isolate the problem: it seems to be caused by a race condition for job arrays that use a PE (combo of -t and -pe). In 1% of cases, esp. on a loaded system (~95% usage for ~3,000 CPUs/slots), the job script is not copied to the compute node pool disk (that has plenty of room) in time: the task gets started before the script is copied. The solution is to use '-b y' and avoid having SGE copy the script to the local pool - but users must specify the job script full path, can't use embedded directives ($#) and can't modify the script while the job array is active. (see http://comments.gmane.org/gmane.comp.clustering.gridengine.users/19763) We run OGS/Grid Engine 2011.11 as distributed by Rocks 6.1.1 (Sand Boa) - is there a easy fix (path) to that race condition? (w/out installing a diff ver of SGE) Thanks, c heers, Sylvain -- On Sat, Apr 16, 2016 at 7:06 AM, Reuti <[email protected]> wrote: > Hi, > > Am 24.03.2016 um 18:32 schrieb Korzennik, Sylvain: > > > One of our user's job array causes for ~1% of his tasks an QERROR. The > reporting file just shows a corresponding "job never ran -> schedule it > again" but that does not help track the cause. > > The configured admin email address will get a copy where an error may show > up like "Can't open input/output file" or "No such file or directory" which > might point to an NFS bottleneck. > > Can you pleaes check the emails of SGE sent to: > > > $ qconf -sconf > ... > administrator_mail reuti > > -- Reuti > > > > Any clue where we should look? Eventually his job (1000 tasks) hoses a > fraction of the slots in that queue. > > > > Cheers, > > Sylvain > > -- > > > > _______________________________________________ > > users mailing list > > [email protected] > > https://gridengine.org/mailman/listinfo/users > >
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
