Hello Reuti, We will try that, but we have also found another issue.
We see that also our SoGE fails and failovers from master to shadow and vice-versa during the same time when this switch in Job ID occur: 03/13/2016 08:04:21| main|mtlxsge001|I|qmaster hard descriptor limit is set to 8192 03/13/2016 08:04:21| main|mtlxsge001|I|qmaster soft descriptor limit is set to 8192 03/13/2016 08:04:21| main|mtlxsge001|I|qmaster will use max. 8172 file descriptors for communication 03/13/2016 08:04:21| main|mtlxsge001|I|qmaster will accept max. 950 dynamic event clients 03/13/2016 08:04:21| main|mtlxsge001|I|starting up SGE 8.1.8 (lx-amd64) >From qacct: jobnumber 351331 start_time Sun Mar 13 08:04:28 2016 end_time Sun Mar 13 08:05:04 2016 jobnumber 351488 start_time Sun Mar 13 08:04:34 2016 end_time Sun Mar 13 08:05:05 2016 jobnumber 351511 start_time Sun Mar 13 08:04:54 2016 end_time Sun Mar 13 08:05:05 2016 jobnumber 351410 start_time Sun Mar 13 08:04:29 2016 end_time Sun Mar 13 08:05:07 2016 jobnumber 351355 start_time Sun Mar 13 08:04:28 2016 end_time Sun Mar 13 08:05:07 2016 jobnumber 351502 start_time Sun Mar 13 08:04:49 2016 end_time Sun Mar 13 08:05:08 2016 jobnumber 9999253 start_time Sun Mar 13 08:04:56 2016 end_time Sun Mar 13 08:05:08 2016 start_time Sun Mar 13 08:04:28 2016 end_time Sun Mar 13 08:05:53 2016 jobnumber 9999337 start_time Sun Mar 13 08:05:43 2016 end_time Sun Mar 13 08:05:53 2016 jobnumber 9999254 start_time Sun Mar 13 08:04:56 2016 end_time Sun Mar 13 08:05:57 2016 There is a correlation in times between the job ID switch and SoGE failure and further failover to another node. Basically now we need to understand why the SoGE fails... Will appreciate on any tips and advices on this. Thank You. -----Original Message----- From: Reuti [mailto:re...@staff.uni-marburg.de] Sent: Tuesday, March 08, 2016 2:25 PM To: Yuri Burmachenko <yur...@mellanox.com> Cc: users@gridengine.org Subject: Re: [gridengine users] SoGE 8.1.8 - Job IDs getting reset very fast 9999999 ==> 1 - 6-7 times in a month > Am 08.03.2016 um 10:59 schrieb Yuri Burmachenko <yur...@mellanox.com>: > > Hello Reuti, > > See below: > > Job ID Job schedule time > 97453 29-02-2016_03:18:55 > 97454 29-02-2016_03:18:57 > 9999563 29-02-2016_03:23:44 > 9999564 29-02-2016_03:23:44 > 9999565 29-02-2016_03:23:44 > .... > 9999999 29-02-2016_03:27:34 > 1 29-02-2016_03:27:35 > > Any idea what could be the root cause and/or where to look? Interesting. One could try `incron` to spot any access to the file "jobseqnum". -- Reuti > > Thanks. > > -----Original Message----- > From: Reuti [mailto:re...@staff.uni-marburg.de] > Sent: Sunday, March 06, 2016 7:27 PM > To: Yuri Burmachenko <yur...@mellanox.com> > Cc: users@gridengine.org > Subject: Re: [gridengine users] SoGE 8.1.8 - Job IDs getting reset > very fast 9999999 ==> 1 - 6-7 times in a month > > Hi, > > Am 06.03.2016 um 18:04 schrieb Yuri Burmachenko: > >> Hallo to distinguished forum members, >> >> Recently we have found that something is wrong with SGE Job IDs - they are >> getting reset very fast: 6-7 times in a month. >> We don't really have so many jobs executed in such a short period of time. >> >> We use JobId (via qacct) as a primary key for different home-made analytic >> tools, and this very quick jobId switch impairs the reliability of the tools. >> >> This started after we had a full electricity shutdown during which we have >> halted all our systems including SGE master/shadow and its execution hosts. > > To elaborate this. When it suddenly jumps to 99999999: what was the highest > JOB_ID which was recorded before that skip in the accounting file? > > -- Reuti > > >> Perhaps something sets $SGE_ROOT/default/spool/qmaster/jobseqnum to >> "9999999" and then something (related or not) restarts SGE setting that >> jobid. >> >> Any tips and advices where to look for the root cause, will be greatly >> appreciated. >> Thank You. >> >> >> >> Yuri Burmachenko | Sr. Engineer | IT | Mellanox Technologies Ltd. >> Work: +972 74 7236386 | Cell +972 54 7542188 |Fax: +972 4 959 3245 >> Follow us on Twitter and Facebook >> >> _______________________________________________ >> users mailing list >> users@gridengine.org >> https://gridengine.org/mailman/listinfo/users > > _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users