Hi, > Am 13.03.2016 um 08:53 schrieb Yuri Burmachenko <yur...@mellanox.com>: > > Hello Reuti, > > We will try that, but we have also found another issue. > > We see that also our SoGE fails and failovers from master to shadow and > vice-versa during the same time when this switch in Job ID occur:
The spool directory is shared between the qmaster and shadow daemons? -- Reuti > 03/13/2016 08:04:21| main|mtlxsge001|I|qmaster hard descriptor limit is set > to 8192 > 03/13/2016 08:04:21| main|mtlxsge001|I|qmaster soft descriptor limit is set > to 8192 > 03/13/2016 08:04:21| main|mtlxsge001|I|qmaster will use max. 8172 file > descriptors for communication > 03/13/2016 08:04:21| main|mtlxsge001|I|qmaster will accept max. 950 dynamic > event clients > 03/13/2016 08:04:21| main|mtlxsge001|I|starting up SGE 8.1.8 (lx-amd64) > > From qacct: > jobnumber 351331 > start_time Sun Mar 13 08:04:28 2016 > end_time Sun Mar 13 08:05:04 2016 > jobnumber 351488 > start_time Sun Mar 13 08:04:34 2016 > end_time Sun Mar 13 08:05:05 2016 > jobnumber 351511 > start_time Sun Mar 13 08:04:54 2016 > end_time Sun Mar 13 08:05:05 2016 > jobnumber 351410 > start_time Sun Mar 13 08:04:29 2016 > end_time Sun Mar 13 08:05:07 2016 > jobnumber 351355 > start_time Sun Mar 13 08:04:28 2016 > end_time Sun Mar 13 08:05:07 2016 > jobnumber 351502 > start_time Sun Mar 13 08:04:49 2016 > end_time Sun Mar 13 08:05:08 2016 > jobnumber 9999253 > start_time Sun Mar 13 08:04:56 2016 > end_time Sun Mar 13 08:05:08 2016 > start_time Sun Mar 13 08:04:28 2016 > end_time Sun Mar 13 08:05:53 2016 > jobnumber 9999337 > start_time Sun Mar 13 08:05:43 2016 > end_time Sun Mar 13 08:05:53 2016 > jobnumber 9999254 > start_time Sun Mar 13 08:04:56 2016 > end_time Sun Mar 13 08:05:57 2016 > > There is a correlation in times between the job ID switch and SoGE failure > and further failover to another node. > > Basically now we need to understand why the SoGE fails... > > Will appreciate on any tips and advices on this. > Thank You. > > > -----Original Message----- > From: Reuti [mailto:re...@staff.uni-marburg.de] > Sent: Tuesday, March 08, 2016 2:25 PM > To: Yuri Burmachenko <yur...@mellanox.com> > Cc: users@gridengine.org > Subject: Re: [gridengine users] SoGE 8.1.8 - Job IDs getting reset very fast > 9999999 ==> 1 - 6-7 times in a month > > >> Am 08.03.2016 um 10:59 schrieb Yuri Burmachenko <yur...@mellanox.com>: >> >> Hello Reuti, >> >> See below: >> >> Job ID Job schedule time >> 97453 29-02-2016_03:18:55 >> 97454 29-02-2016_03:18:57 >> 9999563 29-02-2016_03:23:44 >> 9999564 29-02-2016_03:23:44 >> 9999565 29-02-2016_03:23:44 >> .... >> 9999999 29-02-2016_03:27:34 >> 1 29-02-2016_03:27:35 >> >> Any idea what could be the root cause and/or where to look? > > Interesting. One could try `incron` to spot any access to the file > "jobseqnum". > > -- Reuti > > >> >> Thanks. >> >> -----Original Message----- >> From: Reuti [mailto:re...@staff.uni-marburg.de] >> Sent: Sunday, March 06, 2016 7:27 PM >> To: Yuri Burmachenko <yur...@mellanox.com> >> Cc: users@gridengine.org >> Subject: Re: [gridengine users] SoGE 8.1.8 - Job IDs getting reset >> very fast 9999999 ==> 1 - 6-7 times in a month >> >> Hi, >> >> Am 06.03.2016 um 18:04 schrieb Yuri Burmachenko: >> >>> Hallo to distinguished forum members, >>> >>> Recently we have found that something is wrong with SGE Job IDs - they are >>> getting reset very fast: 6-7 times in a month. >>> We don't really have so many jobs executed in such a short period of time. >>> >>> We use JobId (via qacct) as a primary key for different home-made analytic >>> tools, and this very quick jobId switch impairs the reliability of the >>> tools. >>> >>> This started after we had a full electricity shutdown during which we have >>> halted all our systems including SGE master/shadow and its execution hosts. >> >> To elaborate this. When it suddenly jumps to 99999999: what was the highest >> JOB_ID which was recorded before that skip in the accounting file? >> >> -- Reuti >> >> >>> Perhaps something sets $SGE_ROOT/default/spool/qmaster/jobseqnum to >>> "9999999" and then something (related or not) restarts SGE setting that >>> jobid. >>> >>> Any tips and advices where to look for the root cause, will be greatly >>> appreciated. >>> Thank You. >>> >>> >>> >>> Yuri Burmachenko | Sr. Engineer | IT | Mellanox Technologies Ltd. >>> Work: +972 74 7236386 | Cell +972 54 7542188 |Fax: +972 4 959 3245 >>> Follow us on Twitter and Facebook >>> >>> _______________________________________________ >>> users mailing list >>> users@gridengine.org >>> https://gridengine.org/mailman/listinfo/users >> >> > > _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users