Hello Reuti,

We will try that, but we have also found another issue.

We see that also our SoGE fails and  failovers from master to shadow and 
vice-versa during the same time when this switch in Job ID occur:

03/13/2016 08:04:21|  main|mtlxsge001|I|qmaster hard descriptor limit is set to 
8192
03/13/2016 08:04:21|  main|mtlxsge001|I|qmaster soft descriptor limit is set to 
8192
03/13/2016 08:04:21|  main|mtlxsge001|I|qmaster will use max. 8172 file 
descriptors for communication
03/13/2016 08:04:21|  main|mtlxsge001|I|qmaster will accept max. 950 dynamic 
event clients
03/13/2016 08:04:21|  main|mtlxsge001|I|starting up SGE 8.1.8 (lx-amd64)

>From qacct:
jobnumber    351331              
start_time   Sun Mar 13 08:04:28 2016
end_time     Sun Mar 13 08:05:04 2016
jobnumber    351488              
start_time   Sun Mar 13 08:04:34 2016
end_time     Sun Mar 13 08:05:05 2016
jobnumber    351511              
start_time   Sun Mar 13 08:04:54 2016
end_time     Sun Mar 13 08:05:05 2016
jobnumber    351410              
start_time   Sun Mar 13 08:04:29 2016
end_time     Sun Mar 13 08:05:07 2016
jobnumber    351355              
start_time   Sun Mar 13 08:04:28 2016
end_time     Sun Mar 13 08:05:07 2016
jobnumber    351502              
start_time   Sun Mar 13 08:04:49 2016
end_time     Sun Mar 13 08:05:08 2016
jobnumber    9999253             
start_time   Sun Mar 13 08:04:56 2016
end_time     Sun Mar 13 08:05:08 2016
start_time   Sun Mar 13 08:04:28 2016
end_time     Sun Mar 13 08:05:53 2016
jobnumber    9999337             
start_time   Sun Mar 13 08:05:43 2016
end_time     Sun Mar 13 08:05:53 2016
jobnumber    9999254             
start_time   Sun Mar 13 08:04:56 2016
end_time     Sun Mar 13 08:05:57 2016

There is a correlation in times between the job ID switch and SoGE failure and 
further failover to another node.

Basically now we need to understand why the SoGE fails...

Will appreciate on any tips and advices on this.
Thank You.


-----Original Message-----
From: Reuti [mailto:re...@staff.uni-marburg.de] 
Sent: Tuesday, March 08, 2016 2:25 PM
To: Yuri Burmachenko <yur...@mellanox.com>
Cc: users@gridengine.org
Subject: Re: [gridengine users] SoGE 8.1.8 - Job IDs getting reset very fast 
9999999 ==> 1 - 6-7 times in a month


> Am 08.03.2016 um 10:59 schrieb Yuri Burmachenko <yur...@mellanox.com>:
> 
> Hello Reuti,
> 
> See below:
> 
> Job ID                Job schedule time
> 97453         29-02-2016_03:18:55
> 97454         29-02-2016_03:18:57
> 9999563       29-02-2016_03:23:44
> 9999564       29-02-2016_03:23:44
> 9999565       29-02-2016_03:23:44
> ....
> 9999999       29-02-2016_03:27:34
> 1             29-02-2016_03:27:35
> 
> Any idea what could be the root cause and/or where to look?

Interesting. One could try `incron` to spot any access to the file "jobseqnum".

-- Reuti


> 
> Thanks.
> 
> -----Original Message-----
> From: Reuti [mailto:re...@staff.uni-marburg.de]
> Sent: Sunday, March 06, 2016 7:27 PM
> To: Yuri Burmachenko <yur...@mellanox.com>
> Cc: users@gridengine.org
> Subject: Re: [gridengine users] SoGE 8.1.8 - Job IDs getting reset 
> very fast 9999999 ==> 1 - 6-7 times in a month
> 
> Hi,
> 
> Am 06.03.2016 um 18:04 schrieb Yuri Burmachenko:
> 
>> Hallo to distinguished forum members,
>> 
>> Recently we have found that something is wrong with SGE Job IDs - they are 
>> getting reset very fast: 6-7 times in a month.
>> We don't really have so many jobs executed in such a short period of time.
>> 
>> We use JobId (via qacct) as a primary key for different home-made analytic 
>> tools, and this very quick jobId switch impairs the reliability of the tools.
>> 
>> This started after we had a full electricity shutdown during which we have 
>> halted all our systems including SGE master/shadow and its execution hosts.
> 
> To elaborate this. When it suddenly jumps to 99999999: what was the highest 
> JOB_ID which was recorded before that skip in the accounting file?
> 
> -- Reuti
> 
> 
>> Perhaps something sets $SGE_ROOT/default/spool/qmaster/jobseqnum to 
>> "9999999" and then something (related or not) restarts SGE setting that 
>> jobid.
>> 
>> Any tips and advices where to look for the root cause, will be greatly 
>> appreciated.
>> Thank You.
>> 
>> 
>> 
>> Yuri Burmachenko | Sr. Engineer | IT | Mellanox Technologies Ltd.
>> Work: +972 74 7236386 | Cell +972 54 7542188 |Fax: +972 4 959 3245 
>> Follow us on Twitter and Facebook
>> 
>> _______________________________________________
>> users mailing list
>> users@gridengine.org
>> https://gridengine.org/mailman/listinfo/users
> 
> 


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to