Re: [gridengine users] SoGE 8.1.8 - Job IDs getting reset very fast 9999999 ==> 1 - 6-7 times in a month

Reuti Mon, 14 Mar 2016 02:46:26 -0700

Hi,

> Am 13.03.2016 um 08:53 schrieb Yuri Burmachenko <yur...@mellanox.com>:
> 
> Hello Reuti,
> 
> We will try that, but we have also found another issue.
> 
> We see that also our SoGE fails and  failovers from master to shadow and 
> vice-versa during the same time when this switch in Job ID occur:


The spool directory is shared between the qmaster and shadow daemons?

-- Reuti


> 03/13/2016 08:04:21|  main|mtlxsge001|I|qmaster hard descriptor limit is set 
> to 8192
> 03/13/2016 08:04:21|  main|mtlxsge001|I|qmaster soft descriptor limit is set 
> to 8192
> 03/13/2016 08:04:21|  main|mtlxsge001|I|qmaster will use max. 8172 file 
> descriptors for communication
> 03/13/2016 08:04:21|  main|mtlxsge001|I|qmaster will accept max. 950 dynamic 
> event clients
> 03/13/2016 08:04:21|  main|mtlxsge001|I|starting up SGE 8.1.8 (lx-amd64)
> 
> From qacct:
> jobnumber    351331              
> start_time   Sun Mar 13 08:04:28 2016
> end_time     Sun Mar 13 08:05:04 2016
> jobnumber    351488              
> start_time   Sun Mar 13 08:04:34 2016
> end_time     Sun Mar 13 08:05:05 2016
> jobnumber    351511              
> start_time   Sun Mar 13 08:04:54 2016
> end_time     Sun Mar 13 08:05:05 2016
> jobnumber    351410              
> start_time   Sun Mar 13 08:04:29 2016
> end_time     Sun Mar 13 08:05:07 2016
> jobnumber    351355              
> start_time   Sun Mar 13 08:04:28 2016
> end_time     Sun Mar 13 08:05:07 2016
> jobnumber    351502              
> start_time   Sun Mar 13 08:04:49 2016
> end_time     Sun Mar 13 08:05:08 2016
> jobnumber    9999253             
> start_time   Sun Mar 13 08:04:56 2016
> end_time     Sun Mar 13 08:05:08 2016
> start_time   Sun Mar 13 08:04:28 2016
> end_time     Sun Mar 13 08:05:53 2016
> jobnumber    9999337             
> start_time   Sun Mar 13 08:05:43 2016
> end_time     Sun Mar 13 08:05:53 2016
> jobnumber    9999254             
> start_time   Sun Mar 13 08:04:56 2016
> end_time     Sun Mar 13 08:05:57 2016
> 
> There is a correlation in times between the job ID switch and SoGE failure 
> and further failover to another node.
> 
> Basically now we need to understand why the SoGE fails...
> 
> Will appreciate on any tips and advices on this.
> Thank You.
> 
> 
> -----Original Message-----
> From: Reuti [mailto:re...@staff.uni-marburg.de] 
> Sent: Tuesday, March 08, 2016 2:25 PM
> To: Yuri Burmachenko <yur...@mellanox.com>
> Cc: users@gridengine.org
> Subject: Re: [gridengine users] SoGE 8.1.8 - Job IDs getting reset very fast 
> 9999999 ==> 1 - 6-7 times in a month
> 
> 
>> Am 08.03.2016 um 10:59 schrieb Yuri Burmachenko <yur...@mellanox.com>:
>> 
>> Hello Reuti,
>> 
>> See below:
>> 
>> Job ID               Job schedule time
>> 97453                29-02-2016_03:18:55
>> 97454                29-02-2016_03:18:57
>> 9999563      29-02-2016_03:23:44
>> 9999564      29-02-2016_03:23:44
>> 9999565      29-02-2016_03:23:44
>> ....
>> 9999999      29-02-2016_03:27:34
>> 1            29-02-2016_03:27:35
>> 
>> Any idea what could be the root cause and/or where to look?
> 
> Interesting. One could try `incron` to spot any access to the file 
> "jobseqnum".
> 
> -- Reuti
> 
> 
>> 
>> Thanks.
>> 
>> -----Original Message-----
>> From: Reuti [mailto:re...@staff.uni-marburg.de]
>> Sent: Sunday, March 06, 2016 7:27 PM
>> To: Yuri Burmachenko <yur...@mellanox.com>
>> Cc: users@gridengine.org
>> Subject: Re: [gridengine users] SoGE 8.1.8 - Job IDs getting reset 
>> very fast 9999999 ==> 1 - 6-7 times in a month
>> 
>> Hi,
>> 
>> Am 06.03.2016 um 18:04 schrieb Yuri Burmachenko:
>> 
>>> Hallo to distinguished forum members,
>>> 
>>> Recently we have found that something is wrong with SGE Job IDs - they are 
>>> getting reset very fast: 6-7 times in a month.
>>> We don't really have so many jobs executed in such a short period of time.
>>> 
>>> We use JobId (via qacct) as a primary key for different home-made analytic 
>>> tools, and this very quick jobId switch impairs the reliability of the 
>>> tools.
>>> 
>>> This started after we had a full electricity shutdown during which we have 
>>> halted all our systems including SGE master/shadow and its execution hosts.
>> 
>> To elaborate this. When it suddenly jumps to 99999999: what was the highest 
>> JOB_ID which was recorded before that skip in the accounting file?
>> 
>> -- Reuti
>> 
>> 
>>> Perhaps something sets $SGE_ROOT/default/spool/qmaster/jobseqnum to 
>>> "9999999" and then something (related or not) restarts SGE setting that 
>>> jobid.
>>> 
>>> Any tips and advices where to look for the root cause, will be greatly 
>>> appreciated.
>>> Thank You.
>>> 
>>> 
>>> 
>>> Yuri Burmachenko | Sr. Engineer | IT | Mellanox Technologies Ltd.
>>> Work: +972 74 7236386 | Cell +972 54 7542188 |Fax: +972 4 959 3245 
>>> Follow us on Twitter and Facebook
>>> 
>>> _______________________________________________
>>> users mailing list
>>> users@gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
>> 
>> 
> 
> 


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] SoGE 8.1.8 - Job IDs getting reset very fast 9999999 ==> 1 - 6-7 times in a month

Reply via email to