Am 26.11.2014 um 19:02 schrieb Reuti:

> Am 26.11.2014 um 11:30 schrieb Guillermo Marco Puche:
> 
>> On 26/11/14 11:17, Reuti wrote:
>>> Am 26.11.2014 um 08:23 schrieb Guillermo Marco Puche:
>>> 
>>>> On 26/11/14 00:42, Reuti wrote:
>>>>> Hi,
>>>>> 
>>>>> Am 25.11.2014 um 23:28 schrieb Guillermo Marco Puche:
>>>>> 
>>>>>> I'm experiencing a very weird issue. I've no idea how to deal with it.
>>>>>>  • I've submited multiple jobs ie: job1, job2, job3.
>>>>>>  • Jobs are running in multiple compute nodes
>>>>>>  • I've modified jobs to user hold and then rescheduled
>>>>>>  • Jobs are now in a hqR state in SGE job pool (they're supposed to stay 
>>>>>> there and free their slots and resources in their respective compute 
>>>>>> nodes)
>>>>>>  • Compute nodes that previously ran this jobs continue to execute the 
>>>>>> job process and consuming resources (I can see them with htop inside 
>>>>>> compute node)
>>>>> But they are gone from `qstat` and not listed twice?
>>>> Nope, they're listed once in qstat.
>>>>> 
>>>>>> So what's the correct way to pause/restart a job and hold it on SGE pool 
>>>>>> without holding resources?
>>>>> Are these processes still bound to the execd and the shepherd of SGE or 
>>>>> did they jump out of the process tree compared to the time when they were 
>>>>> running initially?
>>>> Yest processes still bound to the execd and the shepherd of SGE.
>>> Which version of SGE are you using? After issuing `qmod -rj <jobid>` they 
>>> should be gone of course.
>> GE 6.2u5
> 
> Can you please set the loglevel in SGE's configuration:
> 
> $ qconf -sconf
> ...
> loglevel                     log_info
> 
> and have a look at the messages file of the node. There should be an entry 
> like:
> 
> $ less /var/spool/sge/mypc/messages
> 11/26/2014 19:00:24|  main|mypc|I|SIGNAL jid: 11772 jatask: 1 signal: KILL

BTW: Does an entry exits in `qacct -j <jobid>` for this rescheduled job?

-- Reuti


> -- Reuti
> 
> 
>> Guillermo.
>>> 
>>> -- Reuti
>>> 
>>>>> Do you use any `trap` inside the job script?
>>>> No trap commands.
>>>>> -- Reuti
>>>> Regards,
>>>> Guillermo.
>>>> 
> 
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to