Re: [gridengine users] job restart - cannot run on host until clean up of an previous run has finished

Reuti Tue, 21 Aug 2012 23:33:53 -0700

Hi,

Am 22.08.2012 um 00:37 schrieb Henrichs, Juryk:


> Hallo Reuti,
> 
> checkpointing type is application_level. The migr_command script basically 
> writes one value into one file to tell the application to stop. All the rest 
> is taken care of by the application itself.

So it's not safe whether the application really left the machine when the 
"migr_command" finishes - right? I would suggest to put some sleep into the 
procedure and check whether the job script is gone and/or perform a safety 
kill: kill -9 -- -$1 There are some undocumented variables, and so $job_pid can 
be passed as $1 to the "migr_command":

http://arc.liv.ac.uk/SGE/htmlman/htmlman5/checkpoint.html

-- Reuti


> Having this said - the qsub command is started in a script which starts some 
> other processes in parallel to keep track of the computation. Those may not 
> be finished by then. However, this should not be a problem, since the qsub 
> command is not yet returned (as long as the job is suspended and rescheduled 
> but not finished).
> 
> Juryk
> 
>> <compose-unknown-contact.jpg>        Reuti   Dienstag, 21. August 2012 23:47
>> Hi,
>> 
>> Am 21.08.2012 um 22:44 schrieb Henrichs, Juryk:
>> 
>> 
>>> we are running sge 6.2u5. I am trying to restart jobs via checkpointing.
>>> On one of our clusters that works fine - jobs is suspended via the
>>> suspend command, is stopped, rescheduled in the queue and restarted if
>>> resources are available.
>>> 
>>> With apparently the same setup of the sge on a second cluster my jobs
>>> are rescheduled but do not get started. qstat -sj shows
>>> "cannot run on host XXX until clean up of an previous run has finished"
>>> 
>>> If the job is deleted from the queue and restarted manually works perfect.
>>> 
>>> Is there a way to get a more elaborate error message and to find out
>>> what exactly goes wrong with the cleanup?
>>> 
>> 
>> Depending on the checkpointing setup it might be necessary to remove all 
>> processes of a job in the "migr_command" defined script. Which checkpointing 
>> type do you use amd how do you remove the processes therein?
>> 
>> -- Reuti
>> 
>> 
>> 
>>> Juryk
>>> 
>>> 
>>> This e-mail and any attachment thereto may contain confidential information 
>>> and/or information protected by intellectual property rights for the 
>>> exclusive attention of the intended addressees named above. Any access of 
>>> third parties to this e-mail is unauthorised. Any use of this e-mail by 
>>> unintended recipients such as total or partial copying, distribution, 
>>> disclosure etc. is prohibited and may be unlawful. When addressed to our 
>>> clients the content of this e-mail is subject to the General Terms and 
>>> Conditions of GL's Group of Companies applicable at the date of this e-mail.
>>> If you have received this e-mail in error, please notify the sender either 
>>> by telephone or by e-mail and delete the material from any computer.
>>> GL's Group of Companies does not warrant and/or guarantee that this message 
>>> at the moment of receipt is authentic, correct and its communication free 
>>> of errors, interruption etc.
>>> FutureShip GmbH, HRB 106781 AG HH, VAT Reg. No. DE263937825
>>> Geschäftsführer (CEO): Volker Höppner, Henning Kinkhorst, Stefan Deucker
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> 
>>> [email protected]
>>> https://gridengine.org/mailman/listinfo/users
>>> 
>>> 
>>> 
>> 
>> <compose-unknown-contact.jpg>        Henrichs, Juryk Dienstag, 21. August 
>> 2012 22:44
>> Hi
>> 
>> we are running sge 6.2u5. I am trying to restart jobs via checkpointing. 
>> On one of our clusters that works fine - jobs is suspended via the 
>> suspend command, is stopped, rescheduled in the queue and restarted if 
>> resources are available.
>> 
>> With apparently the same setup of the sge on a second cluster my jobs 
>> are rescheduled but do not get started. qstat -sj shows
>> "cannot run on host XXX until clean up of an previous run has finished"
>> 
>> If the job is deleted from the queue and restarted manually works perfect.
>> 
>> Is there a way to get a more elaborate error message and to find out 
>> what exactly goes wrong with the cleanup?
>> 
>> Juryk
>> 
>> <compose-unknown-contact.jpg>        Juryk Henrichs  Dienstag, 21. August 
>> 2012 22:34
>> Hi 
>> 
>> we are running sge 6.2u5. I am trying to restart jobs via checkpointing. On 
>> one of our clusters that works fine - jobs is suspended via the suspend 
>> command, is stopped, rescheduled in the queue and restarted if resources are 
>> available. 
>> 
>> With apparently the same setup of the sge on a second cluster my jobs are 
>> rescheduled but do not get started. qstat -sj shows 
>> "cannot run on host XXX until clean up of an previous run has finished" 
>> 
>> If the job is deleted from the queue and restarted manually works perfect. 
>> 
>> Is there a way to get a more elaborate error message and to find out what 
>> exactly goes wrong with the cleanup? 
>> 
>> Juryk 
>> 
>> 
>> <compose-unknown-contact.jpg>        Juryk Henrichs  Dienstag, 21. August 
>> 2012 13:57
>> Hi 
>> 
>> we are running sge 6.2u5. I am trying to restart jobs via checkpointing. On 
>> one of our clusters that works fine - jobs is suspended via the suspend 
>> command, is stopped, rescheduled in the queue and restarted if resources are 
>> available. 
>> 
>> With apparently the same setup of the sge on a second cluster my jobs are 
>> rescheduled but do not get started. qstat -sj shows 
>> "cannot run on host XXX until clean up of an previous run has finished" 
>> 
>> If the job is deleted from the queue and restarted manually works perfect. 
>> 
>> Is there a way to get a more elaborate error message and to find out what 
>> exactly goes wrong with the cleanup? 
>> 
>> Juryk 
>> 
>> 
>> <compose-unknown-contact.jpg>        Juryk Henrichs  Dienstag, 21. August 
>> 2012 13:07
>> Hi 
>> 
>> we are running sge 6.2u5. I am trying to restart jobs via checkpointing. On 
>> one of our clusters that works fine - jobs is suspended via the suspend 
>> command, is stopped, rescheduled in the queue and restarted if resources are 
>> available. 
>> 
>> With apparently the same setup of the sge on a second cluster my jobs are 
>> rescheduled but do not get started. qstat -sj shows 
>> "cannot run on host XXX until clean up of an previous run has finished" 
>> 
>> If the job is deleted from the queue and restarted manually works perfect. 
>> 
>> Is there a way to get a more elaborate error message and to find out what 
>> exactly goes wrong with the cleanup? 
>> 
>> Juryk 
> 
> -- 
> Juryk Henrichs,
> 
> Senior Project Engineer
> Fluid Engineering
> FutureShip GmbH -- A GL company
> 
> Office Potsdam
> Behlertstr. 3a, Haus G
> D-14467 Potsdam
> 
> Tel.: +49 331 9799 179-16
> Fax.: +49 331 9799 179-9
> 
> http://www.futureship.net
> http://www.gl-group.com
> This e-mail and any attachment thereto may contain confidential information 
> and/or information protected by intellectual property rights for the 
> exclusive attention of the intended addressees named above. Any access of 
> third parties to this e-mail is unauthorised. Any use of this e-mail by 
> unintended recipients such as total or partial copying, distribution, 
> disclosure etc. is prohibited and may be unlawful. When addressed to our 
> clients the content of this e-mail is subject to the General Terms and 
> Conditions of GL's Group of Companies applicable at the date of this e-mail.
> If you have received this e-mail in error, please notify the sender either by 
> telephone or by e-mail and delete the material from any computer.
> GL's Group of Companies does not warrant and/or guarantee that this message 
> at the moment of receipt is authentic, correct and its communication free of 
> errors, interruption etc.
> FutureShip GmbH, HRB 106781 AG HH, VAT Reg. No. DE263937825
> Geschäftsführer (CEO): Volker Höppner, Henning Kinkhorst, Stefan Deucker 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] job restart - cannot run on host until clean up of an previous run has finished

Reply via email to