Hi, Am 22.08.2012 um 00:37 schrieb Henrichs, Juryk:
> Hallo Reuti, > > checkpointing type is application_level. The migr_command script basically > writes one value into one file to tell the application to stop. All the rest > is taken care of by the application itself. So it's not safe whether the application really left the machine when the "migr_command" finishes - right? I would suggest to put some sleep into the procedure and check whether the job script is gone and/or perform a safety kill: kill -9 -- -$1 There are some undocumented variables, and so $job_pid can be passed as $1 to the "migr_command": http://arc.liv.ac.uk/SGE/htmlman/htmlman5/checkpoint.html -- Reuti > Having this said - the qsub command is started in a script which starts some > other processes in parallel to keep track of the computation. Those may not > be finished by then. However, this should not be a problem, since the qsub > command is not yet returned (as long as the job is suspended and rescheduled > but not finished). > > Juryk > >> <compose-unknown-contact.jpg> Reuti Dienstag, 21. August 2012 23:47 >> Hi, >> >> Am 21.08.2012 um 22:44 schrieb Henrichs, Juryk: >> >> >>> we are running sge 6.2u5. I am trying to restart jobs via checkpointing. >>> On one of our clusters that works fine - jobs is suspended via the >>> suspend command, is stopped, rescheduled in the queue and restarted if >>> resources are available. >>> >>> With apparently the same setup of the sge on a second cluster my jobs >>> are rescheduled but do not get started. qstat -sj shows >>> "cannot run on host XXX until clean up of an previous run has finished" >>> >>> If the job is deleted from the queue and restarted manually works perfect. >>> >>> Is there a way to get a more elaborate error message and to find out >>> what exactly goes wrong with the cleanup? >>> >> >> Depending on the checkpointing setup it might be necessary to remove all >> processes of a job in the "migr_command" defined script. Which checkpointing >> type do you use amd how do you remove the processes therein? >> >> -- Reuti >> >> >> >>> Juryk >>> >>> >>> This e-mail and any attachment thereto may contain confidential information >>> and/or information protected by intellectual property rights for the >>> exclusive attention of the intended addressees named above. Any access of >>> third parties to this e-mail is unauthorised. Any use of this e-mail by >>> unintended recipients such as total or partial copying, distribution, >>> disclosure etc. is prohibited and may be unlawful. When addressed to our >>> clients the content of this e-mail is subject to the General Terms and >>> Conditions of GL's Group of Companies applicable at the date of this e-mail. >>> If you have received this e-mail in error, please notify the sender either >>> by telephone or by e-mail and delete the material from any computer. >>> GL's Group of Companies does not warrant and/or guarantee that this message >>> at the moment of receipt is authentic, correct and its communication free >>> of errors, interruption etc. >>> FutureShip GmbH, HRB 106781 AG HH, VAT Reg. No. DE263937825 >>> Geschäftsführer (CEO): Volker Höppner, Henning Kinkhorst, Stefan Deucker >>> >>> >>> _______________________________________________ >>> users mailing list >>> >>> [email protected] >>> https://gridengine.org/mailman/listinfo/users >>> >>> >>> >> >> <compose-unknown-contact.jpg> Henrichs, Juryk Dienstag, 21. August >> 2012 22:44 >> Hi >> >> we are running sge 6.2u5. I am trying to restart jobs via checkpointing. >> On one of our clusters that works fine - jobs is suspended via the >> suspend command, is stopped, rescheduled in the queue and restarted if >> resources are available. >> >> With apparently the same setup of the sge on a second cluster my jobs >> are rescheduled but do not get started. qstat -sj shows >> "cannot run on host XXX until clean up of an previous run has finished" >> >> If the job is deleted from the queue and restarted manually works perfect. >> >> Is there a way to get a more elaborate error message and to find out >> what exactly goes wrong with the cleanup? >> >> Juryk >> >> <compose-unknown-contact.jpg> Juryk Henrichs Dienstag, 21. August >> 2012 22:34 >> Hi >> >> we are running sge 6.2u5. I am trying to restart jobs via checkpointing. On >> one of our clusters that works fine - jobs is suspended via the suspend >> command, is stopped, rescheduled in the queue and restarted if resources are >> available. >> >> With apparently the same setup of the sge on a second cluster my jobs are >> rescheduled but do not get started. qstat -sj shows >> "cannot run on host XXX until clean up of an previous run has finished" >> >> If the job is deleted from the queue and restarted manually works perfect. >> >> Is there a way to get a more elaborate error message and to find out what >> exactly goes wrong with the cleanup? >> >> Juryk >> >> >> <compose-unknown-contact.jpg> Juryk Henrichs Dienstag, 21. August >> 2012 13:57 >> Hi >> >> we are running sge 6.2u5. I am trying to restart jobs via checkpointing. On >> one of our clusters that works fine - jobs is suspended via the suspend >> command, is stopped, rescheduled in the queue and restarted if resources are >> available. >> >> With apparently the same setup of the sge on a second cluster my jobs are >> rescheduled but do not get started. qstat -sj shows >> "cannot run on host XXX until clean up of an previous run has finished" >> >> If the job is deleted from the queue and restarted manually works perfect. >> >> Is there a way to get a more elaborate error message and to find out what >> exactly goes wrong with the cleanup? >> >> Juryk >> >> >> <compose-unknown-contact.jpg> Juryk Henrichs Dienstag, 21. August >> 2012 13:07 >> Hi >> >> we are running sge 6.2u5. I am trying to restart jobs via checkpointing. On >> one of our clusters that works fine - jobs is suspended via the suspend >> command, is stopped, rescheduled in the queue and restarted if resources are >> available. >> >> With apparently the same setup of the sge on a second cluster my jobs are >> rescheduled but do not get started. qstat -sj shows >> "cannot run on host XXX until clean up of an previous run has finished" >> >> If the job is deleted from the queue and restarted manually works perfect. >> >> Is there a way to get a more elaborate error message and to find out what >> exactly goes wrong with the cleanup? >> >> Juryk > > -- > Juryk Henrichs, > > Senior Project Engineer > Fluid Engineering > FutureShip GmbH -- A GL company > > Office Potsdam > Behlertstr. 3a, Haus G > D-14467 Potsdam > > Tel.: +49 331 9799 179-16 > Fax.: +49 331 9799 179-9 > > http://www.futureship.net > http://www.gl-group.com > This e-mail and any attachment thereto may contain confidential information > and/or information protected by intellectual property rights for the > exclusive attention of the intended addressees named above. Any access of > third parties to this e-mail is unauthorised. Any use of this e-mail by > unintended recipients such as total or partial copying, distribution, > disclosure etc. is prohibited and may be unlawful. When addressed to our > clients the content of this e-mail is subject to the General Terms and > Conditions of GL's Group of Companies applicable at the date of this e-mail. > If you have received this e-mail in error, please notify the sender either by > telephone or by e-mail and delete the material from any computer. > GL's Group of Companies does not warrant and/or guarantee that this message > at the moment of receipt is authentic, correct and its communication free of > errors, interruption etc. > FutureShip GmbH, HRB 106781 AG HH, VAT Reg. No. DE263937825 > Geschäftsführer (CEO): Volker Höppner, Henning Kinkhorst, Stefan Deucker _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
