Re: [gridengine users] job restart - cannot run on host until clean up of an previous run has finished

Henrichs, Juryk Tue, 21 Aug 2012 15:35:43 -0700

Hallo Reuti,

checkpointing type is application_level. The migr_command script basically 
writes one value into one file to tell the application to stop. All the rest is 
taken care of by the application itself.


Having this said - the qsub command is started in a script which starts some 
other processes in parallel to keep track of the computation. Those may not be 
finished by then. However, this should not be a problem, since the qsub command 
is not yet returned (as long as the job is suspended and rescheduled but not 
finished).

Juryk




        Reuti <mailto:[email protected]>
        Dienstag, 21. August 2012 23:47
        Hi,

        Am 21.08.2012 um 22:44 schrieb Henrichs, Juryk:


                we are running sge 6.2u5. I am trying to restart jobs via 
checkpointing.
                On one of our clusters that works fine - jobs is suspended via 
the
                suspend command, is stopped, rescheduled in the queue and 
restarted if
                resources are available.

                With apparently the same setup of the sge on a second cluster 
my jobs
                are rescheduled but do not get started. qstat -sj shows
                "cannot run on host XXX until clean up of an previous run has 
finished"

                If the job is deleted from the queue and restarted manually 
works perfect.

                Is there a way to get a more elaborate error message and to 
find out
                what exactly goes wrong with the cleanup?


        Depending on the checkpointing setup it might be necessary to remove 
all processes of a job in the "migr_command" defined script. Which 
checkpointing type do you use amd how do you remove the processes therein?

        -- Reuti



                Juryk


                This e-mail and any attachment thereto may contain confidential 
information and/or information protected by intellectual property rights for 
the exclusive attention of the intended addressees named above. Any access of 
third parties to this e-mail is unauthorised. Any use of this e-mail by 
unintended recipients such as total or partial copying, distribution, 
disclosure etc. is prohibited and may be unlawful. When addressed to our 
clients the content of this e-mail is subject to the General Terms and 
Conditions of GL's Group of Companies applicable at the date of this e-mail.
                If you have received this e-mail in error, please notify the 
sender either by telephone or by e-mail and delete the material from any 
computer.
                GL's Group of Companies does not warrant and/or guarantee that 
this message at the moment of receipt is authentic, correct and its 
communication free of errors, interruption etc.
                FutureShip GmbH, HRB 106781 AG HH, VAT Reg. No. DE263937825
                Geschäftsführer (CEO): Volker Höppner, Henning Kinkhorst, 
Stefan Deucker


                _______________________________________________
                users mailing list
                [email protected]
                https://gridengine.org/mailman/listinfo/users




        Henrichs, Juryk <mailto:[email protected]>
        Dienstag, 21. August 2012 22:44
        Hi

        we are running sge 6.2u5. I am trying to restart jobs via checkpointing.
        On one of our clusters that works fine - jobs is suspended via the
        suspend command, is stopped, rescheduled in the queue and restarted if
        resources are available.

        With apparently the same setup of the sge on a second cluster my jobs
        are rescheduled but do not get started. qstat -sj shows
        "cannot run on host XXX until clean up of an previous run has finished"

        If the job is deleted from the queue and restarted manually works 
perfect.

        Is there a way to get a more elaborate error message and to find out
        what exactly goes wrong with the cleanup?

        Juryk



        Juryk Henrichs <mailto:[email protected]>
        Dienstag, 21. August 2012 22:34
        Hi

        we are running sge 6.2u5. I am trying to restart jobs via 
checkpointing. On one of our clusters that works fine - jobs is suspended via 
the suspend command, is stopped, rescheduled in the queue and restarted if 
resources are available.

        With apparently the same setup of the sge on a second cluster my jobs 
are rescheduled but do not get started. qstat -sj shows
        "cannot run on host XXX until clean up of an previous run has finished"

        If the job is deleted from the queue and restarted manually works 
perfect.

        Is there a way to get a more elaborate error message and to find out 
what exactly goes wrong with the cleanup?

        Juryk




        Juryk Henrichs <mailto:[email protected]>
        Dienstag, 21. August 2012 13:57
        Hi

        we are running sge 6.2u5. I am trying to restart jobs via 
checkpointing. On one of our clusters that works fine - jobs is suspended via 
the suspend command, is stopped, rescheduled in the queue and restarted if 
resources are available.

        With apparently the same setup of the sge on a second cluster my jobs 
are rescheduled but do not get started. qstat -sj shows
        "cannot run on host XXX until clean up of an previous run has finished"

        If the job is deleted from the queue and restarted manually works 
perfect.

        Is there a way to get a more elaborate error message and to find out 
what exactly goes wrong with the cleanup?

        Juryk




        Juryk Henrichs <mailto:[email protected]>
        Dienstag, 21. August 2012 13:07
        Hi

        we are running sge 6.2u5. I am trying to restart jobs via 
checkpointing. On one of our clusters that works fine - jobs is suspended via 
the suspend command, is stopped, rescheduled in the queue and restarted if 
resources are available.

        With apparently the same setup of the sge on a second cluster my jobs 
are rescheduled but do not get started. qstat -sj shows
        "cannot run on host XXX until clean up of an previous run has finished"

        If the job is deleted from the queue and restarted manually works 
perfect.

        Is there a way to get a more elaborate error message and to find out 
what exactly goes wrong with the cleanup?

        Juryk



--
Juryk Henrichs,

Senior Project Engineer
Fluid Engineering
FutureShip GmbH -- A GL company

Office Potsdam
Behlertstr. 3a, Haus G
D-14467 Potsdam

Tel.: +49 331 9799 179-16
Fax.: +49 331 9799 179-9

http://www.futureship.net
http://www.gl-group.com

This e-mail and any attachment thereto may contain confidential information 
and/or information protected by intellectual property rights for the exclusive 
attention of the intended addressees named above. Any access of third parties 
to this e-mail is unauthorised. Any use of this e-mail by unintended recipients 
such as total or partial copying, distribution, disclosure etc. is prohibited 
and may be unlawful. When addressed to our clients the content of this e-mail 
is subject to the General Terms and Conditions of GL's Group of Companies 
applicable at the date of this e-mail.
If you have received this e-mail in error, please notify the sender either by 
telephone or by e-mail and delete the material from any computer.
GL's Group of Companies does not warrant and/or guarantee that this message at 
the moment of receipt is authentic, correct and its communication free of 
errors, interruption etc.
FutureShip GmbH, HRB 106781 AG HH, VAT Reg. No. DE263937825
Geschäftsführer (CEO): Volker Höppner, Henning Kinkhorst, Stefan Deucker

<<inline: compose-unknown-contact.jpg>>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] job restart - cannot run on host until clean up of an previous run has finished

Reply via email to