Re: [OMPI users] Application hangs when checkpointing application (update)

2009-09-17 Thread Josh Hursey

Interesting. I'll try to take a look and see if I can reproduce today.

-- Josh

On Sep 14, 2009, at 4:54 PM, Jean Potsam wrote:


Hi Josh,
   Thanks for the response. I am actually testing it on a  
single node (though in the near future i will run it on a set of  
nodes). Therefore, my application is running on the same machine as  
mpirun.
When I run the application and triggers the checkpointing mechanism  
from a seperate terminal, it checkpoints fine.


However, when I try to checkpoint it from within the main program as  
show below, it hangs.


kind regards,

Jean


--- On Mon, 14/9/09, Josh Hursey <jjhur...@open-mpi.org> wrote:

From: Josh Hursey <jjhur...@open-mpi.org>
Subject: Re: [OMPI users] Application hangs when checkpointing  
application (update)

To: "Open MPI Users" <us...@open-mpi.org>
Date: Monday, 14 September, 2009, 1:27 PM

Is your application running on the same machine as mpirun?

How did you configure Open MPI? Note that is program will not work  
without the FT thread enabled, which would be one reason why it  
would seem to hang (since it is waiting for the application to enter  
the MPI library):

  --enable-ft-thread --enable-mpi-threads

I do not think the message that you saw is related. Often  
orte_checkpoint cannot figure out the jobid on first contact with  
the HNP/mpirun process, so this is displayed as an INVALID handle.


-- Josh

On Sep 11, 2009, at 9:50 AM, Jean Potsam wrote:

>
> Hi Everyone,
>   I noticed that it hangs just before displaying the  
following while trying to checkpoint the application.

>
> 
> [sun06:15252] orte_checkpoint: notify_hnp: Requested a checkpoint  
of jobid [INVALID]

> ###
>
> Can it be related to the above?
>
> Thanks
>
>
>  
--

> Hi Everyone,
> I wrote a small program with a function to  
trigger the checkpointing mechanism as follows:

>
> 
>
> #include 
> #include 
> #include 
> #include 
> #include 
> void trigger_checkpoint();
> int main(int argc, char **argv)
> {
> int rank,size;
> MPI_Init(, );
> MPI_Comm_rank(MPI_COMM_WORLD, );
> MPI_Comm_size(MPI_COMM_WORLD, );
> printf("I am processor no %d of a total of %d procs \n", rank,  
size);

> system("sleep 10");
> trigger_checkpoint();
> printf("I am processor no %d of a total of %d procs \n", rank,  
size);

> system("sleep 10");
> printf("I am processor no %d of a total of %d procs \n", rank,  
size);

> system("sleep 10");
> printf("bye \n");
> MPI_Finalize();
> return 0;
> }
>
> void trigger_checkpoint()
> {
>   printf("hi\n");
>   system("ompi-checkpoint -v `pidof mpirun` ");
> }
> #
>
>
> The application works fine on my laptop with ubuntu as the OS.  
However, when I tried running it on one of the machines at my uni,  
with suse linux installed, the application hangs as soon as the ompi- 
checkpoint is triggered. This is what I get:

>
>
>
> ##
> I am processor no 0 of a total of 1 procs
> hi
> I am processor no 0 of a total of 1 procs
> [sun06:15426] orte_checkpoint: Checkpointing...
> [sun06:15426]PID 15411
> [sun06:15426]Connected to Mpirun [[12727,0],0]
> [sun06:15426] orte_checkpoint: notify_hnp: Contact Head Node  
Process PID 15411

> ###
>
> does anyone has some ideas about this?
>
> Thanks a lot
>
> Jean.
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Application hangs when checkpointing application (update)

2009-09-14 Thread Josh Hursey

Is your application running on the same machine as mpirun?

How did you configure Open MPI? Note that is program will not work  
without the FT thread enabled, which would be one reason why it would  
seem to hang (since it is waiting for the application to enter the MPI  
library):

  --enable-ft-thread --enable-mpi-threads

I do not think the message that you saw is related. Often  
orte_checkpoint cannot figure out the jobid on first contact with the  
HNP/mpirun process, so this is displayed as an INVALID handle.


-- Josh

On Sep 11, 2009, at 9:50 AM, Jean Potsam wrote:



Hi Everyone,
  I noticed that it hangs just before displaying the  
following while trying to checkpoint the application.



[sun06:15252] orte_checkpoint: notify_hnp: Requested a checkpoint of  
jobid [INVALID]

###

Can it be related to the above?

Thanks


--
Hi Everyone,
I wrote a small program with a function to  
trigger the checkpointing mechanism as follows:




#include 
#include 
#include 
#include 
#include 
void trigger_checkpoint();
int main(int argc, char **argv)
{
int rank,size;
MPI_Init(, );
MPI_Comm_rank(MPI_COMM_WORLD, );
MPI_Comm_size(MPI_COMM_WORLD, );
printf("I am processor no %d of a total of %d procs \n", rank, size);
system("sleep 10");
trigger_checkpoint();
printf("I am processor no %d of a total of %d procs \n", rank, size);
system("sleep 10");
printf("I am processor no %d of a total of %d procs \n", rank, size);
system("sleep 10");
printf("bye \n");
MPI_Finalize();
return 0;
}

void trigger_checkpoint()
{
  printf("hi\n");
  system("ompi-checkpoint -v `pidof mpirun` ");
}
#


The application works fine on my laptop with ubuntu as the OS.  
However, when I tried running it on one of the machines at my uni,  
with suse linux installed, the application hangs as soon as the ompi- 
checkpoint is triggered. This is what I get:




##
I am processor no 0 of a total of 1 procs
hi
I am processor no 0 of a total of 1 procs
[sun06:15426] orte_checkpoint: Checkpointing...
[sun06:15426]PID 15411
[sun06:15426]Connected to Mpirun [[12727,0],0]
[sun06:15426] orte_checkpoint: notify_hnp: Contact Head Node Process  
PID 15411

###

does anyone has some ideas about this?

Thanks a lot

Jean.

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] Application hangs when checkpointing application (update)

2009-09-11 Thread Jean Potsam
 
Hi Everyone,
  I noticed that it hangs just before displaying the following 
while trying to checkpoint the application.
 

[sun06:15252] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid 
[INVALID] 
###
 
Can it be related to the above? 
 
Thanks
 
 
--
Hi Everyone,
    I wrote a small program with a function to trigger the 
checkpointing mechanism as follows:
 

 
#include 
#include 
#include 
#include 
#include 
void trigger_checkpoint();
int main(int argc, char **argv)
{
int rank,size;
MPI_Init(, );
MPI_Comm_rank(MPI_COMM_WORLD, );
MPI_Comm_size(MPI_COMM_WORLD, );
printf("I am processor no %d of a total of %d procs \n", rank, size);
system("sleep 10");
trigger_checkpoint();
printf("I am processor no %d of a total of %d procs \n", rank, size);
system("sleep 10");
printf("I am processor no %d of a total of %d procs \n", rank, size);
system("sleep 10");
printf("bye \n");
MPI_Finalize();
return 0;
}
 
void trigger_checkpoint()
{
  printf("hi\n");
  system("ompi-checkpoint -v `pidof mpirun` ");
}
#
   
 
The application works fine on my laptop with ubuntu as the OS. However, when I 
tried running it on one of the machines at my uni, with suse linux installed, 
the application hangs as soon as the ompi-checkpoint is triggered. This is what 
I get:
 
 
 
##
I am processor no 0 of a total of 1 procs 
hi
I am processor no 0 of a total of 1 procs 
[sun06:15426] orte_checkpoint: Checkpointing...
[sun06:15426]    PID 15411
[sun06:15426]    Connected to Mpirun [[12727,0],0]
[sun06:15426] orte_checkpoint: notify_hnp: Contact Head Node Process PID 15411
###

 
does anyone has some ideas about this?
 
Thanks a lot
 
Jean.