Re: [OMPI users] Question about checkpoint/restart protocol

2009-11-06 Thread Josh Hursey


On Nov 5, 2009, at 4:46 AM, Mohamed Adel wrote:


Dear Sergio,

Thank you for your reply. I've inserted the modules into the kernel  
and it all worked fine. But there is still a weired issue. I use the  
command "mpirun -n 2 -am ft-enable-cr -H comp001 checkpoint-restart- 
test" to start the an mpi job. I then use "ompi-checkpoint PID" to  
checkpoint a job, but the ompi-checkpoint didn't respond and the  
mpirun produces the following.


--
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

 Local host:  comp001.local (PID 23514)
 MPI_COMM_WORLD rank: 0

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--
[login01.local:21425] 1 more process has sent help message help-mpi- 
runtime.txt / mpi_init:warn-fork
[login01.local:21425] Set MCA parameter "orte_base_help_aggregate"  
to 0 to see all help / error messages


Notice: if the -n option has a value more than 1, then this error  
occurs, but if the -n option has the value 1 then the ompi- 
checkpoint succeeds, mpirun produces the same message and ompi- 
restart fails with the message

[login01:21417] *** Process received signal ***
[login01:21417] Signal: Segmentation fault (11)
[login01:21417] Signal code: Address not mapped (1)
[login01:21417] Failing at address: (nil)
[login01:21417] [ 0] /lib64/libpthread.so.0 [0x32df20de70]
[login01:21417] [ 1] /home/mab/openmpi-1.3.3/lib/openmpi/ 
mca_crs_blcr.so [0x2b093509dfee]
[login01:21417] [ 2] /home/mab/openmpi-1.3.3/lib/openmpi/ 
mca_crs_blcr.so(opal_crs_blcr_restart+0xd9) [0x2b093509d251]

[login01:21417] [ 3] opal-restart [0x401c3e]
[login01:21417] [ 4] /lib64/libc.so.6(__libc_start_main+0xf4)  
[0x32dea1d8b4]

[login01:21417] [ 5] opal-restart [0x401399]
[login01:21417] *** End of error message ***
--
mpirun noticed that process rank 0 with PID 21417 on node  
login01.local exited on signal 11 (Segmentation fault).

--

Any help with that will be appreciated?


I have not seen this behavior before. The first error is Open MPI  
warning you that one of your MPI processes is trying to use fork(), so  
you may want to make sure that your application is not using any system 
() or fork() function calls. Open MPI internally should not be using  
any of these functions from within the MPI library linked to the  
application.


When you reloaded the BLCR module, did you rebuild Open MPI and  
install it in a clean directory (not over the top of the old directory)?


Have you tried to checkpoint/restart an non-MPI process with BLCR on  
your system? This will help to rule out installation problems with BLCR.


I suspect that Open MPI is not building correctly, or something in  
your build environment is confusing/corrupting the build. Can you send  
me your config.log, it may help me pinpoint the problem if it is build  
related.


-- Josh



Thanks in advance,
Mohamed Adel


From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] On  
Behalf Of Sergio Díaz [sd...@cesga.es]

Sent: Thursday, November 05, 2009 11:38 AM
To: Open MPI Users
Subject: Re: [OMPI users] Question about checkpoint/restart protocol

Hi,

Did you load the BLCR modules before compiling OpenMPI?

Regards,
Sergio

Mohamed Adel escribió:

Dear OMPI users,

I'm a new OpenMPI user. I've configured openmpi-1.3.3 with those  
options "./configure --prefix=/home/mab/openmpi-1.3.3 --with-sge -- 
enable-ft-thread --with-ft=cr --enable-mpi-threads --enable-static  
--disable-shared --with-blcr=/home/mab/blcr-0.8.2/" then compiled  
and installed it successfully.
Now I'm trying to use the checkpoint/restart protocol. I run a  
program with the options "mpirun -n 2 -am ft-enable-cr -H localhost  
prime/checkpoint-restart-test" but I receive the following error:


*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[madel:28896] Abort before MPI_INIT completed successfully; not  
able to guarantee that all other processes were killed!

--
It looks like opal_init failed for some reason;

Re: [OMPI users] Question about checkpoint/restart protocol

2009-11-05 Thread Mohamed Adel
Dear Sergio,

Thank you for your reply. I've inserted the modules into the kernel and it all 
worked fine. But there is still a weired issue. I use the command "mpirun -n 2 
-am ft-enable-cr -H comp001 checkpoint-restart-test" to start the an mpi job. I 
then use "ompi-checkpoint PID" to checkpoint a job, but the ompi-checkpoint 
didn't respond and the mpirun produces the following.

--
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.  

The process that invoked fork was:

  Local host:  comp001.local (PID 23514)
  MPI_COMM_WORLD rank: 0

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--
[login01.local:21425] 1 more process has sent help message help-mpi-runtime.txt 
/ mpi_init:warn-fork
[login01.local:21425] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
all help / error messages

Notice: if the -n option has a value more than 1, then this error occurs, but 
if the -n option has the value 1 then the ompi-checkpoint succeeds, mpirun 
produces the same message and ompi-restart fails with the message 
[login01:21417] *** Process received signal ***
[login01:21417] Signal: Segmentation fault (11)
[login01:21417] Signal code: Address not mapped (1)
[login01:21417] Failing at address: (nil)
[login01:21417] [ 0] /lib64/libpthread.so.0 [0x32df20de70]
[login01:21417] [ 1] /home/mab/openmpi-1.3.3/lib/openmpi/mca_crs_blcr.so 
[0x2b093509dfee]
[login01:21417] [ 2] 
/home/mab/openmpi-1.3.3/lib/openmpi/mca_crs_blcr.so(opal_crs_blcr_restart+0xd9) 
[0x2b093509d251]
[login01:21417] [ 3] opal-restart [0x401c3e]
[login01:21417] [ 4] /lib64/libc.so.6(__libc_start_main+0xf4) [0x32dea1d8b4]
[login01:21417] [ 5] opal-restart [0x401399]
[login01:21417] *** End of error message ***
--
mpirun noticed that process rank 0 with PID 21417 on node login01.local exited 
on signal 11 (Segmentation fault).
--

Any help with that will be appreciated?

Thanks in advance,
Mohamed Adel


From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] On Behalf Of 
Sergio Díaz [sd...@cesga.es]
Sent: Thursday, November 05, 2009 11:38 AM
To: Open MPI Users
Subject: Re: [OMPI users] Question about checkpoint/restart protocol

Hi,

Did you load the BLCR modules before compiling OpenMPI?

Regards,
Sergio

Mohamed Adel escribió:
> Dear OMPI users,
>
> I'm a new OpenMPI user. I've configured openmpi-1.3.3 with those options 
> "./configure --prefix=/home/mab/openmpi-1.3.3 --with-sge --enable-ft-thread 
> --with-ft=cr --enable-mpi-threads --enable-static --disable-shared 
> --with-blcr=/home/mab/blcr-0.8.2/" then compiled and installed it 
> successfully.
> Now I'm trying to use the checkpoint/restart protocol. I run a program with 
> the options "mpirun -n 2 -am ft-enable-cr -H localhost 
> prime/checkpoint-restart-test" but I receive the following error:
>
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [madel:28896] Abort before MPI_INIT completed successfully; not able to 
> guarantee that all other processes were killed!
> --
> It looks like opal_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during opal_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
>   opal_cr_init() failed failed
>   --> Returned value -1 instead of OPAL_SUCCESS
> --
> [madel:28896] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file 
> runtime/orte_init.c at line 77
> --
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT

Re: [OMPI users] Question about checkpoint/restart protocol

2009-11-05 Thread Sergio Díaz

Hi,

Did you load the BLCR modules before compiling OpenMPI?

Regards,
Sergio

Mohamed Adel escribió:

Dear OMPI users,

I'm a new OpenMPI user. I've configured openmpi-1.3.3 with those options 
"./configure --prefix=/home/mab/openmpi-1.3.3 --with-sge --enable-ft-thread 
--with-ft=cr --enable-mpi-threads --enable-static --disable-shared 
--with-blcr=/home/mab/blcr-0.8.2/" then compiled and installed it successfully.
Now I'm trying to use the checkpoint/restart protocol. I run a program with the options 
"mpirun -n 2 -am ft-enable-cr -H localhost prime/checkpoint-restart-test" but I 
receive the following error:

*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[madel:28896] Abort before MPI_INIT completed successfully; not able to 
guarantee that all other processes were killed!
--
It looks like opal_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_cr_init() failed failed
  --> Returned value -1 instead of OPAL_SUCCESS
--
[madel:28896] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file 
runtime/orte_init.c at line 77
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: orte_init failed
  --> Returned "Error" (-1) instead of "Success" (0)
--

I can't find the files mentioned in this post 
"http://www.open-mpi.org/community/lists/users/2009/09/10641.php"; 
(mca_crs_blcr.so, mca_crs_blcr.la). Could you please help me with that error?

Thanks in advance
Mohamed Adel

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

  



--
Sergio Díaz Montes
Centro de Supercomputacion de Galicia
Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
email: sd...@cesga.es ; http://www.cesga.es/




[OMPI users] Question about checkpoint/restart protocol

2009-11-04 Thread Mohamed Adel
Dear OMPI users,

I'm a new OpenMPI user. I've configured openmpi-1.3.3 with those options 
"./configure --prefix=/home/mab/openmpi-1.3.3 --with-sge --enable-ft-thread 
--with-ft=cr --enable-mpi-threads --enable-static --disable-shared 
--with-blcr=/home/mab/blcr-0.8.2/" then compiled and installed it successfully.
Now I'm trying to use the checkpoint/restart protocol. I run a program with the 
options "mpirun -n 2 -am ft-enable-cr -H localhost 
prime/checkpoint-restart-test" but I receive the following error:

*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[madel:28896] Abort before MPI_INIT completed successfully; not able to 
guarantee that all other processes were killed!
--
It looks like opal_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_cr_init() failed failed
  --> Returned value -1 instead of OPAL_SUCCESS
--
[madel:28896] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file 
runtime/orte_init.c at line 77
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: orte_init failed
  --> Returned "Error" (-1) instead of "Success" (0)
--

I can't find the files mentioned in this post 
"http://www.open-mpi.org/community/lists/users/2009/09/10641.php"; 
(mca_crs_blcr.so, mca_crs_blcr.la). Could you please help me with that error?

Thanks in advance
Mohamed Adel