Re: [OMPI users] Question about restart

2009-04-27 Thread Josh Hursey

Thanks for the bug report.

I am having a difficult time reproducing the error. Are you running on  
a single machine using shared memory or across multiple machine using  
a high speed network?


Based on your bug report, my suspicion is that an event is not being  
properly de-registered from the event engine. Typically this means,  
for C/R, that the finalization routine in one of the BTLs  
(interconnect drivers) is missing something. The patch you propose  
seems fine, but I agree that it may be masking another problem.


I'll keep digging and let you know if I find something. In the mean  
time I will attempt to push in a patch to protect the free() that cited.


Cheers,
Josh

On Apr 22, 2009, at 4:09 PM, Yaakoub El Khamra wrote:


Incidentally, if I add a check for the value base->sig.sh_old, that it
is not NULL, and recompile, everything works fine. I am concerned this
is just fixing a symptom rather than the root of the problem.

if(base->sig.sh_old != NULL)
 free(base->sig.sh_old);

is what I used.

Regards
Yaakoub El Khamra




On Wed, Apr 22, 2009 at 2:13 PM, Yaakoub El Khamra
 wrote:

Greetings
I am trying to get the checkpoint/restart to work on a single machine
with openmpi 1.3 (also tried an svn check-out) and ran into a few
problems. I am guessing I am doing something wrong, and would
appreciate some help.

I built openmpi with:
 ./configure --prefi=/usr/local/openmpi-1.3/ --enable-picky
--enable-debug --enable-mpi-f77 --enable-mpi-f90 --enable-mpi-profile
--enable-mpi-cxx --enable-pretty-print-stacktrace --enable-binaries
--enable-trace --enable-static=yes --enable-debug
--with-devel-headers=1 --with-mpi-param-check=always --with-ft=cr
--enable-ft-thread --with-blcr=/usr/local/blcr/
--with-blcr-libdir=/usr/local/blcr/lib --enable-mpi-threads=yes

I am using blcr 0.8.1 configured with:
 ./configure --prefix=/usr/local/blcr/ --enable-debug=yes
--enable-libcr-tracing=yes --enable-kernel-tracing=yes
--enable-testsuite=yes --enable-all-static=yes --enable-static=yes

Checkpoint works fine, without any problems, I run with:
 mpirun  -np 2 -mca ft_cr_enabled 1 -mca ompi_cr_verbose 1  -am
ft-enable-cr -mca crs_verbose 1 -mca crs_blcr_verbose 1  matmultf.exe

I am able to checkpoint without any problems using ompi-checkpoint
--status --term 
but when I try to restart, I get the following error:

[yye00@localhost FTOpenMPI]$ ompi-restart -v   
ompi_global_snapshot_23858.ckpt

[localhost.localdomain:24394] Checking for the existence of
(/home/yye00/ompi_global_snapshot_23858.ckpt)
[localhost.localdomain:24394] Restarting from file
(ompi_global_snapshot_23858.ckpt)
[localhost.localdomain:24394]Exec in self
malloc debug: Invalid free (signal.c, 304)
malloc debug: Invalid free (signal.c, 304)
[localhost:23860] *** Process received signal ***
[localhost:23860] Signal: Bus error (7)
[localhost:23860] Signal code:  (2)
[localhost:23860] Failing at address: 0x7fcbb737ef88
[localhost:23860] [ 0] /lib64/libpthread.so.0 [0x32d720f0f0]
[localhost:23860] [ 1] /usr/local/openmpi-1.3_svn/lib/libmpi.so.0
[0x7fcbbd1eccae]
[localhost:23860] [ 2] /usr/local/openmpi-1.3_svn/lib/libmpi.so.0
[0x7fcbbd1ed5ba]
[localhost:23860] [ 3] /usr/local/openmpi-1.3_svn/lib/libmpi.so.0
[0x7fcbbd1ed745]
[localhost:23860] [ 4]
/usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0(opal_progress+0xbc)
[0x7fcbbcba2aa0]
[localhost:23860] [ 5] /usr/local/openmpi-1.3_svn/lib/libopen- 
pal.so.0

[0x7fcbbcbdead1]
[localhost:23860] [ 6] /usr/local/openmpi-1.3_svn/lib/libopen- 
pal.so.0

[0x7fcbbcbde8e2]
[localhost:23860] [ 7]
/usr/local/openmpi-1.3_svn/lib/libopen-pal.so. 
0(opal_crs_blcr_checkpoint+0x19c)

[0x7fcbbcbde17c]
[localhost:23860] [ 8]
/usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0(opal_cr_inc_core 
+0xb2)

[0x7fcbbcba45e9]
[localhost:23860] [ 9] /usr/local/openmpi-1.3_svn/lib/libopen- 
rte.so.0

[0x7fcbbced1d9d]
[localhost:23860] [10]
/usr/local/openmpi-1.3_svn/lib/libopen-pal.so. 
0(opal_cr_test_if_checkpoint_ready+0x11b)

[0x7fcbbcba4509]
[localhost:23860] [11] /usr/local/openmpi-1.3_svn/lib/libopen- 
pal.so.0

[0x7fcbbcba4bc2]
[localhost:23860] [12] /lib64/libpthread.so.0 [0x32d72073da]
[localhost:23860] [13] /lib64/libc.so.6(clone+0x6d) [0x32d66e62bd]
[localhost:23860] *** End of error message ***
--
mpirun noticed that process rank 1 with PID 24396 on node
localhost.localdomain exited on signal 7 (Bus error).
--

running strace on the ompi-restart did not provide any useful
information. Any suggestions are greatly appreciated. Incidentally,
looking at the signal.c line 304, it is a deallocation subroutine in
opal, it is the evsignal_dealloc subroutine, the actual line is the
"free(base->sig.sh_old);" line . I am about to add debug statements  
to

that subroutine and see if I can get further information, but was
hoping the problem is more user-related than 

Re: [OMPI users] Question about restart

2009-04-22 Thread Yaakoub El Khamra
Incidentally, if I add a check for the value base->sig.sh_old, that it
is not NULL, and recompile, everything works fine. I am concerned this
is just fixing a symptom rather than the root of the problem.

if(base->sig.sh_old != NULL)
  free(base->sig.sh_old);

is what I used.

Regards
Yaakoub El Khamra




On Wed, Apr 22, 2009 at 2:13 PM, Yaakoub El Khamra
 wrote:
> Greetings
> I am trying to get the checkpoint/restart to work on a single machine
> with openmpi 1.3 (also tried an svn check-out) and ran into a few
> problems. I am guessing I am doing something wrong, and would
> appreciate some help.
>
> I built openmpi with:
>  ./configure --prefi=/usr/local/openmpi-1.3/ --enable-picky
> --enable-debug --enable-mpi-f77 --enable-mpi-f90 --enable-mpi-profile
> --enable-mpi-cxx --enable-pretty-print-stacktrace --enable-binaries
> --enable-trace --enable-static=yes --enable-debug
> --with-devel-headers=1 --with-mpi-param-check=always --with-ft=cr
> --enable-ft-thread --with-blcr=/usr/local/blcr/
> --with-blcr-libdir=/usr/local/blcr/lib --enable-mpi-threads=yes
>
> I am using blcr 0.8.1 configured with:
>  ./configure --prefix=/usr/local/blcr/ --enable-debug=yes
> --enable-libcr-tracing=yes --enable-kernel-tracing=yes
> --enable-testsuite=yes --enable-all-static=yes --enable-static=yes
>
> Checkpoint works fine, without any problems, I run with:
>  mpirun  -np 2 -mca ft_cr_enabled 1 -mca ompi_cr_verbose 1  -am
> ft-enable-cr -mca crs_verbose 1 -mca crs_blcr_verbose 1  matmultf.exe
>
> I am able to checkpoint without any problems using ompi-checkpoint
> --status --term 
> but when I try to restart, I get the following error:
>
> [yye00@localhost FTOpenMPI]$ ompi-restart -v  ompi_global_snapshot_23858.ckpt
> [localhost.localdomain:24394] Checking for the existence of
> (/home/yye00/ompi_global_snapshot_23858.ckpt)
> [localhost.localdomain:24394] Restarting from file
> (ompi_global_snapshot_23858.ckpt)
> [localhost.localdomain:24394]    Exec in self
> malloc debug: Invalid free (signal.c, 304)
> malloc debug: Invalid free (signal.c, 304)
> [localhost:23860] *** Process received signal ***
> [localhost:23860] Signal: Bus error (7)
> [localhost:23860] Signal code:  (2)
> [localhost:23860] Failing at address: 0x7fcbb737ef88
> [localhost:23860] [ 0] /lib64/libpthread.so.0 [0x32d720f0f0]
> [localhost:23860] [ 1] /usr/local/openmpi-1.3_svn/lib/libmpi.so.0
> [0x7fcbbd1eccae]
> [localhost:23860] [ 2] /usr/local/openmpi-1.3_svn/lib/libmpi.so.0
> [0x7fcbbd1ed5ba]
> [localhost:23860] [ 3] /usr/local/openmpi-1.3_svn/lib/libmpi.so.0
> [0x7fcbbd1ed745]
> [localhost:23860] [ 4]
> /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0(opal_progress+0xbc)
> [0x7fcbbcba2aa0]
> [localhost:23860] [ 5] /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0
> [0x7fcbbcbdead1]
> [localhost:23860] [ 6] /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0
> [0x7fcbbcbde8e2]
> [localhost:23860] [ 7]
> /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0(opal_crs_blcr_checkpoint+0x19c)
> [0x7fcbbcbde17c]
> [localhost:23860] [ 8]
> /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0(opal_cr_inc_core+0xb2)
> [0x7fcbbcba45e9]
> [localhost:23860] [ 9] /usr/local/openmpi-1.3_svn/lib/libopen-rte.so.0
> [0x7fcbbced1d9d]
> [localhost:23860] [10]
> /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0(opal_cr_test_if_checkpoint_ready+0x11b)
> [0x7fcbbcba4509]
> [localhost:23860] [11] /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0
> [0x7fcbbcba4bc2]
> [localhost:23860] [12] /lib64/libpthread.so.0 [0x32d72073da]
> [localhost:23860] [13] /lib64/libc.so.6(clone+0x6d) [0x32d66e62bd]
> [localhost:23860] *** End of error message ***
> --
> mpirun noticed that process rank 1 with PID 24396 on node
> localhost.localdomain exited on signal 7 (Bus error).
> --
>
> running strace on the ompi-restart did not provide any useful
> information. Any suggestions are greatly appreciated. Incidentally,
> looking at the signal.c line 304, it is a deallocation subroutine in
> opal, it is the evsignal_dealloc subroutine, the actual line is the
> "free(base->sig.sh_old);" line . I am about to add debug statements to
> that subroutine and see if I can get further information, but was
> hoping the problem is more user-related than code-related.
>
>
> Regards
> Yaakoub El Khamra
>



[OMPI users] Question about restart

2009-04-22 Thread Yaakoub El Khamra
Greetings
I am trying to get the checkpoint/restart to work on a single machine
with openmpi 1.3 (also tried an svn check-out) and ran into a few
problems. I am guessing I am doing something wrong, and would
appreciate some help.

I built openmpi with:
 ./configure --prefi=/usr/local/openmpi-1.3/ --enable-picky
--enable-debug --enable-mpi-f77 --enable-mpi-f90 --enable-mpi-profile
--enable-mpi-cxx --enable-pretty-print-stacktrace --enable-binaries
--enable-trace --enable-static=yes --enable-debug
--with-devel-headers=1 --with-mpi-param-check=always --with-ft=cr
--enable-ft-thread --with-blcr=/usr/local/blcr/
--with-blcr-libdir=/usr/local/blcr/lib --enable-mpi-threads=yes

I am using blcr 0.8.1 configured with:
 ./configure --prefix=/usr/local/blcr/ --enable-debug=yes
--enable-libcr-tracing=yes --enable-kernel-tracing=yes
--enable-testsuite=yes --enable-all-static=yes --enable-static=yes

Checkpoint works fine, without any problems, I run with:
 mpirun  -np 2 -mca ft_cr_enabled 1 -mca ompi_cr_verbose 1  -am
ft-enable-cr -mca crs_verbose 1 -mca crs_blcr_verbose 1  matmultf.exe

I am able to checkpoint without any problems using ompi-checkpoint
--status --term 
but when I try to restart, I get the following error:

[yye00@localhost FTOpenMPI]$ ompi-restart -v  ompi_global_snapshot_23858.ckpt
[localhost.localdomain:24394] Checking for the existence of
(/home/yye00/ompi_global_snapshot_23858.ckpt)
[localhost.localdomain:24394] Restarting from file
(ompi_global_snapshot_23858.ckpt)
[localhost.localdomain:24394]Exec in self
malloc debug: Invalid free (signal.c, 304)
malloc debug: Invalid free (signal.c, 304)
[localhost:23860] *** Process received signal ***
[localhost:23860] Signal: Bus error (7)
[localhost:23860] Signal code:  (2)
[localhost:23860] Failing at address: 0x7fcbb737ef88
[localhost:23860] [ 0] /lib64/libpthread.so.0 [0x32d720f0f0]
[localhost:23860] [ 1] /usr/local/openmpi-1.3_svn/lib/libmpi.so.0
[0x7fcbbd1eccae]
[localhost:23860] [ 2] /usr/local/openmpi-1.3_svn/lib/libmpi.so.0
[0x7fcbbd1ed5ba]
[localhost:23860] [ 3] /usr/local/openmpi-1.3_svn/lib/libmpi.so.0
[0x7fcbbd1ed745]
[localhost:23860] [ 4]
/usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0(opal_progress+0xbc)
[0x7fcbbcba2aa0]
[localhost:23860] [ 5] /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0
[0x7fcbbcbdead1]
[localhost:23860] [ 6] /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0
[0x7fcbbcbde8e2]
[localhost:23860] [ 7]
/usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0(opal_crs_blcr_checkpoint+0x19c)
[0x7fcbbcbde17c]
[localhost:23860] [ 8]
/usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0(opal_cr_inc_core+0xb2)
[0x7fcbbcba45e9]
[localhost:23860] [ 9] /usr/local/openmpi-1.3_svn/lib/libopen-rte.so.0
[0x7fcbbced1d9d]
[localhost:23860] [10]
/usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0(opal_cr_test_if_checkpoint_ready+0x11b)
[0x7fcbbcba4509]
[localhost:23860] [11] /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0
[0x7fcbbcba4bc2]
[localhost:23860] [12] /lib64/libpthread.so.0 [0x32d72073da]
[localhost:23860] [13] /lib64/libc.so.6(clone+0x6d) [0x32d66e62bd]
[localhost:23860] *** End of error message ***
--
mpirun noticed that process rank 1 with PID 24396 on node
localhost.localdomain exited on signal 7 (Bus error).
--

running strace on the ompi-restart did not provide any useful
information. Any suggestions are greatly appreciated. Incidentally,
looking at the signal.c line 304, it is a deallocation subroutine in
opal, it is the evsignal_dealloc subroutine, the actual line is the
"free(base->sig.sh_old);" line . I am about to add debug statements to
that subroutine and see if I can get further information, but was
hoping the problem is more user-related than code-related.


Regards
Yaakoub El Khamra