Re: [OMPI users] Question about restart
Thanks for the bug report. I am having a difficult time reproducing the error. Are you running on a single machine using shared memory or across multiple machine using a high speed network? Based on your bug report, my suspicion is that an event is not being properly de-registered from the event engine. Typically this means, for C/R, that the finalization routine in one of the BTLs (interconnect drivers) is missing something. The patch you propose seems fine, but I agree that it may be masking another problem. I'll keep digging and let you know if I find something. In the mean time I will attempt to push in a patch to protect the free() that cited. Cheers, Josh On Apr 22, 2009, at 4:09 PM, Yaakoub El Khamra wrote: Incidentally, if I add a check for the value base->sig.sh_old, that it is not NULL, and recompile, everything works fine. I am concerned this is just fixing a symptom rather than the root of the problem. if(base->sig.sh_old != NULL) free(base->sig.sh_old); is what I used. Regards Yaakoub El Khamra On Wed, Apr 22, 2009 at 2:13 PM, Yaakoub El Khamrawrote: Greetings I am trying to get the checkpoint/restart to work on a single machine with openmpi 1.3 (also tried an svn check-out) and ran into a few problems. I am guessing I am doing something wrong, and would appreciate some help. I built openmpi with: ./configure --prefi=/usr/local/openmpi-1.3/ --enable-picky --enable-debug --enable-mpi-f77 --enable-mpi-f90 --enable-mpi-profile --enable-mpi-cxx --enable-pretty-print-stacktrace --enable-binaries --enable-trace --enable-static=yes --enable-debug --with-devel-headers=1 --with-mpi-param-check=always --with-ft=cr --enable-ft-thread --with-blcr=/usr/local/blcr/ --with-blcr-libdir=/usr/local/blcr/lib --enable-mpi-threads=yes I am using blcr 0.8.1 configured with: ./configure --prefix=/usr/local/blcr/ --enable-debug=yes --enable-libcr-tracing=yes --enable-kernel-tracing=yes --enable-testsuite=yes --enable-all-static=yes --enable-static=yes Checkpoint works fine, without any problems, I run with: mpirun -np 2 -mca ft_cr_enabled 1 -mca ompi_cr_verbose 1 -am ft-enable-cr -mca crs_verbose 1 -mca crs_blcr_verbose 1 matmultf.exe I am able to checkpoint without any problems using ompi-checkpoint --status --term but when I try to restart, I get the following error: [yye00@localhost FTOpenMPI]$ ompi-restart -v ompi_global_snapshot_23858.ckpt [localhost.localdomain:24394] Checking for the existence of (/home/yye00/ompi_global_snapshot_23858.ckpt) [localhost.localdomain:24394] Restarting from file (ompi_global_snapshot_23858.ckpt) [localhost.localdomain:24394]Exec in self malloc debug: Invalid free (signal.c, 304) malloc debug: Invalid free (signal.c, 304) [localhost:23860] *** Process received signal *** [localhost:23860] Signal: Bus error (7) [localhost:23860] Signal code: (2) [localhost:23860] Failing at address: 0x7fcbb737ef88 [localhost:23860] [ 0] /lib64/libpthread.so.0 [0x32d720f0f0] [localhost:23860] [ 1] /usr/local/openmpi-1.3_svn/lib/libmpi.so.0 [0x7fcbbd1eccae] [localhost:23860] [ 2] /usr/local/openmpi-1.3_svn/lib/libmpi.so.0 [0x7fcbbd1ed5ba] [localhost:23860] [ 3] /usr/local/openmpi-1.3_svn/lib/libmpi.so.0 [0x7fcbbd1ed745] [localhost:23860] [ 4] /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0(opal_progress+0xbc) [0x7fcbbcba2aa0] [localhost:23860] [ 5] /usr/local/openmpi-1.3_svn/lib/libopen- pal.so.0 [0x7fcbbcbdead1] [localhost:23860] [ 6] /usr/local/openmpi-1.3_svn/lib/libopen- pal.so.0 [0x7fcbbcbde8e2] [localhost:23860] [ 7] /usr/local/openmpi-1.3_svn/lib/libopen-pal.so. 0(opal_crs_blcr_checkpoint+0x19c) [0x7fcbbcbde17c] [localhost:23860] [ 8] /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0(opal_cr_inc_core +0xb2) [0x7fcbbcba45e9] [localhost:23860] [ 9] /usr/local/openmpi-1.3_svn/lib/libopen- rte.so.0 [0x7fcbbced1d9d] [localhost:23860] [10] /usr/local/openmpi-1.3_svn/lib/libopen-pal.so. 0(opal_cr_test_if_checkpoint_ready+0x11b) [0x7fcbbcba4509] [localhost:23860] [11] /usr/local/openmpi-1.3_svn/lib/libopen- pal.so.0 [0x7fcbbcba4bc2] [localhost:23860] [12] /lib64/libpthread.so.0 [0x32d72073da] [localhost:23860] [13] /lib64/libc.so.6(clone+0x6d) [0x32d66e62bd] [localhost:23860] *** End of error message *** -- mpirun noticed that process rank 1 with PID 24396 on node localhost.localdomain exited on signal 7 (Bus error). -- running strace on the ompi-restart did not provide any useful information. Any suggestions are greatly appreciated. Incidentally, looking at the signal.c line 304, it is a deallocation subroutine in opal, it is the evsignal_dealloc subroutine, the actual line is the "free(base->sig.sh_old);" line . I am about to add debug statements to that subroutine and see if I can get further information, but was hoping the problem is more user-related than
Re: [OMPI users] Question about restart
Incidentally, if I add a check for the value base->sig.sh_old, that it is not NULL, and recompile, everything works fine. I am concerned this is just fixing a symptom rather than the root of the problem. if(base->sig.sh_old != NULL) free(base->sig.sh_old); is what I used. Regards Yaakoub El Khamra On Wed, Apr 22, 2009 at 2:13 PM, Yaakoub El Khamrawrote: > Greetings > I am trying to get the checkpoint/restart to work on a single machine > with openmpi 1.3 (also tried an svn check-out) and ran into a few > problems. I am guessing I am doing something wrong, and would > appreciate some help. > > I built openmpi with: > ./configure --prefi=/usr/local/openmpi-1.3/ --enable-picky > --enable-debug --enable-mpi-f77 --enable-mpi-f90 --enable-mpi-profile > --enable-mpi-cxx --enable-pretty-print-stacktrace --enable-binaries > --enable-trace --enable-static=yes --enable-debug > --with-devel-headers=1 --with-mpi-param-check=always --with-ft=cr > --enable-ft-thread --with-blcr=/usr/local/blcr/ > --with-blcr-libdir=/usr/local/blcr/lib --enable-mpi-threads=yes > > I am using blcr 0.8.1 configured with: > ./configure --prefix=/usr/local/blcr/ --enable-debug=yes > --enable-libcr-tracing=yes --enable-kernel-tracing=yes > --enable-testsuite=yes --enable-all-static=yes --enable-static=yes > > Checkpoint works fine, without any problems, I run with: > mpirun -np 2 -mca ft_cr_enabled 1 -mca ompi_cr_verbose 1 -am > ft-enable-cr -mca crs_verbose 1 -mca crs_blcr_verbose 1 matmultf.exe > > I am able to checkpoint without any problems using ompi-checkpoint > --status --term > but when I try to restart, I get the following error: > > [yye00@localhost FTOpenMPI]$ ompi-restart -v ompi_global_snapshot_23858.ckpt > [localhost.localdomain:24394] Checking for the existence of > (/home/yye00/ompi_global_snapshot_23858.ckpt) > [localhost.localdomain:24394] Restarting from file > (ompi_global_snapshot_23858.ckpt) > [localhost.localdomain:24394] Exec in self > malloc debug: Invalid free (signal.c, 304) > malloc debug: Invalid free (signal.c, 304) > [localhost:23860] *** Process received signal *** > [localhost:23860] Signal: Bus error (7) > [localhost:23860] Signal code: (2) > [localhost:23860] Failing at address: 0x7fcbb737ef88 > [localhost:23860] [ 0] /lib64/libpthread.so.0 [0x32d720f0f0] > [localhost:23860] [ 1] /usr/local/openmpi-1.3_svn/lib/libmpi.so.0 > [0x7fcbbd1eccae] > [localhost:23860] [ 2] /usr/local/openmpi-1.3_svn/lib/libmpi.so.0 > [0x7fcbbd1ed5ba] > [localhost:23860] [ 3] /usr/local/openmpi-1.3_svn/lib/libmpi.so.0 > [0x7fcbbd1ed745] > [localhost:23860] [ 4] > /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0(opal_progress+0xbc) > [0x7fcbbcba2aa0] > [localhost:23860] [ 5] /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0 > [0x7fcbbcbdead1] > [localhost:23860] [ 6] /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0 > [0x7fcbbcbde8e2] > [localhost:23860] [ 7] > /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0(opal_crs_blcr_checkpoint+0x19c) > [0x7fcbbcbde17c] > [localhost:23860] [ 8] > /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0(opal_cr_inc_core+0xb2) > [0x7fcbbcba45e9] > [localhost:23860] [ 9] /usr/local/openmpi-1.3_svn/lib/libopen-rte.so.0 > [0x7fcbbced1d9d] > [localhost:23860] [10] > /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0(opal_cr_test_if_checkpoint_ready+0x11b) > [0x7fcbbcba4509] > [localhost:23860] [11] /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0 > [0x7fcbbcba4bc2] > [localhost:23860] [12] /lib64/libpthread.so.0 [0x32d72073da] > [localhost:23860] [13] /lib64/libc.so.6(clone+0x6d) [0x32d66e62bd] > [localhost:23860] *** End of error message *** > -- > mpirun noticed that process rank 1 with PID 24396 on node > localhost.localdomain exited on signal 7 (Bus error). > -- > > running strace on the ompi-restart did not provide any useful > information. Any suggestions are greatly appreciated. Incidentally, > looking at the signal.c line 304, it is a deallocation subroutine in > opal, it is the evsignal_dealloc subroutine, the actual line is the > "free(base->sig.sh_old);" line . I am about to add debug statements to > that subroutine and see if I can get further information, but was > hoping the problem is more user-related than code-related. > > > Regards > Yaakoub El Khamra >
[OMPI users] Question about restart
Greetings I am trying to get the checkpoint/restart to work on a single machine with openmpi 1.3 (also tried an svn check-out) and ran into a few problems. I am guessing I am doing something wrong, and would appreciate some help. I built openmpi with: ./configure --prefi=/usr/local/openmpi-1.3/ --enable-picky --enable-debug --enable-mpi-f77 --enable-mpi-f90 --enable-mpi-profile --enable-mpi-cxx --enable-pretty-print-stacktrace --enable-binaries --enable-trace --enable-static=yes --enable-debug --with-devel-headers=1 --with-mpi-param-check=always --with-ft=cr --enable-ft-thread --with-blcr=/usr/local/blcr/ --with-blcr-libdir=/usr/local/blcr/lib --enable-mpi-threads=yes I am using blcr 0.8.1 configured with: ./configure --prefix=/usr/local/blcr/ --enable-debug=yes --enable-libcr-tracing=yes --enable-kernel-tracing=yes --enable-testsuite=yes --enable-all-static=yes --enable-static=yes Checkpoint works fine, without any problems, I run with: mpirun -np 2 -mca ft_cr_enabled 1 -mca ompi_cr_verbose 1 -am ft-enable-cr -mca crs_verbose 1 -mca crs_blcr_verbose 1 matmultf.exe I am able to checkpoint without any problems using ompi-checkpoint --status --term but when I try to restart, I get the following error: [yye00@localhost FTOpenMPI]$ ompi-restart -v ompi_global_snapshot_23858.ckpt [localhost.localdomain:24394] Checking for the existence of (/home/yye00/ompi_global_snapshot_23858.ckpt) [localhost.localdomain:24394] Restarting from file (ompi_global_snapshot_23858.ckpt) [localhost.localdomain:24394]Exec in self malloc debug: Invalid free (signal.c, 304) malloc debug: Invalid free (signal.c, 304) [localhost:23860] *** Process received signal *** [localhost:23860] Signal: Bus error (7) [localhost:23860] Signal code: (2) [localhost:23860] Failing at address: 0x7fcbb737ef88 [localhost:23860] [ 0] /lib64/libpthread.so.0 [0x32d720f0f0] [localhost:23860] [ 1] /usr/local/openmpi-1.3_svn/lib/libmpi.so.0 [0x7fcbbd1eccae] [localhost:23860] [ 2] /usr/local/openmpi-1.3_svn/lib/libmpi.so.0 [0x7fcbbd1ed5ba] [localhost:23860] [ 3] /usr/local/openmpi-1.3_svn/lib/libmpi.so.0 [0x7fcbbd1ed745] [localhost:23860] [ 4] /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0(opal_progress+0xbc) [0x7fcbbcba2aa0] [localhost:23860] [ 5] /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0 [0x7fcbbcbdead1] [localhost:23860] [ 6] /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0 [0x7fcbbcbde8e2] [localhost:23860] [ 7] /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0(opal_crs_blcr_checkpoint+0x19c) [0x7fcbbcbde17c] [localhost:23860] [ 8] /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0(opal_cr_inc_core+0xb2) [0x7fcbbcba45e9] [localhost:23860] [ 9] /usr/local/openmpi-1.3_svn/lib/libopen-rte.so.0 [0x7fcbbced1d9d] [localhost:23860] [10] /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0(opal_cr_test_if_checkpoint_ready+0x11b) [0x7fcbbcba4509] [localhost:23860] [11] /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0 [0x7fcbbcba4bc2] [localhost:23860] [12] /lib64/libpthread.so.0 [0x32d72073da] [localhost:23860] [13] /lib64/libc.so.6(clone+0x6d) [0x32d66e62bd] [localhost:23860] *** End of error message *** -- mpirun noticed that process rank 1 with PID 24396 on node localhost.localdomain exited on signal 7 (Bus error). -- running strace on the ompi-restart did not provide any useful information. Any suggestions are greatly appreciated. Incidentally, looking at the signal.c line 304, it is a deallocation subroutine in opal, it is the evsignal_dealloc subroutine, the actual line is the "free(base->sig.sh_old);" line . I am about to add debug statements to that subroutine and see if I can get further information, but was hoping the problem is more user-related than code-related. Regards Yaakoub El Khamra