Re: [OMPI devel] Unbelievable situation BUG
Mea culpa, I completely ignore the possible rollback of the sequence number. I will remove the commit asap. Thanks, george. On Apr 27, 2008, at 12:33 PM, Gleb Natapov wrote: On Sun, Apr 27, 2008 at 07:00:57PM +0300, Lenny Verkhovsky wrote: Hi, all I faced the "Unbelievable situation" The situation is believable, but commit r18274, that adds this output, is not, as it doesn't take into account sequence number wrap around. during running IMB benchmark. /home/USERS/lenny/OMPI_ORTE_LMC/bin/mpirun -np 96 --bynode -hostfile hostfile_ompi -mca btl_openib_max_lmc 1 ./IMB-MPI1 PingPong PingPing Sendrecv Exchange Allreduce Reduce Reduce_scatter Bcast Barrier # # Benchmarking Allreduce # #processes = 96 # #Benchmarking#procs #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Allreduce 96 0 1000 0.02 0.03 0.02 Allreduce 96 4 1000 297.88 298.07 297.95 Allreduce 96 8 1000 296.15 296.32 296.24 Allreduce 96 16 1000 297.99 298.17 298.09 Allreduce 96 32 1000 296.97 297.20 297.04 Allreduce 96 64 1000 298.43 298.64 298.49 Allreduce 96128 1000 296.86 297.07 296.93 Allreduce 96256 1000 298.00 298.30 298.09 Allreduce 96512 1000 296.79 296.96 296.85 Allreduce 96 1024 1000 299.23 299.39 299.31 Allreduce 96 2048 1000 295.51 295.64 295.57 Allreduce 96 4096 1000 246.02 246.13 246.08 Allreduce 96 8192 1000 492.52 492.74 492.63 Allreduce 96 16384 1000 5380.59 5381.47 5381.10 Allreduce 96 32768 1000 5372.86 5373.69 5373.36 Allreduce 96 65536 640 5470.41 5471.88 5471.16 Allreduce 96 131072 320 5554.52 5556.82 .75 [witch24:15639] Unbelievable situation ... we got a duplicated fragment with seq number of 0 (expected 65534) from witch23 [witch24:15639] Unbelievable situation ... we got a duplicated fragment with seq number of 65116 (expected 65534) from witch23 [witch24:15639] *** Process received signal *** [witch24:15639] Signal: Segmentation fault (11) [witch24:15639] Signal code: Address not mapped (1) [witch24:15639] Failing at address: 0x632457d0 [witch24:15639] [ 0] /lib64/libpthread.so.0 [0x2b7929a9bc10] [witch24:15639] [ 1] /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_allocator_bucket.so [0x2b792aa47d34] [witch24:15639] [ 2] /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_pml_ob1.so [0x2b792b172163] [witch24:15639] [ 3] /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_btl_openib.so [0x2b792b6b0772] [witch24:15639] [ 4] /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_btl_openib.so [0x2b792b6b15ff] [witch24:15639] [ 5] /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_bml_r2.so [0x2b792b38307f] [witch24:15639] [ 6] /home/USERS/lenny/OMPI_ORTE_LMC/lib/libopen-pal.so.0(opal_progress +0x4a) [0x2b79294cd16a] [witch24:15639] [ 7] /home/USERS/lenny/OMPI_ORTE_LMC/lib/libmpi.so.0 [0x2b79292163a8] [witch24:15639] [ 8] /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_coll_tuned.so [0x2b792c077cb7] [witch24:15639] [ 9] /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_coll_tuned.so [0x2b792c07b296] [witch24:15639] [10] /home/USERS/lenny/OMPI_ORTE_LMC/lib/libmpi.so.0(PMPI_Allreduce+0x1e7) [0x2b7929229907] [witch24:15639] [11] ./IMB-MPI1(IMB_allreduce+0x8e) [0x40764e] [witch24:15639] [12] ./IMB-MPI1(main+0x3aa) [0x4034ea] [witch24:15639] [13] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2b7929bc2154] [witch24:15639] [14] ./IMB-MPI1 [0x4030a9] [witch24:15639] *** End of error message *** -- Best Regards, Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI devel] Unbelievable situation BUG
On Sun, Apr 27, 2008 at 07:00:57PM +0300, Lenny Verkhovsky wrote: > Hi, all > > I faced the "Unbelievable situation" The situation is believable, but commit r18274, that adds this output, is not, as it doesn't take into account sequence number wrap around. > > during running IMB benchmark. > > > > > > /home/USERS/lenny/OMPI_ORTE_LMC/bin/mpirun -np 96 --bynode -hostfile > hostfile_ompi -mca btl_openib_max_lmc 1 ./IMB-MPI1 PingPong PingPing > Sendrecv Exchange Allreduce Reduce Reduce_scatter Bcast Barrier > > > > > > > > # > > # Benchmarking Allreduce > > # #processes = 96 > > # > > #Benchmarking#procs #bytes #repetitions t_min[usec] > t_max[usec] t_avg[usec] > > Allreduce 96 0 1000 0.02 > 0.03 0.02 > > Allreduce 96 4 1000 297.88 > 298.07 297.95 > > Allreduce 96 8 1000 296.15 > 296.32 296.24 > > Allreduce 96 16 1000 297.99 > 298.17 298.09 > > Allreduce 96 32 1000 296.97 > 297.20 297.04 > > Allreduce 96 64 1000 298.43 > 298.64 298.49 > > Allreduce 96128 1000 296.86 > 297.07 296.93 > > Allreduce 96256 1000 298.00 > 298.30 298.09 > > Allreduce 96512 1000 296.79 > 296.96 296.85 > > Allreduce 96 1024 1000 299.23 > 299.39 299.31 > > Allreduce 96 2048 1000 295.51 > 295.64 295.57 > > Allreduce 96 4096 1000 246.02 > 246.13 246.08 > > Allreduce 96 8192 1000 492.52 > 492.74 492.63 > > Allreduce 96 16384 1000 5380.59 > 5381.47 5381.10 > > Allreduce 96 32768 1000 5372.86 > 5373.69 5373.36 > > Allreduce 96 65536 640 5470.41 > 5471.88 5471.16 > > Allreduce 96 131072 320 5554.52 > 5556.82 .75 > > [witch24:15639] Unbelievable situation ... we got a duplicated fragment > with seq number of 0 (expected 65534) from witch23 > > [witch24:15639] Unbelievable situation ... we got a duplicated fragment > with seq number of 65116 (expected 65534) from witch23 > > [witch24:15639] *** Process received signal *** > > [witch24:15639] Signal: Segmentation fault (11) > > [witch24:15639] Signal code: Address not mapped (1) > > [witch24:15639] Failing at address: 0x632457d0 > > [witch24:15639] [ 0] /lib64/libpthread.so.0 [0x2b7929a9bc10] > > [witch24:15639] [ 1] > /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_allocator_bucket.so > [0x2b792aa47d34] > > [witch24:15639] [ 2] > /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_pml_ob1.so > [0x2b792b172163] > > [witch24:15639] [ 3] > /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_btl_openib.so > [0x2b792b6b0772] > > [witch24:15639] [ 4] > /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_btl_openib.so > [0x2b792b6b15ff] > > [witch24:15639] [ 5] > /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_bml_r2.so > [0x2b792b38307f] > > [witch24:15639] [ 6] > /home/USERS/lenny/OMPI_ORTE_LMC/lib/libopen-pal.so.0(opal_progress+0x4a) > [0x2b79294cd16a] > > [witch24:15639] [ 7] /home/USERS/lenny/OMPI_ORTE_LMC/lib/libmpi.so.0 > [0x2b79292163a8] > > [witch24:15639] [ 8] > /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_coll_tuned.so > [0x2b792c077cb7] > > [witch24:15639] [ 9] > /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_coll_tuned.so > [0x2b792c07b296] > > [witch24:15639] [10] > /home/USERS/lenny/OMPI_ORTE_LMC/lib/libmpi.so.0(PMPI_Allreduce+0x1e7) > [0x2b7929229907] > > [witch24:15639] [11] ./IMB-MPI1(IMB_allreduce+0x8e) [0x40764e] > > [witch24:15639] [12] ./IMB-MPI1(main+0x3aa) [0x4034ea] > > [witch24:15639] [13] /lib64/libc.so.6(__libc_start_main+0xf4) > [0x2b7929bc2154] > > [witch24:15639] [14] ./IMB-MPI1 [0x4030a9] > > [witch24:15639] *** End of error message *** > > > -- > > Best Regards, > > Lenny. > > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Gleb.