Re: [OMPI devel] Unbelievable situation BUG

2008-04-27 Thread George Bosilca
Mea culpa, I completely ignore the possible rollback of the sequence  
number. I will remove the commit asap.


  Thanks,
george.

On Apr 27, 2008, at 12:33 PM, Gleb Natapov wrote:


On Sun, Apr 27, 2008 at 07:00:57PM +0300, Lenny Verkhovsky wrote:

Hi, all

I faced the "Unbelievable situation"
The situation is believable, but commit r18274, that adds this  
output, is

not, as it doesn't take into account sequence number wrap around.



during running IMB benchmark.





/home/USERS/lenny/OMPI_ORTE_LMC/bin/mpirun -np 96 --bynode  -hostfile
hostfile_ompi -mca btl_openib_max_lmc 1 ./IMB-MPI1 PingPong PingPing
Sendrecv Exchange Allreduce Reduce Reduce_scatter Bcast Barrier







#

# Benchmarking Allreduce

# #processes = 96

#

#Benchmarking#procs   #bytes #repetitions  t_min[usec]
t_max[usec]  t_avg[usec]

Allreduce   96  0 1000 0.02
0.03 0.02

Allreduce   96  4 1000   297.88
298.07   297.95

Allreduce   96  8 1000   296.15
296.32   296.24

Allreduce   96 16 1000   297.99
298.17   298.09

Allreduce   96 32 1000   296.97
297.20   297.04

Allreduce   96 64 1000   298.43
298.64   298.49

Allreduce   96128 1000   296.86
297.07   296.93

Allreduce   96256 1000   298.00
298.30   298.09

Allreduce   96512 1000   296.79
296.96   296.85

Allreduce   96   1024 1000   299.23
299.39   299.31

Allreduce   96   2048 1000   295.51
295.64   295.57

Allreduce   96   4096 1000   246.02
246.13   246.08

Allreduce   96   8192 1000   492.52
492.74   492.63

Allreduce   96  16384 1000  5380.59
5381.47  5381.10

Allreduce   96  32768 1000  5372.86
5373.69  5373.36

Allreduce   96  65536  640  5470.41
5471.88  5471.16

Allreduce   96 131072  320  5554.52
5556.82  .75

[witch24:15639] Unbelievable situation ... we got a duplicated  
fragment

with seq number of 0 (expected 65534) from witch23

[witch24:15639] Unbelievable situation ... we got a duplicated  
fragment

with seq number of 65116 (expected 65534) from witch23

[witch24:15639] *** Process received signal ***

[witch24:15639] Signal: Segmentation fault (11)

[witch24:15639] Signal code: Address not mapped (1)

[witch24:15639] Failing at address: 0x632457d0

[witch24:15639] [ 0] /lib64/libpthread.so.0 [0x2b7929a9bc10]

[witch24:15639] [ 1]
/home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_allocator_bucket.so
[0x2b792aa47d34]

[witch24:15639] [ 2]
/home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_pml_ob1.so
[0x2b792b172163]

[witch24:15639] [ 3]
/home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_btl_openib.so
[0x2b792b6b0772]

[witch24:15639] [ 4]
/home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_btl_openib.so
[0x2b792b6b15ff]

[witch24:15639] [ 5]
/home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_bml_r2.so
[0x2b792b38307f]

[witch24:15639] [ 6]
/home/USERS/lenny/OMPI_ORTE_LMC/lib/libopen-pal.so.0(opal_progress 
+0x4a)

[0x2b79294cd16a]

[witch24:15639] [ 7] /home/USERS/lenny/OMPI_ORTE_LMC/lib/libmpi.so.0
[0x2b79292163a8]

[witch24:15639] [ 8]
/home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_coll_tuned.so
[0x2b792c077cb7]

[witch24:15639] [ 9]
/home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_coll_tuned.so
[0x2b792c07b296]

[witch24:15639] [10]
/home/USERS/lenny/OMPI_ORTE_LMC/lib/libmpi.so.0(PMPI_Allreduce+0x1e7)
[0x2b7929229907]

[witch24:15639] [11] ./IMB-MPI1(IMB_allreduce+0x8e) [0x40764e]

[witch24:15639] [12] ./IMB-MPI1(main+0x3aa) [0x4034ea]

[witch24:15639] [13] /lib64/libc.so.6(__libc_start_main+0xf4)
[0x2b7929bc2154]

[witch24:15639] [14] ./IMB-MPI1 [0x4030a9]

[witch24:15639] *** End of error message ***


--

Best Regards,

Lenny.






___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI devel] Unbelievable situation BUG

2008-04-27 Thread Gleb Natapov
On Sun, Apr 27, 2008 at 07:00:57PM +0300, Lenny Verkhovsky wrote:
> Hi, all 
> 
> I faced the "Unbelievable situation"
The situation is believable, but commit r18274, that adds this output, is
not, as it doesn't take into account sequence number wrap around.

> 
> during running IMB benchmark.
> 
>  
> 
>  
> 
> /home/USERS/lenny/OMPI_ORTE_LMC/bin/mpirun -np 96 --bynode  -hostfile
> hostfile_ompi -mca btl_openib_max_lmc 1 ./IMB-MPI1 PingPong PingPing
> Sendrecv Exchange Allreduce Reduce Reduce_scatter Bcast Barrier
> 
>  
> 
>  
> 
>  
> 
> #
> 
> # Benchmarking Allreduce
> 
> # #processes = 96
> 
> #
> 
> #Benchmarking#procs   #bytes #repetitions  t_min[usec]
> t_max[usec]  t_avg[usec]
> 
> Allreduce   96  0 1000 0.02
> 0.03 0.02
> 
> Allreduce   96  4 1000   297.88
> 298.07   297.95
> 
> Allreduce   96  8 1000   296.15
> 296.32   296.24
> 
> Allreduce   96 16 1000   297.99
> 298.17   298.09
> 
> Allreduce   96 32 1000   296.97
> 297.20   297.04
> 
> Allreduce   96 64 1000   298.43
> 298.64   298.49
> 
> Allreduce   96128 1000   296.86
> 297.07   296.93
> 
> Allreduce   96256 1000   298.00
> 298.30   298.09
> 
> Allreduce   96512 1000   296.79
> 296.96   296.85
> 
> Allreduce   96   1024 1000   299.23
> 299.39   299.31
> 
> Allreduce   96   2048 1000   295.51
> 295.64   295.57
> 
> Allreduce   96   4096 1000   246.02
> 246.13   246.08
> 
> Allreduce   96   8192 1000   492.52
> 492.74   492.63
> 
> Allreduce   96  16384 1000  5380.59
> 5381.47  5381.10
> 
> Allreduce   96  32768 1000  5372.86
> 5373.69  5373.36
> 
> Allreduce   96  65536  640  5470.41
> 5471.88  5471.16
> 
> Allreduce   96 131072  320  5554.52
> 5556.82  .75
> 
> [witch24:15639] Unbelievable situation ... we got a duplicated fragment
> with seq number of 0 (expected 65534) from witch23
> 
> [witch24:15639] Unbelievable situation ... we got a duplicated fragment
> with seq number of 65116 (expected 65534) from witch23
> 
> [witch24:15639] *** Process received signal ***
> 
> [witch24:15639] Signal: Segmentation fault (11)
> 
> [witch24:15639] Signal code: Address not mapped (1)
> 
> [witch24:15639] Failing at address: 0x632457d0
> 
> [witch24:15639] [ 0] /lib64/libpthread.so.0 [0x2b7929a9bc10]
> 
> [witch24:15639] [ 1]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_allocator_bucket.so
> [0x2b792aa47d34]
> 
> [witch24:15639] [ 2]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_pml_ob1.so
> [0x2b792b172163]
> 
> [witch24:15639] [ 3]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_btl_openib.so
> [0x2b792b6b0772]
> 
> [witch24:15639] [ 4]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_btl_openib.so
> [0x2b792b6b15ff]
> 
> [witch24:15639] [ 5]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_bml_r2.so
> [0x2b792b38307f]
> 
> [witch24:15639] [ 6]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/libopen-pal.so.0(opal_progress+0x4a)
> [0x2b79294cd16a]
> 
> [witch24:15639] [ 7] /home/USERS/lenny/OMPI_ORTE_LMC/lib/libmpi.so.0
> [0x2b79292163a8]
> 
> [witch24:15639] [ 8]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_coll_tuned.so
> [0x2b792c077cb7]
> 
> [witch24:15639] [ 9]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_coll_tuned.so
> [0x2b792c07b296]
> 
> [witch24:15639] [10]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/libmpi.so.0(PMPI_Allreduce+0x1e7)
> [0x2b7929229907]
> 
> [witch24:15639] [11] ./IMB-MPI1(IMB_allreduce+0x8e) [0x40764e]
> 
> [witch24:15639] [12] ./IMB-MPI1(main+0x3aa) [0x4034ea]
> 
> [witch24:15639] [13] /lib64/libc.so.6(__libc_start_main+0xf4)
> [0x2b7929bc2154]
> 
> [witch24:15639] [14] ./IMB-MPI1 [0x4030a9]
> 
> [witch24:15639] *** End of error message ***
> 
> 
> --
> 
> Best Regards,
> 
> Lenny.
> 
>  
> 

> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Gleb.