[OMPI users] Performance difference on OpenMPI, IntelMPI and ScaliMPI

2009-08-05 Thread Torgny Faxen

Hello,
we are seeing a large difference in performance for some applications 
depending on what MPI is being used.


Attached are performance numbers and oprofile output (first 30 lines) 
from one out of 14 nodes from one application run using OpenMPI, 
IntelMPI and Scali MPI respectively.


Scali MPI is faster the other two MPI:s with a factor of 1.6 and 1.75:

ScaliMPI: walltime for the whole application is 214 seconds
OpenMPI: walltime for the whole application is 376 seconds
Intel MPI: walltime for the whole application is 346 seconds.

The application is running with the main send receive commands being:
MPI_Bsend
MPI_Iprobe followed by MPI_Recv (in case of there being a message). 
Quite often MPI_Iprobe is being called just to check whether there is a 
certain message pending.


Any idea on tuning tips, performance analysis, code modifications to 
improve the OpenMPI performance? A lot of time is being spent in 
"mca_btl_sm_component_progress", "btl_openib_component_progress" and 
other internal routines.


The code is running on a cluster with 140 HP ProLiant DL160 G5 compute 
servers. Infiniband interconnect. Intel Xeon E5462 processors. The 
profiled application is using 144 cores on 18 nodes over Infiniband.


Regards / Torgny
=0
OpenMPI  1.3b2
=0

Walltime: 376 seconds

CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples  %image name   app name 
symbol name
668288   22.2113  mca_btl_sm.sorco2.24pe
mca_btl_sm_component_progress

441828   14.6846  rco2.24perco2.24pestep_
335929   11.1650  libmlx4-rdmav2.sorco2.24pe(no 
symbols)
301446   10.0189  mca_btl_openib.sorco2.24pe
btl_openib_component_progress
1610335.3521  libopen-pal.so.0.0.0 rco2.24pe
opal_progress
1570245.2189  libpthread-2.5.sorco2.24pe
pthread_spin_lock
99526 3.3079  no-vmlinux   no-vmlinux   (no 
symbols)
93887 3.1204  mca_btl_sm.sorco2.24pe
opal_using_threads
69979 2.3258  mca_pml_ob1.so   rco2.24pe
mca_pml_ob1_iprobe
58895 1.9574  mca_bml_r2.sorco2.24pe
mca_bml_r2_progress
55095 1.8311  mca_pml_ob1.so   rco2.24pe
mca_pml_ob1_recv_request_match_wild

49286 1.6381  rco2.24perco2.24petracer_
41946 1.3941  libintlc.so.5rco2.24pe
__intel_new_memcpy

40730 1.3537  rco2.24perco2.24pescobi_
36586 1.2160  rco2.24perco2.24pestate_
20986 0.6975  rco2.24perco2.24pediag_
19321 0.6422  libmpi.so.0.0.0  rco2.24pe
PMPI_Unpack
18552 0.6166  libmpi.so.0.0.0  rco2.24pe
PMPI_Iprobe

17323 0.5757  rco2.24perco2.24peclinic_
16194 0.5382  rco2.24perco2.24pek_epsi_
15330 0.5095  libmpi.so.0.0.0  rco2.24pe
PMPI_Comm_f2c
13778 0.4579  libmpi_f77.so.0.0.0  rco2.24pe
mpi_iprobe_f

13241 0.4401  rco2.24perco2.24pes_recv_
12386 0.4117  rco2.24perco2.24pegrowth_
11699 0.3888  rco2.24perco2.24pe
testnrecv_
11268 0.3745  libmpi.so.0.0.0  rco2.24pe
mca_pml_base_recv_request_construct
10971 0.3646  libmpi.so.0.0.0  rco2.24pe
ompi_convertor_unpack
10034 0.3335  mca_pml_ob1.so   rco2.24pe
mca_pml_ob1_recv_request_match_specific

10003 0.3325  libimf.sorco2.24peexp.L
9375  0.3116  rco2.24perco2.24pe
subbasin_
8912  0.2962  libmpi_f77.so.0.0.0  rco2.24pe
mpi_unpack_f




=0
Intel MPI, version 3.2.0.011/
=0

Walltime: 346 seconds

CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples  %image name   app name 
symbol name

486712   17.7537  rco2 rco2 step_
431941   15.7558  no-vmlinux   no-vmlinux   (no 
symbols)
2124257.7486  libmpi.so.3.2rco2   

Re: [OMPI users] Performance difference on OpenMPI, IntelMPI and ScaliMPI

2009-08-05 Thread Pavel Shamis (Pasha)

Do you know if the application use some collective operations ?

Thanks

Pasha

Torgny Faxen wrote:

Hello,
we are seeing a large difference in performance for some applications 
depending on what MPI is being used.


Attached are performance numbers and oprofile output (first 30 lines) 
from one out of 14 nodes from one application run using OpenMPI, 
IntelMPI and Scali MPI respectively.


Scali MPI is faster the other two MPI:s with a factor of 1.6 and 1.75:

ScaliMPI: walltime for the whole application is 214 seconds
OpenMPI: walltime for the whole application is 376 seconds
Intel MPI: walltime for the whole application is 346 seconds.

The application is running with the main send receive commands being:
MPI_Bsend
MPI_Iprobe followed by MPI_Recv (in case of there being a message). 
Quite often MPI_Iprobe is being called just to check whether there is 
a certain message pending.


Any idea on tuning tips, performance analysis, code modifications to 
improve the OpenMPI performance? A lot of time is being spent in 
"mca_btl_sm_component_progress", "btl_openib_component_progress" and 
other internal routines.


The code is running on a cluster with 140 HP ProLiant DL160 G5 compute 
servers. Infiniband interconnect. Intel Xeon E5462 processors. The 
profiled application is using 144 cores on 18 nodes over Infiniband.


Regards / Torgny
=0 


OpenMPI  1.3b2
=0 



Walltime: 376 seconds

CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples  %image name   app name 
symbol name
668288   22.2113  mca_btl_sm.sorco2.24pe
mca_btl_sm_component_progress

441828   14.6846  rco2.24perco2.24pestep_
335929   11.1650  libmlx4-rdmav2.sorco2.24pe
(no symbols)
301446   10.0189  mca_btl_openib.sorco2.24pe
btl_openib_component_progress
1610335.3521  libopen-pal.so.0.0.0 rco2.24pe
opal_progress
1570245.2189  libpthread-2.5.sorco2.24pe
pthread_spin_lock
99526 3.3079  no-vmlinux   no-vmlinux   
(no symbols)
93887 3.1204  mca_btl_sm.sorco2.24pe
opal_using_threads
69979 2.3258  mca_pml_ob1.so   rco2.24pe
mca_pml_ob1_iprobe
58895 1.9574  mca_bml_r2.sorco2.24pe
mca_bml_r2_progress
55095 1.8311  mca_pml_ob1.so   rco2.24pe
mca_pml_ob1_recv_request_match_wild
49286 1.6381  rco2.24perco2.24pe
tracer_
41946 1.3941  libintlc.so.5rco2.24pe
__intel_new_memcpy
40730 1.3537  rco2.24perco2.24pe
scobi_
36586 1.2160  rco2.24perco2.24pe
state_

20986 0.6975  rco2.24perco2.24pediag_
19321 0.6422  libmpi.so.0.0.0  rco2.24pe
PMPI_Unpack
18552 0.6166  libmpi.so.0.0.0  rco2.24pe
PMPI_Iprobe
17323 0.5757  rco2.24perco2.24pe
clinic_
16194 0.5382  rco2.24perco2.24pe
k_epsi_
15330 0.5095  libmpi.so.0.0.0  rco2.24pe
PMPI_Comm_f2c
13778 0.4579  libmpi_f77.so.0.0.0  rco2.24pe
mpi_iprobe_f
13241 0.4401  rco2.24perco2.24pe
s_recv_
12386 0.4117  rco2.24perco2.24pe
growth_
11699 0.3888  rco2.24perco2.24pe
testnrecv_
11268 0.3745  libmpi.so.0.0.0  rco2.24pe
mca_pml_base_recv_request_construct
10971 0.3646  libmpi.so.0.0.0  rco2.24pe
ompi_convertor_unpack
10034 0.3335  mca_pml_ob1.so   rco2.24pe
mca_pml_ob1_recv_request_match_specific

10003 0.3325  libimf.sorco2.24peexp.L
9375  0.3116  rco2.24perco2.24pe
subbasin_
8912  0.2962  libmpi_f77.so.0.0.0  rco2.24pe
mpi_unpack_f




=0 


Intel MPI, version 3.2.0.011/
=0 



Walltime: 346 seconds

CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples  %image name   app name 
symbol name

486712   17.7537  rco2 rco2 step_
431941   15.755

Re: [OMPI users] Performance difference on OpenMPI, IntelMPI and ScaliMPI

2009-08-05 Thread Torgny Faxen

Pasha,
no collectives are being used.

A simple grep in the code reveals the following MPI functions being used:
MPI_Init
MPI_wtime
MPI_COMM_RANK
MPI_COMM_SIZE
MPI_BUFFER_ATTACH
MPI_BSEND
MPI_PACK
MPI_UNPACK
MPI_PROBE
MPI_GET_COUNT
MPI_RECV
MPI_IPROBE
MPI_FINALIZE

where MPI_IPROBE is the clear winner in terms of number of calls.

/Torgny

Pavel Shamis (Pasha) wrote:

Do you know if the application use some collective operations ?

Thanks

Pasha

Torgny Faxen wrote:

Hello,
we are seeing a large difference in performance for some applications 
depending on what MPI is being used.


Attached are performance numbers and oprofile output (first 30 lines) 
from one out of 14 nodes from one application run using OpenMPI, 
IntelMPI and Scali MPI respectively.


Scali MPI is faster the other two MPI:s with a factor of 1.6 and 1.75:

ScaliMPI: walltime for the whole application is 214 seconds
OpenMPI: walltime for the whole application is 376 seconds
Intel MPI: walltime for the whole application is 346 seconds.

The application is running with the main send receive commands being:
MPI_Bsend
MPI_Iprobe followed by MPI_Recv (in case of there being a message). 
Quite often MPI_Iprobe is being called just to check whether there is 
a certain message pending.


Any idea on tuning tips, performance analysis, code modifications to 
improve the OpenMPI performance? A lot of time is being spent in 
"mca_btl_sm_component_progress", "btl_openib_component_progress" and 
other internal routines.


The code is running on a cluster with 140 HP ProLiant DL160 G5 
compute servers. Infiniband interconnect. Intel Xeon E5462 
processors. The profiled application is using 144 cores on 18 nodes 
over Infiniband.


Regards / Torgny
=0 


OpenMPI  1.3b2
=0 



Walltime: 376 seconds

CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples  %image name   app name 
symbol name
668288   22.2113  mca_btl_sm.sorco2.24pe
mca_btl_sm_component_progress
441828   14.6846  rco2.24perco2.24pe
step_
335929   11.1650  libmlx4-rdmav2.sorco2.24pe
(no symbols)
301446   10.0189  mca_btl_openib.sorco2.24pe
btl_openib_component_progress
1610335.3521  libopen-pal.so.0.0.0 rco2.24pe
opal_progress
1570245.2189  libpthread-2.5.sorco2.24pe
pthread_spin_lock
99526 3.3079  no-vmlinux   no-vmlinux   
(no symbols)
93887 3.1204  mca_btl_sm.sorco2.24pe
opal_using_threads
69979 2.3258  mca_pml_ob1.so   rco2.24pe
mca_pml_ob1_iprobe
58895 1.9574  mca_bml_r2.sorco2.24pe
mca_bml_r2_progress
55095 1.8311  mca_pml_ob1.so   rco2.24pe
mca_pml_ob1_recv_request_match_wild
49286 1.6381  rco2.24perco2.24pe
tracer_
41946 1.3941  libintlc.so.5rco2.24pe
__intel_new_memcpy
40730 1.3537  rco2.24perco2.24pe
scobi_
36586 1.2160  rco2.24perco2.24pe
state_
20986 0.6975  rco2.24perco2.24pe
diag_
19321 0.6422  libmpi.so.0.0.0  rco2.24pe
PMPI_Unpack
18552 0.6166  libmpi.so.0.0.0  rco2.24pe
PMPI_Iprobe
17323 0.5757  rco2.24perco2.24pe
clinic_
16194 0.5382  rco2.24perco2.24pe
k_epsi_
15330 0.5095  libmpi.so.0.0.0  rco2.24pe
PMPI_Comm_f2c
13778 0.4579  libmpi_f77.so.0.0.0  rco2.24pe
mpi_iprobe_f
13241 0.4401  rco2.24perco2.24pe
s_recv_
12386 0.4117  rco2.24perco2.24pe
growth_
11699 0.3888  rco2.24perco2.24pe
testnrecv_
11268 0.3745  libmpi.so.0.0.0  rco2.24pe
mca_pml_base_recv_request_construct
10971 0.3646  libmpi.so.0.0.0  rco2.24pe
ompi_convertor_unpack
10034 0.3335  mca_pml_ob1.so   rco2.24pe
mca_pml_ob1_recv_request_match_specific
10003 0.3325  libimf.sorco2.24pe
exp.L
9375  0.3116  rco2.24perco2.24pe
subbasin_
8912  0.2962  libmpi_f77.so.0.0.0  rco2.24pe
mpi_unpack_f




=0 


Intel MPI, version 3.2.0.011/
==

Re: [OMPI users] Performance difference on OpenMPI, IntelMPI and ScaliMPI

2009-08-05 Thread Ralph Castain
Could you send us the mpirun cmd line? I wonder if you are missing some
options that could help. Also, you might:

(a) upgrade to 1.3.3 - it looks like you are using some kind of pre-release
version

(b) add -mca mpi_show_mca_params env,file - this will cause rank=0 to output
what mca params it sees, and where they came from

(c) check that you built a non-debug version, and remembered to compile your
application with a -O3 flag - i.e., "mpicc -O3 ...". Remember, OMPI does not
automatically add optimization flags to mpicc!

Thanks
Ralph


On Wed, Aug 5, 2009 at 7:15 AM, Torgny Faxen  wrote:

> Pasha,
> no collectives are being used.
>
> A simple grep in the code reveals the following MPI functions being used:
> MPI_Init
> MPI_wtime
> MPI_COMM_RANK
> MPI_COMM_SIZE
> MPI_BUFFER_ATTACH
> MPI_BSEND
> MPI_PACK
> MPI_UNPACK
> MPI_PROBE
> MPI_GET_COUNT
> MPI_RECV
> MPI_IPROBE
> MPI_FINALIZE
>
> where MPI_IPROBE is the clear winner in terms of number of calls.
>
> /Torgny
>
>
> Pavel Shamis (Pasha) wrote:
>
>> Do you know if the application use some collective operations ?
>>
>> Thanks
>>
>> Pasha
>>
>> Torgny Faxen wrote:
>>
>>> Hello,
>>> we are seeing a large difference in performance for some applications
>>> depending on what MPI is being used.
>>>
>>> Attached are performance numbers and oprofile output (first 30 lines)
>>> from one out of 14 nodes from one application run using OpenMPI, IntelMPI
>>> and Scali MPI respectively.
>>>
>>> Scali MPI is faster the other two MPI:s with a factor of 1.6 and 1.75:
>>>
>>> ScaliMPI: walltime for the whole application is 214 seconds
>>> OpenMPI: walltime for the whole application is 376 seconds
>>> Intel MPI: walltime for the whole application is 346 seconds.
>>>
>>> The application is running with the main send receive commands being:
>>> MPI_Bsend
>>> MPI_Iprobe followed by MPI_Recv (in case of there being a message). Quite
>>> often MPI_Iprobe is being called just to check whether there is a certain
>>> message pending.
>>>
>>> Any idea on tuning tips, performance analysis, code modifications to
>>> improve the OpenMPI performance? A lot of time is being spent in
>>> "mca_btl_sm_component_progress", "btl_openib_component_progress" and other
>>> internal routines.
>>>
>>> The code is running on a cluster with 140 HP ProLiant DL160 G5 compute
>>> servers. Infiniband interconnect. Intel Xeon E5462 processors. The profiled
>>> application is using 144 cores on 18 nodes over Infiniband.
>>>
>>> Regards / Torgny
>>> =0
>>>
>>> OpenMPI  1.3b2
>>> =0
>>>
>>>
>>> Walltime: 376 seconds
>>>
>>> CPU: CPU with timer interrupt, speed 0 MHz (estimated)
>>> Profiling through timer interrupt
>>> samples  %image name   app name
>>> symbol name
>>> 668288   22.2113  mca_btl_sm.sorco2.24pe
>>>  mca_btl_sm_component_progress
>>> 441828   14.6846  rco2.24perco2.24pestep_
>>> 335929   11.1650  libmlx4-rdmav2.sorco2.24pe(no
>>> symbols)
>>> 301446   10.0189  mca_btl_openib.sorco2.24pe
>>>  btl_openib_component_progress
>>> 1610335.3521  libopen-pal.so.0.0.0 rco2.24pe
>>>  opal_progress
>>> 1570245.2189  libpthread-2.5.sorco2.24pe
>>>  pthread_spin_lock
>>> 99526 3.3079  no-vmlinux   no-vmlinux   (no
>>> symbols)
>>> 93887 3.1204  mca_btl_sm.sorco2.24pe
>>>  opal_using_threads
>>> 69979 2.3258  mca_pml_ob1.so   rco2.24pe
>>>  mca_pml_ob1_iprobe
>>> 58895 1.9574  mca_bml_r2.sorco2.24pe
>>>  mca_bml_r2_progress
>>> 55095 1.8311  mca_pml_ob1.so   rco2.24pe
>>>  mca_pml_ob1_recv_request_match_wild
>>> 49286 1.6381  rco2.24perco2.24pe
>>>  tracer_
>>> 41946 1.3941  libintlc.so.5rco2.24pe
>>>  __intel_new_memcpy
>>> 40730 1.3537  rco2.24perco2.24pe
>>>  scobi_
>>> 36586 1.2160  rco2.24perco2.24pe
>>>  state_
>>> 20986 0.6975  rco2.24perco2.24pediag_
>>> 19321 0.6422  libmpi.so.0.0.0  rco2.24pe
>>>  PMPI_Unpack
>>> 18552 0.6166  libmpi.so.0.0.0  rco2.24pe
>>>  PMPI_Iprobe
>>> 17323 0.5757  rco2.24perco2.24pe
>>>  clinic_
>>> 16194 0.5382  rco2.24perco2.24pe
>>>  k_epsi_
>>> 15330 0.5095  libmpi.so.0.0.0  rco2.24pe
>>>  PMPI_Comm_f2c
>>> 13778 0.4579  libmpi_f77.so.0.0.0  rco2.24pe
>>>  mpi_iprobe_f
>>> 13241 0.4401  rco2.24perco2.24pe
>>>  s_recv_
>>> 12386 0.4117  rco2.24perco2.24pe
>>>  growth_
>>> 11699 0.3888  rco2.24perco2.24pe
>>>  testnrecv_
>>> 11268 0.3745  libmpi.so.0.0.0  rco2.24pe
>>>  mca_pml_base_

Re: [OMPI users] Performance difference on OpenMPI, IntelMPI and ScaliMPI

2009-08-05 Thread Pavel Shamis (Pasha)

Torgny,
We have one know issue in openib btl that it related to IPROBE - 
https://svn.open-mpi.org/trac/ompi/ticket/1362
Theoretical it maybe source cause of the performance degradation, but 
for me the performance difference sounds too big.


* Do you know what is typical message size for this application?
* Did you enable live pinned ? (--mca mpi_leave_pinned 1) ?

Also I recommend you to read this FAQ - 
http://netmirror.org/mirror/open-mpi.org/faq/?category=tuning#running-perf-numbers


Pasha


Torgny Faxen wrote:

Pasha,
no collectives are being used.

A simple grep in the code reveals the following MPI functions being used:
MPI_Init
MPI_wtime
MPI_COMM_RANK
MPI_COMM_SIZE
MPI_BUFFER_ATTACH
MPI_BSEND
MPI_PACK
MPI_UNPACK
MPI_PROBE
MPI_GET_COUNT
MPI_RECV
MPI_IPROBE
MPI_FINALIZE

where MPI_IPROBE is the clear winner in terms of number of calls.

/Torgny

Pavel Shamis (Pasha) wrote:

Do you know if the application use some collective operations ?

Thanks

Pasha

Torgny Faxen wrote:

Hello,
we are seeing a large difference in performance for some 
applications depending on what MPI is being used.


Attached are performance numbers and oprofile output (first 30 
lines) from one out of 14 nodes from one application run using 
OpenMPI, IntelMPI and Scali MPI respectively.


Scali MPI is faster the other two MPI:s with a factor of 1.6 and 1.75:

ScaliMPI: walltime for the whole application is 214 seconds
OpenMPI: walltime for the whole application is 376 seconds
Intel MPI: walltime for the whole application is 346 seconds.

The application is running with the main send receive commands being:
MPI_Bsend
MPI_Iprobe followed by MPI_Recv (in case of there being a message). 
Quite often MPI_Iprobe is being called just to check whether there 
is a certain message pending.


Any idea on tuning tips, performance analysis, code modifications to 
improve the OpenMPI performance? A lot of time is being spent in 
"mca_btl_sm_component_progress", "btl_openib_component_progress" and 
other internal routines.


The code is running on a cluster with 140 HP ProLiant DL160 G5 
compute servers. Infiniband interconnect. Intel Xeon E5462 
processors. The profiled application is using 144 cores on 18 nodes 
over Infiniband.


Regards / Torgny
=0 


OpenMPI  1.3b2
=0 



Walltime: 376 seconds

CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples  %image name   app name 
symbol name
668288   22.2113  mca_btl_sm.sorco2.24pe
mca_btl_sm_component_progress
441828   14.6846  rco2.24perco2.24pe
step_
335929   11.1650  libmlx4-rdmav2.sorco2.24pe
(no symbols)
301446   10.0189  mca_btl_openib.sorco2.24pe
btl_openib_component_progress
1610335.3521  libopen-pal.so.0.0.0 rco2.24pe
opal_progress
1570245.2189  libpthread-2.5.sorco2.24pe
pthread_spin_lock
99526 3.3079  no-vmlinux   no-vmlinux   
(no symbols)
93887 3.1204  mca_btl_sm.sorco2.24pe
opal_using_threads
69979 2.3258  mca_pml_ob1.so   rco2.24pe
mca_pml_ob1_iprobe
58895 1.9574  mca_bml_r2.sorco2.24pe
mca_bml_r2_progress
55095 1.8311  mca_pml_ob1.so   rco2.24pe
mca_pml_ob1_recv_request_match_wild
49286 1.6381  rco2.24perco2.24pe
tracer_
41946 1.3941  libintlc.so.5rco2.24pe
__intel_new_memcpy
40730 1.3537  rco2.24perco2.24pe
scobi_
36586 1.2160  rco2.24perco2.24pe
state_
20986 0.6975  rco2.24perco2.24pe
diag_
19321 0.6422  libmpi.so.0.0.0  rco2.24pe
PMPI_Unpack
18552 0.6166  libmpi.so.0.0.0  rco2.24pe
PMPI_Iprobe
17323 0.5757  rco2.24perco2.24pe
clinic_
16194 0.5382  rco2.24perco2.24pe
k_epsi_
15330 0.5095  libmpi.so.0.0.0  rco2.24pe
PMPI_Comm_f2c
13778 0.4579  libmpi_f77.so.0.0.0  rco2.24pe
mpi_iprobe_f
13241 0.4401  rco2.24perco2.24pe
s_recv_
12386 0.4117  rco2.24perco2.24pe
growth_
11699 0.3888  rco2.24perco2.24pe
testnrecv_
11268 0.3745  libmpi.so.0.0.0  rco2.24pe
mca_pml_base_recv_request_construct
10971 0.3646  libmpi.so.0.0.0  rco2.24pe
ompi_convertor_unpack
10034 

Re: [OMPI users] Performance difference on OpenMPI, IntelMPI and ScaliMPI

2009-08-05 Thread Terry Dontje
We've found on certain applications binding to processors can have up to 
a 2x difference.  ScaliMPI automatically binds processes by socket so if 
you are not running a one process per cpu job each process will land on 
a different socket. 

OMPI defaults to not binding at all.  You may want to try and use the 
rankfile option (see manpage) and see if that helps any.


If the above doesn't improve anything the next question is do you know 
what the sizes of the messages are?  For very small messages I believe 
Scali shows a 2x better performance than Intel and OMPI (I think this is 
due to a fastpath optimization).


--td

Message: 1
Date: Wed, 05 Aug 2009 15:15:52 +0200
From: Torgny Faxen 
Subject: Re: [OMPI users] Performance difference on OpenMPI, IntelMPI
    and ScaliMPI
To: pa...@dev.mellanox.co.il, Open MPI Users 
Message-ID: <4a798608.5030...@nsc.liu.se>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Pasha,
no collectives are being used.

A simple grep in the code reveals the following MPI functions being used:
MPI_Init
MPI_wtime
MPI_COMM_RANK
MPI_COMM_SIZE
MPI_BUFFER_ATTACH
MPI_BSEND
MPI_PACK
MPI_UNPACK
MPI_PROBE
MPI_GET_COUNT
MPI_RECV
MPI_IPROBE
MPI_FINALIZE

where MPI_IPROBE is the clear winner in terms of number of calls.

/Torgny

Pavel Shamis (Pasha) wrote:
  

> Do you know if the application use some collective operations ?
>
> Thanks
>
> Pasha
>
> Torgny Faxen wrote:


>> Hello,
>> we are seeing a large difference in performance for some applications 
>> depending on what MPI is being used.

>>
>> Attached are performance numbers and oprofile output (first 30 lines) 
>> from one out of 14 nodes from one application run using OpenMPI, 
>> IntelMPI and Scali MPI respectively.

>>
>> Scali MPI is faster the other two MPI:s with a factor of 1.6 and 1.75:
>>
>> ScaliMPI: walltime for the whole application is 214 seconds
>> OpenMPI: walltime for the whole application is 376 seconds
>> Intel MPI: walltime for the whole application is 346 seconds.
>>
>> The application is running with the main send receive commands being:
>> MPI_Bsend
>> MPI_Iprobe followed by MPI_Recv (in case of there being a message). 
>> Quite often MPI_Iprobe is being called just to check whether there is 
>> a certain message pending.

>>
>> Any idea on tuning tips, performance analysis, code modifications to 
>> improve the OpenMPI performance? A lot of time is being spent in 
>> "mca_btl_sm_component_progress", "btl_openib_component_progress" and 
>> other internal routines.

>>
>> The code is running on a cluster with 140 HP ProLiant DL160 G5 
>> compute servers. Infiniband interconnect. Intel Xeon E5462 
>> processors. The profiled application is using 144 cores on 18 nodes 
>> over Infiniband.

>>
>> Regards / Torgny
>> =0 
>>

>> OpenMPI  1.3b2
>> =0 
>>

>>
>> Walltime: 376 seconds
>>
>> CPU: CPU with timer interrupt, speed 0 MHz (estimated)
>> Profiling through timer interrupt
>> samples  %image name   app name 
>> symbol name
>> 668288   22.2113  mca_btl_sm.sorco2.24pe
>> mca_btl_sm_component_progress
>> 441828   14.6846  rco2.24perco2.24pe
>> step_
>> 335929   11.1650  libmlx4-rdmav2.sorco2.24pe
>> (no symbols)
>> 301446   10.0189  mca_btl_openib.sorco2.24pe
>> btl_openib_component_progress
>> 1610335.3521  libopen-pal.so.0.0.0 rco2.24pe
>> opal_progress
>> 1570245.2189  libpthread-2.5.sorco2.24pe
>> pthread_spin_lock
>> 99526 3.3079  no-vmlinux   no-vmlinux   
>> (no symbols)
>> 93887 3.1204  mca_btl_sm.sorco2.24pe
>> opal_using_threads
>> 69979 2.3258  mca_pml_ob1.so   rco2.24pe
>> mca_pml_ob1_iprobe
>> 58895 1.9574  mca_bml_r2.sorco2.24pe
>> mca_bml_r2_progress
>> 55095 1.8311  mca_pml_ob1.so   rco2.24pe
>> mca_pml_ob1_recv_request_match_wild
>> 49286 1.6381  rco2.24perco2.24pe
>> tracer_
>> 41946 1.3941  libintlc.so.5rco2.24pe
>> __intel_new_memcpy
>> 40730 1.3537  rco2.24pe   

Re: [OMPI users] Performance difference on OpenMPI, IntelMPI and ScaliMPI

2009-08-05 Thread Pavel Shamis (Pasha)




If the above doesn't improve anything the next question is do you know 
what the sizes of the messages are? For very small messages I believe 
Scali shows a 2x better performance than Intel and OMPI (I think this 
is due to a fastpath optimization).


I remember that mvapich was faster that scali for small messages (I'm 
talking only about IB, no sm).
Ompi 1.3 latency is very close to mvapich latency. So I do not see how 
Scali latency may be better than OMPI.


Pasha


Re: [OMPI users] Performance difference on OpenMPI, IntelMPI and ScaliMPI

2009-08-05 Thread Terry Dontje
A comment to the below.  I meant the 2x performance was for shared 
memory communications.


--td

Message: 3
Date: Wed, 05 Aug 2009 09:55:42 -0400
From: Terry Dontje 
Subject: Re: [OMPI users] Performance difference on OpenMPI, IntelMPI
and ScaliMPI
To: us...@open-mpi.org
Message-ID: <4a798f5e.3000...@sun.com>
Content-Type: text/plain; CHARSET=US-ASCII; format=flowed

We've found on certain applications binding to processors can have up to 
a 2x difference.  ScaliMPI automatically binds processes by socket so if 
you are not running a one process per cpu job each process will land on 
a different socket. 

OMPI defaults to not binding at all.  You may want to try and use the 
rankfile option (see manpage) and see if that helps any.


If the above doesn't improve anything the next question is do you know 
what the sizes of the messages are?  For very small messages I believe 
Scali shows a 2x better performance than Intel and OMPI (I think this is 
due to a fastpath optimization).




Re: [OMPI users] Performance difference on OpenMPI, IntelMPI and ScaliMPI

2009-08-05 Thread Torgny Faxen

Ralph,
I am running through a locally provided wrapper but  it translates to:
/software/mpi/openmpi/1.3b2/i101017/bin/mpirun -np 144 -npernode 8 -mca 
mpi_show_mca_params env,file /nobac

kup/rossby11/faxen/RCO_scobi/src_161.openmpi/rco2.24pe

a) Upgrade.. This will take some time, it will have to go through the 
administrator, this is a production cluster

b) -mca .. see output below
c) I used exactly the same optimization flags for all three versions 
(ScaliMPI, OpenMPI and IntelMPI) and this is Fortran so I am using 
mpif90 :-)


Regards / Torgny

[n70:30299] ess=env (environment)
[n70:30299] orte_ess_jobid=482607105 (environment)
[n70:30299] orte_ess_vpid=0 (environment)
[n70:30299] mpi_yield_when_idle=0 (environment)
[n70:30299] mpi_show_mca_params=env,file (environment)


Ralph Castain wrote:
Could you send us the mpirun cmd line? I wonder if you are missing 
some options that could help. Also, you might:


(a) upgrade to 1.3.3 - it looks like you are using some kind of 
pre-release version


(b) add -mca mpi_show_mca_params env,file - this will cause rank=0 to 
output what mca params it sees, and where they came from


(c) check that you built a non-debug version, and remembered to 
compile your application with a -O3 flag - i.e., "mpicc -O3 ...". 
Remember, OMPI does not automatically add optimization flags to mpicc!


Thanks
Ralph


On Wed, Aug 5, 2009 at 7:15 AM, Torgny Faxen > wrote:


Pasha,
no collectives are being used.

A simple grep in the code reveals the following MPI functions
being used:
MPI_Init
MPI_wtime
MPI_COMM_RANK
MPI_COMM_SIZE
MPI_BUFFER_ATTACH
MPI_BSEND
MPI_PACK
MPI_UNPACK
MPI_PROBE
MPI_GET_COUNT
MPI_RECV
MPI_IPROBE
MPI_FINALIZE

where MPI_IPROBE is the clear winner in terms of number of calls.

/Torgny


Pavel Shamis (Pasha) wrote:

Do you know if the application use some collective operations ?

Thanks

Pasha

Torgny Faxen wrote:

Hello,
we are seeing a large difference in performance for some
applications depending on what MPI is being used.

Attached are performance numbers and oprofile output
(first 30 lines) from one out of 14 nodes from one
application run using OpenMPI, IntelMPI and Scali MPI
respectively.

Scali MPI is faster the other two MPI:s with a factor of
1.6 and 1.75:

ScaliMPI: walltime for the whole application is 214 seconds
OpenMPI: walltime for the whole application is 376 seconds
Intel MPI: walltime for the whole application is 346 seconds.

The application is running with the main send receive
commands being:
MPI_Bsend
MPI_Iprobe followed by MPI_Recv (in case of there being a
message). Quite often MPI_Iprobe is being called just to
check whether there is a certain message pending.

Any idea on tuning tips, performance analysis, code
modifications to improve the OpenMPI performance? A lot of
time is being spent in "mca_btl_sm_component_progress",
"btl_openib_component_progress" and other internal routines.

The code is running on a cluster with 140 HP ProLiant
DL160 G5 compute servers. Infiniband interconnect. Intel
Xeon E5462 processors. The profiled application is using
144 cores on 18 nodes over Infiniband.

Regards / Torgny

=0

OpenMPI  1.3b2

=0


Walltime: 376 seconds

CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples  %image name   app name  
  symbol name
668288   22.2113  mca_btl_sm.sorco2.24pe  
 mca_btl_sm_component_progress
441828   14.6846  rco2.24perco2.24pe  
 step_
335929   11.1650  libmlx4-rdmav2.sorco2.24pe  
 (no symbols)
301446   10.0189  mca_btl_openib.sorco2.24pe  
 btl_openib_component_progress
1610335.3521  libopen-pal.so.0.0.0 rco2.24pe  
 opal_progress

1570245.2189  libpthread-2.5.so
rco2.24pe  
 pthread_spin_lock
99526 3.3079  no-vmlinux   no-vmlinux
  (no symbols)
93887 3.1204  mca_btl_sm.so 

Re: [OMPI users] Performance difference on OpenMPI, IntelMPI and ScaliMPI

2009-08-05 Thread Ralph Castain
Okay, one problem is fairly clear. As Terry indicated, you have to tell us
to bind or else you lose a lot of performace. Set -mca opal_paffinity_alone
1 on your cmd line and it should make a significant difference.


On Wed, Aug 5, 2009 at 8:10 AM, Torgny Faxen  wrote:

> Ralph,
> I am running through a locally provided wrapper but  it translates to:
> /software/mpi/openmpi/1.3b2/i101017/bin/mpirun -np 144 -npernode 8 -mca
> mpi_show_mca_params env,file /nobac
> kup/rossby11/faxen/RCO_scobi/src_161.openmpi/rco2.24pe
>
> a) Upgrade.. This will take some time, it will have to go through the
> administrator, this is a production cluster
> b) -mca .. see output below
> c) I used exactly the same optimization flags for all three versions
> (ScaliMPI, OpenMPI and IntelMPI) and this is Fortran so I am using mpif90
> :-)
>
> Regards / Torgny
>
> [n70:30299] ess=env (environment)
> [n70:30299] orte_ess_jobid=482607105 (environment)
> [n70:30299] orte_ess_vpid=0 (environment)
> [n70:30299] mpi_yield_when_idle=0 (environment)
> [n70:30299] mpi_show_mca_params=env,file (environment)
>
>
> Ralph Castain wrote:
>
>> Could you send us the mpirun cmd line? I wonder if you are missing some
>> options that could help. Also, you might:
>>
>> (a) upgrade to 1.3.3 - it looks like you are using some kind of
>> pre-release version
>>
>> (b) add -mca mpi_show_mca_params env,file - this will cause rank=0 to
>> output what mca params it sees, and where they came from
>>
>> (c) check that you built a non-debug version, and remembered to compile
>> your application with a -O3 flag - i.e., "mpicc -O3 ...". Remember, OMPI
>> does not automatically add optimization flags to mpicc!
>>
>> Thanks
>> Ralph
>>
>>
>> On Wed, Aug 5, 2009 at 7:15 AM, Torgny Faxen > fa...@nsc.liu.se>> wrote:
>>
>>Pasha,
>>no collectives are being used.
>>
>>A simple grep in the code reveals the following MPI functions
>>being used:
>>MPI_Init
>>MPI_wtime
>>MPI_COMM_RANK
>>MPI_COMM_SIZE
>>MPI_BUFFER_ATTACH
>>MPI_BSEND
>>MPI_PACK
>>MPI_UNPACK
>>MPI_PROBE
>>MPI_GET_COUNT
>>MPI_RECV
>>MPI_IPROBE
>>MPI_FINALIZE
>>
>>where MPI_IPROBE is the clear winner in terms of number of calls.
>>
>>/Torgny
>>
>>
>>Pavel Shamis (Pasha) wrote:
>>
>>Do you know if the application use some collective operations ?
>>
>>Thanks
>>
>>Pasha
>>
>>Torgny Faxen wrote:
>>
>>Hello,
>>we are seeing a large difference in performance for some
>>applications depending on what MPI is being used.
>>
>>Attached are performance numbers and oprofile output
>>(first 30 lines) from one out of 14 nodes from one
>>application run using OpenMPI, IntelMPI and Scali MPI
>>respectively.
>>
>>Scali MPI is faster the other two MPI:s with a factor of
>>1.6 and 1.75:
>>
>>ScaliMPI: walltime for the whole application is 214 seconds
>>OpenMPI: walltime for the whole application is 376 seconds
>>Intel MPI: walltime for the whole application is 346 seconds.
>>
>>The application is running with the main send receive
>>commands being:
>>MPI_Bsend
>>MPI_Iprobe followed by MPI_Recv (in case of there being a
>>message). Quite often MPI_Iprobe is being called just to
>>check whether there is a certain message pending.
>>
>>Any idea on tuning tips, performance analysis, code
>>modifications to improve the OpenMPI performance? A lot of
>>time is being spent in "mca_btl_sm_component_progress",
>>"btl_openib_component_progress" and other internal routines.
>>
>>The code is running on a cluster with 140 HP ProLiant
>>DL160 G5 compute servers. Infiniband interconnect. Intel
>>Xeon E5462 processors. The profiled application is using
>>144 cores on 18 nodes over Infiniband.
>>
>>Regards / Torgny
>>
>>  
>> =0
>>
>>OpenMPI  1.3b2
>>
>>  
>> =0
>>
>>
>>Walltime: 376 seconds
>>
>>CPU: CPU with timer interrupt, speed 0 MHz (estimated)
>>Profiling through timer interrupt
>>samples  %image name   app name
>>symbol name
>>668288   22.2113  mca_btl_sm.sorco2.24pe
>> mca_btl_sm_component_progress
>>441828   14.6846  rco2.24perco2.24pe
>> step_
>>335929   11.1650  libmlx4-rdmav2.sorco2.24pe
>> (no symbols)
>>301446   10.0189  mca_btl_openib.sorco2.24pe
>>

Re: [OMPI users] Performance difference on OpenMPI, IntelMPI and ScaliMPI

2009-08-05 Thread Torgny Faxen

Ralph,
I can't get "opal_paffinity_alone" to work (see below). However, there 
is a "mpi_affinity_alone" that I tried without any improvement.


However, setting:
-mca btl_openib_eager_limit 65536
gave a 15% improvement so OpenMPI is now down to 326 (from previous 376 
seconds). Still a lot more than ScaliMPI with 214 seconds.


Looking at the profile data my gut feeling is that the performance 
suffers due to the frequent calls to MPI_IPROBE. I will look at this and 
count the number of calls but it could easily be 10 times more calls to 
MPI_IPROBE thanto  MPI_BSEND.


/Torgny

n70 462% ompi_info --param all all | grep opal
   MCA opal: parameter "opal_signal" (current value: 
"6,7,8,11", data source: default value)
   MCA opal: parameter "opal_set_max_sys_limits" (current 
value: "0", data source: default value)
   MCA opal: parameter "opal_event_include" (current value: 
"poll", data source: default value)

n70 463% ompi_info --param all all | grep paffinity
MCA mpi: parameter "mpi_paffinity_alone" (current 
value: "0", data source: default value)
  MCA paffinity: parameter "paffinity_base_verbose" (current 
value: "0", data source: default value)

 Verbosity level of the paffinity framework
  MCA paffinity: parameter "paffinity" (current value: , 
data source: default value)
 Default selection set of components for the 
paffinity framework ( means use all components that can be found)
  MCA paffinity: parameter "paffinity_linux_priority" (current 
value: "10", data source: default value)

 Priority of the linux paffinity component
  MCA paffinity: information "paffinity_linux_plpa_version" 
(value: "1.2rc2", data source: default value)





Ralph Castain wrote:
Okay, one problem is fairly clear. As Terry indicated, you have to 
tell us to bind or else you lose a lot of performace. Set -mca 
opal_paffinity_alone 1 on your cmd line and it should make a 
significant difference.



On Wed, Aug 5, 2009 at 8:10 AM, Torgny Faxen > wrote:


Ralph,
I am running through a locally provided wrapper but  it translates to:
/software/mpi/openmpi/1.3b2/i101017/bin/mpirun -np 144 -npernode 8
-mca mpi_show_mca_params env,file /nobac
kup/rossby11/faxen/RCO_scobi/src_161.openmpi/rco2.24pe

a) Upgrade.. This will take some time, it will have to go through
the administrator, this is a production cluster
b) -mca .. see output below
c) I used exactly the same optimization flags for all three
versions (ScaliMPI, OpenMPI and IntelMPI) and this is Fortran so I
am using mpif90 :-)

Regards / Torgny

[n70:30299] ess=env (environment)
[n70:30299] orte_ess_jobid=482607105 (environment)
[n70:30299] orte_ess_vpid=0 (environment)
[n70:30299] mpi_yield_when_idle=0 (environment)
[n70:30299] mpi_show_mca_params=env,file (environment)


Ralph Castain wrote:

Could you send us the mpirun cmd line? I wonder if you are
missing some options that could help. Also, you might:

(a) upgrade to 1.3.3 - it looks like you are using some kind
of pre-release version

(b) add -mca mpi_show_mca_params env,file - this will cause
rank=0 to output what mca params it sees, and where they came from

(c) check that you built a non-debug version, and remembered
to compile your application with a -O3 flag - i.e., "mpicc -O3
...". Remember, OMPI does not automatically add optimization
flags to mpicc!

Thanks
Ralph


On Wed, Aug 5, 2009 at 7:15 AM, Torgny Faxen mailto:fa...@nsc.liu.se> >> wrote:

   Pasha,
   no collectives are being used.

   A simple grep in the code reveals the following MPI functions
   being used:
   MPI_Init
   MPI_wtime
   MPI_COMM_RANK
   MPI_COMM_SIZE
   MPI_BUFFER_ATTACH
   MPI_BSEND
   MPI_PACK
   MPI_UNPACK
   MPI_PROBE
   MPI_GET_COUNT
   MPI_RECV
   MPI_IPROBE
   MPI_FINALIZE

   where MPI_IPROBE is the clear winner in terms of number of
calls.

   /Torgny


   Pavel Shamis (Pasha) wrote:

   Do you know if the application use some collective
operations ?

   Thanks

   Pasha

   Torgny Faxen wrote:

   Hello,
   we are seeing a large difference in performance for
some
   applications depending on what MPI is being used.

   Attached are performance numbers and oprofile output
   (first 30 lines) from one out of 14 nodes from one
   application run using OpenMPI, IntelMPI and Scali MPI
 

Re: [OMPI users] Performance difference on OpenMPI, IntelMPI and ScaliMPI

2009-08-05 Thread Pavel Shamis (Pasha)



However, setting:
-mca btl_openib_eager_limit 65536
gave a 15% improvement so OpenMPI is now down to 326 (from previous 
376 seconds). Still a lot more than ScaliMPI with 214 seconds.
Can you please run ibv_devinfo on one of compute nodes ? It is 
interesting to know what kind of IB HW you have on our cluster.


Pasha


Re: [OMPI users] Performance difference on OpenMPI, IntelMPI and ScaliMPI

2009-08-06 Thread Torgny Faxen

Pasha,
se attached file.

I have traced how MPI_IPROBE is called and also managed to significantly 
reduce the number of calls to MPI_IPROBE. Unfortunately this only 
resulted in the program spending time in other routines. Basically the 
code runs through a number of timesteps and after each timestep all 
slave nodes wait for the master to give go ahead for the next step and 
this is were a lot of time is being spent. Either a load inbalance or 
just waiting for all MPI_BSEND:s to complete or something else.


I am kind of stuck right now and will have to do some more 
investigations . Strange that this works so much better using Scali MPI.


Regards / Torgny

Pavel Shamis (Pasha) wrote:



However, setting:
-mca btl_openib_eager_limit 65536
gave a 15% improvement so OpenMPI is now down to 326 (from previous 
376 seconds). Still a lot more than ScaliMPI with 214 seconds.
Can you please run ibv_devinfo on one of compute nodes ? It is 
interesting to know what kind of IB HW you have on our cluster.


Pasha
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




--
-
  Torgny Faxén  
  National Supercomputer Center
  Linköping University  
  S-581 83 Linköping
  Sweden

  Email:fa...@nsc.liu.se
  Telephone: +46 13 285798 (office) +46 13 282535  (fax)
  http://www.nsc.liu.se
-


hca_id: mlx4_0
fw_ver: 2.5.000
node_guid:  001e:0bff:ff4c:1bf4
sys_image_guid: 001e:0bff:ff4c:1bf7
vendor_id:  0x02c9
vendor_part_id: 25418
hw_ver: 0xA0
board_id:   HP_09D001
phys_port_cnt:  2
port:   1
state:  active (4)
max_mtu:2048 (4)
active_mtu: 2048 (4)
sm_lid: 1
port_lid:   132
port_lmc:   0x00

port:   2
state:  down (1)
max_mtu:2048 (4)
active_mtu: 2048 (4)
sm_lid: 0
port_lid:   0
port_lmc:   0x00