We've found on certain applications binding to processors can have up to a 2x difference. ScaliMPI automatically binds processes by socket so if you are not running a one process per cpu job each process will land on a different socket. OMPI defaults to not binding at all. You may want to try and use the rankfile option (see manpage) and see if that helps any.

If the above doesn't improve anything the next question is do you know what the sizes of the messages are? For very small messages I believe Scali shows a 2x better performance than Intel and OMPI (I think this is due to a fastpath optimization).

--td
Message: 1
Date: Wed, 05 Aug 2009 15:15:52 +0200
From: Torgny Faxen <fa...@nsc.liu.se>
Subject: Re: [OMPI users] Performance difference on OpenMPI, IntelMPI
        and ScaliMPI
To: pa...@dev.mellanox.co.il, Open MPI Users <us...@open-mpi.org>
Message-ID: <4a798608.5030...@nsc.liu.se>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Pasha,
no collectives are being used.

A simple grep in the code reveals the following MPI functions being used:
MPI_Init
MPI_wtime
MPI_COMM_RANK
MPI_COMM_SIZE
MPI_BUFFER_ATTACH
MPI_BSEND
MPI_PACK
MPI_UNPACK
MPI_PROBE
MPI_GET_COUNT
MPI_RECV
MPI_IPROBE
MPI_FINALIZE

where MPI_IPROBE is the clear winner in terms of number of calls.

/Torgny

Pavel Shamis (Pasha) wrote:
> Do you know if the application use some collective operations ?
>
> Thanks
>
> Pasha
>
> Torgny Faxen wrote:
>> Hello,
>> we are seeing a large difference in performance for some applications >> depending on what MPI is being used.
>>
>> Attached are performance numbers and oprofile output (first 30 lines) >> from one out of 14 nodes from one application run using OpenMPI, >> IntelMPI and Scali MPI respectively.
>>
>> Scali MPI is faster the other two MPI:s with a factor of 1.6 and 1.75:
>>
>> ScaliMPI: walltime for the whole application is 214 seconds
>> OpenMPI: walltime for the whole application is 376 seconds
>> Intel MPI: walltime for the whole application is 346 seconds.
>>
>> The application is running with the main send receive commands being:
>> MPI_Bsend
>> MPI_Iprobe followed by MPI_Recv (in case of there being a message). >> Quite often MPI_Iprobe is being called just to check whether there is >> a certain message pending.
>>
>> Any idea on tuning tips, performance analysis, code modifications to >> improve the OpenMPI performance? A lot of time is being spent in >> "mca_btl_sm_component_progress", "btl_openib_component_progress" and >> other internal routines.
>>
>> The code is running on a cluster with 140 HP ProLiant DL160 G5 >> compute servers. Infiniband interconnect. Intel Xeon E5462 >> processors. The profiled application is using 144 cores on 18 nodes >> over Infiniband.
>>
>> Regards / Torgny
>> =====================================================================================================================0 >>
>> OpenMPI  1.3b2
>> =====================================================================================================================0 >>
>>
>> Walltime: 376 seconds
>>
>> CPU: CPU with timer interrupt, speed 0 MHz (estimated)
>> Profiling through timer interrupt
>> samples % image name app name >> symbol name >> 668288 22.2113 mca_btl_sm.so rco2.24pe >> mca_btl_sm_component_progress >> 441828 14.6846 rco2.24pe rco2.24pe >> step_ >> 335929 11.1650 libmlx4-rdmav2.so rco2.24pe >> (no symbols) >> 301446 10.0189 mca_btl_openib.so rco2.24pe >> btl_openib_component_progress >> 161033 5.3521 libopen-pal.so.0.0.0 rco2.24pe >> opal_progress >> 157024 5.2189 libpthread-2.5.so rco2.24pe >> pthread_spin_lock >> 99526 3.3079 no-vmlinux no-vmlinux >> (no symbols) >> 93887 3.1204 mca_btl_sm.so rco2.24pe >> opal_using_threads >> 69979 2.3258 mca_pml_ob1.so rco2.24pe >> mca_pml_ob1_iprobe >> 58895 1.9574 mca_bml_r2.so rco2.24pe >> mca_bml_r2_progress >> 55095 1.8311 mca_pml_ob1.so rco2.24pe >> mca_pml_ob1_recv_request_match_wild >> 49286 1.6381 rco2.24pe rco2.24pe >> tracer_ >> 41946 1.3941 libintlc.so.5 rco2.24pe >> __intel_new_memcpy >> 40730 1.3537 rco2.24pe rco2.24pe >> scobi_ >> 36586 1.2160 rco2.24pe rco2.24pe >> state_ >> 20986 0.6975 rco2.24pe rco2.24pe >> diag_ >> 19321 0.6422 libmpi.so.0.0.0 rco2.24pe >> PMPI_Unpack >> 18552 0.6166 libmpi.so.0.0.0 rco2.24pe >> PMPI_Iprobe >> 17323 0.5757 rco2.24pe rco2.24pe >> clinic_ >> 16194 0.5382 rco2.24pe rco2.24pe >> k_epsi_ >> 15330 0.5095 libmpi.so.0.0.0 rco2.24pe >> PMPI_Comm_f2c >> 13778 0.4579 libmpi_f77.so.0.0.0 rco2.24pe >> mpi_iprobe_f >> 13241 0.4401 rco2.24pe rco2.24pe >> s_recv_ >> 12386 0.4117 rco2.24pe rco2.24pe >> growth_ >> 11699 0.3888 rco2.24pe rco2.24pe >> testnrecv_ >> 11268 0.3745 libmpi.so.0.0.0 rco2.24pe >> mca_pml_base_recv_request_construct >> 10971 0.3646 libmpi.so.0.0.0 rco2.24pe >> ompi_convertor_unpack >> 10034 0.3335 mca_pml_ob1.so rco2.24pe >> mca_pml_ob1_recv_request_match_specific >> 10003 0.3325 libimf.so rco2.24pe >> exp.L >> 9375 0.3116 rco2.24pe rco2.24pe >> subbasin_ >> 8912 0.2962 libmpi_f77.so.0.0.0 rco2.24pe >> mpi_unpack_f
>>
>>
>>
>> =====================================================================================================================0 >>
>> Intel MPI, version 3.2.0.011/
>> =====================================================================================================================0 >>
>>
>> Walltime: 346 seconds
>>
>> CPU: CPU with timer interrupt, speed 0 MHz (estimated)
>> Profiling through timer interrupt
>> samples % image name app name >> symbol name >> 486712 17.7537 rco2 rco2 >> step_ >> 431941 15.7558 no-vmlinux no-vmlinux >> (no symbols) >> 212425 7.7486 libmpi.so.3.2 rco2 >> MPIDI_CH3U_Recvq_FU >> 188975 6.8932 libmpi.so.3.2 rco2 >> MPIDI_CH3I_RDSSM_Progress >> 172855 6.3052 libmpi.so.3.2 rco2 >> MPIDI_CH3I_read_progress >> 121472 4.4309 libmpi.so.3.2 rco2 >> MPIDI_CH3I_SHM_read_progress >> 64492 2.3525 libc-2.5.so rco2 >> sched_yield >> 52372 1.9104 rco2 rco2 >> tracer_
>> 48621     1.7735  libmpi.so.3.2            rco2                     .plt
>> 45475 1.6588 libmpiif.so.3.2 rco2 >> pmpi_iprobe__ >> 44082 1.6080 libmpi.so.3.2 rco2 >> MPID_Iprobe >> 42788 1.5608 libmpi.so.3.2 rco2 >> MPIDI_CH3_Stop_recv >> 42754 1.5595 libpthread-2.5.so rco2 >> pthread_mutex_lock >> 42190 1.5390 libmpi.so.3.2 rco2 >> PMPI_Iprobe >> 41577 1.5166 rco2 rco2 >> scobi_ >> 40356 1.4721 libmpi.so.3.2 rco2 >> MPIDI_CH3_Start_recv >> 38582 1.4073 libdaplcma.so.1.0.2 rco2 >> (no symbols) >> 37545 1.3695 rco2 rco2 >> state_
>> 35597     1.2985  libc-2.5.so              rco2                     free
>> 34019 1.2409 libc-2.5.so rco2 >> malloc >> 31841 1.1615 rco2 rco2 >> s_recv_ >> 30955 1.1291 libmpi.so.3.2 rco2 >> __I_MPI___intel_new_memcpy >> 27876 1.0168 libc-2.5.so rco2 >> _int_malloc >> 26963 0.9835 rco2 rco2 >> testnrecv_ >> 23005 0.8391 libpthread-2.5.so rco2 >> __pthread_mutex_unlock_usercnt >> 22290 0.8131 libmpi.so.3.2 rco2 >> MPID_Segment_manipulate >> 22086 0.8056 libmpi.so.3.2 rco2 >> MPIDI_CH3I_read_progress_expected >> 19146 0.6984 rco2 rco2 >> diag_ >> 18250 0.6657 rco2 rco2 >> clinic_ >> =====================================================================================================================0 >>
>> Scali MPI, version 3.13.10-59413
>> =====================================================================================================================0 >>
>>
>> Walltime:
>>
>> CPU: CPU with timer interrupt, speed 0 MHz (estimated)
>> Profiling through timer interrupt
>> samples % image name app name >> symbol name >> 484267 30.0664 rco2.24pe rco2.24pe >> step_ >> 111949 6.9505 libmlx4-rdmav2.so rco2.24pe >> (no symbols) >> 73930 4.5900 libmpi.so rco2.24pe >> scafun_rq_handle_body >> 57846 3.5914 libmpi.so rco2.24pe >> invert_decode_header >> 55836 3.4667 libpthread-2.5.so rco2.24pe >> pthread_spin_lock >> 53703 3.3342 rco2.24pe rco2.24pe >> tracer_ >> 40934 2.5414 rco2.24pe rco2.24pe >> scobi_ >> 40244 2.4986 libmpi.so rco2.24pe >> scafun_request_probe_handler >> 37399 2.3220 rco2.24pe rco2.24pe >> state_ >> 30455 1.8908 libmpi.so rco2.24pe >> invert_matchandprobe >> 29707 1.8444 no-vmlinux no-vmlinux >> (no symbols) >> 29147 1.8096 libmpi.so rco2.24pe >> FMPI_scafun_Iprobe >> 27969 1.7365 libmpi.so rco2.24pe >> decode_8_u_64 >> 27475 1.7058 libmpi.so rco2.24pe >> scafun_rq_anysrc_fair_one >> 25966 1.6121 libmpi.so rco2.24pe >> scafun_uxq_probe >> 24380 1.5137 libc-2.5.so rco2.24pe >> memcpy
>> 22615     1.4041  libmpi.so                rco2.24pe                .plt
>> 21172 1.3145 rco2.24pe rco2.24pe >> diag_ >> 20716 1.2862 libc-2.5.so rco2.24pe >> memset >> 18565 1.1526 libmpi.so rco2.24pe >> openib_wrapper_poll_cq >> 18192 1.1295 rco2.24pe rco2.24pe >> clinic_ >> 17135 1.0638 libmpi.so rco2.24pe >> PMPI_Iprobe >> 16685 1.0359 rco2.24pe rco2.24pe >> k_epsi_ >> 16236 1.0080 libmpi.so rco2.24pe >> PMPI_Unpack >> 15563 0.9662 libmpi.so rco2.24pe >> scafun_r_rq_append >> 14829 0.9207 libmpi.so rco2.24pe >> scafun_rq_test_finished >> 13349 0.8288 rco2.24pe rco2.24pe >> s_recv_ >> 12490 0.7755 libmpi.so rco2.24pe >> flop_matchandprobe >> 12427 0.7715 libibverbs.so.1.0.0 rco2.24pe >> (no symbols) >> 12272 0.7619 libmpi.so rco2.24pe >> scafun_rq_handle >> 12146 0.7541 rco2.24pe rco2.24pe >> growth_ >> 10175 0.6317 libmpi.so rco2.24pe >> wrp2p_test_finished >> 9888 0.6139 libimf.so rco2.24pe >> exp.L >> 9179 0.5699 rco2.24pe rco2.24pe >> subbasin_ >> 9082 0.5639 rco2.24pe rco2.24pe >> testnrecv_ >> 8901 0.5526 libmpi.so rco2.24pe >> openib_wrapper_purge_requests >> 7425 0.4610 rco2.24pe rco2.24pe >> scobimain_ >> 7378 0.4581 rco2.24pe rco2.24pe >> scobi_interface_ >> 6530 0.4054 rco2.24pe rco2.24pe >> setvbc_ >> 6471 0.4018 libfmpi.so rco2.24pe >> pmpi_iprobe >> 6341 0.3937 rco2.24pe rco2.24pe >> snap_
>>
>>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


-- --------------------------------------------------------- Torgny Fax?n National Supercomputer Center Link?ping University S-581 83 Link?ping Sweden

Reply via email to