Re: [OMPI devel] RFC: OB1 optimizations

Paul Hargrove Wed, 8 Jan 2014 11:59:36 -0500 (EST)

Nevermind, since Nathan just clarified that the results are not comparable.


-Paul [Sent from my phone]
On Jan 8, 2014 8:58 AM, "Paul Hargrove" <phhargr...@lbl.gov> wrote:

> Interestingly enough the 4MB latency actually improved significantly
> relative to the initial numbers.
>
> -Paul [Sent from my phone]
> On Jan 8, 2014 8:50 AM, "George Bosilca" <bosi...@icl.utk.edu> wrote:
>
>> These results are way worst that the one you send on your previous email?
>> What is the reason?
>>
>>   George.
>>
>> On Jan 8, 2014, at 17:33 , Nathan Hjelm <hje...@lanl.gov> wrote:
>>
>> > Ah, good catch. A new version is attached that should eliminate the race
>> > window for the multi-threaded case. Performance numbers are still
>> > looking really good. We beat mvapich2 in the small message ping-pong by
>> > a good margin. See the results below. The large message latency
>> > difference for large messages is probably due to a difference in the max
>> > send size for vader vs mvapich.
>> >
>> > To answer Pasha's question. I don't see a noticiable difference in
>> > performance for btl's with no sendi function (this includes
>> > ugni). OpenIB should get a boost. I will test that once I get an
>> > allocation.
>> >
>> > CPU: Xeon E5-2670 @ 2.60 GHz
>> >
>> > Open MPI (-mca btl vader,self):
>> > # OSU MPI Latency Test v4.1
>> > # Size          Latency (us)
>> > 0                       0.17
>> > 1                       0.19
>> > 2                       0.19
>> > 4                       0.19
>> > 8                       0.19
>> > 16                      0.19
>> > 32                      0.19
>> > 64                      0.40
>> > 128                     0.40
>> > 256                     0.43
>> > 512                     0.52
>> > 1024                    0.67
>> > 2048                    0.94
>> > 4096                    1.44
>> > 8192                    2.04
>> > 16384                   3.47
>> > 32768                   6.10
>> > 65536                   9.38
>> > 131072                 16.47
>> > 262144                 29.63
>> > 524288                 54.81
>> > 1048576               106.63
>> > 2097152               206.84
>> > 4194304               421.26
>> >
>> >
>> > mvapich2 1.9:
>> > # OSU MPI Latency Test
>> > # Size            Latency (us)
>> > 0                         0.23
>> > 1                         0.23
>> > 2                         0.23
>> > 4                         0.23
>> > 8                         0.23
>> > 16                        0.28
>> > 32                        0.28
>> > 64                        0.39
>> > 128                       0.40
>> > 256                       0.40
>> > 512                       0.42
>> > 1024                      0.51
>> > 2048                      0.71
>> > 4096                      1.02
>> > 8192                      1.60
>> > 16384                     3.47
>> > 32768                     5.05
>> > 65536                     8.06
>> > 131072                   14.82
>> > 262144                   28.15
>> > 524288                   53.69
>> > 1048576                 127.47
>> > 2097152                 235.58
>> > 4194304                 683.90
>> >
>> >
>> > -Nathan
>> >
>> > On Tue, Jan 07, 2014 at 06:23:13PM -0700, George Bosilca wrote:
>> >>   The local request is not correctly released, leading to assert in
>> debug
>> >>   mode. This is because you avoid calling
>> MCA_PML_BASE_RECV_REQUEST_FINI,
>> >>   fact that leaves the request in an ACTIVE state, condition carefully
>> >>   checked during the call to destructor.
>> >>
>> >>   I attached a second patch that fixes the issue above, and implement a
>> >>   similar optimization for the blocking send.
>> >>
>> >>   Unfortunately, this is not enough. The mca_pml_ob1_send_inline
>> >>   optimization is horribly wrong in a multithreaded case as it alter
>> the
>> >>   send_sequence without storing it. If you create a gap in the
>> send_sequence
>> >>   a deadlock will __definitively__ occur. I strongly suggest you turn
>> off
>> >>   the mca_pml_ob1_send_inline optimization on the multithreaded case.
>> All
>> >>   the others optimizations should be safe in all cases.
>> >>
>> >>     George.
>> >>
>> >>   On Jan 8, 2014, at 01:15 , Shamis, Pavel <sham...@ornl.gov> wrote:
>> >>
>> >>> Overall it looks good. It would be helpful to validate performance
>> >>   numbers for other interconnects as well.
>> >>> -Pasha
>> >>>
>> >>>> -----Original Message-----
>> >>>> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Nathan
>> >>>> Hjelm
>> >>>> Sent: Tuesday, January 07, 2014 6:45 PM
>> >>>> To: Open MPI Developers List
>> >>>> Subject: [OMPI devel] RFC: OB1 optimizations
>> >>>>
>> >>>> What: Push some ob1 optimizations to the trunk and 1.7.5.
>> >>>>
>> >>>> What: This patch contains two optimizations:
>> >>>>
>> >>>> - Introduce a fast send path for blocking send calls. This path uses
>> >>>>   the btl sendi function to put the data on the wire without the need
>> >>>>   for setting up a send request. In the case of btl/vader this can
>> >>>>   also avoid allocating/initializing a new fragment. With btl/vader
>> >>>>   this optimization improves small message latency by 50-200ns in
>> >>>>   ping-pong type benchmarks. Larger messages may take a small hit in
>> >>>>   the range of 10-20ns.
>> >>>>
>> >>>> - Use a stack-allocated receive request for blocking recieves. This
>> >>>>   optimization saves the extra instructions associated with accessing
>> >>>>   the receive request free list. I was able to get another 50-200ns
>> >>>>   improvement in the small-message ping-pong with this optimization.
>> I
>> >>>>   see no hit for larger messages.
>> >>>>
>> >>>> When: These changes touch the critical path in ob1 and are targeted
>> for
>> >>>> 1.7.5. As such I will set a moderately long timeout. Timeout set for
>> >>>> next Friday (Jan 17).
>> >>>>
>> >>>> Some results from osu_latency on haswell:
>> >>>>
>> >>>> hjelmn@cn143 pt2pt]$ mpirun -n 2 --bind-to core -mca btl vader,self
>> >>>> ./osu_latency
>> >>>> # OSU MPI Latency Test v4.0.1
>> >>>> # Size          Latency (us)
>> >>>> 0                       0.11
>> >>>> 1                       0.14
>> >>>> 2                       0.14
>> >>>> 4                       0.14
>> >>>> 8                       0.14
>> >>>> 16                      0.14
>> >>>> 32                      0.15
>> >>>> 64                      0.18
>> >>>> 128                     0.36
>> >>>> 256                     0.37
>> >>>> 512                     0.46
>> >>>> 1024                    0.56
>> >>>> 2048                    0.80
>> >>>> 4096                    1.12
>> >>>> 8192                    1.68
>> >>>> 16384                   2.98
>> >>>> 32768                   5.10
>> >>>> 65536                   8.12
>> >>>> 131072                 14.07
>> >>>> 262144                 25.30
>> >>>> 524288                 47.40
>> >>>> 1048576                91.71
>> >>>> 2097152               195.56
>> >>>> 4194304               487.05
>> >>>>
>> >>>>
>> >>>> Patch Attached.
>> >>>>
>> >>>> -Nathan
>> >>> _______________________________________________
>> >>> devel mailing list
>> >>> de...@open-mpi.org
>> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >>
>> >>   _______________________________________________
>> >>   devel mailing list
>> >>   de...@open-mpi.org
>> >>   http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >
>> >
>> >
>> <ob1_optimization_take3.patch>_______________________________________________
>> > devel mailing list
>> > de...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>

Re: [OMPI devel] RFC: OB1 optimizations

Reply via email to