Yeah. Its hard to say what the results will look like on Haswell. I
expect they should show some improvement from George's change but we
won't know until I can get to a Haswell node. Hopefully one becomes
available today.

-Nathan

On Wed, Jan 08, 2014 at 08:59:34AM -0800, Paul Hargrove wrote:
>    Nevermind, since Nathan just clarified that the results are not
>    comparable.
> 
>    -Paul [Sent from my phone]
> 
>    On Jan 8, 2014 8:58 AM, "Paul Hargrove" <phhargr...@lbl.gov> wrote:
> 
>      Interestingly enough the 4MB latency actually improved significantly
>      relative to the initial numbers.
> 
>      -Paul [Sent from my phone]
> 
>      On Jan 8, 2014 8:50 AM, "George Bosilca" <bosi...@icl.utk.edu> wrote:
> 
>        These results are way worst that the one you send on your previous
>        email? What is the reason?
> 
>          George.
> 
>        On Jan 8, 2014, at 17:33 , Nathan Hjelm <hje...@lanl.gov> wrote:
> 
>        > Ah, good catch. A new version is attached that should eliminate the
>        race
>        > window for the multi-threaded case. Performance numbers are still
>        > looking really good. We beat mvapich2 in the small message ping-pong
>        by
>        > a good margin. See the results below. The large message latency
>        > difference for large messages is probably due to a difference in the
>        max
>        > send size for vader vs mvapich.
>        >
>        > To answer Pasha's question. I don't see a noticiable difference in
>        > performance for btl's with no sendi function (this includes
>        > ugni). OpenIB should get a boost. I will test that once I get an
>        > allocation.
>        >
>        > CPU: Xeon E5-2670 @ 2.60 GHz
>        >
>        > Open MPI (-mca btl vader,self):
>        > # OSU MPI Latency Test v4.1
>        > # Size          Latency (us)
>        > 0                       0.17
>        > 1                       0.19
>        > 2                       0.19
>        > 4                       0.19
>        > 8                       0.19
>        > 16                      0.19
>        > 32                      0.19
>        > 64                      0.40
>        > 128                     0.40
>        > 256                     0.43
>        > 512                     0.52
>        > 1024                    0.67
>        > 2048                    0.94
>        > 4096                    1.44
>        > 8192                    2.04
>        > 16384                   3.47
>        > 32768                   6.10
>        > 65536                   9.38
>        > 131072                 16.47
>        > 262144                 29.63
>        > 524288                 54.81
>        > 1048576               106.63
>        > 2097152               206.84
>        > 4194304               421.26
>        >
>        >
>        > mvapich2 1.9:
>        > # OSU MPI Latency Test
>        > # Size            Latency (us)
>        > 0                         0.23
>        > 1                         0.23
>        > 2                         0.23
>        > 4                         0.23
>        > 8                         0.23
>        > 16                        0.28
>        > 32                        0.28
>        > 64                        0.39
>        > 128                       0.40
>        > 256                       0.40
>        > 512                       0.42
>        > 1024                      0.51
>        > 2048                      0.71
>        > 4096                      1.02
>        > 8192                      1.60
>        > 16384                     3.47
>        > 32768                     5.05
>        > 65536                     8.06
>        > 131072                   14.82
>        > 262144                   28.15
>        > 524288                   53.69
>        > 1048576                 127.47
>        > 2097152                 235.58
>        > 4194304                 683.90
>        >
>        >
>        > -Nathan
>        >
>        > On Tue, Jan 07, 2014 at 06:23:13PM -0700, George Bosilca wrote:
>        >>   The local request is not correctly released, leading to assert in
>        debug
>        >>   mode. This is because you avoid calling
>        MCA_PML_BASE_RECV_REQUEST_FINI,
>        >>   fact that leaves the request in an ACTIVE state, condition
>        carefully
>        >>   checked during the call to destructor.
>        >>
>        >>   I attached a second patch that fixes the issue above, and
>        implement a
>        >>   similar optimization for the blocking send.
>        >>
>        >>   Unfortunately, this is not enough. The mca_pml_ob1_send_inline
>        >>   optimization is horribly wrong in a multithreaded case as it
>        alter the
>        >>   send_sequence without storing it. If you create a gap in the
>        send_sequence
>        >>   a deadlock will __definitively__ occur. I strongly suggest you
>        turn off
>        >>   the mca_pml_ob1_send_inline optimization on the multithreaded
>        case. All
>        >>   the others optimizations should be safe in all cases.
>        >>
>        >>     George.
>        >>
>        >>   On Jan 8, 2014, at 01:15 , Shamis, Pavel <sham...@ornl.gov>
>        wrote:
>        >>
>        >>> Overall it looks good. It would be helpful to validate performance
>        >>   numbers for other interconnects as well.
>        >>> -Pasha
>        >>>
>        >>>> -----Original Message-----
>        >>>> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of
>        Nathan
>        >>>> Hjelm
>        >>>> Sent: Tuesday, January 07, 2014 6:45 PM
>        >>>> To: Open MPI Developers List
>        >>>> Subject: [OMPI devel] RFC: OB1 optimizations
>        >>>>
>        >>>> What: Push some ob1 optimizations to the trunk and 1.7.5.
>        >>>>
>        >>>> What: This patch contains two optimizations:
>        >>>>
>        >>>> - Introduce a fast send path for blocking send calls. This path
>        uses
>        >>>>   the btl sendi function to put the data on the wire without the
>        need
>        >>>>   for setting up a send request. In the case of btl/vader this
>        can
>        >>>>   also avoid allocating/initializing a new fragment. With
>        btl/vader
>        >>>>   this optimization improves small message latency by 50-200ns in
>        >>>>   ping-pong type benchmarks. Larger messages may take a small hit
>        in
>        >>>>   the range of 10-20ns.
>        >>>>
>        >>>> - Use a stack-allocated receive request for blocking recieves.
>        This
>        >>>>   optimization saves the extra instructions associated with
>        accessing
>        >>>>   the receive request free list. I was able to get another
>        50-200ns
>        >>>>   improvement in the small-message ping-pong with this
>        optimization. I
>        >>>>   see no hit for larger messages.
>        >>>>
>        >>>> When: These changes touch the critical path in ob1 and are
>        targeted for
>        >>>> 1.7.5. As such I will set a moderately long timeout. Timeout set
>        for
>        >>>> next Friday (Jan 17).
>        >>>>
>        >>>> Some results from osu_latency on haswell:
>        >>>>
>        >>>> hjelmn@cn143 pt2pt]$ mpirun -n 2 --bind-to core -mca btl
>        vader,self
>        >>>> ./osu_latency
>        >>>> # OSU MPI Latency Test v4.0.1
>        >>>> # Size          Latency (us)
>        >>>> 0                       0.11
>        >>>> 1                       0.14
>        >>>> 2                       0.14
>        >>>> 4                       0.14
>        >>>> 8                       0.14
>        >>>> 16                      0.14
>        >>>> 32                      0.15
>        >>>> 64                      0.18
>        >>>> 128                     0.36
>        >>>> 256                     0.37
>        >>>> 512                     0.46
>        >>>> 1024                    0.56
>        >>>> 2048                    0.80
>        >>>> 4096                    1.12
>        >>>> 8192                    1.68
>        >>>> 16384                   2.98
>        >>>> 32768                   5.10
>        >>>> 65536                   8.12
>        >>>> 131072                 14.07
>        >>>> 262144                 25.30
>        >>>> 524288                 47.40
>        >>>> 1048576                91.71
>        >>>> 2097152               195.56
>        >>>> 4194304               487.05
>        >>>>
>        >>>>
>        >>>> Patch Attached.
>        >>>>
>        >>>> -Nathan
>        >>> _______________________________________________
>        >>> devel mailing list
>        >>> de...@open-mpi.org
>        >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>        >>
>        >>   _______________________________________________
>        >>   devel mailing list
>        >>   de...@open-mpi.org
>        >>   http://www.open-mpi.org/mailman/listinfo.cgi/devel
>        >
>        >
>        >
>        
> <ob1_optimization_take3.patch>_______________________________________________
>        > devel mailing list
>        > de...@open-mpi.org
>        > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
>        _______________________________________________
>        devel mailing list
>        de...@open-mpi.org
>        http://www.open-mpi.org/mailman/listinfo.cgi/devel

> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Attachment: pgp3eHj5E4NC_.pgp
Description: PGP signature

Reply via email to