Re: [OMPI devel] memcpy MCA framework

2008-08-17 Thread Jeff Squyres
Let's talk about this in Dublin.  I can probably help with the m4  
magic, but I need to understand exactly what needs to be done first.



On Aug 16, 2008, at 11:51 AM, Terry Dontje wrote:


George Bosilca wrote:
The intent of the memcpy framework is to allow a selection between  
several memcpy at runtime. Of course, there will be a preselection  
at compile time, but all versions that can compile on a given  
architecture will be benchmarked at runtime and the best one will  
be selected. There is a file with several versions of memcpy for  
x86 (32 and 64) somewhere around (I should have one if interested),  
that can be used as a starting point.


Ok, I guess I need to look at this code.  I wonder if there may be  
cases for Sun's machines in which this benchmark could end up  
picking the wrong memcpy?
The only thing we need is a volunteer to build the m4 magic.  
Figuring out what we can compile if kind of tricky, as some of the  
functions are in assembly, some others in C, and some others a  
mixture (the MMX headers).


Isn't the atomic code very similar?  If I get to this point before  
anyone else I probably will volunteer.


--td

 george.

On Aug 16, 2008, at 3:19 PM, Terry Dontje wrote:


Hi Tim,
Thanks for bringing the below up and asking for a redirection to  
the devel list.  I think looking/using the MCA memcpy framework  
would be a good thing to do and maybe we can work on this together  
once I get out from under some commitments.  However, some of the  
challenges that originally scared me away from looking at the  
memcpy MCA is whether we really want all the OMPI memcpy's to be  
replaced or just specific ones.  Also, I was concerned on trying  
to figure out which version of memcpy I should be using.  I  
believe currently things are done such that you get one version  
based on which system you compile on.  For Sun there may be  
several different SPARC platforms that would need to use different  
memcpy code but we would like to just ship one set of bits.
Not saying the above not doable under the memcpy MCA framework  
just that it somewhat scared me away from thinking about it at  
first glance.


--td
Date: Fri, 15 Aug 2008 12:08:18 -0400 From: "Tim Mattox" > Subject: Re: [OMPI users] SM btl slows down bandwidth? To:  
"Open MPI Users"  Message-ID: > Content-Type: text/plain; charset=ISO-8859-1 Hi Terry (and  
others), I have previously explored this some on Linux/X86-64 and  
concluded that Open MPI needs to supply it's own memcpy routine  
to get good sm performance, since the memcpy supplied by glibc is  
not even close to optimal. We have an unused MCA framework  
already set up to supply an opal_memcpy. AFAIK, George and Brian  
did the original work to set up that framework. It has been on my  
to-do list for awhile to start implementing opal_memcpy  
components for the architectures I have access to, and to modify  
OMPI to actually use opal_memcpy where ti makes sense. Terry, I  
presume what you suggest could be dealt with similarly when we  
are running/building on SPARC. Any followup discussion on this  
should probably happen on the developer mailing list. On Thu, Aug  
14, 2008 at 12:19 PM, Terry Dontje  wrote:
> Interestingly enough on the SPARC platform the Solaris  
memcpy's actually use
> non-temporal stores for copies >= 64KB.  By default some of  
the mca
> parameters to the sm BTL stop at 32KB.  I've done  
experimentations of
> bumping the sm segment sizes to above 64K and seen incredible  
speedup on our
> M9000 platforms.  I am looking for some nice way to integrate  
a memcpy that

> lowers this boundary to 32KB or lower into Open MPI.
> I have not looked into whether Solaris x86/x64 memcpy's use  
the non-temporal

> stores or not.
>
> --td


>>
>> Message: 1
>> Date: Thu, 14 Aug 2008 09:28:59 -0400
>> From: Jeff Squyres 
>> Subject: Re: [OMPI users] SM btl slows down bandwidth?
>> To: rbbr...@sandia.gov, Open MPI Users 
>> Message-ID: <562557eb-857c-4ca8-97ad-f294c7fed...@cisco.com>
>> Content-Type: text/plain; charset=US-ASCII; format=flowed;  
delsp=yes

>>
>> At this time, we are not using non-temporal stores for  
shared memory

>>  operations.
>>
>>
>> On Aug 13, 2008, at 11:46 AM, Ron Brightwell wrote:
>>
>>





 >> [...]
 >>
 >> MPICH2 manages to get about 5GB/s in shared memory  
performance on the

 >> Xeon 5420 system.





>>>


>>> >
>>> > Does the sm btl use a memcpy with non-temporal stores  
like MPICH2?
>>> > This can be a big win for bandwidth benchmarks that  
don't actually

>>> > touch their receive buffers at all...
>>> >
>>> > -Ron
>>> >
>>> >
>>> > ___
>>> > users mailing list
>>> > us...@open-mpi.org
>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users


>>>


>>
>>
>> -- Jeff Squyres Cisco Systems


>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>





-- T

Re: [OMPI devel] memcpy MCA framework

2008-08-17 Thread Brian Barrett
I obviously won't be in Dublin (I'll be in a fishing boat in the  
middle of nowhere Canada -- much better), so I'm going to chime in now.


The m4 part actually isn't too bad and is pretty simple.  I'm not sure  
other than looking at some variables set by ompi_config_asm that there  
is much to check.  The hard parts are dealing with the finer grained  
instruction set requirements.


On x86 in particular, many of the operations in the memcpy are part of  
SSE, SSE2, or SSE3.  Currently, we don't have any finer concept of a  
processor than x86 and most compilers target an instruction set that  
will run on anything considered 686, which is almost everything out  
there.  We'd have to decide how to handle instruction streams which  
are no longer going to work on every chip.  Since we know we have a  
number of users with heterogeneous x86 clusters, this is something to  
think about.


Brian

On Aug 17, 2008, at 7:57 AM, Jeff Squyres wrote:

Let's talk about this in Dublin.  I can probably help with the m4  
magic, but I need to understand exactly what needs to be done first.



On Aug 16, 2008, at 11:51 AM, Terry Dontje wrote:


George Bosilca wrote:
The intent of the memcpy framework is to allow a selection between  
several memcpy at runtime. Of course, there will be a preselection  
at compile time, but all versions that can compile on a given  
architecture will be benchmarked at runtime and the best one will  
be selected. There is a file with several versions of memcpy for  
x86 (32 and 64) somewhere around (I should have one if  
interested), that can be used as a starting point.


Ok, I guess I need to look at this code.  I wonder if there may be  
cases for Sun's machines in which this benchmark could end up  
picking the wrong memcpy?
The only thing we need is a volunteer to build the m4 magic.  
Figuring out what we can compile if kind of tricky, as some of the  
functions are in assembly, some others in C, and some others a  
mixture (the MMX headers).


Isn't the atomic code very similar?  If I get to this point before  
anyone else I probably will volunteer.


--td

george.

On Aug 16, 2008, at 3:19 PM, Terry Dontje wrote:


Hi Tim,
Thanks for bringing the below up and asking for a redirection to  
the devel list.  I think looking/using the MCA memcpy framework  
would be a good thing to do and maybe we can work on this  
together once I get out from under some commitments.  However,  
some of the challenges that originally scared me away from  
looking at the memcpy MCA is whether we really want all the OMPI  
memcpy's to be replaced or just specific ones.  Also, I was  
concerned on trying to figure out which version of memcpy I  
should be using.  I believe currently things are done such that  
you get one version based on which system you compile on.  For  
Sun there may be several different SPARC platforms that would  
need to use different memcpy code but we would like to just ship  
one set of bits.
Not saying the above not doable under the memcpy MCA framework  
just that it somewhat scared me away from thinking about it at  
first glance.


--td
Date: Fri, 15 Aug 2008 12:08:18 -0400 From: "Tim Mattox" > Subject: Re: [OMPI users] SM btl slows down bandwidth? To:  
"Open MPI Users"  Message-ID: > Content-Type: text/plain; charset=ISO-8859-1 Hi Terry (and  
others), I have previously explored this some on Linux/X86-64  
and concluded that Open MPI needs to supply it's own memcpy  
routine to get good sm performance, since the memcpy supplied by  
glibc is not even close to optimal. We have an unused MCA  
framework already set up to supply an opal_memcpy. AFAIK, George  
and Brian did the original work to set up that framework. It has  
been on my to-do list for awhile to start implementing  
opal_memcpy components for the architectures I have access to,  
and to modify OMPI to actually use opal_memcpy where ti makes  
sense. Terry, I presume what you suggest could be dealt with  
similarly when we are running/building on SPARC. Any followup  
discussion on this should probably happen on the developer  
mailing list. On Thu, Aug 14, 2008 at 12:19 PM, Terry Dontje > wrote:
> Interestingly enough on the SPARC platform the Solaris  
memcpy's actually use
> non-temporal stores for copies >= 64KB.  By default some of  
the mca
> parameters to the sm BTL stop at 32KB.  I've done  
experimentations of
> bumping the sm segment sizes to above 64K and seen incredible  
speedup on our
> M9000 platforms.  I am looking for some nice way to integrate  
a memcpy that

> lowers this boundary to 32KB or lower into Open MPI.
> I have not looked into whether Solaris x86/x64 memcpy's use  
the non-temporal

> stores or not.
>
> --td


>>
>> Message: 1
>> Date: Thu, 14 Aug 2008 09:28:59 -0400
>> From: Jeff Squyres 
>> Subject: Re: [OMPI users] SM btl slows down bandwidth?
>> To: rbbr...@sandia.gov, Open MPI Users 
>> Message-ID: <562557eb-857c-4ca8-97ad-f294c7fed...@cisco.com>
>> Content-Type: text/plai