Re: [openib-general] A critique of RDMA PUT/GET in HPC

2006-08-29 Thread Michael Krause


At 08:56 AM 8/25/2006, Greg Lindahl wrote:
On Fri, Aug 25, 2006 at
10:13:01AM -0500, Tom Tucker wrote:
 He does say this, but his analysis does not support this conclusion.
His
 analysis revolves around MPI send/recv, not the MPI 2.0 get/put
 services.
Nobody uses MPI put/get anyway, so leaving out analyzing that
doesn't
change reality much.
Is this due to legacy or other reasons? One reason cited from
Winsocks Direct for using the bcopy vs. the RDMA zcopy operations was the
cost to register memory if done on a per operation basis, i.e. single
use. The bcopy threshold was ~9KB. With the new verbs
developed for iWARP and then added to IB v1.2, the bcopy threshold was
reduced to ~1KB. 
Now, if I recall correctly, many MPI implementations split their buffer
usage between what are often 1KB envelopes and what are large
regions. One can persistently register the envelopes so their size
does not really matter and thus could use send / receive or RDMA
semantics for their update depending upon how the completions are
managed. The larger data movements can be RDMA semantics if desired
as these are typically large in size.

 A valid
conclusion IMO is that MPI send/recv can
 be most efficiently implemented over an unconnected reliable
datagram
 protocol that supports 64bit tag matching at the data sink.
And not
 coincidentally, Myricom has this ;-)
As do all of the non-VIA-family interconnects he mentions. Since
we
all landed on the same conclusion, you might think we're on to
something. Or not.
We've had this argument multiple times and examined all of the known and
relatively volume usage models which includes the suite of MPI benchmarks
used to evaluate and drive implementations. Any interconnect
architecture is one of compromise if it is to be used in a volume
environment - the goal for the architects is to insure the compromises do
not result in a brain-dead or too diminished technology that will not
meet customer requirements. 
With respect to reliable datagram, unless one does software multiplexing
over what amounts to a reliable connection which comes with a performance
penalty as well as complexity in terms of error recover, etc. logic it
really does not buy one anything better than a RC model used today.
Given the application mix and the customer usage model, IB provided four
transport types to meet different application needs and allow people to
make choices. iWARP reduced this to one since the target
applications really were met with RC and reliable datagram as defined in
IB simply was not being picked up or demanded by the targeted ISV.
While some of us had argued for the software multiplex model, others
wanted everything to be implemented in hardware so IB is what it is
today. In any case, it is one of a set of reasonable
compromises and for the most part, I contend it is difficult to argue
that these interconnect technologies are so compromised that they are
brain dead or broken.
However, that's
only part of the argument. Another part is that the
buffer space needed to use RDMA put/get for all data links is huge.
And there are some other interesting points.
The buffer and context differences to track RDMA vs. Send are not
significant in terms of hardware. In terms of software, memory
needs to be registered in some capacity to perform DMA to it and hence,
there is a cost from the OS / application perspective. Our goals
were to be able to use application buffers to provide zero copy data
movements as well as OS bypass. RDMA vs. Send does not
incrementally differ in terms of resource costs in the end.

 I DO agree
that it is interesting reading. :-), it's definitely got
 people fired up.
Heh. Glad you found it interesting.
The article is somewhat interesting but does not really present anything
novel in this on-going debate on how interconnects should be
designed. There will always be someone pointing out a
particular issue here and there and in the end, many of these amount to
mouse nuts when placed into the larger context. When they don't, a
new interconnect is defined or extensions are made to compensate as
nothing is ever permanent or perfect.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] A critique of RDMA PUT/GET in HPC

2006-08-25 Thread Tom Tucker
On Thu, 2006-08-24 at 15:53 -0700, Greg Lindahl wrote:
 For those of you interested in this topic, there's an interesting
 article by Patrick Geoffrey in HPCWire entitled A Critique of RDMA.
 
 http://www.hpcwire.com/hpc/815242.html
 
 (you might have to be a subscriber, but I'm sure Patrick would send
 you a copy if you ask.)
 
 It's basically a critique of why SEND/RECV is better for MPI
 implementations than PUT/GET.

He does say this, but his analysis does not support this conclusion. His
analysis revolves around MPI send/recv, not the MPI 2.0 get/put
services. He makes the point (true in my opinion) that the MPI_RECV
64bit (tag,communicator) filter make MPI_RECV prickly to implement on
IB/iWARP SEND/RECV and IB/iWARP RDMA. His data are drawn from
observations of MPI applications that use MPI send/recv mapped to an
RDMA transport. However, his conclusion covers a programming model (MPI
get/put) that is not observed in the data. In other words, he doesn't
compare the performance of an algorithm implemented using MPI send/recv
vs. the same algorithm implemented using MPI get/put. He evaluates the
performance of an algorithm implemented using MPI send/recv mapped to an
RDMA transport and then says because this mapping has problems that the
RDMA programming model is bad. That conclusion is not supported by his
analysis or his data. A valid conclusion IMO is that MPI send/recv can
be most efficiently implemented over an unconnected reliable datagram
protocol that supports 64bit tag matching at the data sink. And not
coincidentally, Myricom has this ;-)

I DO agree that it is interesting reading. :-), it's definitely got
people fired up.

My 2 cents.


 
 Even if you don't agree with him, it's good reading. For motivation,
 you might want to note that most of the SEND/RECV-based products
 mentioned achieve better MPI 0-byte latency than IB Verbs-based MPI
 implementations.
 
 While I don't agree with everything Patrick says, this does get back
 to my point that I've run into many people who assume that PUT/GET is
 always the right way to do things. And it isn't.
 
 -- greg
 
 
 
 ___
 openib-general mailing list
 openib-general@openib.org
 http://openib.org/mailman/listinfo/openib-general
 
 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
 


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] A critique of RDMA PUT/GET in HPC

2006-08-25 Thread Greg Lindahl
On Fri, Aug 25, 2006 at 10:13:01AM -0500, Tom Tucker wrote:

 He does say this, but his analysis does not support this conclusion. His
 analysis revolves around MPI send/recv, not the MPI 2.0 get/put
 services.

Nobody uses MPI put/get anyway, so leaving out analyzing that doesn't
change reality much.

 A valid conclusion IMO is that MPI send/recv can
 be most efficiently implemented over an unconnected reliable datagram
 protocol that supports 64bit tag matching at the data sink. And not
 coincidentally, Myricom has this ;-)

As do all of the non-VIA-family interconnects he mentions.  Since we
all landed on the same conclusion, you might think we're on to
something. Or not.

However, that's only part of the argument.  Another part is that the
buffer space needed to use RDMA put/get for all data links is huge.
And there are some other interesting points.

 I DO agree that it is interesting reading. :-), it's definitely got
 people fired up.

Heh. Glad you found it interesting.

-- greg


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general