At 08:56 AM 8/25/2006, Greg Lindahl wrote:
On Fri, Aug 25, 2006 at
10:13:01AM -0500, Tom Tucker wrote:
He does say this, but his analysis does not support this conclusion.
His
analysis revolves around MPI send/recv, not the MPI 2.0 get/put
services.
Nobody uses MPI put/get anyway, so leaving out analyzing that
doesn't
change reality much.
Is this due to legacy or other reasons? One reason cited from
Winsocks Direct for using the bcopy vs. the RDMA zcopy operations was the
cost to register memory if done on a per operation basis, i.e. single
use. The bcopy threshold was ~9KB. With the new verbs
developed for iWARP and then added to IB v1.2, the bcopy threshold was
reduced to ~1KB.
Now, if I recall correctly, many MPI implementations split their buffer
usage between what are often 1KB envelopes and what are large
regions. One can persistently register the envelopes so their size
does not really matter and thus could use send / receive or RDMA
semantics for their update depending upon how the completions are
managed. The larger data movements can be RDMA semantics if desired
as these are typically large in size.
A valid
conclusion IMO is that MPI send/recv can
be most efficiently implemented over an unconnected reliable
datagram
protocol that supports 64bit tag matching at the data sink.
And not
coincidentally, Myricom has this ;-)
As do all of the non-VIA-family interconnects he mentions. Since
we
all landed on the same conclusion, you might think we're on to
something. Or not.
We've had this argument multiple times and examined all of the known and
relatively volume usage models which includes the suite of MPI benchmarks
used to evaluate and drive implementations. Any interconnect
architecture is one of compromise if it is to be used in a volume
environment - the goal for the architects is to insure the compromises do
not result in a brain-dead or too diminished technology that will not
meet customer requirements.
With respect to reliable datagram, unless one does software multiplexing
over what amounts to a reliable connection which comes with a performance
penalty as well as complexity in terms of error recover, etc. logic it
really does not buy one anything better than a RC model used today.
Given the application mix and the customer usage model, IB provided four
transport types to meet different application needs and allow people to
make choices. iWARP reduced this to one since the target
applications really were met with RC and reliable datagram as defined in
IB simply was not being picked up or demanded by the targeted ISV.
While some of us had argued for the software multiplex model, others
wanted everything to be implemented in hardware so IB is what it is
today. In any case, it is one of a set of reasonable
compromises and for the most part, I contend it is difficult to argue
that these interconnect technologies are so compromised that they are
brain dead or broken.
However, that's
only part of the argument. Another part is that the
buffer space needed to use RDMA put/get for all data links is huge.
And there are some other interesting points.
The buffer and context differences to track RDMA vs. Send are not
significant in terms of hardware. In terms of software, memory
needs to be registered in some capacity to perform DMA to it and hence,
there is a cost from the OS / application perspective. Our goals
were to be able to use application buffers to provide zero copy data
movements as well as OS bypass. RDMA vs. Send does not
incrementally differ in terms of resource costs in the end.
I DO agree
that it is interesting reading. :-), it's definitely got
people fired up.
Heh. Glad you found it interesting.
The article is somewhat interesting but does not really present anything
novel in this on-going debate on how interconnects should be
designed. There will always be someone pointing out a
particular issue here and there and in the end, many of these amount to
mouse nuts when placed into the larger context. When they don't, a
new interconnect is defined or extensions are made to compensate as
nothing is ever permanent or perfect.
Mike
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general