Gleb Natapov wrote:
On Sun, Jul 01, 2007 at 07:27:24PM +0300, Dror Goldenberg wrote:
SSQ is needed for scalability, no need to explain this (by
the way RD is needed for the same reason too. What's Mellanox
plan to support it?
RD is not supported in hardware today. Implementing RD is extremely
complicated. To solve the scalability issues on MPI like applications
we believe that SRC and SSQ are the right solutions. It is much simpler
for implementation by both software and hardware. By MPI-like I refer
to applications that have some level of trust between two processes of
the
same application. RD also has some performance issues as it only
supports one message in the air. Those performance issues are solved
by design in SRC/SSQ.
Didn't know about RD limitation. Is this shortcomings of IB spec or
general limitation of reliable datagram? RD looks much nice to me then SRC/SSQ.
The RD limitation is part of the IB spec.
It is a part of Spec after all, so why to invent new shiny
staff when it is still possible to achieve better scalability
without them).
It's truly about complexity. And as I mentioned in OFA meeting at
Sonoma,
Mellanox is willing to contribute SRC/SSQ to the IB spec as well.
We are discussing you implementation proposal and in my
opinion it doesn't fit application needs. I may be wrong
here, so if there is somebody who things that sending random
completion to random processes it the best idea ever and
absence of this "feature" is the only thing that stops him
from IB adoption he may chime in here and voice his opinion.
Your input about how to demultiplex send completions on SSQ is
valuable. Unfortunately it is not supported in the current generation.
What I can suggest here is, not new on this thread, but:
1) all pollers see the same CQ, only the poller that sees the completion
that
belongs to takes it out of the CQ
Progress of one process depend on all other processes on the same node. Not
good at all.
In MPI, it happens many times that all processes depends on each other
to make forward progress, this way or the other. I am not saying that
this is the ideal solution, but there is some price involved in sharing
resources. You can always upgrade resources for a process that utilizes
them, e.g. if communication pattern is that each process talks with 4
neighbors, then let it has dedicated unshared QPs.
2) only one process polls the CQ, if it doesn't belong to the poller,
the
poller will put it in a SW queue to the right process. The other
processes just poll on the SW queue
Not good of the same reason.
As the variant each process can poll HW CQ and SW CQ if completion from HW CQ
belong to another process put it on appropriate SW CQ. I don't think
that reasonable API will require such afford from applications (and I am
not talking about all locking overhead and cache bouncing that will
result from such implementation, but latency will be bad that's for sure).
I don't think that polling on SQ completions are in the latency path.
You usually need it in order to free networking buffers. In any case I
understand your point.
3) the SQ will have a "completed WQE index" reported. Everybody can
look at it and determine how many WQEs completed. This one has
some cons because the CQ is not shared here... need to bake this
one more.
And where application will get WC? Or should it maintain its own queue
of WQEs?
In this method, each app should have its own queue.
If we wrap one of these into the right API, once there is HW available
that
can do the SSQ CQ demultiplexing, it can work without any API change.
That is something I don't see in proposed API.
Looking at the Dror's slides on slide 6 "Scalable Reliable
Connection" I see that wire protocol is extended to send DST
SRQ as part of a header.
Receiver side then puts completion to appropriate CQ
according this field. Have you proposition address this? How?
SRC indeed includes demultiplexing of the CQ. SSQ does not currently,
unfortunately.
Is it possible to add this only with FW upgrade?
Unfortunately no.
But I think that with the right API we can abstract this, and later on
have better performance for it.
Who will put this additional data on a wire (HW or libibverbs
may be app)? Also I don't see this in Dror's slide, but
completion of local operation should be demultiplexed to
appropriate CQ too. WQE may contain additional field, for
instance, that will tell where to put a completion. Once
again who will do the demux in you proposition (HW, libiverbs
or app)? The right answer is most certainly HW in both cases
so will Hermon support this?
Or may be you want to demultiplex everything inside
libibvers? In this case I want to see design of this
(preferably with performance analysis).
One thing to mention. The way I see it is according to the order of the
slides. First get SRC going, improve the scalability. Then SSQ can be
added to further improve scalability. In other words I am suggesting
that maybe we can worry with the SSQ deficiencies a bit later :)
That is my point! Let's do it once lets do it right and lets do it when HW
is ready :)
SRC is ready in HW, it can be implemented in SW now and will
significantly help scalability.
We can resume SSQ discussion or other alternatives later on...
--
Gleb.
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general