Re: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects

Dror Goldenberg Mon, 02 Jul 2007 03:58:19 -0700

Gleb Natapov wrote:

On Sun, Jul 01, 2007 at 07:27:24PM +0300, Dror Goldenberg wrote:
SSQ is needed for scalability, no need to explain this (bythe way RD is needed for the same reason too. What's Mellanoxplan to support it?
RD is not supported in hardware today. Implementing RD is extremelycomplicated. To solve the scalability issues on MPI like applications
we believe that SRC and SSQ are the right solutions. It is much simpler
for implementation by both software and hardware. By MPI-like I refer
to applications that have some level of trust between two processes of
the
same application. RD also has some performance issues as it onlysupports one message in the air. Those performance issues are solved
by design in SRC/SSQ.
Didn't know about RD limitation. Is this shortcomings of IB spec or
general limitation of reliable datagram? RD looks much nice to me then SRC/SSQ.


The RD limitation is part of the IB spec.

It is a part of Spec after all, so why to invent new shinystaff when it is still possible to achieve better scalabilitywithout them).
It's truly about complexity. And as I mentioned in OFA meeting at
Sonoma,Mellanox is willing to contribute SRC/SSQ to the IB spec as well.
We are discussing you implementation proposal and in myopinion it doesn't fit application needs. I may be wronghere, so if there is somebody who things that sending randomcompletion to random processes it the best idea ever andabsence of this "feature" is the only thing that stops himfrom IB adoption he may chime in here and voice his opinion.
Your input about how to demultiplex send completions on SSQ isvaluable. Unfortunately it is not supported in the current generation.
What I can suggest here is, not new on this thread, but:
1) all pollers see the same CQ, only the poller that sees the completion
that
      belongs to takes it out of the CQ
Progress of one process depend on all other processes on the same node. Not
good at all.

In MPI, it happens many times that all processes depends on each otherto make forward progress, this way or the other. I am not saying thatthis is the ideal solution, but there is some price involved in sharingresources. You can always upgrade resources for a process that utilizesthem, e.g. if communication pattern is that each process talks with 4neighbors, then let it has dedicated unshared QPs.

2) only one process polls the CQ, if it doesn't belong to the poller,
the
poller will put it in a SW queue to the right process. The otherprocesses just poll on the SW queue

Not good of the same reason.

As the variant each process can poll HW CQ and SW CQ if completion from HW CQ
belong to another process put it on appropriate SW CQ. I don't think
that reasonable API will require such afford from applications (and I am
not talking about all locking overhead and cache bouncing that will
result from such implementation, but latency will be bad that's for sure).

I don't think that polling on SQ completions are in the latency path.You usually need it in order to free networking buffers. In any case Iunderstand your point.

3) the SQ will have a "completed WQE index" reported. Everybody can
     look at it and determine how many WQEs completed. This one has
some cons because the CQ is not shared here... need to bake thisone more.

And where application will get WC? Or should it maintain its own queue
of WQEs?

In this method, each app should have its own queue.

If we wrap one of these into the right API, once there is HW available
thatcan do the SSQ CQ demultiplexing, it can work without any API change.
That is something I don't see in proposed API.
Looking at the Dror's slides on slide 6 "Scalable ReliableConnection" I see that wire protocol is extended to send DSTSRQ as part of a header.Receiver side then puts completion to appropriate CQaccording this field. Have you proposition address this? How?
SRC indeed includes demultiplexing of the CQ. SSQ does not currently,
unfortunately.
Is it possible to add this only with FW upgrade?

Unfortunately no.

But I think that with the right API we can abstract this, and later on
have better performance for it.
Who will put this additional data on a wire (HW or libibverbsmay be app)? Also I don't see this in Dror's slide, butcompletion of local operation should be demultiplexed toappropriate CQ too. WQE may contain additional field, forinstance, that will tell where to put a completion. Onceagain who will do the demux in you proposition (HW, libiverbsor app)? The right answer is most certainly HW in both casesso will Hermon support this?Or may be you want to demultiplex everything insidelibibvers? In this case I want to see design of this(preferably with performance analysis).
One thing to mention. The way I see it is according to the order of the
slides. First get SRC going, improve the scalability. Then SSQ can be
added to further improve scalability. In other words I am suggesting
that maybe we can worry with the SSQ deficiencies a bit later :)
That is my point! Let's do it once lets do it right and lets do it when HW
is ready :)

SRC is ready in HW, it can be implemented in SW now and willsignificantly help scalability.

We can resume SSQ discussion or other alternatives later on...

--
                        Gleb.
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects

Reply via email to