Eugene Loh wrote:
Possibly, you meant to ask how one does directed polling with a wildcard source MPI_ANY_SOURCE. If that was your question, the answer is we punt. We report failure to the ULP, which reverts to the standard code path.

Sorry, I meant ANY_SOURCE. If you poll only the queue that correspond to a posted receive, you only optimize micro-benchmarks, until they start using ANY_SOURCE. So, does recvi() is a one-time shot ? Ie do you poll the right queue only once and if it fails then you fall back on polling all queues ? If yes, then it's unobtrusive but I don't think it would help much. If you poll the right queue many times, then you have to decide when to fall back on polling all queues, and it's not trivial.

How do you ensure you check all incoming queues from time to time to prevent 
flow control (specially if the queues are small for scaling) ?
There are a variety of choices here. Further, I'm afraid we ultimately have to expose some of those choices to the user (MCA parameters or something).

In the vast majority of cases, users don't know how to turn the knobs. The problem is that with local np going up, queue sizes will go down fast (square root), and you will have to poll all queues more often. Using more memory for queues just pushed the scalability wall a little bit further.

congestion. What if then the user code posts a rather specific request (receive a message with a particular tag on a particular communicator from a particular source) and with high urgency (blocking request... "I ain't going anywhere until you give me what I'm asking for"). A good servant would drop whatever else s/he is doing to oblige the boss.

If you poll only one queue, then stuff can pile up on another and a sender is now blocked. At best, you have a synchronization point. At worst, a deadlock.

So, let's say there's a standard MPI_Recv. Let's say there's also some congestion starting to build. What should the MPI implementation do?

The MPI implementation cannot trust the user/app to indicates where the messages will come from. So, if you have N incoming queues, you need to poll them all eventually. If you do, polling time increase linearly. If you try to limit the polling space with whatever heuristic (like the queue corresponding to the current blocking receive), then you take the risk of not consuming fast enough another queue. And usually, the heuristics quickly fall apart (ANY_SOURCE, multiple asynchronous receives, etc).

Really, only single-queue solves that.

Yes, and you could toss the receive-side optimizations as well. So, one could say, "Our np=2 latency remains 2x slower than Scali's, but at least we no longer have that hideous scaling with large np." Maybe that's where we want to end up.

I think all optimizations except recvi() are fine and worth using. I am just saying that the recvi() optimization is dubious as it is, and the single-queue is potentially a larger hanging fruit on the recv side: it could still be fast (spinlock or atomic to manage shared receive queue) to have lower np=2 latency, and it would scale well with large np. No tuning needed, no special cases, smaller memory footprint.

I will leave it at that, just some inputs.

Patrick

Reply via email to