OMPI OF Pow Wow Notes
26 Nov 2007
---------------------------------------------------------------------------
Discussion of current / upcoming work:
OCG (Chelsio):
- Did a bunch of udapl work, but abandoned it. Will commit it to a
tmp branch if anyone cares (likely not).
- They have been directed to move to the verbs API; will be starting
on that next week.
Cisco:
- likely to get more involved in PML/BTL issues since Galen + Brian
now transferring out of these areas.
- race between Chelsio / Cisco as to who implements RDMA CM connect PC
first (more on this below). May involve some changes to the connect
PC interface.
- Re-working libevent and progress engine stuff with George
LLNL:
- Andrew Friedley leaving LLNL in 3 weeks
- UD code more of less functional, but working on reliability stuff
down in the BTL. That part is not quite working yet.
- When he leaves LLNL, UD BTL may become unmaintained.
IBM:
- Has an interest in NUNAs
- May be interested in maintaining the UD BTL; worried about scale.
Mellanox:
- Just finished first implementation of XRC
- Now working on QA issues with XRC: testing with multiple subnets,
different numbers of HCAs/ports on different hosts, etc.
Sun:
- Currently working full steam ahead on UDAPL.
- Will likely join in openib BTL/etc. when Sun's verbs stack is ready.
- Will *hopefully* get early access to Sun's verbs stack for testing /
compatibility issues before the stack becomes final.
ORNL:
- Mostly working on non-PML/BTL issues these days.
- Galen's advice: get progress thread working for best IB overlap and
real application performance.
Voltaire:
- Working on XRC improvements
- Working on message coalescing. Only sees benefit if you drastically
reduce the number of available buffers and credits -- i.e., be much
more like openib BTL before BSRQ (2 buffer sizes: large and small,
and have very few small buffer credits).
<lots of discussion about message coalescing; this will be worth at
least an FAQ item to explain all the trade-offs. There could be
non-artificial benefits for coalescing at scale because of limiting
the number of credits>
- Moving HCA initializing stuff to a common area to share it with
collective components.
---------------------------------------------------------------------------
Discussion of various "moving forward" proposals:
- ORNL, Cisco, Mellanox discussing how to reduce cost of memory
registration. Currently running some benchmarks to figure out where
the bottlenecks are. Cheap registration would *help* (but not
completely solve) overlap issues by reducing the number of sync
points -- e.g., always just do one big RDMA operation (vs. the
pipeline protocol).
- Some discussion of a UD-based connect PC. Gleb mentions that what
was proposed is effectively the same as the IBTA CM (i.e., ibcm).
Jeff will go investigate.
- Gleb also mentions that the connect PC needs to be based on the
endpoint, not the entire module (for non-uniform hardware
networks). Jeff took a to-do item to fix. Probably needs to be
done for v1.3.
- To UD or not to UD? Lots of discussion on this.
- Some data has been presented by OSU showing that UD drops don't
happen often. Their tests were run in a large non-blocking
network. Some in the group feel that in a busy blocking network,
UD drops will be [much] more common (there is at least some
evidence that in a busy non-blocking network, drops *are* rare).
This issue affects how we design the recovery of UD drops: if
drops are rare, recovery can be arbitrarily expensive. If drops
are not rare, recovery needs to be at least somewhat efficient.
If drops are frequent, recovery needs to be cheap/fast/easy.
- Mellanox is investigating why ibv_rc_pingpong gives better
bandwidth than ibv_ud_pingpong (i.e., UD bandwidth is poor).
- Discuss the possibility of doing connection caching: only allow so
many RC connections at a time. Maintain an LRU of RC connections.
When you need to close one, also recycle (or free) all of its
posted buffers.
- Discussion of MVAPICH technique for large UD messages: "[receiver]
zero copy UD". Send a match header; receiver picks a UD QP from a
ready pool and sends it back to the sender. Fragments from the
user's buffer are posted to that QP on the receiver, so the sender
sends straight into the receiver's target buffer. This scheme
assumes no drops. For OMPI, this scheme also requires more
complexity from our current multi-device striping method: we'd
want to stripe across large contiguous portions of the message
(vs. round robining small fragments from the message).
- One point specifically discussed: long message alltoall at scale
(i.e., large numbers of MPI processes). Andrew Friedley is going
to ask around LLNL if anyone does this, but our guess is no
because each host would need a *ton* of RAM to do this:
(num_procs_per_node * num_procs_total * length_of_buffer). Our
suspicion is that alltoall for short messages is much more common
(and still, by far, not the most common MPI collective).
- One proposal:
- Use UD for short messages (except for peers that switch to eager
RDMA)
- Always use RC for long messages, potentially with connection
caching+fast IB connect (ibcm?)
- Another proposal: let OSU keep forging ahead with UD and see what
they come up with. I.e., let them figure out if UD is worth it or
not.
- End result: it's not 100% clear that UD is a "win" yet -- there
are many unanswered questions.
- Make new PML that is essentially "CM+matching", send entire messages
down to lower layer instead of having the PML do the fragmenting:
- Rationale:
- pretty simple PML
- allow lower layer to do more optimizations based on full
knowledge of the specific network being used
- networks get CM-like benefits without having to "natively"
support shmem (because matching will still be done in the PML
and there will be a lower layer/component for shmem)
- [possibly] remove some stuff from current code in ob1 that is
not necessary in IB/OF (Gleb didn't think that this would be
useful; most of OB1 is there to support IB/OF)
- not force other networks to same model as IB/OF (i.e., when we
want
new things in IB/OF, we have to go change all the other BTLs)
--> ^^ I forgot to mention this point on the call today
- if we go towards a combined RC+UD OF protocol, the current OB1
is not good at this because the BTL flags will have to "lie"
about whether a given endpoint is capable of RDMA or not.
--> Gleb mentioned that it doesn't matter what the PML thinks;
even if the PML tells the BTL to RDMA PUT/GET, the BTL can
emulate it if it isn't supported (e.g., if an endpoint
switches between RD and UD)
- Jeff sees this as a code re-org, not so much as a re-write.
- Gleb is skeptical on the value of this; it may be more valuable if
we go towards a blended UD+RC protocol, though.
The phone bridge started kicking people off at this point (after we
went 30+ minutes beyond the scheduled end time). So no conclusions
were reached. This discussion probably needs to continue in e-mail,
etc.
--
Jeff Squyres
Cisco Systems