On Thu, Nov 06, 2014 at 04:06:23PM -0500, Joshua Ladd wrote:
>    Nathan,
>    Has this bug always been present in OpenIB or is this a recent addition?
>    If this is regression, I would also be inclined to say that this is a

The bug is as old as the message coalescing feature in the openib
btl. When the feature was added the openib btl no longer supported
calling btl_free on descriptors allocated by sendi (a serious bug).

>    blocker for 1.8.4. This is a SIGNIFICANT bug. Both Howard and I were quite
>    surprised that all the while this code has been in use at LANL
>    in production systems, this issue was never discovered. 

Don't know why it suddenly came up but in 1.8.1 we added a inline send
optimization to the MPI_Isend path. The optimization uses the btl_sendi
function to get the fragment on the wire without allocating a send
request. If this fails the btl fragment returned by sendi is released
with btl_free, a send request is allocated, and the normal send path is
used. The optimization was tested with the openib btl so I don't know
why this wasn't caught earlier. My guess is some other change may have
triggered it.

We fixed the bug in 1.8.4 by totally disabling message coalescing. The
feature is meant to game the osu_mbw_mr test and does next to nothing
for real apps. Additionally, the inline send optimization makes the
feature less of a win with osu_mbw_mr anyway. We beat mvapich handily on
LANL systems without message coalescing.

For master I have a fix that allows the message coalescing feature to
remain on. This fix will come over with the btl changes. I may backport
this fix to 1.8.5 as it fixes another long standing bug with message
coalescing.

-Nathan

Attachment: pgpWpZpb5fHR7.pgp
Description: PGP signature

Reply via email to