Hi,

> From: [email protected] [mailto:linux-rdma-
> [email protected]] On Behalf Of Christopher Mitchell
> Sent: Tuesday, July 07, 2015 5:21 AM
> 
> Roland,
> 
> My libmlx4 is named libmlx4-rdmav2.so and timestamped 2014-01-01 (from
> the package libmlx4-1, 1.0.5-1ubuntu1 for Ubuntu Trusty 14.04).

This is good to align us on the source code. Reading there, I see the
following code handling inline data copy:
                                while (len >= MLX4_INLINE_ALIGN - off) {
                                        to_copy = MLX4_INLINE_ALIGN - off;
                                        memcpy(wqe, addr, to_copy);
                                        len -= to_copy;
                                        wqe += to_copy;
                                        addr += to_copy;
                                        seg_len += to_copy;
                                        wmb(); /* see comment below */
                                        seg->byte_count = htonl(MLX4_INLINE_SEG 
| seg_len);
                                        seg_len = 0;
                                        seg = wqe;
                                        wqe += sizeof *seg;
                                        off = sizeof *seg;
                                        ++num_seg;
                                }

                                memcpy(wqe, addr, len);
                                wqe += len;
                                seg_len += len;
                                off += len;

Note the memcpy to the work queue element ("wqe"). This buffer is completely
independent of the original data buffer. Unless memcpy started behaving
differently recently, this does not seem like something that can produce
the symptoms you are describing.

> I use
> a mutex-protected FIFO of send buffers to make sure only one thread is
> able to acquire a particular send buffer at a time, and my application
> does not modify a send buffer after it passes it to my message-sending
> method.

While not directly related to issue at hand, you might want to consider
a design where each thread is using an independent work queue, to prevent
data dependencies and locking between threads.

> I've also replicated this with a single-threaded version of
> the application; something as simple as zeroing out the first byte of
> the send buffer on the line after ibv_post_send() is enough to
> occasionally trigger this behavior.
>

Can you provide a short reproducer code for this?

Could it be that you are calling "fork" (or a variant of it, i.e. system())
in your code after doing the memory registration and before calling
the post send verb? This can lead to similar behaviors, as the memory page
is CoWed and the registered memory pages of the work queue might differ from
what libmlx4 is writing into.

Thanks,
--Shachar

N.B. you might want to avoid top-posting in OSS mailing lists, as it makes it
hard to maintain context.

Reply via email to