Howdy Roland;

Here is more details on why not using memcpy appears to be a good idea under 
valgrind.

Valgrind replaces the libc memcpy call with a simple version that copies a byte 
at a time (in order).  If libmlx4 is not built with --with-valgrind, valgrind 
considers each write an invalid write and spends a very long time after each 
write updating its error database.  We experimented with replacing the Valgrind 
error database update with a configurable spin loop and found that if we put a 
delay of around 100,000 cycles between writes in the 'byte memcpy' when writing 
to the blueflame page, that a sent message gets lost/misplaced in a simple 
testcase with two MPI_barriers back to back (resulting in a hang because not 
all processes exit the first barrier).   Our theory is the card sees 'byte' 
writes to the blueflame page and due to the long delay, uses the information 
before it is all written out (and thus getting wrong info).

With the patched version, longs are written to the blueflame page and it now 
happens to work under valgrind.   Of course, it may be luck.   I could be that 
writing longs are 4-8 times more efficient, so the delay is not longer big 
enough to matter.   It could be that it simply fixes our testcase in that the 
card is still reading early but happening to get the correct data in this case. 
   Or it could be that writing longs actually fixes things and that writing 
bytes is a bad idea (since you could get a context switch at any time since 
this is user code and that could give the same effect).   It any case, it fixes 
our testcases and seems like having control how data is written to the 
blueflame page is a good idea in any case.

When users use our valgrind wrapper scripts (they don't always) , we LD_PRELOAD 
a patched version of this library compiled with --with-valgrind, which prevents 
the delay to begin with (and runs much faster under Valgrind as a result).   

I hope this clarifies things a little.
-John G.

P.S. If a context switch happens during a write to the blueflame page or some 
other memory mapped NIC addresses, could bad things happen?   This is why I 
continued the detailed hunt after figuring out compiling with --with-valgrind 
resolved our problems, since similar delays could happen during context 
switches.


At 03:51 PM 6/11/2009, Roland Dreier wrote:

> > Our MPI folks detected a hang while using Valgrind with our ConnectX cards.
> > After trying the current master branch in git we solved the problem by 
> > applying
> > this patch from the git tree to v1.0.
> > 
> >     Don't use memcpy() to write blueflame sends
>
>Didn't realize this had that implication (I thought it just made
>blueflame not give latency benefit).  Anyway yes it has been a while
>since a libmlx4 release.  I'll make one soon, probably next week.

---------------------------------------------------------------------
John C. Gyllenhaal                             Bldg:  453   Rm: 4151
Computation Department                 Email: [email protected]
Lawrence Livermore National Lab  Voice: (925) 424-5485
7000 East Ave, L-557                         Fax:   (925) 423-6961
Livermore, CA. 94551-0808              URL: http://www.llnl.gov/icc/lc/DEG
--------------------------------------------------------------------- 

_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to