On Fri, Jun 12, 2009 at 09:52:15AM -0700, John Gyllenhaal wrote: > Valgrind replaces the libc memcpy call with a simple version that > copies a byte at a time (in order). If libmlx4 is not built with > --with-valgrind, valgrind considers each write an invalid write and > spends a very long time after each write updating its error > database. We experimented with replacing the Valgrind error > database update with a configurable spin loop and found that if we > put a delay of around 100,000 cycles between writes in the 'byte > memcpy' when writing to the blueflame page, that a sent message gets > lost/misplaced in a simple testcase with two MPI_barriers back to > back (resulting in a hang because not all processes exit the first > barrier). Our theory is the card sees 'byte' writes to the > blueflame page and due to the long delay, uses the information > before it is all written out (and thus getting wrong info).
There are lots of ways adding a timing delay here can cause problems. x86 CPUs have write combining buffers that can be enabled and will aggregate byte writes into larger transfers, they do flush based on a timer in some cases. Some devices that do this also have internal aggregation buffers that will flush in certain cases, often non-sequential writes or again timers.. I'm not sure what the chip's expectation is for the actual bus transfers in this area, but I think you are right to be concerned about atomicity, even when transfering based on longs. For instance, you do not want to rely upon write combining to create a single PCI-E transaction out of the message to ensure atomicity in a multi-process environment. This will not work reliably 100% of the time. It is worth looking at using SSE instructions to burst transfer the entire message in one atomic go. Jason _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
