> Valgrind replaces the libc memcpy call with a simple version that > copies a byte at a time (in order). If libmlx4 is not built with > --with-valgrind, valgrind considers each write an invalid write and > spends a very long time after each write updating its error database. > We experimented with replacing the Valgrind error database update > with a configurable spin loop and found that if we put a delay of > around 100,000 cycles between writes in the 'byte memcpy' when > writing to the blueflame page, that a sent message gets > lost/misplaced in a simple testcase with two MPI_barriers back to > back (resulting in a hang because not all processes exit the first > barrier). Our theory is the card sees 'byte' writes to the blueflame > page and due to the long delay, uses the information before it is all > written out (and thus getting wrong info).
That makes sense. The HW documentation says that blueflame writes must be done in aligned chunks of at least 4 bytes, so it's not surprising that byte writes confuse the HW in some cases. - R. _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
