On Dec 19, 2012, at 11:26 AM, Handerson, Steven wrote: > I fixed the problem we were experiencing by adding a barrier. > The bug occurred between a piece of code that uses (many, over a loop) SEND > (from the leader) > and RECV (in the worker processes) to ship data to the > processing nodes from the head / leader, and I think what might have been > happening is > that this communication was mixed up with the following allreduce, when > there's no barrier. > > The bug shows up in Valgrind and dmalloc as a read from freed memory.
Hmm. This sounds sketchy (meaning: it *sounds* like this is a valid communication pattern, but it's impossible to tell without seeing the code). > I might spend some time trying to make a small piece of code that reproduces > this, If you have the time, that would be great. > but maybe this gives you some idea of what might be the issue, > if it's something that should be fixed. > Some more info: it happens even as far back as openMPI 1.3.4, and even in the > newest 1.6.3. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/