Jeff Squyres wrote:
On Mar 31, 2009, at 3:06 PM, Eugene Loh wrote:
The thing I was wondering about was memory barriers. E.g., you
initialize stuff and then post the FIFO pointer. The other guy sees
the
FIFO pointer before the initialized memory.
We do do memory barriers during that SM startup sequence. I haven't
checked in a while, but I thought we were doing the right kinds of
barriers in the right order...
There are certainly *some* barriers. The particular scenario I asked
about didn't seem protected against (IMHO), but I certainly don't
understand these issues and remain cautious about any guesses I make
until I can demonstrate the problem and a solution.
Regarding "demonstrating the problem", I see the Sun MTT logs show some
number of Init errors without mca_coll_hierarch involved. I'll try
rerunning on the same machines and see if I can trigger the problem.
But George mentioned on the call today that they may have found the
issue, but they're testing it. He didn't explain what the issue was
in case he was wrong. ;-)
'kay.