Jeff Squyres wrote:

On Mar 31, 2009, at 3:06 PM, Eugene Loh wrote:

The thing I was wondering about was memory barriers.  E.g., you
initialize stuff and then post the FIFO pointer. The other guy sees the
FIFO pointer before the initialized memory.

We do do memory barriers during that SM startup sequence. I haven't checked in a while, but I thought we were doing the right kinds of barriers in the right order...

There are certainly *some* barriers. The particular scenario I asked about didn't seem protected against (IMHO), but I certainly don't understand these issues and remain cautious about any guesses I make until I can demonstrate the problem and a solution.

Regarding "demonstrating the problem", I see the Sun MTT logs show some number of Init errors without mca_coll_hierarch involved. I'll try rerunning on the same machines and see if I can trigger the problem.

But George mentioned on the call today that they may have found the issue, but they're testing it. He didn't explain what the issue was in case he was wrong. ;-)

'kay.

Reply via email to