On Oct 25, 2010, at 9:29 PM, George Bosilca wrote:

> 1. Not all processes deadlock in btl_sm_add_procs. The process that setup the 
> shared memory area, is going forward, and block later in a barrier.

Yes, I'm seeing the same thing (I didn't include all details like this in my 
post, sorry). I was running with -np 2 on a local machine and saw vpid=0 get 
stuck in opal_progress (because the first time through, seg_inited < 
n_local_procs).  vpid=1 increments seg_inited and therefore doesn't enter the 
loop that calls opal_progress(), and therefore continues on.

> 2. All other processes, loop around the opal_progress, until they got a 
> message from all other processes. The variable used for counting is somehow 
> updated correctly, but we still call opal_progress. I couldn't figure out is 
> we loop more that we should, or if opal_progress doesn't return. However, 
> both of these possibilities look very unlikely to me: the loop in the 
> sm_add_procs is pretty straightforward, and I couldn't find any loops in 
> opal_progress. I wonder if some of the messages get lost on the exchange.

I had this problem, too, until I tried to use padb to get stack traces.  I 
noticed that when I ran padb, my blocked process un-blocked itself and 
continued.  After more digging, I determined that my blocked process was, in 
fact, blocked in poll() with an infinite timeout.  padb (or any signal at all) 
caused it to unblock and therefore continue.

> 3. If I unblock the situation by hand, everything goes back to normal. 
> NetPIPE runs to completion but the performances are __really__ bad. On my 
> test machine I get around 2000Mbs, when the expected value is at least 10 
> times more. Similar finding on the latency side, we're now at 1.65 micro-sec 
> up from the usual 0.35 we had before.

It's a feature!

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to