I like the idea of putting the old libevent back as a separate component, just 
for performance/correctness comparisons. I think it would be good for the 
trunk, but for the release branches just choose one version to ship (so we 
don't confuse users).

-- Josh

On Oct 26, 2010, at 6:27 AM, Jeff Squyres (jsquyres) wrote:

> Btw it strikes me that we could put the old libevent back as a separate 
> component for comparisons. 
> 
> Sent from my PDA. No type good. 
> 
> On Oct 26, 2010, at 6:20 AM, "Jeff Squyres" <jsquy...@cisco.com> wrote:
> 
>> On Oct 25, 2010, at 9:29 PM, George Bosilca wrote:
>> 
>>> 1. Not all processes deadlock in btl_sm_add_procs. The process that setup 
>>> the shared memory area, is going forward, and block later in a barrier.
>> 
>> Yes, I'm seeing the same thing (I didn't include all details like this in my 
>> post, sorry). I was running with -np 2 on a local machine and saw vpid=0 get 
>> stuck in opal_progress (because the first time through, seg_inited < 
>> n_local_procs).  vpid=1 increments seg_inited and therefore doesn't enter 
>> the loop that calls opal_progress(), and therefore continues on.
>> 
>>> 2. All other processes, loop around the opal_progress, until they got a 
>>> message from all other processes. The variable used for counting is somehow 
>>> updated correctly, but we still call opal_progress. I couldn't figure out 
>>> is we loop more that we should, or if opal_progress doesn't return. 
>>> However, both of these possibilities look very unlikely to me: the loop in 
>>> the sm_add_procs is pretty straightforward, and I couldn't find any loops 
>>> in opal_progress. I wonder if some of the messages get lost on the exchange.
>> 
>> I had this problem, too, until I tried to use padb to get stack traces.  I 
>> noticed that when I ran padb, my blocked process un-blocked itself and 
>> continued.  After more digging, I determined that my blocked process was, in 
>> fact, blocked in poll() with an infinite timeout.  padb (or any signal at 
>> all) caused it to unblock and therefore continue.
>> 
>>> 3. If I unblock the situation by hand, everything goes back to normal. 
>>> NetPIPE runs to completion but the performances are __really__ bad. On my 
>>> test machine I get around 2000Mbs, when the expected value is at least 10 
>>> times more. Similar finding on the latency side, we're now at 1.65 
>>> micro-sec up from the usual 0.35 we had before.
>> 
>> It's a feature!
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 

------------------------------------
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey


Reply via email to