I like the idea of putting the old libevent back as a separate component, just for performance/correctness comparisons. I think it would be good for the trunk, but for the release branches just choose one version to ship (so we don't confuse users).
-- Josh On Oct 26, 2010, at 6:27 AM, Jeff Squyres (jsquyres) wrote: > Btw it strikes me that we could put the old libevent back as a separate > component for comparisons. > > Sent from my PDA. No type good. > > On Oct 26, 2010, at 6:20 AM, "Jeff Squyres" <jsquy...@cisco.com> wrote: > >> On Oct 25, 2010, at 9:29 PM, George Bosilca wrote: >> >>> 1. Not all processes deadlock in btl_sm_add_procs. The process that setup >>> the shared memory area, is going forward, and block later in a barrier. >> >> Yes, I'm seeing the same thing (I didn't include all details like this in my >> post, sorry). I was running with -np 2 on a local machine and saw vpid=0 get >> stuck in opal_progress (because the first time through, seg_inited < >> n_local_procs). vpid=1 increments seg_inited and therefore doesn't enter >> the loop that calls opal_progress(), and therefore continues on. >> >>> 2. All other processes, loop around the opal_progress, until they got a >>> message from all other processes. The variable used for counting is somehow >>> updated correctly, but we still call opal_progress. I couldn't figure out >>> is we loop more that we should, or if opal_progress doesn't return. >>> However, both of these possibilities look very unlikely to me: the loop in >>> the sm_add_procs is pretty straightforward, and I couldn't find any loops >>> in opal_progress. I wonder if some of the messages get lost on the exchange. >> >> I had this problem, too, until I tried to use padb to get stack traces. I >> noticed that when I ran padb, my blocked process un-blocked itself and >> continued. After more digging, I determined that my blocked process was, in >> fact, blocked in poll() with an infinite timeout. padb (or any signal at >> all) caused it to unblock and therefore continue. >> >>> 3. If I unblock the situation by hand, everything goes back to normal. >>> NetPIPE runs to completion but the performances are __really__ bad. On my >>> test machine I get around 2000Mbs, when the expected value is at least 10 >>> times more. Similar finding on the latency side, we're now at 1.65 >>> micro-sec up from the usual 0.35 we had before. >> >> It's a feature! >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > ------------------------------------ Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey