The video you provided was most helpful -- thank you! I confirm that the hwloc message you sent (and your posts to the hwloc-users list) indicate that hwloc is getting confused by a buggy BIOS, but it's only dealing with the L3 cache, and that shouldn't affect the binding that OMPI is doing.
Can I ask you to do two more tests: 1. Run with "--report-bindings" and send the output. It'll prettyprint-render where OMPI thinks it is binding each process. Ralph asked you to run a few test already, and the output you sent may simply confirm what you sent previously, but he's more of an ORTE expert than I am -- the --reporting-bindings output shows easily parseable output for the rest of us. :-) 2. Run with "--bind-to none" and see if that helps. I.e., if, per #1, OMPI thinks it is binding correctly (i.e., each of the 48 processes is being bound to a unique core), then perhaps hwloc is doing something wrong in the actual binding (i.e., binding the 48 processes only among the lower 32 cores). On Aug 22, 2014, at 6:49 AM, Andrej Prsa <aprs...@gmail.com> wrote: > Hi again, > > I generated a video that demonstrates the problem; for brevity I did > not run a full process, but I'm providing the timing below. If you'd > like me to record a full process, just let me know -- but as I said in > my previous email, 32 procs drop to 1 after about a minute and the > computation then rests on a single processor to complete the job. > > With openmpi-1.6.5: > > real 1m13.186s > user 0m0.044s > sys 0m0.059s > > With openmpi-1.8.2rc4: > > real 13m42.998s > user 0m0.070s > sys 0m0.066s > > Exact invocation both times, exact same job submitted. Here's a link to > the video: > > http://clusty.ast.villanova.edu/aprsa/files/test.ogv > > Please let me know if I can provide you with anything further. > > Thanks, > Andrej > >> Ah, that sheds some light. There is indeed a significant change >> between earlier releases and the 1.8.1 and above that might explain >> what he is seeing. Specifically, we no longer hammer the cpu while in >> MPI_Finalize. So if 16 of the procs are finishing early (which the >> output would suggest), then they will go into a "lazy" finalize state >> while they wait for the rest of the procs to complete their work. >> >> In contrast, prior releases would continue at 100% cpu while they >> polled to see if the other procs were done. >> >> We did this to help save power/energy, and because users had asked >> why the cpu utilization remained at 100% even though procs were >> waiting in finalize >> >> HTH >> Ralph >> >> On Aug 21, 2014, at 5:55 PM, Christopher Samuel >> <sam...@unimelb.edu.au> wrote: >> >>> On 22/08/14 10:43, Ralph Castain wrote: >>> >>>> From your earlier concerns, I would have expected only to find 32 >>>> of them running. Was that not the case in this run? >>> >>> As I understand it in his original email he mentioned that with >>> 1.6.5 all 48 processes were running at 100% CPU and was wondering >>> if the buggy BIOS that caused hwloc the issues he reported on the >>> hwloc-users list might be the cause for this regression in >>> performance. >>> >>> All the best, >>> Chris >>> -- >>> Christopher Samuel Senior Systems Administrator >>> VLSCI - Victorian Life Sciences Computation Initiative >>> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 >>> http://www.vlsci.org.au/ http://twitter.com/vlsci >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/08/15686.php >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/08/15687.php > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15690.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/