The video you provided was most helpful -- thank you!

I confirm that the hwloc message you sent (and your posts to the hwloc-users 
list) indicate that hwloc is getting confused by a buggy BIOS, but it's only 
dealing with the L3 cache, and that shouldn't affect the binding that OMPI is 
doing.

Can I ask you to do two more tests:

1. Run with "--report-bindings" and send the output.  It'll prettyprint-render 
where OMPI thinks it is binding each process.  Ralph asked you to run a few 
test already, and the output you sent may simply confirm what you sent 
previously, but he's more of an ORTE expert than I am -- the 
--reporting-bindings output shows easily parseable output for the rest of us.  
:-)

2. Run with "--bind-to none" and see if that helps.  I.e., if, per #1, OMPI 
thinks it is binding correctly (i.e., each of the 48 processes is being bound 
to a unique core), then perhaps hwloc is doing something wrong in the actual 
binding (i.e., binding the 48 processes only among the lower 32 cores).



On Aug 22, 2014, at 6:49 AM, Andrej Prsa <aprs...@gmail.com> wrote:

> Hi again,
> 
> I generated a video that demonstrates the problem; for brevity I did
> not run a full process, but I'm providing the timing below. If you'd
> like me to record a full process, just let me know -- but as I said in
> my previous email, 32 procs drop to 1 after about a minute and the
> computation then rests on a single processor to complete the job.
> 
> With openmpi-1.6.5:
> 
>       real    1m13.186s
>       user    0m0.044s
>       sys     0m0.059s
> 
> With openmpi-1.8.2rc4:
> 
>       real    13m42.998s
>       user    0m0.070s
>       sys     0m0.066s
> 
> Exact invocation both times, exact same job submitted. Here's a link to
> the video:
> 
>       http://clusty.ast.villanova.edu/aprsa/files/test.ogv
> 
> Please let me know if I can provide you with anything further.
> 
> Thanks,
> Andrej
> 
>> Ah, that sheds some light. There is indeed a significant change
>> between earlier releases and the 1.8.1 and above that might explain
>> what he is seeing. Specifically, we no longer hammer the cpu while in
>> MPI_Finalize. So if 16 of the procs are finishing early (which the
>> output would suggest), then they will go into a "lazy" finalize state
>> while they wait for the rest of the procs to complete their work.
>> 
>> In contrast, prior releases would continue at 100% cpu while they
>> polled to see if the other procs were done.
>> 
>> We did this to help save power/energy, and because users had asked
>> why the cpu utilization remained at 100% even though procs were
>> waiting in finalize
>> 
>> HTH
>> Ralph
>> 
>> On Aug 21, 2014, at 5:55 PM, Christopher Samuel
>> <sam...@unimelb.edu.au> wrote:
>> 
>>> On 22/08/14 10:43, Ralph Castain wrote:
>>> 
>>>> From your earlier concerns, I would have expected only to find 32
>>>> of them running. Was that not the case in this run?
>>> 
>>> As I understand it in his original email he mentioned that with
>>> 1.6.5 all 48 processes were running at 100% CPU and was wondering
>>> if the buggy BIOS that caused hwloc the issues he reported on the
>>> hwloc-users list might be the cause for this regression in
>>> performance.
>>> 
>>> All the best,
>>> Chris
>>> -- 
>>> Christopher Samuel        Senior Systems Administrator
>>> VLSCI - Victorian Life Sciences Computation Initiative
>>> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
>>> http://www.vlsci.org.au/      http://twitter.com/vlsci
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2014/08/15686.php
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/08/15687.php
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15690.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to