Re: [OMPI devel] 1.8.2rc4 problem: only 32 out of 48 cores are working

Ralph Castain Thu, 21 Aug 2014 18:15:32 -0400 (EDT)

How odd - can you run it with --display-devel-map and send that along? It will 
give us a detailed statement of where it thinks everything should run.



On Aug 21, 2014, at 2:49 PM, Andrej Prsa <aprs...@gmail.com> wrote:

> Hi Ralph,
> 
> Thanks for your reply!
> 
>> One thing you might want to try: add this to your mpirun cmd line:
>> 
>> --display-allocation
>> 
>> This will tell you how many slots we think we've been given on your
>> cluster.
> 
> I tried that using 1.8.2rc4, this is what I get:
> 
> ======================   ALLOCATED NODES   ======================
>        node2: slots=48 max_slots=48 slots_inuse=0 state=UNKNOWN
> =================================================================
> 
> I forgot to mention previously that mpirun runs all cores on localhost,
> it is only when running on another host (--hostfile hosts) that the 32
> proc cap is observed. I'm attaching a snapshot of the most recent run.
> The job was invoked by:
> 
> /usr/local/openmpi-1.8.2rc4/bin/mpirun -np 48 --hostfile hosts
>  --display-allocation ./test.py > test.std 2> test.ste
> 
> test.ste contains the hwloc error I mentioned in my previous post:
> 
> ****************************************************************************
> * hwloc has encountered what looks like an error from the operating system.
> *
> * object (L3 cpuset 0x000003f0) intersection without inclusion!
> * Error occurred in topology.c line 760
> *
> * Please report this error message to the hwloc user's mailing list,
> * along with the output from the hwloc-gather-topology.sh script.
> ****************************************************************************
> 
> Hope this helps,
> Andrej
> 
> 
>> On Aug 21, 2014, at 12:50 PM, Ralph Castain <r...@open-mpi.org> wrote:
>> 
>>> Starting early in the 1.7 series, we began to bind procs by default
>>> to cores when -np <= 2, and to sockets if np > 2. Is it possible
>>> this is what you are seeing?
>>> 
>>> 
>>> On Aug 21, 2014, at 12:45 PM, Andrej Prsa <aprs...@gmail.com> wrote:
>>> 
>>>> Dear devels,
>>>> 
>>>> I have been trying out 1.8.2rcs recently and found a show-stopping
>>>> problem on our cluster. Running any job with any number of
>>>> processors larger than 32 will always employ only 32 cores per
>>>> node (our nodes have 48 cores). We are seeing identical behavior
>>>> with 1.8.2rc4, 1.8.2rc2, and 1.8.1. Running identical programs
>>>> shows no such issues with version 1.6.5, where all 48 cores per
>>>> node are working. While our system is running torque/maui, the
>>>> problem is evident by running mpirun directly.
>>>> 
>>>> I am attaching hwloc topology in case that helps -- I am aware of a
>>>> buggy bios code that trips hwloc, but I don't know if that might
>>>> be an issue or not. I am happy to help debugging if you can
>>>> provide me with guidance.
>>>> 
>>>> Thanks,
>>>> Andrej
>>>> <cluster.output><cluster.tar.bz2>_______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/devel/2014/08/15676.php
>>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/08/15678.php
> <htop.jpg>_______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15679.php

Re: [OMPI devel] 1.8.2rc4 problem: only 32 out of 48 cores are working

Reply via email to