How odd - can you run it with --display-devel-map and send that along? It will give us a detailed statement of where it thinks everything should run.
On Aug 21, 2014, at 2:49 PM, Andrej Prsa <aprs...@gmail.com> wrote: > Hi Ralph, > > Thanks for your reply! > >> One thing you might want to try: add this to your mpirun cmd line: >> >> --display-allocation >> >> This will tell you how many slots we think we've been given on your >> cluster. > > I tried that using 1.8.2rc4, this is what I get: > > ====================== ALLOCATED NODES ====================== > node2: slots=48 max_slots=48 slots_inuse=0 state=UNKNOWN > ================================================================= > > I forgot to mention previously that mpirun runs all cores on localhost, > it is only when running on another host (--hostfile hosts) that the 32 > proc cap is observed. I'm attaching a snapshot of the most recent run. > The job was invoked by: > > /usr/local/openmpi-1.8.2rc4/bin/mpirun -np 48 --hostfile hosts > --display-allocation ./test.py > test.std 2> test.ste > > test.ste contains the hwloc error I mentioned in my previous post: > > **************************************************************************** > * hwloc has encountered what looks like an error from the operating system. > * > * object (L3 cpuset 0x000003f0) intersection without inclusion! > * Error occurred in topology.c line 760 > * > * Please report this error message to the hwloc user's mailing list, > * along with the output from the hwloc-gather-topology.sh script. > **************************************************************************** > > Hope this helps, > Andrej > > >> On Aug 21, 2014, at 12:50 PM, Ralph Castain <r...@open-mpi.org> wrote: >> >>> Starting early in the 1.7 series, we began to bind procs by default >>> to cores when -np <= 2, and to sockets if np > 2. Is it possible >>> this is what you are seeing? >>> >>> >>> On Aug 21, 2014, at 12:45 PM, Andrej Prsa <aprs...@gmail.com> wrote: >>> >>>> Dear devels, >>>> >>>> I have been trying out 1.8.2rcs recently and found a show-stopping >>>> problem on our cluster. Running any job with any number of >>>> processors larger than 32 will always employ only 32 cores per >>>> node (our nodes have 48 cores). We are seeing identical behavior >>>> with 1.8.2rc4, 1.8.2rc2, and 1.8.1. Running identical programs >>>> shows no such issues with version 1.6.5, where all 48 cores per >>>> node are working. While our system is running torque/maui, the >>>> problem is evident by running mpirun directly. >>>> >>>> I am attaching hwloc topology in case that helps -- I am aware of a >>>> buggy bios code that trips hwloc, but I don't know if that might >>>> be an issue or not. I am happy to help debugging if you can >>>> provide me with guidance. >>>> >>>> Thanks, >>>> Andrej >>>> <cluster.output><cluster.tar.bz2>_______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/08/15676.php >>> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/08/15678.php > <htop.jpg>_______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15679.php