Thank you very much, Ralph.

Heck, it had to be something stupid like this.
Sorry for taking your time.
Yes, switching from "slots" to "slot" fixes the rankfile problem,
and both cases work.

I must have been carried along by the hostfile syntax,
where the "slots" reign, but when it comes to binding,
obviously for each process rank one wants a single "slot"
(unless the process is multi-threaded, which is what I need to setup).

I will write 100 times in the blackboard:
"Slots in the hostfile, slot in the rankfile,
slot is singular, to err is plural."
... at least until Ralph's new plural-forgiving parsing rule
makes it to the code.

Regards,
Gus Correa




Ralph Castain wrote:
I normally hide my eyes when rankfiles appear,
but since you provide so much help on this list yourself... :-)

I believe the problem is that you have the keyword "slots" wrong -
it is supposed to be "slot":

    rank 1=host1 slot=1:0,1
    rank 0=host2 slot=0:*
    rank 2=host4 slot=1-2
    rank 3=host3 slot=0:1,1:0-2

Hence the flex parser gets confused...

I didn't write this code, but it seems to me that a little more leeway (e.g., allowing "slots" as well as "slot") would be more appropriate. If you try the revision and it works, I'll submit a change to accept both syntax options.

On Jul 26, 2011, at 2:49 PM, Gus Correa wrote:

Dear Open MPI pros

I am having trouble to get the mpiexec rankfile option right.
I would appreciate any help to solve the problem.

Also is there a way to tell Open MPI to print out its own numbering
of the "slots", and perhaps how they're mapped to the socket:core pair?

I am using Open MPI 1.4.3, compiled with Torque 2.4.11 support,
on Linux CentOS 5.2 x86_64.
This cluster has nodes with dual AMD Opteron quad-core processors,
a total of 8 cores per node.
I enclose a snippet of /proc/cpuinfo below.

I build the rankfile on the fly from the $PBS_NODEFILE.
The mpiexec command line is:

mpiexec \
       -v \
        -np ${NP} \
       -mca btl openib,sm,self \
       -tag-output \
       -report-bindings \
       -rf $my_rf \
        -mca paffinity_base_verbose 1 \
       connectivity_c -v


I tried two different ways to specify the slots on the rankfile:

*First way (sequential "slots" on each node):

rank   0=node34 slots=0
rank   1=node34 slots=1
rank   2=node34 slots=2
rank   3=node34 slots=3
rank   4=node34 slots=4
rank   5=node34 slots=5
rank   6=node34 slots=6
rank   7=node34 slots=7
rank   8=node33 slots=0
rank   9=node33 slots=1
rank  10=node33 slots=2
rank  11=node33 slots=3
rank  12=node33 slots=4
rank  13=node33 slots=5
rank  14=node33 slots=6
rank  15=node33 slots=7


*Second way ( slots in socket:core style) :

rank   0=node34 slots=0:0
rank   1=node34 slots=0:1
rank   2=node34 slots=0:2
rank   3=node34 slots=0:3
rank   4=node34 slots=1:0
rank   5=node34 slots=1:1
rank   6=node34 slots=1:2
rank   7=node34 slots=1:3
rank   8=node33 slots=0:0
rank   9=node33 slots=0:1
rank  10=node33 slots=0:2
rank  11=node33 slots=0:3
rank  12=node33 slots=1:0
rank  13=node33 slots=1:1
rank  14=node33 slots=1:2
rank  15=node33 slots=1:3

***

I get the errors messages below.
I am scratching my head to full baldness to try to understand them.

They seem to suggest that my rankfile syntax is wrong
(which I copied from the FAQ and man mpiexec), or that it is not parsing it as 
I expected it to be.
Or is it perhaps that it doesn't like the numbers I am using for the
various slots in the rankfile?
The error messages also complaint about
node allocation or oversubscribed slots,
but the nodes were allocated by Torque, and the rankfiles were
written with no intent to oversubscribe.

*First rankfile error:

--------------------------------------------------------------------------
Rankfile claimed host 0 that was not allocated or oversubscribed it's slots.
Please review your rank-slot assignments and your host allocation to ensure
a proper match.

--------------------------------------------------------------------------
-

... etc, etc ...

*Second rankfile error:

--------------------------------------------------------------------------
Rankfile claimed host 0:0 that was not allocated or oversubscribed it's slots.
Please review your rank-slot assignments and your host allocation to ensure
a proper match.

--------------------------------------------------------------------------
--------------------------------------------------------------------------
A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
launch so we are aborting.

... etc, etc ...

**********

I am stuck.
Any help is much appreciated.
Thank you.

Gus Correa



*****************************
Snippet of /proc/cpuinfo
*****************************

processor       : 0
physical id     : 0
core id         : 0
siblings        : 4
cpu cores       : 4

processor       : 1
physical id     : 0
core id         : 1
siblings        : 4
cpu cores       : 4

processor       : 2
physical id     : 0
core id         : 2
siblings        : 4
cpu cores       : 4

processor       : 3
physical id     : 0
core id         : 3
siblings        : 4
cpu cores       : 4

processor       : 4
physical id     : 1
core id         : 0
siblings        : 4
cpu cores       : 4

processor       : 5
physical id     : 1
core id         : 1
siblings        : 4
cpu cores       : 4

processor       : 6
physical id     : 1
core id         : 2
siblings        : 4
cpu cores       : 4

processor       : 7
physical id     : 1
core id         : 3
siblings        : 4
cpu cores       : 4

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to