Hi

today I tried a different rankfile and got once more a problem. :-((

> > thank you very much for your patch. I have applied the patch to
> > openmpi-1.6.4rc4.
> > 
> > Open MPI: 1.6.4rc4r28022
> > : [B .][. .] (slot list 0:0)
> > : [. B][. .] (slot list 0:1)
> > : [B B][. .] (slot list 0:0-1)
> > : [. .][B .] (slot list 1:0)
> > : [. .][. B] (slot list 1:1)
> > : [. .][B B] (slot list 1:0-1)
> > : [B B][B B] (slot list 0:0-1,1:0-1)
> 
> That looks great.  I'll file a CMR to get this patch into 1.6.
> Unless you indicate otherwise, I'll assume this issue is understood 
> for 1.6.

Rankfile rf_6 is the same as last time. I have added one more
line in rf_7 and I switched the sequence of the hosts in rf_8.
Everything is still fine with rf_6. I don't get any output for
rank 1 with rf_7 and I get an error for rf_8. Both machines
use the same hardware.


sunpc1 rankfiles 106 cat rf_6
# mpiexec -report-bindings -rf rf_6 hostname
rank 0=sunpc1 slot=0:0-1,1:0-1

sunpc1 rankfiles 107 cat rf_7
# mpiexec -report-bindings -rf rf_7 hostname
rank 0=sunpc1 slot=0:0-1,1:0-1
rank 1=sunpc0 slot=0:0-1

sunpc1 rankfiles 108 cat rf_8
# mpiexec -report-bindings -rf rf_8 hostname
rank 0=sunpc0 slot=0:0-1,1:0-1
rank 1=sunpc1 slot=0:0-1


sunpc1 rankfiles 109 mpiexec -report-bindings -rf rf_6 hostname
[sunpc1:09779] MCW rank 0 bound to socket 0[core 0-1]
  socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)

sunpc1 rankfiles 110 mpiexec -report-bindings -rf rf_7 hostname
[sunpc1:09782] MCW rank 0 bound to socket 0[core 0-1]
  socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)

sunpc1 rankfiles 111 mpiexec -report-bindings -rf rf_8 hostname
--------------------------------------------------------------------------
The rankfile that was used claimed that a host was either not
allocated or oversubscribed its slots.  Please review your rank-slot
assignments and your host allocation to ensure a proper match.  Also,
some systems may require using full hostnames, such as
"host1.example.com" (instead of just plain "host1").

  Host: sunpc0
--------------------------------------------------------------------------



I get the following output, if I use sunpc0 as local host.

sunpc0 rankfiles 102 mpiexec -report-bindings -rf rf_6 hostname
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------

sunpc0 rankfiles 103 mpiexec -report-bindings -rf rf_7 hostname
--------------------------------------------------------------------------
The rankfile that was used claimed that a host was either not
allocated or oversubscribed its slots.  Please review your rank-slot
assignments and your host allocation to ensure a proper match.  Also,
some systems may require using full hostnames, such as
"host1.example.com" (instead of just plain "host1").

  Host: sunpc1
--------------------------------------------------------------------------

sunpc0 rankfiles 104 mpiexec -report-bindings -rf rf_8 hostname
[sunpc0:19027] MCW rank 0 bound to socket 0[core 0-1]
  socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)


I get the following output, if I use tyr as local host.

tyr rankfiles 218 mpiexec -report-bindings -rf rf_6 hostname
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------

tyr rankfiles 219 mpiexec -report-bindings -rf rf_7 hostname
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------

tyr rankfiles 220 mpiexec -report-bindings -rf rf_8 hostname
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------



Do you have any ideas why this happens? Thank you very much for
any help in advance.


Kind regards

Siegmar

Reply via email to