Re: [OMPI devel] Process placement

2016-05-07 Thread Ralph Castain
I believe this has been fixed. Note that the allocation display occurs prior to 
mapping, and thus the slots_inuse will be zero at that point. You’ll see those 
numbers change if you do a comm_spawn, but otherwise they will always be zero


> On May 5, 2016, at 8:37 PM, Ralph Castain  wrote:
> 
> Okay, I see it - will fix on Fri. This is unique to master.
> 
>> On May 5, 2016, at 1:54 PM, Aurélien Bouteiller > > wrote:
>> 
>> Ralph, 
>> 
>> I still observe these issues in the current master. (npernode is not 
>> respected either).
>> 
>> Also note that the display_allocation seems to be wrong (slots_inuse=0 when 
>> the slot is obviously in use). 
>> 
>> $ git show 
>> 4899c89 (HEAD -> master, origin/master, origin/HEAD) Fix a race condition 
>> when multiple threads try to create a bml enBouteiller  6 hours ago
>> 
>> $ bin/mpirun -np 12 -hostfile /opt/etc/ib10g.machinefile.ompi 
>> -display-allocation -map-by nodehostname 
>> 
>> ==   ALLOCATED NODES   ==
>>  dancer00: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>  dancer01: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>  dancer02: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>  dancer03: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>  dancer04: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>  dancer05: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>  dancer06: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>  dancer07: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>  dancer08: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>  dancer09: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>  dancer10: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>  dancer11: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>  dancer12: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>  dancer13: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>  dancer14: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>>  dancer15: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>> =
>> dancer01
>> dancer00
>> dancer01
>> dancer01
>> dancer01
>> dancer00
>> dancer00
>> dancer00
>> dancer00
>> dancer00
>> dancer00
>> dancer00
>> 
>> 
>> --
>> Aurélien Bouteiller, Ph.D. ~~ https://icl.cs.utk.edu/~bouteill/ 
>> 
>>> Le 13 avr. 2016 à 13:38, Ralph Castain >> > a écrit :
>>> 
>>> The —map-by node option should now be fixed on master, and PRs waiting for 
>>> 1.10 and 2.0
>>> 
>>> Thx!
>>> 
 On Apr 12, 2016, at 6:45 PM, Ralph Castain >>> > wrote:
 
 FWIW: speaking just to the —map-by node issue, Josh Ladd reported the 
 problem on master as well yesterday. I’ll be looking into it on Wed.
 
> On Apr 12, 2016, at 5:53 PM, George Bosilca  > wrote:
> 
> 
> 
> On Wed, Apr 13, 2016 at 1:59 AM, Gilles Gouaillardet  > wrote:
> George,
> 
> about the process binding part
> 
> On 4/13/2016 7:32 AM, George Bosilca wrote:
> Also my processes, despite the fact that I asked for 1 per node, are not 
> bound to the first core. Shouldn’t we release the process binding when we 
> know there is a single process per node (as in the above case) ?
> did you expect the tasks are bound to the first *core* on each node ?
> 
> i would expect the tasks are bound to the first *socket* on each node.
> 
> In this particular instance, where it has been explicitly requested to 
> have a single process per node, I would have expected the process to be 
> unbound (we know there is only one per node). It is the responsibility of 
> the application to bound itself or its thread if necessary. Why are we 
> enforcing a particular binding policy?
> 
> (since we do not know how many (OpenMP or other) threads will be used by 
> the application, 
> --bind-to socket is a good policy imho. in this case (one task per node), 
> no binding at all would mean
> the task can migrate from one socket to the other, and/or OpenMP threads 
> are bound accross sockets.
> That would trigger some NUMA effects (better bandwidth if memory is 
> locally accessed, but worst performance
> is memory is allocated only on one socket).
> so imho, --bind-to socket is still my preferred policy, even if there is 
> only one MPI task per node.
> 
> Open MPI is about MPI ranks/processes. I don't think it is our job to try 
> to figure out how the user handle do with it's own threads.
> 
> Your justification make sense if the application on

Re: [OMPI devel] Process placement

2016-05-05 Thread Ralph Castain
Okay, I see it - will fix on Fri. This is unique to master.

> On May 5, 2016, at 1:54 PM, Aurélien Bouteiller  wrote:
> 
> Ralph, 
> 
> I still observe these issues in the current master. (npernode is not 
> respected either).
> 
> Also note that the display_allocation seems to be wrong (slots_inuse=0 when 
> the slot is obviously in use). 
> 
> $ git show 
> 4899c89 (HEAD -> master, origin/master, origin/HEAD) Fix a race condition 
> when multiple threads try to create a bml enBouteiller  6 hours ago
> 
> $ bin/mpirun -np 12 -hostfile /opt/etc/ib10g.machinefile.ompi 
> -display-allocation -map-by nodehostname 
> 
> ==   ALLOCATED NODES   ==
>   dancer00: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>   dancer01: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>   dancer02: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>   dancer03: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>   dancer04: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>   dancer05: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>   dancer06: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>   dancer07: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>   dancer08: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>   dancer09: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>   dancer10: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>   dancer11: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>   dancer12: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>   dancer13: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>   dancer14: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
>   dancer15: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> =
> dancer01
> dancer00
> dancer01
> dancer01
> dancer01
> dancer00
> dancer00
> dancer00
> dancer00
> dancer00
> dancer00
> dancer00
> 
> 
> --
> Aurélien Bouteiller, Ph.D. ~~ https://icl.cs.utk.edu/~bouteill/ 
> 
>> Le 13 avr. 2016 à 13:38, Ralph Castain > > a écrit :
>> 
>> The —map-by node option should now be fixed on master, and PRs waiting for 
>> 1.10 and 2.0
>> 
>> Thx!
>> 
>>> On Apr 12, 2016, at 6:45 PM, Ralph Castain >> > wrote:
>>> 
>>> FWIW: speaking just to the —map-by node issue, Josh Ladd reported the 
>>> problem on master as well yesterday. I’ll be looking into it on Wed.
>>> 
 On Apr 12, 2016, at 5:53 PM, George Bosilca >>> > wrote:
 
 
 
 On Wed, Apr 13, 2016 at 1:59 AM, Gilles Gouaillardet >>> > wrote:
 George,
 
 about the process binding part
 
 On 4/13/2016 7:32 AM, George Bosilca wrote:
 Also my processes, despite the fact that I asked for 1 per node, are not 
 bound to the first core. Shouldn’t we release the process binding when we 
 know there is a single process per node (as in the above case) ?
 did you expect the tasks are bound to the first *core* on each node ?
 
 i would expect the tasks are bound to the first *socket* on each node.
 
 In this particular instance, where it has been explicitly requested to 
 have a single process per node, I would have expected the process to be 
 unbound (we know there is only one per node). It is the responsibility of 
 the application to bound itself or its thread if necessary. Why are we 
 enforcing a particular binding policy?
 
 (since we do not know how many (OpenMP or other) threads will be used by 
 the application, 
 --bind-to socket is a good policy imho. in this case (one task per node), 
 no binding at all would mean
 the task can migrate from one socket to the other, and/or OpenMP threads 
 are bound accross sockets.
 That would trigger some NUMA effects (better bandwidth if memory is 
 locally accessed, but worst performance
 is memory is allocated only on one socket).
 so imho, --bind-to socket is still my preferred policy, even if there is 
 only one MPI task per node.
 
 Open MPI is about MPI ranks/processes. I don't think it is our job to try 
 to figure out how the user handle do with it's own threads.
 
 Your justification make sense if the application only uses a single 
 socket. It also make sense if one starts multiple ranks per node, and the 
 internal threads of each MPI process inherit the MPI process binding. 
 However, in the case where there is a single process per node, because 
 there is a mismatch between the number of resources available (hardware 
 threads) and the binding of the parent process, all the threads of the MPI 
 applicat

Re: [OMPI devel] Process placement

2016-05-05 Thread Aurélien Bouteiller
Ralph, 

I still observe these issues in the current master. (npernode is not respected 
either).

Also note that the display_allocation seems to be wrong (slots_inuse=0 when the 
slot is obviously in use). 

$ git show 
4899c89 (HEAD -> master, origin/master, origin/HEAD) Fix a race condition when 
multiple threads try to create a bml enBouteiller  6 hours ago

$ bin/mpirun -np 12 -hostfile /opt/etc/ib10g.machinefile.ompi 
-display-allocation -map-by nodehostname 

==   ALLOCATED NODES   ==
dancer00: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer01: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer02: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer03: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer04: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer05: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer06: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer07: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer08: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer09: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer10: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer11: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer12: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer13: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer14: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer15: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
=
dancer01
dancer00
dancer01
dancer01
dancer01
dancer00
dancer00
dancer00
dancer00
dancer00
dancer00
dancer00


--
Aurélien Bouteiller, Ph.D. ~~ https://icl.cs.utk.edu/~bouteill/ 

> Le 13 avr. 2016 à 13:38, Ralph Castain  a écrit :
> 
> The —map-by node option should now be fixed on master, and PRs waiting for 
> 1.10 and 2.0
> 
> Thx!
> 
>> On Apr 12, 2016, at 6:45 PM, Ralph Castain > > wrote:
>> 
>> FWIW: speaking just to the —map-by node issue, Josh Ladd reported the 
>> problem on master as well yesterday. I’ll be looking into it on Wed.
>> 
>>> On Apr 12, 2016, at 5:53 PM, George Bosilca >> > wrote:
>>> 
>>> 
>>> 
>>> On Wed, Apr 13, 2016 at 1:59 AM, Gilles Gouaillardet >> > wrote:
>>> George,
>>> 
>>> about the process binding part
>>> 
>>> On 4/13/2016 7:32 AM, George Bosilca wrote:
>>> Also my processes, despite the fact that I asked for 1 per node, are not 
>>> bound to the first core. Shouldn’t we release the process binding when we 
>>> know there is a single process per node (as in the above case) ?
>>> did you expect the tasks are bound to the first *core* on each node ?
>>> 
>>> i would expect the tasks are bound to the first *socket* on each node.
>>> 
>>> In this particular instance, where it has been explicitly requested to have 
>>> a single process per node, I would have expected the process to be unbound 
>>> (we know there is only one per node). It is the responsibility of the 
>>> application to bound itself or its thread if necessary. Why are we 
>>> enforcing a particular binding policy?
>>> 
>>> (since we do not know how many (OpenMP or other) threads will be used by 
>>> the application, 
>>> --bind-to socket is a good policy imho. in this case (one task per node), 
>>> no binding at all would mean
>>> the task can migrate from one socket to the other, and/or OpenMP threads 
>>> are bound accross sockets.
>>> That would trigger some NUMA effects (better bandwidth if memory is locally 
>>> accessed, but worst performance
>>> is memory is allocated only on one socket).
>>> so imho, --bind-to socket is still my preferred policy, even if there is 
>>> only one MPI task per node.
>>> 
>>> Open MPI is about MPI ranks/processes. I don't think it is our job to try 
>>> to figure out how the user handle do with it's own threads.
>>> 
>>> Your justification make sense if the application only uses a single socket. 
>>> It also make sense if one starts multiple ranks per node, and the internal 
>>> threads of each MPI process inherit the MPI process binding. However, in 
>>> the case where there is a single process per node, because there is a 
>>> mismatch between the number of resources available (hardware threads) and 
>>> the binding of the parent process, all the threads of the MPI application 
>>> are [by default] bound on a single socket.
>>> 
>>>  George.
>>> 
>>> PS: That being said I think I'll need to implement the binding code anyway 
>>> in order to deal with the wide variety of behaviors in the different MPI 
>>> implementations.
>>> 
>>>  
>>> 
>>> Cheers,
>>

Re: [OMPI devel] Process placement

2016-04-13 Thread Ralph Castain
The —map-by node option should now be fixed on master, and PRs waiting for 1.10 
and 2.0

Thx!

> On Apr 12, 2016, at 6:45 PM, Ralph Castain  wrote:
> 
> FWIW: speaking just to the —map-by node issue, Josh Ladd reported the problem 
> on master as well yesterday. I’ll be looking into it on Wed.
> 
>> On Apr 12, 2016, at 5:53 PM, George Bosilca > > wrote:
>> 
>> 
>> 
>> On Wed, Apr 13, 2016 at 1:59 AM, Gilles Gouaillardet > > wrote:
>> George,
>> 
>> about the process binding part
>> 
>> On 4/13/2016 7:32 AM, George Bosilca wrote:
>> Also my processes, despite the fact that I asked for 1 per node, are not 
>> bound to the first core. Shouldn’t we release the process binding when we 
>> know there is a single process per node (as in the above case) ?
>> did you expect the tasks are bound to the first *core* on each node ?
>> 
>> i would expect the tasks are bound to the first *socket* on each node.
>> 
>> In this particular instance, where it has been explicitly requested to have 
>> a single process per node, I would have expected the process to be unbound 
>> (we know there is only one per node). It is the responsibility of the 
>> application to bound itself or its thread if necessary. Why are we enforcing 
>> a particular binding policy?
>> 
>> (since we do not know how many (OpenMP or other) threads will be used by the 
>> application, 
>> --bind-to socket is a good policy imho. in this case (one task per node), no 
>> binding at all would mean
>> the task can migrate from one socket to the other, and/or OpenMP threads are 
>> bound accross sockets.
>> That would trigger some NUMA effects (better bandwidth if memory is locally 
>> accessed, but worst performance
>> is memory is allocated only on one socket).
>> so imho, --bind-to socket is still my preferred policy, even if there is 
>> only one MPI task per node.
>> 
>> Open MPI is about MPI ranks/processes. I don't think it is our job to try to 
>> figure out how the user handle do with it's own threads.
>> 
>> Your justification make sense if the application only uses a single socket. 
>> It also make sense if one starts multiple ranks per node, and the internal 
>> threads of each MPI process inherit the MPI process binding. However, in the 
>> case where there is a single process per node, because there is a mismatch 
>> between the number of resources available (hardware threads) and the binding 
>> of the parent process, all the threads of the MPI application are [by 
>> default] bound on a single socket.
>> 
>>  George.
>> 
>> PS: That being said I think I'll need to implement the binding code anyway 
>> in order to deal with the wide variety of behaviors in the different MPI 
>> implementations.
>> 
>>  
>> 
>> Cheers,
>> 
>> Gilles
>> ___
>> devel mailing list
>> de...@open-mpi.org 
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>> 
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2016/04/18758.php 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org 
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2016/04/18759.php
> 



Re: [OMPI devel] Process placement

2016-04-12 Thread Ralph Castain
FWIW: speaking just to the —map-by node issue, Josh Ladd reported the problem 
on master as well yesterday. I’ll be looking into it on Wed.

> On Apr 12, 2016, at 5:53 PM, George Bosilca  wrote:
> 
> 
> 
> On Wed, Apr 13, 2016 at 1:59 AM, Gilles Gouaillardet  > wrote:
> George,
> 
> about the process binding part
> 
> On 4/13/2016 7:32 AM, George Bosilca wrote:
> Also my processes, despite the fact that I asked for 1 per node, are not 
> bound to the first core. Shouldn’t we release the process binding when we 
> know there is a single process per node (as in the above case) ?
> did you expect the tasks are bound to the first *core* on each node ?
> 
> i would expect the tasks are bound to the first *socket* on each node.
> 
> In this particular instance, where it has been explicitly requested to have a 
> single process per node, I would have expected the process to be unbound (we 
> know there is only one per node). It is the responsibility of the application 
> to bound itself or its thread if necessary. Why are we enforcing a particular 
> binding policy?
> 
> (since we do not know how many (OpenMP or other) threads will be used by the 
> application, 
> --bind-to socket is a good policy imho. in this case (one task per node), no 
> binding at all would mean
> the task can migrate from one socket to the other, and/or OpenMP threads are 
> bound accross sockets.
> That would trigger some NUMA effects (better bandwidth if memory is locally 
> accessed, but worst performance
> is memory is allocated only on one socket).
> so imho, --bind-to socket is still my preferred policy, even if there is only 
> one MPI task per node.
> 
> Open MPI is about MPI ranks/processes. I don't think it is our job to try to 
> figure out how the user handle do with it's own threads.
> 
> Your justification make sense if the application only uses a single socket. 
> It also make sense if one starts multiple ranks per node, and the internal 
> threads of each MPI process inherit the MPI process binding. However, in the 
> case where there is a single process per node, because there is a mismatch 
> between the number of resources available (hardware threads) and the binding 
> of the parent process, all the threads of the MPI application are [by 
> default] bound on a single socket.
> 
>  George.
> 
> PS: That being said I think I'll need to implement the binding code anyway in 
> order to deal with the wide variety of behaviors in the different MPI 
> implementations.
> 
>  
> 
> Cheers,
> 
> Gilles
> ___
> devel mailing list
> de...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
> 
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2016/04/18758.php 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2016/04/18759.php



Re: [OMPI devel] Process placement

2016-04-12 Thread George Bosilca
On Wed, Apr 13, 2016 at 1:59 AM, Gilles Gouaillardet 
wrote:

> George,
>
> about the process binding part
>
> On 4/13/2016 7:32 AM, George Bosilca wrote:
>
>> Also my processes, despite the fact that I asked for 1 per node, are not
>> bound to the first core. Shouldn’t we release the process binding when we
>> know there is a single process per node (as in the above case) ?
>>
> did you expect the tasks are bound to the first *core* on each node ?
>
> i would expect the tasks are bound to the first *socket* on each node.
>

In this particular instance, where it has been explicitly requested to have
a single process per node, I would have expected the process to be unbound
(we know there is only one per node). It is the responsibility of the
application to bound itself or its thread if necessary. Why are we
enforcing a particular binding policy?

(since we do not know how many (OpenMP or other) threads will be used by
> the application,

--bind-to socket is a good policy imho. in this case (one task per node),
> no binding at all would mean
> the task can migrate from one socket to the other, and/or OpenMP threads
> are bound accross sockets.
> That would trigger some NUMA effects (better bandwidth if memory is
> locally accessed, but worst performance
> is memory is allocated only on one socket).
> so imho, --bind-to socket is still my preferred policy, even if there is
> only one MPI task per node.
>

Open MPI is about MPI ranks/processes. I don't think it is our job to try
to figure out how the user handle do with it's own threads.

Your justification make sense if the application only uses a single socket.
It also make sense if one starts multiple ranks per node, and the internal
threads of each MPI process inherit the MPI process binding. However, in
the case where there is a single process per node, because there is a
mismatch between the number of resources available (hardware threads) and
the binding of the parent process, all the threads of the MPI application
are [by default] bound on a single socket.

 George.

PS: That being said I think I'll need to implement the binding code anyway
in order to deal with the wide variety of behaviors in the different MPI
implementations.



>
> Cheers,
>
> Gilles
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/04/18758.php


Re: [OMPI devel] Process placement

2016-04-12 Thread Gilles Gouaillardet

George,

about the process binding part

On 4/13/2016 7:32 AM, George Bosilca wrote:

Also my processes, despite the fact that I asked for 1 per node, are not bound 
to the first core. Shouldn’t we release the process binding when we know there 
is a single process per node (as in the above case) ?

did you expect the tasks are bound to the first *core* on each node ?

i would expect the tasks are bound to the first *socket* on each node.

(since we do not know how many (OpenMP or other) threads will be used by 
the application,
--bind-to socket is a good policy imho. in this case (one task per 
node), no binding at all would mean
the task can migrate from one socket to the other, and/or OpenMP threads 
are bound accross sockets.
That would trigger some NUMA effects (better bandwidth if memory is 
locally accessed, but worst performance

is memory is allocated only on one socket).
so imho, --bind-to socket is still my preferred policy, even if there is 
only one MPI task per node.


Cheers,

Gilles