Re: [OMPI devel] problem when binding to socket on a single socket node

2010-04-12 Thread Nadia Derbey
On Fri, 2010-04-09 at 14:23 -0400, Terry Dontje wrote:
> Ralph Castain wrote: 
> > Okay, just wanted to ensure everyone was working from the same base
> > code. 
> > 
> > 
> > Terry, Brad: you might want to look this proposed change over.
> > Something doesn't quite look right to me, but I haven't really
> > walked through the code to check it.
> > 
> > 
> At first blush I don't really get the usage of orte_odls_globals.bound
> in you patch.  It would seem to me that the insertion of that
> conditional would prevent the check it surrounds being done when the
> process has not been bounded prior to startup which is a common case.

Well, if you have a look at the algo in the ORTE_BIND_TO_SOCKET path
(odls_default_fork_local_proc() in odls_default_module.c):


   (line 715)
 {


 {

continue
}

}

...


What I'm saying is that the only way to have nothing set in the affinity
mask (which would justify the last test) is to have never called the
 instruction. This means:
  . the test on orte_odls_globals.bound is true
  . call  for all the cores in the socket.

In the other path, what we are doing is checking if we have set one or
more bits in a mask after having actually set them: don't you think it's
useless?

That's why I'm suggesting to call the last check only if
orte_odls_globals.bound is true.

Regards,
Nadia
> 
> --td
> 
> 
> > 
> > On Apr 9, 2010, at 9:33 AM, Terry Dontje wrote:
> > 
> > > Nadia Derbey wrote: 
> > > > On Fri, 2010-04-09 at 08:41 -0600, Ralph Castain wrote:
> > > >   
> > > > > Just to check: is this with the latest trunk? Brad and Terry have 
> > > > > been making changes to this section of code, including modifying the 
> > > > > PROCESS_IS_BOUND test...
> > > > > 
> > > > > 
> > > > > 
> > > > 
> > > > Well, it was on the v1.5. But I just checked: looks like
> > > >   1. the call to OPAL_PAFFINITY_PROCESS_IS_BOUND is still there in
> > > >  odls_default_fork_local_proc()
> > > >   2. OPAL_PAFFINITY_PROCESS_IS_BOUND() is defined the same way
> > > > 
> > > > But, I'll give it a try with the latest trunk.
> > > > 
> > > > Regards,
> > > > Nadia
> > > > 
> > > >   
> > > The changes, I've done do not touch
> > > OPAL_PAFFINITY_PROCESS_IS_BOUND at all.  Also, I am only touching
> > > code related to the "bind-to-core" option so I really doubt if my
> > > changes are causing issues here.
> > > 
> > > --td
> > > > > On Apr 9, 2010, at 3:39 AM, Nadia Derbey wrote:
> > > > > 
> > > > > 
> > > > > > Hi,
> > > > > > 
> > > > > > I am facing a problem with a test that runs fine on some nodes, and
> > > > > > fails on others.
> > > > > > 
> > > > > > I have a heterogenous cluster, with 3 types of nodes:
> > > > > > 1) Single socket , 4 cores
> > > > > > 2) 2 sockets, 4cores per socket
> > > > > > 3) 2 sockets, 6 cores/socket
> > > > > > 
> > > > > > I am using:
> > > > > > . salloc to allocate the nodes,
> > > > > > . mpirun binding/mapping options "-bind-to-socket -bysocket"
> > > > > > 
> > > > > > # salloc -N 1 mpirun -n 4 -bind-to-socket -bysocket sleep 900
> > > > > > 
> > > > > > This command fails if the allocated node is of type #1 (single 
> > > > > > socket/4
> > > > > > cpus).
> > > > > > BTW, in that case orte_show_help is referencing a tag
> > > > > > ("could-not-bind-to-socket") that does not exist in
> > > > > > help-odls-default.txt.
> > > > > > 
> > > > > > While it succeeds when run on nodes of type #2 or 3.
> > > > > > I think a "bind to socket" should not return an error on a single 
> > > > > > socket
> > > > > > machine, but rather be a noop.
> > > > > > 
> > > > > > The problem comes from the test
> > > > > > OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound);
> > > > > > called in odls_default_fork_local_proc() after the binding to the
> > > > > > processors socket has been done:
> > > > > > 
> > > > > >
> > > > > >OPAL_PAFFINITY_CPU_ZERO(mask);
> > > > > >for (n=0; n < orte_default_num_cores_per_socket; n++) {
> > > > > >
> > > > > >OPAL_PAFFINITY_CPU_SET(phys_cpu, mask);
> > > > > >}
> > > > > >/* if we did not bind it anywhere, then that is an error */
> > > > > >OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound);
> > > > > >if (!bound) {
> > > > > >orte_show_help("help-odls-default.txt",
> > > > > >   "odls-default:could-not-bind-to-socket", 
> > > > > > true);
> > > > > >ORTE_ODLS_ERROR_OUT(ORTE_ERR_FATAL);
> > > > > >}
> > > > > > 
> > > > > > OPAL_PAFFINITY_PROCESS_IS_BOUND() will return true if there bits 
> > > > > > set in
> > > > > > the mask *AND* the number of bits set is lesser than the number of 
> > > > > > cpus
> > > > > > on the machine. Thus on a single socket, 4 cores machine the test 
> > > > > > will
> > > > > > fail. While on other the kinds of machines it will succeed.
> > > > > > 
> > > > > > Again, I think the problem could be solved by changing the 
> > > > > > alogrithm,
> > > > > > and assuming that O

Re: [OMPI devel] problem when binding to socket on a single socket node

2010-04-12 Thread Ralph Castain
Guess I'll jump in here as I finally had a few minutes to look at the code and 
think about your original note. In fact, I believe your original statement is 
the source of contention.

If someone tells us -bind-to-socket, but there is only one socket, then we 
really cannot bind them to anything. Any check by their code would reveal that 
they had not, in fact, been bound - raising questions as to whether or not OMPI 
is performing the request. Our operating standard has been to error out if the 
user specifies something we cannot do to avoid that kind of confusion. This is 
what generated the code in the system today.

Now I can see an argument that -bind-to-socket with one socket maybe shouldn't 
generate an error, but that decision then has to get reflected in other code 
areas as well.

As for the test you cite -  it actually performs a valuable function and was 
added to catch specific scenarios. In particular, if you follow the code flow 
up just a little, you will see that it is possible to complete the loop without 
ever actually setting a bit in the mask. This happens when none of the cpus in 
that socket have been assigned to us via an external bind. People actually use 
that as a means of suballocating nodes, so the test needs to be there. Again, 
if the user said "bind to socket", but none of that socket's cores are assigned 
for our use, that is an error.

I haven't looked at your specific fix, but I agree with Terry's question. It 
seems to me that whether or not we were externally bound is irrelevant. Even if 
the overall result is what you want, I think a more logically understandable 
test would help others reading the code.

But first we need to resolve the question: should this scenario return an error 
or not?


On Apr 12, 2010, at 1:43 AM, Nadia Derbey wrote:

> On Fri, 2010-04-09 at 14:23 -0400, Terry Dontje wrote:
>> Ralph Castain wrote: 
>>> Okay, just wanted to ensure everyone was working from the same base
>>> code. 
>>> 
>>> 
>>> Terry, Brad: you might want to look this proposed change over.
>>> Something doesn't quite look right to me, but I haven't really
>>> walked through the code to check it.
>>> 
>>> 
>> At first blush I don't really get the usage of orte_odls_globals.bound
>> in you patch.  It would seem to me that the insertion of that
>> conditional would prevent the check it surrounds being done when the
>> process has not been bounded prior to startup which is a common case.
> 
> Well, if you have a look at the algo in the ORTE_BIND_TO_SOCKET path
> (odls_default_fork_local_proc() in odls_default_module.c):
> 
> 
>(line 715)
>  {
>
>
> {
>
>continue
>}
>
> }
> 
> ...
> 
> 
> What I'm saying is that the only way to have nothing set in the affinity
> mask (which would justify the last test) is to have never called the
>  instruction. This means:
>  . the test on orte_odls_globals.bound is true
>  . call  for all the cores in the socket.
> 
> In the other path, what we are doing is checking if we have set one or
> more bits in a mask after having actually set them: don't you think it's
> useless?
> 
> That's why I'm suggesting to call the last check only if
> orte_odls_globals.bound is true.
> 
> Regards,
> Nadia
>> 
>> --td
>> 
>> 
>>> 
>>> On Apr 9, 2010, at 9:33 AM, Terry Dontje wrote:
>>> 
 Nadia Derbey wrote: 
> On Fri, 2010-04-09 at 08:41 -0600, Ralph Castain wrote:
> 
>> Just to check: is this with the latest trunk? Brad and Terry have been 
>> making changes to this section of code, including modifying the 
>> PROCESS_IS_BOUND test...
>> 
>> 
>> 
> 
> Well, it was on the v1.5. But I just checked: looks like
>  1. the call to OPAL_PAFFINITY_PROCESS_IS_BOUND is still there in
> odls_default_fork_local_proc()
>  2. OPAL_PAFFINITY_PROCESS_IS_BOUND() is defined the same way
> 
> But, I'll give it a try with the latest trunk.
> 
> Regards,
> Nadia
> 
> 
 The changes, I've done do not touch
 OPAL_PAFFINITY_PROCESS_IS_BOUND at all.  Also, I am only touching
 code related to the "bind-to-core" option so I really doubt if my
 changes are causing issues here.
 
 --td
>> On Apr 9, 2010, at 3:39 AM, Nadia Derbey wrote:
>> 
>> 
>>> Hi,
>>> 
>>> I am facing a problem with a test that runs fine on some nodes, and
>>> fails on others.
>>> 
>>> I have a heterogenous cluster, with 3 types of nodes:
>>> 1) Single socket , 4 cores
>>> 2) 2 sockets, 4cores per socket
>>> 3) 2 sockets, 6 cores/socket
>>> 
>>> I am using:
>>> . salloc to allocate the nodes,
>>> . mpirun binding/mapping options "-bind-to-socket -bysocket"
>>> 
>>> # salloc -N 1 mpirun -n 4 -bind-to-socket -bysocket sleep 900
>>> 
>>> This command fails if the allocated node is of type #1 (single socket/4
>>> cpus).
>>> BTW, in that case orte_show_help is referencing a tag
>

Re: [OMPI devel] problem when binding to socket on a single socket node

2010-04-12 Thread Terry Dontje
Ralph, I guess I am curious why is it that if there is only one socket 
we cannot bind to it?  Does plpa actually error on this or is this a 
condition we decided was an error at odls?


I am somewhat torn on whether this makes sense.  On the one hand it is 
definitely useless as to the result if you allow it.  However if you 
don't allow it and you have a script or running tests on multiple 
systems it would be nice to have this run because you are not really 
running into a resource starvation issue.


At a minimum I think the error condition/message needs to be spelled out 
(defined).As to whether we allow binding when only one socket exist 
I could go either way slightly leaning towards allowing such a 
specification to work.


--td


Ralph Castain wrote:

Guess I'll jump in here as I finally had a few minutes to look at the code and 
think about your original note. In fact, I believe your original statement is 
the source of contention.

If someone tells us -bind-to-socket, but there is only one socket, then we 
really cannot bind them to anything. Any check by their code would reveal that 
they had not, in fact, been bound - raising questions as to whether or not OMPI 
is performing the request. Our operating standard has been to error out if the 
user specifies something we cannot do to avoid that kind of confusion. This is 
what generated the code in the system today.

Now I can see an argument that -bind-to-socket with one socket maybe shouldn't 
generate an error, but that decision then has to get reflected in other code 
areas as well.

As for the test you cite -  it actually performs a valuable function and was added to 
catch specific scenarios. In particular, if you follow the code flow up just a little, 
you will see that it is possible to complete the loop without ever actually setting a bit 
in the mask. This happens when none of the cpus in that socket have been assigned to us 
via an external bind. People actually use that as a means of suballocating nodes, so the 
test needs to be there. Again, if the user said "bind to socket", but none of 
that socket's cores are assigned for our use, that is an error.

I haven't looked at your specific fix, but I agree with Terry's question. It 
seems to me that whether or not we were externally bound is irrelevant. Even if 
the overall result is what you want, I think a more logically understandable 
test would help others reading the code.

But first we need to resolve the question: should this scenario return an error 
or not?


On Apr 12, 2010, at 1:43 AM, Nadia Derbey wrote:

  

On Fri, 2010-04-09 at 14:23 -0400, Terry Dontje wrote:

Ralph Castain wrote: 
  

Okay, just wanted to ensure everyone was working from the same base
code. 



Terry, Brad: you might want to look this proposed change over.
Something doesn't quite look right to me, but I haven't really
walked through the code to check it.




At first blush I don't really get the usage of orte_odls_globals.bound
in you patch.  It would seem to me that the insertion of that
conditional would prevent the check it surrounds being done when the
process has not been bounded prior to startup which is a common case.
  

Well, if you have a look at the algo in the ORTE_BIND_TO_SOCKET path
(odls_default_fork_local_proc() in odls_default_module.c):


   (line 715)
 {
   
   
{
   
   continue
   }
   
}

...


What I'm saying is that the only way to have nothing set in the affinity
mask (which would justify the last test) is to have never called the
 instruction. This means:
 . the test on orte_odls_globals.bound is true
 . call  for all the cores in the socket.

In the other path, what we are doing is checking if we have set one or
more bits in a mask after having actually set them: don't you think it's
useless?

That's why I'm suggesting to call the last check only if
orte_odls_globals.bound is true.

Regards,
Nadia


--td


  

On Apr 9, 2010, at 9:33 AM, Terry Dontje wrote:


Nadia Derbey wrote: 
  

On Fri, 2010-04-09 at 08:41 -0600, Ralph Castain wrote:



Just to check: is this with the latest trunk? Brad and Terry have been making 
changes to this section of code, including modifying the PROCESS_IS_BOUND 
test...



  

Well, it was on the v1.5. But I just checked: looks like
 1. the call to OPAL_PAFFINITY_PROCESS_IS_BOUND is still there in
odls_default_fork_local_proc()
 2. OPAL_PAFFINITY_PROCESS_IS_BOUND() is defined the same way

But, I'll give it a try with the latest trunk.

Regards,
Nadia




The changes, I've done do not touch
OPAL_PAFFINITY_PROCESS_IS_BOUND at all.  Also, I am only touching
code related to the "bind-to-core" option so I really doubt if my
changes are causing issues here.

--td
  

On Apr 9, 2010, at 3:39 AM, Nadia Derbey wrote:


  

Hi,

I am facing a problem with a test that runs fine on some nodes, and
fails on others.

I have a heterog

Re: [OMPI devel] problem when binding to socket on a single socket node

2010-04-12 Thread Eugene Loh

Ralph Castain wrote:


If someone tells us -bind-to-socket, but there is only one socket, then we 
really cannot bind them to anything. Any check by their code would reveal that 
they had not, in fact, been bound - raising questions as to whether or not OMPI 
is performing the request. Our operating standard has been to error out if the 
user specifies something we cannot do to avoid that kind of confusion. This is 
what generated the code in the system today.

Now I can see an argument that -bind-to-socket with one socket maybe shouldn't 
generate an error, but that decision then has to get reflected in other code 
areas as well.

But first we need to resolve the question: should this scenario return an error 
or not?
 

From the onset of the -bind-to-X functionality, -bind-to-board -byboard 
for a single-board system would result in binding to everything|nothing 
-- is the glass completely full or completely empty?  In any case, no error.


Consider a single-board, two-socket, quad-core node and these command 
lines with 1.3.4r22104:


% mpirun -H mynode -n 1 -bycore   -bind-to-core   -report-bindings ./a.out
% mpirun -H mynode -n 1 -bysocket -bind-to-socket -report-bindings ./a.out
% mpirun -H mynode -n 1 -byboard  -bind-to-board  -report-bindings ./a.out

The first binds to "cpus 0001", the second to "socket 0 cpus 000f", and 
the third reports no bindings ("bind to everything") and no errors.


Re: [OMPI devel] problem when binding to socket on a single socket node

2010-04-12 Thread Nadia Derbey
On Mon, 2010-04-12 at 07:50 -0600, Ralph Castain wrote:
> Guess I'll jump in here as I finally had a few minutes to look at the code 
> and think about your original note. In fact, I believe your original 
> statement is the source of contention.
> 
> If someone tells us -bind-to-socket, but there is only one socket, then we 
> really cannot bind them to anything. Any check by their code would reveal 
> that they had not, in fact, been bound - raising questions as to whether or 
> not OMPI is performing the request. Our operating standard has been to error 
> out if the user specifies something we cannot do to avoid that kind of 
> confusion. This is what generated the code in the system today.
> 
> Now I can see an argument that -bind-to-socket with one socket maybe 
> shouldn't generate an error, but that decision then has to get reflected in 
> other code areas as well.

Actually, that was my original point: -bind-to-socket on a single socket
should IMHO lead to a noop, but not an error.

> 
> As for the test you cite -  it actually performs a valuable function and was 
> added to catch specific scenarios. In particular, if you follow the code flow 
> up just a little, you will see that it is possible to complete the loop 
> without ever actually setting a bit in the mask. This happens when none of 
> the cpus in that socket have been assigned to us via an external bind.

This is exactly what I said, but may be I didn't express right:
  . the test on orte_odls_globals.bound is true   
---> means we have an external bind
  . call  for all the cores in the socket.
---> none of the cpus on the socket belongs to our binding

Regards,

>  People actually use that as a means of suballocating nodes, so the test 
> needs to be there. Again, if the user said "bind to socket", but none of that 
> socket's cores are assigned for our use, that is an error.
> 
> I haven't looked at your specific fix, but I agree with Terry's question. It 
> seems to me that whether or not we were externally bound is irrelevant. Even 
> if the overall result is what you want, I think a more logically 
> understandable test would help others reading the code.
> 
> But first we need to resolve the question: should this scenario return an 
> error or not?
> 
> 
> On Apr 12, 2010, at 1:43 AM, Nadia Derbey wrote:
> 
> > On Fri, 2010-04-09 at 14:23 -0400, Terry Dontje wrote:
> >> Ralph Castain wrote: 
> >>> Okay, just wanted to ensure everyone was working from the same base
> >>> code. 
> >>> 
> >>> 
> >>> Terry, Brad: you might want to look this proposed change over.
> >>> Something doesn't quite look right to me, but I haven't really
> >>> walked through the code to check it.
> >>> 
> >>> 
> >> At first blush I don't really get the usage of orte_odls_globals.bound
> >> in you patch.  It would seem to me that the insertion of that
> >> conditional would prevent the check it surrounds being done when the
> >> process has not been bounded prior to startup which is a common case.
> > 
> > Well, if you have a look at the algo in the ORTE_BIND_TO_SOCKET path
> > (odls_default_fork_local_proc() in odls_default_module.c):
> > 
> > 
> >(line 715)
> >  {
> >
> >
> > {
> >
> >continue
> >}
> >
> > }
> > 
> > ...
> > 
> > 
> > What I'm saying is that the only way to have nothing set in the affinity
> > mask (which would justify the last test) is to have never called the
> >  instruction. This means:
> >  . the test on orte_odls_globals.bound is true
> >  . call  for all the cores in the socket.
> > 
> > In the other path, what we are doing is checking if we have set one or
> > more bits in a mask after having actually set them: don't you think it's
> > useless?
> > 
> > That's why I'm suggesting to call the last check only if
> > orte_odls_globals.bound is true.
> > 
> > Regards,
> > Nadia
> >> 
> >> --td
> >> 
> >> 
> >>> 
> >>> On Apr 9, 2010, at 9:33 AM, Terry Dontje wrote:
> >>> 
>  Nadia Derbey wrote: 
> > On Fri, 2010-04-09 at 08:41 -0600, Ralph Castain wrote:
> > 
> >> Just to check: is this with the latest trunk? Brad and Terry have been 
> >> making changes to this section of code, including modifying the 
> >> PROCESS_IS_BOUND test...
> >> 
> >> 
> >> 
> > 
> > Well, it was on the v1.5. But I just checked: looks like
> >  1. the call to OPAL_PAFFINITY_PROCESS_IS_BOUND is still there in
> > odls_default_fork_local_proc()
> >  2. OPAL_PAFFINITY_PROCESS_IS_BOUND() is defined the same way
> > 
> > But, I'll give it a try with the latest trunk.
> > 
> > Regards,
> > Nadia
> > 
> > 
>  The changes, I've done do not touch
>  OPAL_PAFFINITY_PROCESS_IS_BOUND at all.  Also, I am only touching
>  code related to the "bind-to-core" option so I really doubt if my
>  changes are causing issues here.
>  
>  --td
> >> On Apr 9, 2010, at 3:39 AM, Nadia Derbey wrote:
> >>>

Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-12 Thread Oliver Geisler


Quoting Ashley Pittman :



On 10 Apr 2010, at 04:51, Eugene Loh wrote:

Why is shared-memory performance about four orders of magnitude  
slower than it should be?  The processes are communicating via  
memory that's shared by having the processes all mmap the same file  
into their address spaces.  Is it possible that with the newer  
kernels, operations to that shared file are going all the way out  
to disk?  Maybe you don't know the answer, but hopefully someone on  
this mail list can provide some insight.


Is the /tmp filesystem on NFS by any chance?



Yes, /tmp is on NFS .. those are diskless nodes all without disks and  
no swap space mounted.


Maybe I should setup one of the nodes with a disk, so I could try the  
difference.


(Sorry, but I may return results next week since, I am out of office  
right now)


Thanks
oli




Ashley,

--

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.









This message was sent using IMP, the Internet Messaging Program.


--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-12 Thread Ralph Castain
In  that scenario, you need to set the session directories to point somewhere 
other than /tmp. I believe you will find that in our FAQs as this has been a 
recurring problem. The shared memory backing file resides in the session 
directory tree, so if that is NFS mounted, your performance will stink.

People with that setup generally point the session dir at a ramdisk area, but 
anywhere in ram will do.

On Apr 12, 2010, at 9:10 AM, Oliver Geisler wrote:

> 
> Quoting Ashley Pittman :
> 
>> 
>> On 10 Apr 2010, at 04:51, Eugene Loh wrote:
>> 
>>> Why is shared-memory performance about four orders of magnitude slower than 
>>> it should be?  The processes are communicating via memory that's shared by 
>>> having the processes all mmap the same file into their address spaces.  Is 
>>> it possible that with the newer kernels, operations to that shared file are 
>>> going all the way out to disk?  Maybe you don't know the answer, but 
>>> hopefully someone on this mail list can provide some insight.
>> 
>> Is the /tmp filesystem on NFS by any chance?
>> 
> 
> Yes, /tmp is on NFS .. those are diskless nodes all without disks and no swap 
> space mounted.
> 
> Maybe I should setup one of the nodes with a disk, so I could try the 
> difference.
> 
> (Sorry, but I may return results next week since, I am out of office right 
> now)
> 
> Thanks
> oli
> 
> 
> 
>> Ashley,
>> 
>> --
>> 
>> Ashley Pittman, Bath, UK.
>> 
>> Padb - A parallel job inspection tool for cluster computing
>> http://padb.pittman.org.uk
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> --
>> This message has been scanned for viruses and
>> dangerous content by MailScanner, and is
>> believed to be clean.
>> 
>> 
> 
> 
> 
> 
> 
> 
> This message was sent using IMP, the Internet Messaging Program.
> 
> 
> -- 
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] problem when binding to socket on a single socket node

2010-04-12 Thread Ralph Castain
By definition, if you bind to all available cpus in the OS, you are bound to 
nothing (i.e., "unbound") as your process runs on any available cpu.

PLPA doesn't care, and I personally don't care. I was just explaining why it 
generates an error in the odls.

A user app would detect its binding by (a) getting the affiinity mask from the 
OS, and then (b) seeing if the bits are set to '1' for all available 
processors. If it is, then you are not bound - there is no mechanism available 
for checking "are the bits set only for the processors I asked to be bound to". 
The OS doesn't track what you asked for, it only tracks where you are bound - 
and a mask with all '1's is defined as "unbound".

So the reason for my question was simple: a user asked us to "bind" their 
process. If their process checks to see if it is bound, it will return "no". 
The user would therefore be led to believe that OMPI had failed to execute 
their request, when in fact we did execute it - but the result was (as Nadia 
says) a "no-op".

After talking with Jeff, I think he has the right answer. It is a method we 
have used elsewhere, so it isn't unexpected behavior. Basically, he proposed 
that we use an mca param to control this behavior:

* default: generate an error message as the "bind" results in a no-op, and this 
is our current behavior

* warn: generate a warning that the binding wound up being a "no-op", but 
continue working

* quiet: just ignore it and keep going

Fairly trivial to implement, and Bull could set the default mca param file to 
"quiet" to get what they want. I'm not sure if that's what the community wants 
or not - like I said, it makes no diff to me so long as the code logic is 
understandable.


On Apr 12, 2010, at 8:27 AM, Terry Dontje wrote:

> Ralph, I guess I am curious why is it that if there is only one socket we 
> cannot bind to it?  Does plpa actually error on this or is this a condition 
> we decided was an error at odls?
> 
> I am somewhat torn on whether this makes sense.  On the one hand it is 
> definitely useless as to the result if you allow it.  However if you don't 
> allow it and you have a script or running tests on multiple systems it would 
> be nice to have this run because you are not really running into a resource 
> starvation issue.
> 
> At a minimum I think the error condition/message needs to be spelled out 
> (defined).As to whether we allow binding when only one socket exist I 
> could go either way slightly leaning towards allowing such a specification to 
> work.
> 
> --td
> 
> 
> Ralph Castain wrote:
>> 
>> Guess I'll jump in here as I finally had a few minutes to look at the code 
>> and think about your original note. In fact, I believe your original 
>> statement is the source of contention.
>> 
>> If someone tells us -bind-to-socket, but there is only one socket, then we 
>> really cannot bind them to anything. Any check by their code would reveal 
>> that they had not, in fact, been bound - raising questions as to whether or 
>> not OMPI is performing the request. Our operating standard has been to error 
>> out if the user specifies something we cannot do to avoid that kind of 
>> confusion. This is what generated the code in the system today.
>> 
>> Now I can see an argument that -bind-to-socket with one socket maybe 
>> shouldn't generate an error, but that decision then has to get reflected in 
>> other code areas as well.
>> 
>> As for the test you cite -  it actually performs a valuable function and was 
>> added to catch specific scenarios. In particular, if you follow the code 
>> flow up just a little, you will see that it is possible to complete the loop 
>> without ever actually setting a bit in the mask. This happens when none of 
>> the cpus in that socket have been assigned to us via an external bind. 
>> People actually use that as a means of suballocating nodes, so the test 
>> needs to be there. Again, if the user said "bind to socket", but none of 
>> that socket's cores are assigned for our use, that is an error.
>> 
>> I haven't looked at your specific fix, but I agree with Terry's question. It 
>> seems to me that whether or not we were externally bound is irrelevant. Even 
>> if the overall result is what you want, I think a more logically 
>> understandable test would help others reading the code.
>> 
>> But first we need to resolve the question: should this scenario return an 
>> error or not?
>> 
>> 
>> On Apr 12, 2010, at 1:43 AM, Nadia Derbey wrote:
>> 
>>   
>>> On Fri, 2010-04-09 at 14:23 -0400, Terry Dontje wrote:
>>> 
 Ralph Castain wrote: 
   
> Okay, just wanted to ensure everyone was working from the same base
> code. 
> 
> 
> Terry, Brad: you might want to look this proposed change over.
> Something doesn't quite look right to me, but I haven't really
> walked through the code to check it.
> 
> 
> 
 At first blush I don't really get the usage of orte_odls_globals.b

Re: [OMPI devel] problem when binding to socket on a single socket node

2010-04-12 Thread Ralph Castain

On Apr 12, 2010, at 8:42 AM, Eugene Loh wrote:

> Ralph Castain wrote:
> 
>> If someone tells us -bind-to-socket, but there is only one socket, then we 
>> really cannot bind them to anything. Any check by their code would reveal 
>> that they had not, in fact, been bound - raising questions as to whether or 
>> not OMPI is performing the request. Our operating standard has been to error 
>> out if the user specifies something we cannot do to avoid that kind of 
>> confusion. This is what generated the code in the system today.
>> 
>> Now I can see an argument that -bind-to-socket with one socket maybe 
>> shouldn't generate an error, but that decision then has to get reflected in 
>> other code areas as well.
>> 
>> But first we need to resolve the question: should this scenario return an 
>> error or not?
>> 
> From the onset of the -bind-to-X functionality, -bind-to-board -byboard for a 
> single-board system would result in binding to everything|nothing -- is the 
> glass completely full or completely empty?  In any case, no error.

Only because we haven't really implemented bind-to-board yet - once we do 
(should that happen), then it would indeed generate an error.

> 
> Consider a single-board, two-socket, quad-core node and these command lines 
> with 1.3.4r22104:
> 
> % mpirun -H mynode -n 1 -bycore   -bind-to-core   -report-bindings ./a.out
> % mpirun -H mynode -n 1 -bysocket -bind-to-socket -report-bindings ./a.out
> % mpirun -H mynode -n 1 -byboard  -bind-to-board  -report-bindings ./a.out
> 
> The first binds to "cpus 0001", the second to "socket 0 cpus 000f", and the 
> third reports no bindings ("bind to everything") and no errors.
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




[OMPI devel] bind-to-board [was: problem when binding to socket on a single socket node]

2010-04-12 Thread Eugene Loh

Ralph Castain wrote:


On Apr 12, 2010, at 8:42 AM, Eugene Loh wrote:
 


Ralph Castain wrote:
   


If someone tells us -bind-to-socket, but there is only one socket, then we 
really cannot bind them to anything. Any check by their code would reveal that 
they had not, in fact, been bound - raising questions as to whether or not OMPI 
is performing the request. Our operating standard has been to error out if the 
user specifies something we cannot do to avoid that kind of confusion. This is 
what generated the code in the system today.

Now I can see an argument that -bind-to-socket with one socket maybe shouldn't 
generate an error, but that decision then has to get reflected in other code 
areas as well.

But first we need to resolve the question: should this scenario return an error 
or not?
 


From the onset of the -bind-to-X functionality, -bind-to-board -byboard for a 
single-board system would result in binding to everything|nothing -- is the 
glass completely full or completely empty?  In any case, no error.
   


Only because we haven't really implemented bind-to-board yet

Well, we have implemented it.  It's accepted by "mpirun" and listed by 
"mpirun --help".  So, there's a bug.  Shall I fill a trac ticket?



- once we do (should that happen), then it would indeed generate an error.
 


Consider a single-board, two-socket, quad-core node and these command lines 
with 1.3.4r22104:

% mpirun -H mynode -n 1 -bycore   -bind-to-core   -report-bindings ./a.out
% mpirun -H mynode -n 1 -bysocket -bind-to-socket -report-bindings ./a.out
% mpirun -H mynode -n 1 -byboard  -bind-to-board  -report-bindings ./a.out

The first binds to "cpus 0001", the second to "socket 0 cpus 000f", and the third reports 
no bindings ("bind to everything") and no errors.





Re: [OMPI devel] bind-to-board [was: problem when binding to socket on a single socket node]

2010-04-12 Thread Ralph Castain
Get a life :-)

On Apr 12, 2010, at 11:56 AM, Eugene Loh wrote:

> Ralph Castain wrote:
> 
>> On Apr 12, 2010, at 8:42 AM, Eugene Loh wrote:
>> 
>>> Ralph Castain wrote:
>>>   
 If someone tells us -bind-to-socket, but there is only one socket, then we 
 really cannot bind them to anything. Any check by their code would reveal 
 that they had not, in fact, been bound - raising questions as to whether 
 or not OMPI is performing the request. Our operating standard has been to 
 error out if the user specifies something we cannot do to avoid that kind 
 of confusion. This is what generated the code in the system today.
 
 Now I can see an argument that -bind-to-socket with one socket maybe 
 shouldn't generate an error, but that decision then has to get reflected 
 in other code areas as well.
 
 But first we need to resolve the question: should this scenario return an 
 error or not?
 
>>> From the onset of the -bind-to-X functionality, -bind-to-board -byboard for 
>>> a single-board system would result in binding to everything|nothing -- is 
>>> the glass completely full or completely empty?  In any case, no error.
>>>   
>> Only because we haven't really implemented bind-to-board yet
>> 
> Well, we have implemented it.  It's accepted by "mpirun" and listed by 
> "mpirun --help".  So, there's a bug.  Shall I fill a trac ticket?
> 
>> - once we do (should that happen), then it would indeed generate an error.
>> 
>>> Consider a single-board, two-socket, quad-core node and these command lines 
>>> with 1.3.4r22104:
>>> 
>>> % mpirun -H mynode -n 1 -bycore   -bind-to-core   -report-bindings ./a.out
>>> % mpirun -H mynode -n 1 -bysocket -bind-to-socket -report-bindings ./a.out
>>> % mpirun -H mynode -n 1 -byboard  -bind-to-board  -report-bindings ./a.out
>>> 
>>> The first binds to "cpus 0001", the second to "socket 0 cpus 000f", and the 
>>> third reports no bindings ("bind to everything") and no errors.
>>> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] problem when binding to socket on a single socket node

2010-04-12 Thread Eugene Loh

Ralph Castain wrote:


If someone tells us -bind-to-socket, but there is only one socket, then we 
really cannot bind them to anything. Any check by their code would reveal that 
they had not, in fact, been bound - raising questions as to whether or not OMPI 
is performing the request. Our operating standard has been to error out if the 
user specifies something we cannot do to avoid that kind of confusion. This is 
what generated the code in the system today.

Now I can see an argument that -bind-to-socket with one socket maybe shouldn't 
generate an error, but that decision then has to get reflected in other code 
areas as well.

But first we need to resolve the question: should this scenario return an error 
or not?
 

Okay, so my bind-to-board example didn't pass muster.  How about this 
one?  This is a node with 8 cores: 0-7:


% mpirun -H mynode -n 1 -slot-list 0-7 -report-bindings hostname
[mynode:27978] [[17644,0],0] odls:default:fork binding child 
[[17644,1],0] to slot_list 0-7

mynode

I bind to all cores.  mpirun does not complain.  Indeed, it reports that 
I'm bound to all cores.


Re: [OMPI devel] problem when binding to socket on a single socket node

2010-04-12 Thread Ralph Castain
Let me put this succinctly - I DO NOT CARE!

I wrote this stuff, warning you folks from Sun in particular that you were 
opening a can of worms. As I said then, I'll do it once, but the vast range of 
corner cases will make this a nightmare that I will NOT continue to chase.

Welcome to YOUR nightmare. :-)


On Apr 12, 2010, at 12:11 PM, Eugene Loh wrote:

> Ralph Castain wrote:
> 
>> If someone tells us -bind-to-socket, but there is only one socket, then we 
>> really cannot bind them to anything. Any check by their code would reveal 
>> that they had not, in fact, been bound - raising questions as to whether or 
>> not OMPI is performing the request. Our operating standard has been to error 
>> out if the user specifies something we cannot do to avoid that kind of 
>> confusion. This is what generated the code in the system today.
>> 
>> Now I can see an argument that -bind-to-socket with one socket maybe 
>> shouldn't generate an error, but that decision then has to get reflected in 
>> other code areas as well.
>> 
>> But first we need to resolve the question: should this scenario return an 
>> error or not?
>> 
> Okay, so my bind-to-board example didn't pass muster.  How about this one?  
> This is a node with 8 cores: 0-7:
> 
> % mpirun -H mynode -n 1 -slot-list 0-7 -report-bindings hostname
> [mynode:27978] [[17644,0],0] odls:default:fork binding child [[17644,1],0] to 
> slot_list 0-7
> mynode
> 
> I bind to all cores.  mpirun does not complain.  Indeed, it reports that I'm 
> bound to all cores.
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-12 Thread Jeff Squyres
On Apr 12, 2010, at 11:10 AM, Oliver Geisler wrote:

> > Is the /tmp filesystem on NFS by any chance?
> 
> Yes, /tmp is on NFS .. those are diskless nodes all without disks and 
> no swap space mounted.

Ah, that could do it.  Open MPI's shared memory files are under /tmp.  So if 
/tmp is NFS, you could get extremely high latencies because of dirty page 
writes out through NFS.

You don't necessarily have to make /tmp disk-full -- if you just make OMPI's 
session directories go into a ramdisk instead of to NFS, that should also be 
sufficient.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/