Ralph, I guess I am curious why is it that if there is only one socket we cannot bind to it? Does plpa actually error on this or is this a condition we decided was an error at odls?

I am somewhat torn on whether this makes sense. On the one hand it is definitely useless as to the result if you allow it. However if you don't allow it and you have a script or running tests on multiple systems it would be nice to have this run because you are not really running into a resource starvation issue.

At a minimum I think the error condition/message needs to be spelled out (defined). As to whether we allow binding when only one socket exist I could go either way slightly leaning towards allowing such a specification to work.

--td


Ralph Castain wrote:
Guess I'll jump in here as I finally had a few minutes to look at the code and 
think about your original note. In fact, I believe your original statement is 
the source of contention.

If someone tells us -bind-to-socket, but there is only one socket, then we 
really cannot bind them to anything. Any check by their code would reveal that 
they had not, in fact, been bound - raising questions as to whether or not OMPI 
is performing the request. Our operating standard has been to error out if the 
user specifies something we cannot do to avoid that kind of confusion. This is 
what generated the code in the system today.

Now I can see an argument that -bind-to-socket with one socket maybe shouldn't 
generate an error, but that decision then has to get reflected in other code 
areas as well.

As for the test you cite -  it actually performs a valuable function and was added to 
catch specific scenarios. In particular, if you follow the code flow up just a little, 
you will see that it is possible to complete the loop without ever actually setting a bit 
in the mask. This happens when none of the cpus in that socket have been assigned to us 
via an external bind. People actually use that as a means of suballocating nodes, so the 
test needs to be there. Again, if the user said "bind to socket", but none of 
that socket's cores are assigned for our use, that is an error.

I haven't looked at your specific fix, but I agree with Terry's question. It 
seems to me that whether or not we were externally bound is irrelevant. Even if 
the overall result is what you want, I think a more logically understandable 
test would help others reading the code.

But first we need to resolve the question: should this scenario return an error 
or not?


On Apr 12, 2010, at 1:43 AM, Nadia Derbey wrote:

On Fri, 2010-04-09 at 14:23 -0400, Terry Dontje wrote:
Ralph Castain wrote:
Okay, just wanted to ensure everyone was working from the same base
code.

Terry, Brad: you might want to look this proposed change over.
Something doesn't quite look right to me, but I haven't really
walked through the code to check it.


At first blush I don't really get the usage of orte_odls_globals.bound
in you patch.  It would seem to me that the insertion of that
conditional would prevent the check it surrounds being done when the
process has not been bounded prior to startup which is a common case.
Well, if you have a look at the algo in the ORTE_BIND_TO_SOCKET path
(odls_default_fork_local_proc() in odls_default_module.c):

<set target_socket depending on the desired mapping>
<set my paffinity mask to 0>       (line 715)
<for each core in the socket> {
   <get the associated phys_core>
   <get the associated phys_cpu>
   <if we are bound (orte_odls_globals.bound)> {
       <if phys_cpu does not belong to the cpus I'm bound to>
           continue
   }
   <set phys-cpu bit in my affinity mask>
}
<check if something is set in my affinity mask>
...


What I'm saying is that the only way to have nothing set in the affinity
mask (which would justify the last test) is to have never called the
<set phys_cpu in my affinity mask> instruction. This means:
 . the test on orte_odls_globals.bound is true
 . call <continue> for all the cores in the socket.

In the other path, what we are doing is checking if we have set one or
more bits in a mask after having actually set them: don't you think it's
useless?

That's why I'm suggesting to call the last check only if
orte_odls_globals.bound is true.

Regards,
Nadia
--td


On Apr 9, 2010, at 9:33 AM, Terry Dontje wrote:

Nadia Derbey wrote:
On Fri, 2010-04-09 at 08:41 -0600, Ralph Castain wrote:

Just to check: is this with the latest trunk? Brad and Terry have been making 
changes to this section of code, including modifying the PROCESS_IS_BOUND 
test...



Well, it was on the v1.5. But I just checked: looks like
 1. the call to OPAL_PAFFINITY_PROCESS_IS_BOUND is still there in
    odls_default_fork_local_proc()
 2. OPAL_PAFFINITY_PROCESS_IS_BOUND() is defined the same way

But, I'll give it a try with the latest trunk.

Regards,
Nadia


The changes, I've done do not touch
OPAL_PAFFINITY_PROCESS_IS_BOUND at all.  Also, I am only touching
code related to the "bind-to-core" option so I really doubt if my
changes are causing issues here.

--td
On Apr 9, 2010, at 3:39 AM, Nadia Derbey wrote:


Hi,

I am facing a problem with a test that runs fine on some nodes, and
fails on others.

I have a heterogenous cluster, with 3 types of nodes:
1) Single socket , 4 cores
2) 2 sockets, 4cores per socket
3) 2 sockets, 6 cores/socket

I am using:
. salloc to allocate the nodes,
. mpirun binding/mapping options "-bind-to-socket -bysocket"

# salloc -N 1 mpirun -n 4 -bind-to-socket -bysocket sleep 900

This command fails if the allocated node is of type #1 (single socket/4
cpus).
BTW, in that case orte_show_help is referencing a tag
("could-not-bind-to-socket") that does not exist in
help-odls-default.txt.

While it succeeds when run on nodes of type #2 or 3.
I think a "bind to socket" should not return an error on a single socket
machine, but rather be a noop.

The problem comes from the test
OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound);
called in odls_default_fork_local_proc() after the binding to the
processors socket has been done:
========
  <snip>
  OPAL_PAFFINITY_CPU_ZERO(mask);
  for (n=0; n < orte_default_num_cores_per_socket; n++) {
      <snip>
      OPAL_PAFFINITY_CPU_SET(phys_cpu, mask);
  }
  /* if we did not bind it anywhere, then that is an error */
  OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound);
  if (!bound) {
      orte_show_help("help-odls-default.txt",
                     "odls-default:could-not-bind-to-socket", true);
      ORTE_ODLS_ERROR_OUT(ORTE_ERR_FATAL);
  }
========
OPAL_PAFFINITY_PROCESS_IS_BOUND() will return true if there bits set in
the mask *AND* the number of bits set is lesser than the number of cpus
on the machine. Thus on a single socket, 4 cores machine the test will
fail. While on other the kinds of machines it will succeed.

Again, I think the problem could be solved by changing the alogrithm,
and assuming that ORTE_BIND_TO_SOCKET, on a single socket machine =
noop.

Another solution could be to call the test
OPAL_PAFFINITY_PROCESS_IS_BOUND() at the end of the loop only if we are
bound (orte_odls_globals.bound). Actually that is the only case where I
see a justification to this test (see attached patch).

And may be both solutions could be mixed.

Regards,
Nadia


--
Nadia Derbey <nadia.der...@bull.net>
<001_fix_process_binding_test.patch>_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
<Mail Attachment.gif>
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.650.633.7054
Oracle - Performance Technologies
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

____________________________________________________________________

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.650.633.7054
Oracle - Performance Technologies
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Nadia Derbey <nadia.der...@bull.net>

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.650.633.7054
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com <mailto:terry.don...@oracle.com>

Reply via email to