On Fri, 2010-04-09 at 08:41 -0600, Ralph Castain wrote:
> Just to check: is this with the latest trunk? Brad and Terry have been making 
> changes to this section of code, including modifying the PROCESS_IS_BOUND 
> test...
> 
> 

Well, it was on the v1.5. But I just checked: looks like
  1. the call to OPAL_PAFFINITY_PROCESS_IS_BOUND is still there in
     odls_default_fork_local_proc()
  2. OPAL_PAFFINITY_PROCESS_IS_BOUND() is defined the same way

But, I'll give it a try with the latest trunk.

Regards,
Nadia

> On Apr 9, 2010, at 3:39 AM, Nadia Derbey wrote:
> 
> > Hi,
> > 
> > I am facing a problem with a test that runs fine on some nodes, and
> > fails on others.
> > 
> > I have a heterogenous cluster, with 3 types of nodes:
> > 1) Single socket , 4 cores
> > 2) 2 sockets, 4cores per socket
> > 3) 2 sockets, 6 cores/socket
> > 
> > I am using:
> > . salloc to allocate the nodes,
> > . mpirun binding/mapping options "-bind-to-socket -bysocket"
> > 
> > # salloc -N 1 mpirun -n 4 -bind-to-socket -bysocket sleep 900
> > 
> > This command fails if the allocated node is of type #1 (single socket/4
> > cpus).
> > BTW, in that case orte_show_help is referencing a tag
> > ("could-not-bind-to-socket") that does not exist in
> > help-odls-default.txt.
> > 
> > While it succeeds when run on nodes of type #2 or 3.
> > I think a "bind to socket" should not return an error on a single socket
> > machine, but rather be a noop.
> > 
> > The problem comes from the test
> > OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound);
> > called in odls_default_fork_local_proc() after the binding to the
> > processors socket has been done:
> > ========
> >    <snip>
> >    OPAL_PAFFINITY_CPU_ZERO(mask);
> >    for (n=0; n < orte_default_num_cores_per_socket; n++) {
> >        <snip>
> >        OPAL_PAFFINITY_CPU_SET(phys_cpu, mask);
> >    }
> >    /* if we did not bind it anywhere, then that is an error */
> >    OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound);
> >    if (!bound) {
> >        orte_show_help("help-odls-default.txt",
> >                       "odls-default:could-not-bind-to-socket", true);
> >        ORTE_ODLS_ERROR_OUT(ORTE_ERR_FATAL);
> >    }
> > ========
> > OPAL_PAFFINITY_PROCESS_IS_BOUND() will return true if there bits set in
> > the mask *AND* the number of bits set is lesser than the number of cpus
> > on the machine. Thus on a single socket, 4 cores machine the test will
> > fail. While on other the kinds of machines it will succeed.
> > 
> > Again, I think the problem could be solved by changing the alogrithm,
> > and assuming that ORTE_BIND_TO_SOCKET, on a single socket machine =
> > noop.
> > 
> > Another solution could be to call the test
> > OPAL_PAFFINITY_PROCESS_IS_BOUND() at the end of the loop only if we are
> > bound (orte_odls_globals.bound). Actually that is the only case where I
> > see a justification to this test (see attached patch).
> > 
> > And may be both solutions could be mixed.
> > 
> > Regards,
> > Nadia
> > 
> > 
> > -- 
> > Nadia Derbey <nadia.der...@bull.net>
> > <001_fix_process_binding_test.patch>_______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
-- 
Nadia Derbey <nadia.der...@bull.net>

Reply via email to