On Fri, 2010-04-09 at 08:41 -0600, Ralph Castain wrote: > Just to check: is this with the latest trunk? Brad and Terry have been making > changes to this section of code, including modifying the PROCESS_IS_BOUND > test... > >
Well, it was on the v1.5. But I just checked: looks like 1. the call to OPAL_PAFFINITY_PROCESS_IS_BOUND is still there in odls_default_fork_local_proc() 2. OPAL_PAFFINITY_PROCESS_IS_BOUND() is defined the same way But, I'll give it a try with the latest trunk. Regards, Nadia > On Apr 9, 2010, at 3:39 AM, Nadia Derbey wrote: > > > Hi, > > > > I am facing a problem with a test that runs fine on some nodes, and > > fails on others. > > > > I have a heterogenous cluster, with 3 types of nodes: > > 1) Single socket , 4 cores > > 2) 2 sockets, 4cores per socket > > 3) 2 sockets, 6 cores/socket > > > > I am using: > > . salloc to allocate the nodes, > > . mpirun binding/mapping options "-bind-to-socket -bysocket" > > > > # salloc -N 1 mpirun -n 4 -bind-to-socket -bysocket sleep 900 > > > > This command fails if the allocated node is of type #1 (single socket/4 > > cpus). > > BTW, in that case orte_show_help is referencing a tag > > ("could-not-bind-to-socket") that does not exist in > > help-odls-default.txt. > > > > While it succeeds when run on nodes of type #2 or 3. > > I think a "bind to socket" should not return an error on a single socket > > machine, but rather be a noop. > > > > The problem comes from the test > > OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound); > > called in odls_default_fork_local_proc() after the binding to the > > processors socket has been done: > > ======== > > <snip> > > OPAL_PAFFINITY_CPU_ZERO(mask); > > for (n=0; n < orte_default_num_cores_per_socket; n++) { > > <snip> > > OPAL_PAFFINITY_CPU_SET(phys_cpu, mask); > > } > > /* if we did not bind it anywhere, then that is an error */ > > OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound); > > if (!bound) { > > orte_show_help("help-odls-default.txt", > > "odls-default:could-not-bind-to-socket", true); > > ORTE_ODLS_ERROR_OUT(ORTE_ERR_FATAL); > > } > > ======== > > OPAL_PAFFINITY_PROCESS_IS_BOUND() will return true if there bits set in > > the mask *AND* the number of bits set is lesser than the number of cpus > > on the machine. Thus on a single socket, 4 cores machine the test will > > fail. While on other the kinds of machines it will succeed. > > > > Again, I think the problem could be solved by changing the alogrithm, > > and assuming that ORTE_BIND_TO_SOCKET, on a single socket machine = > > noop. > > > > Another solution could be to call the test > > OPAL_PAFFINITY_PROCESS_IS_BOUND() at the end of the loop only if we are > > bound (orte_odls_globals.bound). Actually that is the only case where I > > see a justification to this test (see attached patch). > > > > And may be both solutions could be mixed. > > > > Regards, > > Nadia > > > > > > -- > > Nadia Derbey <nadia.der...@bull.net> > > <001_fix_process_binding_test.patch>_______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Nadia Derbey <nadia.der...@bull.net>