Just to check: is this with the latest trunk? Brad and Terry have been making changes to this section of code, including modifying the PROCESS_IS_BOUND test...
On Apr 9, 2010, at 3:39 AM, Nadia Derbey wrote: > Hi, > > I am facing a problem with a test that runs fine on some nodes, and > fails on others. > > I have a heterogenous cluster, with 3 types of nodes: > 1) Single socket , 4 cores > 2) 2 sockets, 4cores per socket > 3) 2 sockets, 6 cores/socket > > I am using: > . salloc to allocate the nodes, > . mpirun binding/mapping options "-bind-to-socket -bysocket" > > # salloc -N 1 mpirun -n 4 -bind-to-socket -bysocket sleep 900 > > This command fails if the allocated node is of type #1 (single socket/4 > cpus). > BTW, in that case orte_show_help is referencing a tag > ("could-not-bind-to-socket") that does not exist in > help-odls-default.txt. > > While it succeeds when run on nodes of type #2 or 3. > I think a "bind to socket" should not return an error on a single socket > machine, but rather be a noop. > > The problem comes from the test > OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound); > called in odls_default_fork_local_proc() after the binding to the > processors socket has been done: > ======== > <snip> > OPAL_PAFFINITY_CPU_ZERO(mask); > for (n=0; n < orte_default_num_cores_per_socket; n++) { > <snip> > OPAL_PAFFINITY_CPU_SET(phys_cpu, mask); > } > /* if we did not bind it anywhere, then that is an error */ > OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound); > if (!bound) { > orte_show_help("help-odls-default.txt", > "odls-default:could-not-bind-to-socket", true); > ORTE_ODLS_ERROR_OUT(ORTE_ERR_FATAL); > } > ======== > OPAL_PAFFINITY_PROCESS_IS_BOUND() will return true if there bits set in > the mask *AND* the number of bits set is lesser than the number of cpus > on the machine. Thus on a single socket, 4 cores machine the test will > fail. While on other the kinds of machines it will succeed. > > Again, I think the problem could be solved by changing the alogrithm, > and assuming that ORTE_BIND_TO_SOCKET, on a single socket machine = > noop. > > Another solution could be to call the test > OPAL_PAFFINITY_PROCESS_IS_BOUND() at the end of the loop only if we are > bound (orte_odls_globals.bound). Actually that is the only case where I > see a justification to this test (see attached patch). > > And may be both solutions could be mixed. > > Regards, > Nadia > > > -- > Nadia Derbey <nadia.der...@bull.net> > <001_fix_process_binding_test.patch>_______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel