Re: [OMPI devel] trunk hangs when I specify a particular binding by rankfile

2014-06-20 Thread tmishima
Thanks Ralph. I'll check it on next Monday. Tetsuya > Should be fixed with r32058 > > > On Jun 20, 2014, at 4:13 AM, tmish...@jcity.maeda.co.jp wrote: > > > > > > > Hi Ralph, > > > > By the way, something is wrong with your latest rmaps_rank_file.c. > > I've got the error below. I'm tring to fi

Re: [OMPI devel] trunk hangs when I specify a particular binding by rankfile

2014-06-20 Thread Ralph Castain
Should be fixed with r32058 On Jun 20, 2014, at 4:13 AM, tmish...@jcity.maeda.co.jp wrote: > > > Hi Ralph, > > By the way, something is wrong with your latest rmaps_rank_file.c. > I've got the error below. I'm tring to find the problem. But, you > could find it more quickly... > > [mishima@m

Re: [OMPI devel] trunk hangs when I specify a particular binding by rankfile

2014-06-20 Thread tmishima
Hi Ralph, By the way, something is wrong with your latest rmaps_rank_file.c. I've got the error below. I'm tring to find the problem. But, you could find it more quickly... [mishima@manage trial]$ cat rankfile rank 0=node05 slot=0-1 rank 1=node05 slot=3-4 rank 2=node05 slot=6-7 [mishima@manage

Re: [OMPI devel] trunk hangs when I specify a particular binding by rankfile

2014-06-20 Thread Ralph Castain
Hmmm...this is a tough one. It basically comes down to what we mean by relative locality. Initially, we meant "at what level do these procs share cpus" - however, coll/ml is using it as "at what level are these procs commonly bound". Subtle difference, but significant. Your proposed version imp

Re: [OMPI devel] trunk hangs when I specify a particular binding by rankfile

2014-06-20 Thread Gilles Gouaillardet
Ralph, Here is attached a patch that fixes/works around my issue. this is more of a proof of concept, so i did not commit it to the trunk. basically : opal_hwloc_base_get_relative_locality (topo, set1, set2) sets the locality based on the deepest element that is part of both set1 and set2. in m

Re: [OMPI devel] trunk hangs when I specify a particular binding by rankfile

2014-06-20 Thread Gilles Gouaillardet
Ralph, my test VM is single socket four cores. here is something odd i just found when running mpirun -np 2 intercomm_create. tasks [0,1] are bound on cpus [0,1] => OK tasks[2-3] (first spawn) are bound on cpus [2,3] => OK tasks[4-5] (second spawn) are not bound (and cpuset is [0-3]) => OK in omp

Re: [OMPI devel] trunk hangs when I specify a particular binding by rankfile

2014-06-20 Thread tmishima
I'm not sure, but I guess it's related to Gilles's ticket. It's a quite bad binding pattern as Ralph pointed out, so checking for that condition and disqualifying coll/ml could be a practical solution as well. Tetsuya > It is related, but it means that coll/ml has a higher degree of sensitivity

Re: [OMPI devel] trunk hangs when I specify a particular binding by rankfile

2014-06-19 Thread Ralph Castain
It is related, but it means that coll/ml has a higher degree of sensitivity to the binding pattern than what you reported (which was that coll/ml doesn't work with unbound processes). What we are now seeing is that coll/ml also doesn't work when processes are bound across sockets. Which means t

Re: [OMPI devel] trunk hangs when I specify a particular binding by rankfile

2014-06-19 Thread Gilles Gouaillardet
Ralph and Tetsuya, is this related to the hang i reported at http://www.open-mpi.org/community/lists/devel/2014/06/14975.php ? Nathan already replied he is working on a fix. Cheers, Gilles On 2014/06/20 11:54, Ralph Castain wrote: > My guess is that the coll/ml component may have problems wit

Re: [OMPI devel] trunk hangs when I specify a particular binding by rankfile

2014-06-19 Thread Ralph Castain
My guess is that the coll/ml component may have problems with binding a single process across multiple cores like that - it might be that we'll have to have it check for that condition and disqualify itself. It is a particularly bad binding pattern, though, as shared memory gets completely messe

[OMPI devel] trunk hangs when I specify a particular binding by rankfile

2014-06-19 Thread tmishima
Hi folks, Recently I have been seeing a hang with trunk when I specify a particular binding by use of rankfile or "-map-by slot". This can be reproduced by the rankfile which allocates a process beyond socket boundary. For example, on the node05 which has 2 socket with 4 core, the rank 1 is allo