Thanks Ralph. I'll check it on next Monday.
Tetsuya
> Should be fixed with r32058
>
>
> On Jun 20, 2014, at 4:13 AM, tmish...@jcity.maeda.co.jp wrote:
>
> >
> >
> > Hi Ralph,
> >
> > By the way, something is wrong with your latest rmaps_rank_file.c.
> > I've got the error below. I'm tring to fi
Should be fixed with r32058
On Jun 20, 2014, at 4:13 AM, tmish...@jcity.maeda.co.jp wrote:
>
>
> Hi Ralph,
>
> By the way, something is wrong with your latest rmaps_rank_file.c.
> I've got the error below. I'm tring to find the problem. But, you
> could find it more quickly...
>
> [mishima@m
Hi Ralph,
By the way, something is wrong with your latest rmaps_rank_file.c.
I've got the error below. I'm tring to find the problem. But, you
could find it more quickly...
[mishima@manage trial]$ cat rankfile
rank 0=node05 slot=0-1
rank 1=node05 slot=3-4
rank 2=node05 slot=6-7
[mishima@manage
Hmmm...this is a tough one. It basically comes down to what we mean by relative
locality. Initially, we meant "at what level do these procs share cpus" -
however, coll/ml is using it as "at what level are these procs commonly bound".
Subtle difference, but significant.
Your proposed version imp
Ralph,
Here is attached a patch that fixes/works around my issue.
this is more of a proof of concept, so i did not commit it to the trunk.
basically :
opal_hwloc_base_get_relative_locality (topo, set1, set2)
sets the locality based on the deepest element that is part of both set1 and
set2.
in m
Ralph,
my test VM is single socket four cores.
here is something odd i just found when running mpirun -np 2
intercomm_create.
tasks [0,1] are bound on cpus [0,1] => OK
tasks[2-3] (first spawn) are bound on cpus [2,3] => OK
tasks[4-5] (second spawn) are not bound (and cpuset is [0-3]) => OK
in omp
I'm not sure, but I guess it's related to Gilles's ticket.
It's a quite bad binding pattern as Ralph pointed out, so
checking for that condition and disqualifying coll/ml could
be a practical solution as well.
Tetsuya
> It is related, but it means that coll/ml has a higher degree of
sensitivity
It is related, but it means that coll/ml has a higher degree of sensitivity to
the binding pattern than what you reported (which was that coll/ml doesn't work
with unbound processes). What we are now seeing is that coll/ml also doesn't
work when processes are bound across sockets.
Which means t
Ralph and Tetsuya,
is this related to the hang i reported at
http://www.open-mpi.org/community/lists/devel/2014/06/14975.php ?
Nathan already replied he is working on a fix.
Cheers,
Gilles
On 2014/06/20 11:54, Ralph Castain wrote:
> My guess is that the coll/ml component may have problems wit
My guess is that the coll/ml component may have problems with binding a single
process across multiple cores like that - it might be that we'll have to have
it check for that condition and disqualify itself. It is a particularly bad
binding pattern, though, as shared memory gets completely messe
Hi folks,
Recently I have been seeing a hang with trunk when I specify a
particular binding by use of rankfile or "-map-by slot".
This can be reproduced by the rankfile which allocates a process
beyond socket boundary. For example, on the node05 which has 2 socket
with 4 core, the rank 1 is allo
11 matches
Mail list logo