Let me know if Nadia can help here, Ralph.
Josh
On Fri, Sep 12, 2014 at 9:31 AM, Ralph Castain wrote:
>
> On Sep 12, 2014, at 5:45 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
> Ralph,
>
> On Fri, Sep 12, 2014 at 10:54 AM, Ralph Castain wrote:
>
>> The design is suppose
On Sep 12, 2014, at 5:45 AM, Gilles Gouaillardet
wrote:
> Ralph,
>
> On Fri, Sep 12, 2014 at 10:54 AM, Ralph Castain wrote:
> The design is supposed to be that each node knows precisely how many daemons
> are involved in each collective, and who is going to talk to them.
>
> ok, but in the
Ralph,
On Fri, Sep 12, 2014 at 10:54 AM, Ralph Castain wrote:
> The design is supposed to be that each node knows precisely how many
> daemons are involved in each collective, and who is going to talk to them.
ok, but in the design does not ensure that things will happen in the right
order :
-
The design is supposed to be that each node knows precisely how many daemons
are involved in each collective, and who is going to talk to them. The
signature contains the info required to ensure the receiver knows which
collective this message relates to, and just happens to also allow them to
Ralph,
you are right, this was definetly not the right fix (at least with 4
nodes or more)
i finally understood what is going wrong here :
to make it simple, the allgather recursive doubling algo is not
implemented with
MPI_Recv(...,peer,...) like functions but with
MPI_Recv(...,MPI_ANY_SOURCE,..
Yeah, that's not the right fix, I'm afraid. I've made the direct component the
default again until I have time to dig into this deeper.
On Sep 11, 2014, at 4:02 AM, Gilles Gouaillardet
wrote:
> Ralph,
>
> the root cause is when the second orted/mpirun runs rcd_finalize_coll,
> it does not inv
Ralph,
the root cause is when the second orted/mpirun runs rcd_finalize_coll,
it does not invoke pmix_server_release
because allgather_stub was not previously invoked since the the fence
was not yet entered.
/* in rcd_finalize_coll, coll->cbfunc is NULL */
the attached patch is likely not the rig
Ralph,
things got worst indeed :-(
now a simple hello world involving two hosts hang in mpi_init.
there is still a race condition : if a tasks a call fence long after task b,
then task b will never leave the fence
i ll try to debug this ...
Cheers,
Gilles
On 2014/09/11 2:36, Ralph Castain wro
I think I now have this fixed - let me know what you see.
On Sep 9, 2014, at 6:15 AM, Ralph Castain wrote:
> Yeah, that's not the correct fix. The right way to fix it is for all three
> components to have their own RML tag, and for each of them to establish a
> persistent receive. They then c
Yeah, that's not the correct fix. The right way to fix it is for all three
components to have their own RML tag, and for each of them to establish a
persistent receive. They then can use the signature to tell which collective
the incoming message belongs to.
I'll fix it, but it won't be until t
Folks,
Since r32672 (trunk), grpcomm/rcd is the default module.
the attached spawn.c test program is a trimmed version of the
spawn_with_env_vars.c test case
from the ibm test suite.
when invoked on two nodes :
- the program hangs with -np 2
- the program can crash with np > 2
error message is
[n
11 matches
Mail list logo