> On Aug 19, 2016, at 4:24 PM, r...@open-mpi.org wrote:
> 
> Hi folks
> 
> I had a question arise regarding a problem being seen by an OMPI user - has 
> to do with the old bugaboo I originally dealt with back in my LANL days. The 
> problem is with an app that repeatedly hammers on a collective, and gets 
> overwhelmed by unexpected messages when one of the procs falls behind.

I did some investigation on roadrunner several years ago and determined that 
the user code issue coll/sync was attempting to fix was due to a bug in 
ob1/cksum (really can’t remember). coll/sync was simply masking a live-lock 
problem. I committed a workaround for the bug in r26575 
(https://github.com/open-mpi/ompi/commit/59e529cf1dfe986e40d14ec4d2a2e5ef0cea5e35)
 and tested it with the user code. After this change the user code ran fine 
without coll/sync. Since lanl no longer had any users of coll/sync we stopped 
supporting it.

> I solved this back then by introducing the “sync” component in ompi/mca/coll, 
> which injected a barrier operation every N collectives. You could even “tune” 
> it by doing the injection for only specific collectives.
> 
> However, I can no longer find that component in the code base - I find it in 
> the 1.6 series, but someone removed it during the 1.7 series.
> 
> Can someone tell me why this was done??? Is there any reason not to bring it 
> back? It solves a very real, not uncommon, problem.
> Ralph

This was discussed during one (or several) tel-cons years ago. We agreed to 
kill it and bring it back if there is 1) a use case, and 2) someone is willing 
to support it. See 
https://github.com/open-mpi/ompi/commit/5451ee46bd6fcdec002b333474dec919475d2d62
 .

Can you link the user email?

-Nathan
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to