In the multi job environment, can't we just start binding processes on the
first avaliable and unused socket?
I mean first job/user will start binding itself from socket 0,
the next job/user will start binding itself from socket 2, for instance .
Lenny.

On Mon, Aug 17, 2009 at 6:02 AM, Ralph Castain <r...@open-mpi.org> wrote:

>
> On Aug 16, 2009, at 8:16 PM, Eugene Loh wrote:
>
>  Chris Samuel wrote:
>
> ----- "Eugene Loh" <eugene....@sun.com> <eugene....@sun.com> wrote:
>
>
>  This is an important discussion.
>
>
>  Indeed! My big fear is that people won't pick up the significance
> of the change and will complain about performance regressions
> in the middle of an OMPI stable release cycle.
>
>  2) The proposed OMPI bind-to-socket default is less severe. In the
> general case, it would allow multiple jobs to bind in the same way
> without oversubscribing any core or socket. (This comment added to
> the trac ticket.)
>
>
>  That's a nice clarification, thanks. I suspect though that the
> same issue we have with MVAPICH would occur if two 4 core jobs
> both bound themselves to the first socket.
>
>
>  Okay, so let me point out a second distinction from MVAPICH:  the default
> policy would be to spread out over sockets.
>
> Let's say you have two sockets, with four cores each.  Let's say you submit
> two four-core jobs.  The first job would put two processes on the first
> socket and two processes on the second.  The second job would do the same.
> The loading would be even.
>
> I'm not saying there couldn't be problems.  It's just that MVAPICH2 (at
> least what I looked at) has multiple shortfalls.  The binding is to fill up
> one socket after another (which decreases memory bandwidth per process and
> increases chances of collisions with other jobs) and binding is to core
> (increasing chances of oversubscribing cores).  The proposed OMPI behavior
> distributes over sockets (improving memory bandwidth per process and
> reducing collisions with other jobs) and binding is to sockets (reducing
> changes of oversubscribing cores, whether due to other MPI jobs or due to
> multithreaded processes).  So, the proposed OMPI behavior mitigates the
> problems.
>
> It would be even better to have binding selections adapt to other bindings
> on the system.
>
> In any case, regardless of what the best behavior is, I appreciate the
> point about changing behavior in the middle of a stable release.  Arguably,
> leaving significant performance on the table in typical situations is a bug
> that warrants fixing even in the middle of a release, but I won't try to
> settle that debate here.
>
>
> I think the problem here, Eugene, is that performance benchmarks are far
> from the typical application. We have repeatedly seen this - optimizing for
> benchmarks frequently makes applications run less efficiently. So I concur
> with Chris on this one - let's not go -too- benchmark happy and hurt the
> regular users.
>
> Here at LANL, binding to-socket instead of to-core hurts performance by
> ~5-10%, depending on the specific application. Of course, either binding
> method is superior to no binding at all...
>
> UNLESS you have a threaded application, in which case -any- binding can be
> highly detrimental to performance.
>
> So going slow on this makes sense. If we provide the capability, but leave
> it off by default, then people can test it against real applications and see
> the impact. Then we can better assess the right default settings.
>
> Ralph
>
>
>  3) Defaults (if I understand correctly) can be set differently
> on each cluster.
>
>
>  Yes, but the defaults should be sensible for the majority of
> clusters.  If the majority do indeed share nodes between jobs
> then I would suggest that the default should be off and the
> minority who don't share nodes should have to enable it.
>
>
>  In debates on this subject, I've heard people argue that:
>
> *) Though nodes are getting fatter, most are still thin.
>
> *) Resource managers tend to space share the cluster.
>  _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

Reply via email to