In the multi job environment, can't we just start binding processes on the first avaliable and unused socket? I mean first job/user will start binding itself from socket 0, the next job/user will start binding itself from socket 2, for instance . Lenny.
On Mon, Aug 17, 2009 at 6:02 AM, Ralph Castain <r...@open-mpi.org> wrote: > > On Aug 16, 2009, at 8:16 PM, Eugene Loh wrote: > > Chris Samuel wrote: > > ----- "Eugene Loh" <eugene....@sun.com> <eugene....@sun.com> wrote: > > > This is an important discussion. > > > Indeed! My big fear is that people won't pick up the significance > of the change and will complain about performance regressions > in the middle of an OMPI stable release cycle. > > 2) The proposed OMPI bind-to-socket default is less severe. In the > general case, it would allow multiple jobs to bind in the same way > without oversubscribing any core or socket. (This comment added to > the trac ticket.) > > > That's a nice clarification, thanks. I suspect though that the > same issue we have with MVAPICH would occur if two 4 core jobs > both bound themselves to the first socket. > > > Okay, so let me point out a second distinction from MVAPICH: the default > policy would be to spread out over sockets. > > Let's say you have two sockets, with four cores each. Let's say you submit > two four-core jobs. The first job would put two processes on the first > socket and two processes on the second. The second job would do the same. > The loading would be even. > > I'm not saying there couldn't be problems. It's just that MVAPICH2 (at > least what I looked at) has multiple shortfalls. The binding is to fill up > one socket after another (which decreases memory bandwidth per process and > increases chances of collisions with other jobs) and binding is to core > (increasing chances of oversubscribing cores). The proposed OMPI behavior > distributes over sockets (improving memory bandwidth per process and > reducing collisions with other jobs) and binding is to sockets (reducing > changes of oversubscribing cores, whether due to other MPI jobs or due to > multithreaded processes). So, the proposed OMPI behavior mitigates the > problems. > > It would be even better to have binding selections adapt to other bindings > on the system. > > In any case, regardless of what the best behavior is, I appreciate the > point about changing behavior in the middle of a stable release. Arguably, > leaving significant performance on the table in typical situations is a bug > that warrants fixing even in the middle of a release, but I won't try to > settle that debate here. > > > I think the problem here, Eugene, is that performance benchmarks are far > from the typical application. We have repeatedly seen this - optimizing for > benchmarks frequently makes applications run less efficiently. So I concur > with Chris on this one - let's not go -too- benchmark happy and hurt the > regular users. > > Here at LANL, binding to-socket instead of to-core hurts performance by > ~5-10%, depending on the specific application. Of course, either binding > method is superior to no binding at all... > > UNLESS you have a threaded application, in which case -any- binding can be > highly detrimental to performance. > > So going slow on this makes sense. If we provide the capability, but leave > it off by default, then people can test it against real applications and see > the impact. Then we can better assess the right default settings. > > Ralph > > > 3) Defaults (if I understand correctly) can be set differently > on each cluster. > > > Yes, but the defaults should be sensible for the majority of > clusters. If the majority do indeed share nodes between jobs > then I would suggest that the default should be off and the > minority who don't share nodes should have to enable it. > > > In debates on this subject, I've heard people argue that: > > *) Though nodes are getting fatter, most are still thin. > > *) Resource managers tend to space share the cluster. > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >