Re: [OMPI devel] Heads up on new feature to 1.3.4
On Aug 17, 2009, at 7:59 PM, Chris Samuel wrote: Ah, I think I've misunderstood the website then. :-( It calls 1.3 stable and 1.2 old and I presumed old meant deprecated. :-( To clarify... 1.3 *is* stable, meaning "ok for production use." We test all 1.3 releases before they go out, it undergoes regression testing, etc. We have two different version series: 1. Odd minor numbers (e.g., 1.3.x): Feature release series. They're stable and usable, but features may come and go during successive releases. To be clear: feature release series does not mean "beta" or "we haven't tested this much". 2. Even minor number (e.g., 1.4.x) : Super stable series. Only bug fixes will be applied; no feature additions or subtractions will occur. Both series have the same level of testing before they are released. The difference is mainly a classification indicating whether new features can be added / subtracted or not. -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] Heads up on new feature to 1.3.4
- "Eugene Loh" wrote: > Actually, the current proposed defaults for 1.3.4 are > not to change the defaults at all. Thanks, I hadn't picked up on the latest update to the trac ticket 3 days ago that says that the defaults will stay the same. Sounds good to me! All the best and have a good weekend all! Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency
Re: [OMPI devel] Heads up on new feature to 1.3.4
Chris Samuel wrote: - "Chris Samuel" wrote: $ mpiexec --mca opal_paffinity_alone 1 -bysocket -bind-to-socket -mca odls_base_report_bindings 99 -mca odls_base_verbose 7 ./cpi-1.4 To clarify - does that command line accurately reflect the proposed defaults for OMPI 1.3.4 ? Not the verbose/reporting options. Actually, the current proposed defaults for 1.3.4 are not to change the defaults at all.
Re: [OMPI devel] Heads up on new feature to 1.3.4
- "Chris Samuel" wrote: > $ mpiexec --mca opal_paffinity_alone 1 -bysocket -bind-to-socket -mca > odls_base_report_bindings 99 -mca odls_base_verbose 7 ./cpi-1.4 To clarify - does that command line accurately reflect the proposed defaults for OMPI 1.3.4 ? cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency
Re: [OMPI devel] Heads up on new feature to 1.3.4
- "Chris Samuel" wrote: > This is most likely because it's getting an error from the > kernel when trying to bind to a socket it's not permitted > to access. This is what strace reports: 18561 sched_setaffinity(18561, 8, { f0 } 18561 <... sched_setaffinity resumed> ) = -1 EINVAL (Invalid argument) so that would appear to be it. cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency
Re: [OMPI devel] Heads up on new feature to 1.3.4
- "Eugene Loh" wrote: > Ah, you're missing the third secret safety switch that prevents > hapless mortals from using this stuff accidentally! :^) Sounds good to me. :-) > I think you need to add > > --mca opal_paffinity_alone 1 Yup, looks like that's it; it fails to launch with that.. $ mpiexec --mca opal_paffinity_alone 1 -bysocket -bind-to-socket -mca odls_base_report_bindings 99 -mca odls_base_verbose 7 ./cpi-1.4 [tango095.vpac.org:18548] mca:base:select:( odls) Querying component [default] [tango095.vpac.org:18548] mca:base:select:( odls) Query of component [default] set priority to 1 [tango095.vpac.org:18548] mca:base:select:( odls) Selected component [default] [tango095.vpac.org:18548] [[33990,0],0] odls:launch: spawning child [[33990,1],0] [tango095.vpac.org:18548] [[33990,0],0] odls:launch: spawning child [[33990,1],1] [tango095.vpac.org:18548] [[33990,0],0] odls:default:fork binding child [[33990,1],0] to socket 0 cpus 000f [tango095.vpac.org:18548] [[33990,0],0] odls:default:fork binding child [[33990,1],1] to socket 1 cpus 00f0 -- An attempt to set processor affinity has failed - please check to ensure that your system supports such functionality. If so, then this is probably something that should be reported to the OMPI developers. -- -- mpiexec was unable to start the specified application as it encountered an error on node tango095.vpac.org. More information may be available above. -- 4 total processes failed to start This is most likely because it's getting an error from the kernel when trying to bind to a socket it's not permitted to access. cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency
Re: [OMPI devel] Heads up on new feature to 1.3.4
Chris Samuel wrote: OK, grabbed that (1.4a1r21825). Configured with: ./configure --prefix=$FOO --with-openib --with-tm=/usr/ local/torque/latest --enable-static --enable-shared It built & installed OK, but when running a trivial example with it I don't see evidence for that code getting called. Perhaps I'm not passing the correct options ? $ mpiexec -bysocket -bind-to-socket -mca odls_base_report_bindings 99 -mca odls_base_verbose 7 ./cpi-1.4 Ah, you're missing the third secret safety switch that prevents hapless mortals from using this stuff accidentally! :^) I think you need to add --mca opal_paffinity_alone 1 a name that not even Ralph himself likes!
Re: [OMPI devel] Heads up on new feature to 1.3.4
- "Ralph Castain" wrote: > Hi Chris Hiya, > The devel trunk has all of this in it - you can get that tarball from > the OMPI web site (take the nightly snapshot). OK, grabbed that (1.4a1r21825). Configured with: ./configure --prefix=$FOO --with-openib --with-tm=/usr/ local/torque/latest --enable-static --enable-shared It built & installed OK, but when running a trivial example with it I don't see evidence for that code getting called. Perhaps I'm not passing the correct options ? $ mpiexec -bysocket -bind-to-socket -mca odls_base_report_bindings 99 -mca odls_base_verbose 7 ./cpi-1.4 [tango095.vpac.org:16976] mca:base:select:( odls) Querying component [default] [tango095.vpac.org:16976] mca:base:select:( odls) Query of component [default] set priority to 1 [tango095.vpac.org:16976] mca:base:select:( odls) Selected component [default] [tango095.vpac.org:16976] [[36578,0],0] odls:launch: spawning child [[36578,1],0] [tango095.vpac.org:16976] [[36578,0],0] odls:launch: spawning child [[36578,1],1] [tango095.vpac.org:16976] [[36578,0],0] odls:launch: spawning child [[36578,1],2] [tango095.vpac.org:16976] [[36578,0],0] odls:launch: spawning child [[36578,1],3] Process 0 on tango095.vpac.org Process 1 on tango095.vpac.org Process 2 on tango095.vpac.org Process 3 on tango095.vpac.org ^Cmpiexec: killing job... Increasing odls_base_verbose only seems to add the environment being passed to the child processes. :-( I'm pretty sure I've got the right code as ompi_info -a reports the debug setting from the patch: MCA odls: parameter "odls_base_report_bindings" (current value: <0>, data source: default value) > I plan to work on cpuset support beginning Tues morning. Great, anything I can help with then please let me know, I'll be back from leave by then. All the best, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency
Re: [OMPI devel] Heads up on new feature to 1.3.4
Hi Chris The devel trunk has all of this in it - you can get that tarball from the OMPI web site (take the nightly snapshot). I plan to work on cpuset support beginning Tues morning. Ralph On Aug 17, 2009, at 7:18 PM, Chris Samuel wrote: - "Eugene Loh" wrote: Hi Eugene, [...] It would be even better to have binding selections adapt to other bindings on the system. Indeed! This touches on the earlier thread about making OMPI aware of its cpuset/cgroup allocation on the node (for those sites that are using it), it might solve this issue quite nicely as OMPI would know precisely what cores & sockets were allocated for its use without having to worry about other HPC processes. No idea how to figure that out for processes outside of cpusets. :-( In any case, regardless of what the best behavior is, I appreciate the point about changing behavior in the middle of a stable release. Not a problem, and I take Jeff's point about 1.3 not being a super stable release and thus not being a blocker to changes such as this. Arguably, leaving significant performance on the table in typical situations is a bug that warrants fixing even in the middle of a release, but I won't try to settle that debate here. I agree for those cases where there's no downside, and thinking further on your point of balancing between sockets I can see why that would limit the impact. Most of the cases I can think of that would be most adversely affected are down to other jobs binding to cores naively and if that's happening outside of cpusets then the cluster sysadmin has more to worry about from mixing those applications than mixing with OMPI ones which are just binding to sockets. :-) So I'll happily withdraw my objection on those grounds. *But* I would like to test this code out on a cluster with cpuset support enabled to see whether it will behave itself. Basically if I run a 4 core MPI job on a dual socket system which has been allocated only the cores on socket 0 what will happen when it tries to bind to socket 1 which is outside its cpuset ? Is there a 1.3 branch or tarball with these patches applied that I could test out ? cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Heads up on new feature to 1.3.4
- "Eugene Loh" wrote: Hi Eugene, [...] > It would be even better to have binding selections adapt to other > bindings on the system. Indeed! This touches on the earlier thread about making OMPI aware of its cpuset/cgroup allocation on the node (for those sites that are using it), it might solve this issue quite nicely as OMPI would know precisely what cores & sockets were allocated for its use without having to worry about other HPC processes. No idea how to figure that out for processes outside of cpusets. :-( > In any case, regardless of what the best behavior is, I appreciate > the point about changing behavior in the middle of a stable release. Not a problem, and I take Jeff's point about 1.3 not being a super stable release and thus not being a blocker to changes such as this. > Arguably, leaving significant performance on the table in typical > situations is a bug that warrants fixing even in the middle of a > release, but I won't try to settle that debate here. I agree for those cases where there's no downside, and thinking further on your point of balancing between sockets I can see why that would limit the impact. Most of the cases I can think of that would be most adversely affected are down to other jobs binding to cores naively and if that's happening outside of cpusets then the cluster sysadmin has more to worry about from mixing those applications than mixing with OMPI ones which are just binding to sockets. :-) So I'll happily withdraw my objection on those grounds. *But* I would like to test this code out on a cluster with cpuset support enabled to see whether it will behave itself. Basically if I run a 4 core MPI job on a dual socket system which has been allocated only the cores on socket 0 what will happen when it tries to bind to socket 1 which is outside its cpuset ? Is there a 1.3 branch or tarball with these patches applied that I could test out ? cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency
Re: [OMPI devel] Heads up on new feature to 1.3.4
On Aug 17, 2009, at 5:59 PM, Chris Samuel wrote: - "Jeff Squyres" wrote: An important point to raise here: the 1.3 series is *not* the super stable series. It is the *feature* series. Specifically: it is not out of scope to introduce or change features within the 1.3 series. Ah, I think I've misunderstood the website then. :-( It calls 1.3 stable and 1.2 old and I presumed old meant deprecated. :-( Old = I wouldn't use it, given the choice :-) -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Heads up on new feature to 1.3.4
- "Jeff Squyres" wrote: > An important point to raise here: the 1.3 series is *not* the super > stable series. It is the *feature* series. Specifically: it is not > out of scope to introduce or change features within the 1.3 series. Ah, I think I've misunderstood the website then. :-( It calls 1.3 stable and 1.2 old and I presumed old meant deprecated. :-( -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency
Re: [OMPI devel] Heads up on new feature to 1.3.4
On Aug 17 2009, Paul H. Hargrove wrote: + I wonder if one can do any "introspection" with the dynamic linker to detect hybrid OpenMP (no "I") apps and avoid pinning them by default (examining OMP_NUM_THREADS in the environment is no good, since that variable may have a site default value other than 1 or empty). To me this is the most obvious class of application that will suffer from imposing pinning by default. This is a bit off-thread, but my experience with tuning 'threading' (mainly OpenMP) is that it makes tuning processes (e.g. MPI) look trivial. You need affinity even more than you do for processes, but few operating systems provide a way of binding threads to cores. You can try tweaking the POSIX scheduling parameters, but I failed to find a system on which they were connected to anything. All right, this is all a little out of date now, but I'll bet it hasn't changed much. That being so, a reasonable test would be to check for ANY secondary thread in the process and/or threading call, and to throw in the towel that that point. I don't know ELF, but the latter can be done in most reasonably advanced linkers (by using weak externals). Despite their uncleanliness, some heuristics of this nature are probably the only viable solution, for the reasons that Jeff described. I stand by my term "gratuitous hack"! Regards, Nick Maclaren.
Re: [OMPI devel] Heads up on new feature to 1.3.4
Jeff, Jeff Squyres wrote: ignored it whenever presenting competitive data. The 1,000,000th time I saw this, I gave up arguing that our competitors were not being fair and simply changed our defaults to always leave memory pinned for OpenFabrics-based networks. Instead, you should have told them that caching memory registration is unsafe and ask them why they don't care if their customers don't get the right answer. And then you would follow up by asking if they actually have a way to check that there is no data corruption. It's not really FUD, it's tit for tat :-) 2. Even if you tag someone in public for not being fair, they always say the same thing, "Oh sorry, my mistake" (regardless of whether they actually forgot or did it intentionally). I told several competitors *many times* that they had to use leave_pinned, but in all public comparison numbers, they never did. Hence, they always looked better. Looked better on what, micro-benchmarks ? The same micro-benchmarks that have already been manipulated to death, like OSU using a stream-based bandwidth test to hide the start-up overhead ? If the option improves real applications at large, then it should be on by default and there is no debate (users should never have to know about knobs). If it is only for micro-benchmarks, stand your ground and do the right thing. It does not do the community any good if MPI implementations are tuned for a broken micro-benchmarks penis contest. If you want to play that game, at least make your own micro-benchmarks. Believe me, I know what it is to hear technical atrocities from these marketing idiots. There is nothing you can do, they are payed to talk and you are not. In the end, HPC gets what HPC deserves, people should do their homework. For applications at large, performance gains due to core-binding is suspect. Memory-binding may have more spine, but the OS should already be able do a good job with NUMA allocation and page migration. - The Linux scheduler does no/cannot optimize well for many HPC apps; binding definitely helps in many scenarios (not just benchmarks). Then fix the Linux scheduler. Only the OS scheduler can do a meaningful resource allocation, because it sees everything and you don't. Patrick
Re: [OMPI devel] Heads up on new feature to 1.3.4
Some more thoughts in this thread that I've not seen expressed yet (perhaps I missed them): + Some argue that this change in the middle of a stable series may, to some users, appear to be a performance regression when they update. However, I would argue that if the alternative is to delay this feature until the next stable release, it will STILL appear to those same users to be a performance regression when they upgrade. If the choice is between sooner or later I would vote for sooner. + I wonder if one can do any "introspection" with the dynamic linker to detect hybrid OpenMP (no "I") apps and avoid pinning them by default (examining OMP_NUM_THREADS in the environment is no good, since that variable may have a site default value other than 1 or empty). To me this is the most obvious class of application that will suffer from imposing pinning by default. + The question of round-robin-by-core vs round-robin-by-socket is not fundamentally any different from the question of how to map one's tasks to flat-SMP nodes (cylic, block or block-cylic; XYZT vs TXYZ, etc.) There is NO universal right answer, and for better or worse the end-user that wants to maximize performance is going to need to either understand how their comms interact with task layout, or they are going to try different options until the are happy. -Paul -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group Tel: +1-510-495-2352 HPC Research Department Fax: +1-510-486-6900 Lawrence Berkeley National Laboratory
Re: [OMPI devel] Heads up on new feature to 1.3.4
Some very good points in this thread all round. On Mon, 2009-08-17 at 09:00 -0400, Jeff Squyres wrote: > > This is probably not too surprising (i.e., allowing the OS to move > jobs around between cores on a socket can probably involve a little > cache thrashing, resulting in that 5-10% loss). I'm hand-waving > here, > and I have not tried this myself, but it's not too surprising of a > result to me. It's also not too surprising that others don't see > this > effect at all (e.g., Sun didn't see any difference in bind-to-core > vs. > bind-to-socket) when they ran their tests. YMMV. > > I'd actually be in favor of a by-core binding (not by-socket), but > spreading the processes out round robin by socket, not by core. All > of this would be the *default* behavior, of course -- command line > params/MCA params will be provided to change to whatever pattern is > desired. I'm in favour of by-core binding, if it's done correctly I've seen results that tie in with Ralphs 5-10% figure. If it's done incorrectly however it can be atrocious, the kernel scheduler may not be perfect but at least it's never bad. One (small) point nobody has mentioned yet is that when using round-robin core binding some applications prefer you to round robin by-socket and some prefer you to round-robin by-core. This will depend on their level of comms and any cache-sharing benefits. Perhaps this is the reason Ralph saw improvements but Sun didn't? Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [OMPI devel] Heads up on new feature to 1.3.4
On Aug 17, 2009, at 3:23 PM, N.M. Maclaren wrote: >Yes, BUT... We had a similar option to this for a long, long time. Sorry, perhaps I should have spelled out what I meant by "mandatory". The system would not build (or run, depending on where it was set) without such a value being specified. There would be no default. Gotcha. I have another "but", though. :-) OMPI already has about a billion configurable parameters. If we *force* people to do something more than "mpirun -np x my_favorite_benchmark", then they'll say stuff like "we couldn't even get Open MPI to run" (I've seen people say that about other MPI's -- fortunately, I haven't heard that about Open MPI except where either OMPI legitimately had a bug or the user had something wrong in their setup). We work in a very nasty, competitive community. :-( -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] Heads up on new feature to 1.3.4
On Aug 17 2009, Jeff Squyres wrote: Yes, BUT... We had a similar option to this for a long, long time. Sorry, perhaps I should have spelled out what I meant by "mandatory". The system would not build (or run, depending on where it was set) without such a value being specified. There would be no default. Regards, Nick Maclaren.
Re: [OMPI devel] Heads up on new feature to 1.3.4
On Aug 17, 2009, at 12:11 PM, N.M. Maclaren wrote: 1) To have a mandatory configuration option setting the default, which would have a name like 'performance' for the binding option. YOU could then beat up anyone who benchmarkets without it for being biassed. This is a better solution, but the "I shouldn't need to have to think just because I am doing something complicated" brigade would object. Yes, BUT... We had a similar option to this for a long, long time. Marketing departments from other organizations / companies willfully ignored it whenever presenting competitive data. The 1,000,000th time I saw this, I gave up arguing that our competitors were not being fair and simply changed our defaults to always leave memory pinned for OpenFabrics-based networks. To be clear: the option was "--mca mpi_leave_pinned 1" -- granted, the name wasn't as obvious as "--performance", but this option was widely publicized and easy to know that you should do for benchmarks (with a name like --performance, the natural question will be "why don't you enable [--]performance by default? This means that OMPI has --no- performance by default...?"). I would tell person/marketer X at a conference, "Hey, you didn't run with leave_pinned; our numbers are much better than that." "Oh, sorry" they would inevitably say; "I'll fix it next time I make new slides." There are several problems that arise from this scenario: 1. The competitors aren't interested in being fair. Spin is everything. HPC is highly competitive. 2. Even if you tag someone in public for not being fair, they always say the same thing, "Oh sorry, my mistake" (regardless of whether they actually forgot or did it intentionally). I told several competitors *many times* that they had to use leave_pinned, but in all public comparison numbers, they never did. Hence, they always looked better. (/me takes a moment to calm down after venturing down memory lane of all the unfair comparisons made against OMPI... :-) ) 3. To some degree, "out of the box performance" *is* a compelling reason. Sure, I would hope that marketers and competitors to be ethical (they aren't, but you can hope anyway), but the naive / new user shouldn't need to know a million switches to get good performance. Having good / simple switches to optimize for different workloads is a good thing (e.g., Platform MPI has some nice options for this kind of stuff). But the bottom line is that you can't rely on someone running anything other "mpirun -np x my_favorite_benchmark". - Also, as an aside to many of the other posts, yes, this is a complex issue. But: - We're only talking about defaults, not absolute behavior. If you want or need to disable/change this behavior, you certainly can. - It's been stated a few times, but I feel that this is important: most other MPI's bind by default. They're deriving performance benefits from this. We're not. Open MPI has to be competitive (or my management will ask me, "Why are you working on that crappy MPI?"). - The Linux scheduler does no/cannot optimize well for many HPC apps; binding definitely helps in many scenarios (not just benchmarks). - Of course you can construct scenarios where things break / perform badly. Particularly if you do Wrong Things. If you do Wrong Things, you should be punished (e.g., via bad performance). It's not the software's fault if you choose to bind 10 threads to 1 core. It's not the software's fault if you're on a large SMP and you choose to dedicate all of the processors to HPC apps and don't leave any for the OS (particularly if you have a lot of OS activity). And so on. Of course, we should do a good job of trying to do reasonable things by default (e.g., not binding 10 threads to one core by default), and we should provide options (sometimes automatic) for disabling those reasonable things if we can't do them well. But sometimes we *do* have to rely on the user telling us things. - I took Ralph's previous remarks as a general statement about threading being problematic to any form of binding. I talked to him on the phone -- he actually had a specific case in mind (what I would consider Wrong Behavior: binding N threads to 1 core). - Ralph and I chatted earlier; I would be ok to wait for the other 2 pieces of functionality to come in before we make binding occur by default: 1. coordinate between multiple OMPI jobs on the same node to ensure not to bind to the same cores (or at least print a warning) 2. follow the binding directives of resource managers (SLURM, Torque, etc.) Sun is free to make binding-by-default in the ClusterTools distribution if/whenever they want, of course. I fully understand their reasoning for doing so. They're also in a better position to coach their users when to use which options, etc. because they have direct contact with
Re: [OMPI devel] Heads up on new feature to 1.3.4
On Aug 17 2009, Ralph Castain wrote: At issue for us is that other MPIs -do- bind by default, thus creating an apparent performance advantage for themselves compared to us on standard benchmarks run "out-of-the-box". We repeatedly get beat-up in papers and elsewhere over our performance, when many times the major difference is in the default binding. If we bind the same way they do, then the performance gap disappears or is minimal. The two obvious gratuitous hacks that I can see to tackle this are: 1) To have a mandatory configuration option setting the default, which would have a name like 'performance' for the binding option. YOU could then beat up anyone who benchmarkets without it for being biassed. This is a better solution, but the "I shouldn't need to have to think just because I am doing something complicated" brigade would object. 2) To use a heuristic to choose which algorithm to select, based on the core count, number of users, load averages, number of active non-root processes and similar unreliable indicators of the purpose for which the system is being used. It should be chosen so that it doesn't behave TOO badly when it gets it wrong, as it will, but that it gets the case of a dedicated benchmarketing system right most of the time. Regards, Nick Maclaren.
Re: [OMPI devel] Heads up on new feature to 1.3.4
I don't disagree with your statements. However, I was addressing the specific question of two OpenMPI programs conflicting on process placement, not the overall question you are raising. The issue of when/if to bind has been debated for a long time. I agree that having more options (bind-to-socket, bind-to-core, etc) makes sense and that the choice of a default is difficult, for all the reasons that have been raised in this thread. At issue for us is that other MPIs -do- bind by default, thus creating an apparent performance advantage for themselves compared to us on standard benchmarks run "out-of-the-box". We repeatedly get beat-up in papers and elsewhere over our performance, when many times the major difference is in the default binding. If we bind the same way they do, then the performance gap disappears or is minimal. So this is why we are wrestling with this issue. I'm not sure of the best compromise here, but I think people have raised good points on all sides. Unfortunately, there problem isn't a perfect answer... :-/ Certainly, I have no clue what it would be! Not that smart :-) Ralph On Mon, Aug 17, 2009 at 9:12 AM, N.M. Maclaren wrote: > On Aug 17 2009, Ralph Castain wrote: > > The problem is that the two mpiruns don't know about each other, and >> therefore the second mpirun doesn't know that another mpirun has already >> used socket 0. >> >> We hope to change that at some point in the future. >> > > It won't help. The problem is less likely to be that two jobs are running > OpenMPI programs (that have been recently linked!), but that the other > tasks > are not OpenMPI at all. I have mentioned daemons, kernel threads and so > on, > but think of shared-memory parallel programs (OpenMP etc.) and so on; a LOT > of applications nowadays include some sort of threading. > > For the ordinary multi-user system, you don't want any form of binding. The > scheduler is ricketty enough as it is, without confusing it further. That > may change as the consequences of serious levels of multiple cores force > that area to be improved, but don't hold your breath. And I haven't a clue > which of the many directions scheduler design will go! > > I agree that having an option, and having it easy to experiment with, is > the > right way to go. What the default should be is very much less clear. > > Regards, > Nick Maclaren. > > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] Heads up on new feature to 1.3.4
In some of the experiments I've run and studied on exclusive binding to specific cores, the performance metrics (which have yielded both excellent gains as well as phases of reduced performance) have depended upon the nature of the experiment being run (a task partitioning problem) and how the experimental data was organized (a data partitioning problem). This is especially true when one considers the context in which the experiment was run - meaning what other experiments scheduled either concurrently or serially, the priorities of those experiments and the configuration of the cluster / MPI network at any given point in time. The approach we used was Bayesian. In other words, performance prediction was conditioned on patterns of structure and context from both forward in inverse Bayesian cycles. Ken Lloyd > -Original Message- > From: devel-boun...@open-mpi.org > [mailto:devel-boun...@open-mpi.org] On Behalf Of Jeff Squyres > Sent: Monday, August 17, 2009 7:01 AM > To: Open MPI Developers > Subject: Re: [OMPI devel] Heads up on new feature to 1.3.4 > > On Aug 16, 2009, at 11:02 PM, Ralph Castain wrote: > > > I think the problem here, Eugene, is that performance > benchmarks are > > far from the typical application. We have repeatedly seen this - > > optimizing for benchmarks frequently makes applications run less > > efficiently. So I concur with Chris on this one - let's not > go -too- > > benchmark happy and hurt the regular users. > > FWIW, I've seen processor binding help real user codes, too. > Indeed, on a system where an MPI job has exclusive use of the > node, how does binding hurt you? > > On nodes where multiple MPI jobs are running, if a resource > manager is being used, then we should be obeying what they > have specified for each job to use. We need a bit more work > in that direction to make that work, but it's very do-able. > > When resource managers are not used and multiple MPI jobs > share the same node, then OMPI has to coordinate amongst its > jobs to not oversubscribe cores (when possible). As Ralph > indicated in a later mail, we still need some work in this area, too. > > > Here at LANL, binding to-socket instead of to-core hurts > performance > > by ~5-10%, depending on the specific application. Of course, either > > binding method is superior to no binding at all... > > This is probably not too surprising (i.e., allowing the OS to > move jobs around between cores on a socket can probably > involve a little cache thrashing, resulting in that 5-10% > loss). I'm hand-waving here, and I have not tried this > myself, but it's not too surprising of a result to me. It's > also not too surprising that others don't see this effect at > all (e.g., Sun didn't see any difference in bind-to-core vs. > bind-to-socket) when they ran their tests. YMMV. > > I'd actually be in favor of a by-core binding (not > by-socket), but spreading the processes out round robin by > socket, not by core. All of this would be the *default* > behavior, of course -- command line params/MCA params will be > provided to change to whatever pattern is desired. > > > UNLESS you have a threaded application, in which case -any- binding > > can be highly detrimental to performance. > > I'm not quite sure I understand this statement. Binding is > not inherently contrary to multi-threaded applications. > > -- > Jeff Squyres > jsquy...@cisco.com > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Heads up on new feature to 1.3.4
On Aug 17 2009, Ralph Castain wrote: The problem is that the two mpiruns don't know about each other, and therefore the second mpirun doesn't know that another mpirun has already used socket 0. We hope to change that at some point in the future. It won't help. The problem is less likely to be that two jobs are running OpenMPI programs (that have been recently linked!), but that the other tasks are not OpenMPI at all. I have mentioned daemons, kernel threads and so on, but think of shared-memory parallel programs (OpenMP etc.) and so on; a LOT of applications nowadays include some sort of threading. For the ordinary multi-user system, you don't want any form of binding. The scheduler is ricketty enough as it is, without confusing it further. That may change as the consequences of serious levels of multiple cores force that area to be improved, but don't hold your breath. And I haven't a clue which of the many directions scheduler design will go! I agree that having an option, and having it easy to experiment with, is the right way to go. What the default should be is very much less clear. Regards, Nick Maclaren.
Re: [OMPI devel] Heads up on new feature to 1.3.4
Jeff Squyres wrote: On Aug 16, 2009, at 11:02 PM, Ralph Castain wrote: UNLESS you have a threaded application, in which case -any- binding can be highly detrimental to performance. I'm not quite sure I understand this statement. Binding is not inherently contrary to multi-threaded applications. I think the concern is that if a process binds to a particular core and all threads inherit the same binding, then all threads will bind to the same core, inhibiting multithreading speedups (at best). If you bind to sockets rather than specific cores, even if multiple threads inherit the same binding, the contention will be less.
Re: [OMPI devel] Heads up on new feature to 1.3.4
On Aug 17 2009, Jeff Squyres wrote: On Aug 16, 2009, at 11:02 PM, Ralph Castain wrote: I think the problem here, Eugene, is that performance benchmarks are far from the typical application. We have repeatedly seen this - optimizing for benchmarks frequently makes applications run less efficiently. So I concur with Chris on this one - let's not go -too- benchmark happy and hurt the regular users. FWIW, I've seen processor binding help real user codes, too. Indeed, on a system where an MPI job has exclusive use of the node, how does binding hurt you? Here is how, and I can assure you that's it's not nice, not at all; it can kill an application dead. I have some experience with running large SMP systems (Origin, SunFire F15K and POWER3/4 racks) and this area was a nightmare. Process A is bound, and is waiting briefly for a receive. All of the other cores are busy with the processors bound to them. There is then some action from another process, a daemon or a kernel thread that needs service from the kernel. So it starts a thread on process A's core. Unfortunately, this is a long-running thread (e.g. NFS) so, when the other processors finish, and A is the bottleneck, the whole job hangs until that kernel thread finishes. You can get a similar effect if process A is bound to a CPU which has an I/O device bound to it. When something else entirely starts hammering that device, even if it doesn't tie it up for long each time, bye-bye performance. This is typically a problem on multi-socket systems, of course, but could show up even on quite small ones. For this reason, many schedulers ignore binding hints when they 'think' they know better - and, no matter what the documentation says, hints is generally all they are. You can then get processes rotating round the processors, exercising the inter-cache buses nicely In my experience, binding can sometimes make that more likely rather than less, and the best solutions are usually different. Yes, I used binding, but it was hell to set up, and many people give up, saying that it degrades performance. I advise ordinary users to avoid it like the plague, and use more reliable tuning techniques. UNLESS you have a threaded application, in which case -any- binding can be highly detrimental to performance. I'm not quite sure I understand this statement. Binding is not inherently contrary to multi-threaded applications. That is true. But see above. Another circumstance where that is true is when your application is a MPI one, but which calls SMP-enabled libraries; this is getting increasingly common. Binding can stop those using spare cores or otherwise confuse them; God help you if they start to use a 4-core algorithm on one core! Regards, Nick Maclaren.
Re: [OMPI devel] Heads up on new feature to 1.3.4
On Aug 16, 2009, at 8:56 PM, George Bosilca wrote: I tend to agree with Chris. Changing the behavior of the 1.3 in the middle of the stable release cycle, will be very confusing for our users. An important point to raise here: the 1.3 series is *not* the super stable series. It is the *feature* series. Specifically: it is not out of scope to introduce or change features within the 1.3 series. -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] Heads up on new feature to 1.3.4
On Aug 16, 2009, at 11:02 PM, Ralph Castain wrote: I think the problem here, Eugene, is that performance benchmarks are far from the typical application. We have repeatedly seen this - optimizing for benchmarks frequently makes applications run less efficiently. So I concur with Chris on this one - let's not go -too- benchmark happy and hurt the regular users. FWIW, I've seen processor binding help real user codes, too. Indeed, on a system where an MPI job has exclusive use of the node, how does binding hurt you? On nodes where multiple MPI jobs are running, if a resource manager is being used, then we should be obeying what they have specified for each job to use. We need a bit more work in that direction to make that work, but it's very do-able. When resource managers are not used and multiple MPI jobs share the same node, then OMPI has to coordinate amongst its jobs to not oversubscribe cores (when possible). As Ralph indicated in a later mail, we still need some work in this area, too. Here at LANL, binding to-socket instead of to-core hurts performance by ~5-10%, depending on the specific application. Of course, either binding method is superior to no binding at all... This is probably not too surprising (i.e., allowing the OS to move jobs around between cores on a socket can probably involve a little cache thrashing, resulting in that 5-10% loss). I'm hand-waving here, and I have not tried this myself, but it's not too surprising of a result to me. It's also not too surprising that others don't see this effect at all (e.g., Sun didn't see any difference in bind-to-core vs. bind-to-socket) when they ran their tests. YMMV. I'd actually be in favor of a by-core binding (not by-socket), but spreading the processes out round robin by socket, not by core. All of this would be the *default* behavior, of course -- command line params/MCA params will be provided to change to whatever pattern is desired. UNLESS you have a threaded application, in which case -any- binding can be highly detrimental to performance. I'm not quite sure I understand this statement. Binding is not inherently contrary to multi-threaded applications. -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] Heads up on new feature to 1.3.4
The problem is that the two mpiruns don't know about each other, and therefore the second mpirun doesn't know that another mpirun has already used socket 0. We hope to change that at some point in the future. Ralph On Aug 17, 2009, at 4:02 AM, Lenny Verkhovsky wrote: In the multi job environment, can't we just start binding processes on the first avaliable and unused socket? I mean first job/user will start binding itself from socket 0, the next job/user will start binding itself from socket 2, for instance . Lenny. On Mon, Aug 17, 2009 at 6:02 AM, Ralph Castain wrote: On Aug 16, 2009, at 8:16 PM, Eugene Loh wrote: Chris Samuel wrote: - "Eugene Loh" wrote: This is an important discussion. Indeed! My big fear is that people won't pick up the significance of the change and will complain about performance regressions in the middle of an OMPI stable release cycle. 2) The proposed OMPI bind-to-socket default is less severe. In the general case, it would allow multiple jobs to bind in the same way without oversubscribing any core or socket. (This comment added to the trac ticket.) That's a nice clarification, thanks. I suspect though that the same issue we have with MVAPICH would occur if two 4 core jobs both bound themselves to the first socket. Okay, so let me point out a second distinction from MVAPICH: the default policy would be to spread out over sockets. Let's say you have two sockets, with four cores each. Let's say you submit two four-core jobs. The first job would put two processes on the first socket and two processes on the second. The second job would do the same. The loading would be even. I'm not saying there couldn't be problems. It's just that MVAPICH2 (at least what I looked at) has multiple shortfalls. The binding is to fill up one socket after another (which decreases memory bandwidth per process and increases chances of collisions with other jobs) and binding is to core (increasing chances of oversubscribing cores). The proposed OMPI behavior distributes over sockets (improving memory bandwidth per process and reducing collisions with other jobs) and binding is to sockets (reducing changes of oversubscribing cores, whether due to other MPI jobs or due to multithreaded processes). So, the proposed OMPI behavior mitigates the problems. It would be even better to have binding selections adapt to other bindings on the system. In any case, regardless of what the best behavior is, I appreciate the point about changing behavior in the middle of a stable release. Arguably, leaving significant performance on the table in typical situations is a bug that warrants fixing even in the middle of a release, but I won't try to settle that debate here. I think the problem here, Eugene, is that performance benchmarks are far from the typical application. We have repeatedly seen this - optimizing for benchmarks frequently makes applications run less efficiently. So I concur with Chris on this one - let's not go -too- benchmark happy and hurt the regular users. Here at LANL, binding to-socket instead of to-core hurts performance by ~5-10%, depending on the specific application. Of course, either binding method is superior to no binding at all... UNLESS you have a threaded application, in which case -any- binding can be highly detrimental to performance. So going slow on this makes sense. If we provide the capability, but leave it off by default, then people can test it against real applications and see the impact. Then we can better assess the right default settings. Ralph 3) Defaults (if I understand correctly) can be set differently on each cluster. Yes, but the defaults should be sensible for the majority of clusters. If the majority do indeed share nodes between jobs then I would suggest that the default should be off and the minority who don't share nodes should have to enable it. In debates on this subject, I've heard people argue that: *) Though nodes are getting fatter, most are still thin. *) Resource managers tend to space share the cluster. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Heads up on new feature to 1.3.4
In the multi job environment, can't we just start binding processes on the first avaliable and unused socket? I mean first job/user will start binding itself from socket 0, the next job/user will start binding itself from socket 2, for instance . Lenny. On Mon, Aug 17, 2009 at 6:02 AM, Ralph Castain wrote: > > On Aug 16, 2009, at 8:16 PM, Eugene Loh wrote: > > Chris Samuel wrote: > > - "Eugene Loh" wrote: > > > This is an important discussion. > > > Indeed! My big fear is that people won't pick up the significance > of the change and will complain about performance regressions > in the middle of an OMPI stable release cycle. > > 2) The proposed OMPI bind-to-socket default is less severe. In the > general case, it would allow multiple jobs to bind in the same way > without oversubscribing any core or socket. (This comment added to > the trac ticket.) > > > That's a nice clarification, thanks. I suspect though that the > same issue we have with MVAPICH would occur if two 4 core jobs > both bound themselves to the first socket. > > > Okay, so let me point out a second distinction from MVAPICH: the default > policy would be to spread out over sockets. > > Let's say you have two sockets, with four cores each. Let's say you submit > two four-core jobs. The first job would put two processes on the first > socket and two processes on the second. The second job would do the same. > The loading would be even. > > I'm not saying there couldn't be problems. It's just that MVAPICH2 (at > least what I looked at) has multiple shortfalls. The binding is to fill up > one socket after another (which decreases memory bandwidth per process and > increases chances of collisions with other jobs) and binding is to core > (increasing chances of oversubscribing cores). The proposed OMPI behavior > distributes over sockets (improving memory bandwidth per process and > reducing collisions with other jobs) and binding is to sockets (reducing > changes of oversubscribing cores, whether due to other MPI jobs or due to > multithreaded processes). So, the proposed OMPI behavior mitigates the > problems. > > It would be even better to have binding selections adapt to other bindings > on the system. > > In any case, regardless of what the best behavior is, I appreciate the > point about changing behavior in the middle of a stable release. Arguably, > leaving significant performance on the table in typical situations is a bug > that warrants fixing even in the middle of a release, but I won't try to > settle that debate here. > > > I think the problem here, Eugene, is that performance benchmarks are far > from the typical application. We have repeatedly seen this - optimizing for > benchmarks frequently makes applications run less efficiently. So I concur > with Chris on this one - let's not go -too- benchmark happy and hurt the > regular users. > > Here at LANL, binding to-socket instead of to-core hurts performance by > ~5-10%, depending on the specific application. Of course, either binding > method is superior to no binding at all... > > UNLESS you have a threaded application, in which case -any- binding can be > highly detrimental to performance. > > So going slow on this makes sense. If we provide the capability, but leave > it off by default, then people can test it against real applications and see > the impact. Then we can better assess the right default settings. > > Ralph > > > 3) Defaults (if I understand correctly) can be set differently > on each cluster. > > > Yes, but the defaults should be sensible for the majority of > clusters. If the majority do indeed share nodes between jobs > then I would suggest that the default should be off and the > minority who don't share nodes should have to enable it. > > > In debates on this subject, I've heard people argue that: > > *) Though nodes are getting fatter, most are still thin. > > *) Resource managers tend to space share the cluster. > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] Heads up on new feature to 1.3.4
On Aug 16, 2009, at 8:16 PM, Eugene Loh wrote: Chris Samuel wrote: - "Eugene Loh" wrote: This is an important discussion. Indeed! My big fear is that people won't pick up the significance of the change and will complain about performance regressions in the middle of an OMPI stable release cycle. 2) The proposed OMPI bind-to-socket default is less severe. In the general case, it would allow multiple jobs to bind in the same way without oversubscribing any core or socket. (This comment added to the trac ticket.) That's a nice clarification, thanks. I suspect though that the same issue we have with MVAPICH would occur if two 4 core jobs both bound themselves to the first socket. Okay, so let me point out a second distinction from MVAPICH: the default policy would be to spread out over sockets. Let's say you have two sockets, with four cores each. Let's say you submit two four-core jobs. The first job would put two processes on the first socket and two processes on the second. The second job would do the same. The loading would be even. I'm not saying there couldn't be problems. It's just that MVAPICH2 (at least what I looked at) has multiple shortfalls. The binding is to fill up one socket after another (which decreases memory bandwidth per process and increases chances of collisions with other jobs) and binding is to core (increasing chances of oversubscribing cores). The proposed OMPI behavior distributes over sockets (improving memory bandwidth per process and reducing collisions with other jobs) and binding is to sockets (reducing changes of oversubscribing cores, whether due to other MPI jobs or due to multithreaded processes). So, the proposed OMPI behavior mitigates the problems. It would be even better to have binding selections adapt to other bindings on the system. In any case, regardless of what the best behavior is, I appreciate the point about changing behavior in the middle of a stable release. Arguably, leaving significant performance on the table in typical situations is a bug that warrants fixing even in the middle of a release, but I won't try to settle that debate here. I think the problem here, Eugene, is that performance benchmarks are far from the typical application. We have repeatedly seen this - optimizing for benchmarks frequently makes applications run less efficiently. So I concur with Chris on this one - let's not go -too- benchmark happy and hurt the regular users. Here at LANL, binding to-socket instead of to-core hurts performance by ~5-10%, depending on the specific application. Of course, either binding method is superior to no binding at all... UNLESS you have a threaded application, in which case -any- binding can be highly detrimental to performance. So going slow on this makes sense. If we provide the capability, but leave it off by default, then people can test it against real applications and see the impact. Then we can better assess the right default settings. Ralph 3) Defaults (if I understand correctly) can be set differently on each cluster. Yes, but the defaults should be sensible for the majority of clusters. If the majority do indeed share nodes between jobs then I would suggest that the default should be off and the minority who don't share nodes should have to enable it. In debates on this subject, I've heard people argue that: *) Though nodes are getting fatter, most are still thin. *) Resource managers tend to space share the cluster. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Heads up on new feature to 1.3.4
Chris Samuel wrote: - "Eugene Loh" wrote: This is an important discussion. Indeed! My big fear is that people won't pick up the significance of the change and will complain about performance regressions in the middle of an OMPI stable release cycle. 2) The proposed OMPI bind-to-socket default is less severe. In the general case, it would allow multiple jobs to bind in the same way without oversubscribing any core or socket. (This comment added to the trac ticket.) That's a nice clarification, thanks. I suspect though that the same issue we have with MVAPICH would occur if two 4 core jobs both bound themselves to the first socket. Okay, so let me point out a second distinction from MVAPICH: the default policy would be to spread out over sockets. Let's say you have two sockets, with four cores each. Let's say you submit two four-core jobs. The first job would put two processes on the first socket and two processes on the second. The second job would do the same. The loading would be even. I'm not saying there couldn't be problems. It's just that MVAPICH2 (at least what I looked at) has multiple shortfalls. The binding is to fill up one socket after another (which decreases memory bandwidth per process and increases chances of collisions with other jobs) and binding is to core (increasing chances of oversubscribing cores). The proposed OMPI behavior distributes over sockets (improving memory bandwidth per process and reducing collisions with other jobs) and binding is to sockets (reducing changes of oversubscribing cores, whether due to other MPI jobs or due to multithreaded processes). So, the proposed OMPI behavior mitigates the problems. It would be even better to have binding selections adapt to other bindings on the system. In any case, regardless of what the best behavior is, I appreciate the point about changing behavior in the middle of a stable release. Arguably, leaving significant performance on the table in typical situations is a bug that warrants fixing even in the middle of a release, but I won't try to settle that debate here. 3) Defaults (if I understand correctly) can be set differently on each cluster. Yes, but the defaults should be sensible for the majority of clusters. If the majority do indeed share nodes between jobs then I would suggest that the default should be off and the minority who don't share nodes should have to enable it. In debates on this subject, I've heard people argue that: *) Though nodes are getting fatter, most are still thin. *) Resource managers tend to space share the cluster.
Re: [OMPI devel] Heads up on new feature to 1.3.4
- "Eugene Loh" wrote: > This is an important discussion. Indeed! My big fear is that people won't pick up the significance of the change and will complain about performance regressions in the middle of an OMPI stable release cycle. > Do note: > > 1) Bind-to-core is actually the default behavior of many MPIs today. We had this issue with MVAPICH before we dumped it to go to OpenMPI as if we had (for example) two 4 core jobs running on the same node they'd both go at half speed whilst the node itself was 50% idle. Turned out they'd both bound to cores 0-3 leaving cores 4-7 unused. :-( Fortunately there was an undocumented environment variable that let us turn it off for all jobs, but getting rid of that misbehaviour was a major reason for switching to OpenMPI. > 2) The proposed OMPI bind-to-socket default is less severe. In the > general case, it would allow multiple jobs to bind in the same way > without oversubscribing any core or socket. (This comment added to > the trac ticket.) That's a nice clarification, thanks. I suspect though that the same issue we have with MVAPICH would occur if two 4 core jobs both bound themselves to the first socket. Thinking further, it would be interesting to find out how this code would behave on a system where cpusets is in use and so OMPI has to submit to the will of the scheduler regarding cores/sockets. > 3) Defaults (if I understand correctly) can be set differently > on each cluster. Yes, but the defaults should be sensible for the majority of clusters. If the majority do indeed share nodes between jobs then I would suggest that the default should be off and the minority who don't share nodes should have to enable it. There's also the issue of those users who (for whatever reason) like to build their own MPI stack and who are even less likely to understand the impact that they may have on others.. :-( cheers! Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency
Re: [OMPI devel] Heads up on new feature to 1.3.4
I tend to agree with Chris. Changing the behavior of the 1.3 in the middle of the stable release cycle, will be very confusing for our users. Moreover, as Ralph pointed out, everything in Open MPI is configurable so if we advertise this feature in the Changelog, the institutions where the nodes are not shared can easily amend their configuration files to take advantage of it. In particular, for Sun, if we push this feature in the 1.3.4 release, they can ship their version (derived from the 1.3.4) with the MCA parameter set to bind-to- whatever. We can bring this topic in the spotlight for the next cycle (1.4/1.5). george. On Aug 16, 2009, at 20:42 , Chris Samuel wrote: - "Ralph Castain" wrote: Hi Chris Hiya Ralph, There would be a "-do-not-bind" option that will prevent us from binding processes to anything which should cover that situation. Gotcha. My point was only that we would be changing the out-of-the-box behavior to the opposite of today's, so all those such as yourself would now have to add the -do-not-bind MCA param to your default MCA param file. Doable - but it -is- a significant change in our out-of-the-box behavior. I think this is too big a change in the default behaviour for a stable release, it'll cause a lot of people pain for no readily apparent reason. I also believe that if those sites with multiple MPI jobs on nodes are indeed in the majority then it makes more sense to keep the default behaviour and have those who need this functionality enable it on their installs. Thoughts ? cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Heads up on new feature to 1.3.4
- "Ralph Castain" wrote: > Hi Chris Hiya Ralph, > There would be a "-do-not-bind" option that will prevent us from > binding processes to anything which should cover that situation. Gotcha. > My point was only that we would be changing the out-of-the-box > behavior to the opposite of today's, so all those such as yourself > would now have to add the -do-not-bind MCA param to your default MCA > param file. > > Doable - but it -is- a significant change in our out-of-the-box > behavior. I think this is too big a change in the default behaviour for a stable release, it'll cause a lot of people pain for no readily apparent reason. I also believe that if those sites with multiple MPI jobs on nodes are indeed in the majority then it makes more sense to keep the default behaviour and have those who need this functionality enable it on their installs. Thoughts ? cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency
Re: [OMPI devel] Heads up on new feature to 1.3.4
This is an important discussion. Do note: 1) Bind-to-core is actually the default behavior of many MPIs today. 2) The proposed OMPI bind-to-socket default is less severe. In the general case, it would allow multiple jobs to bind in the same way without oversubscribing any core or socket. (This comment added to the trac ticket.) 3) Defaults (if I understand correctly) can be set differently on each cluster. Ralph Castain wrote: There would be a "-do-not-bind" option that will prevent us from binding processes to anything which should cover that situation. My point was only that we would be changing the out-of-the-box behavior to the opposite of today's, so all those such as yourself would now have to add the -do-not-bind MCA param to your default MCA param file. Doable - but it -is- a significant change in our out-of-the-box behavior. On Sun, Aug 16, 2009 at 2:14 AM, Chris Samuelwrote: - "Terry Dontje" wrote: > I just wanted to give everyone a heads up if they do not get bugs > email. I just submitted a CMR to move over some new paffinity options > from the trunk to the v1.3 branch. https://svn.open-mpi.org/trac/ompi/ticket/1997 Ralphs comments imply that for those sites that share nodes between jobs (such as ourselves, and most other sites that I'm aware of in Australia) these changes will severely impact performance. I think that would be a Very Bad Thing(tm). Can it be something that defaults to being configured out for at least 1.3 please ? That way those few sites that can take advantage can enable it whilst the rest of us aren't impacted.
Re: [OMPI devel] Heads up on new feature to 1.3.4
Hi Chris There would be a "-do-not-bind" option that will prevent us from binding processes to anything which should cover that situation. My point was only that we would be changing the out-of-the-box behavior to the opposite of today's, so all those such as yourself would now have to add the -do-not-bind MCA param to your default MCA param file. Doable - but it -is- a significant change in our out-of-the-box behavior. On Sun, Aug 16, 2009 at 2:14 AM, Chris Samuel wrote: > > - "Terry Dontje" wrote: > > > I just wanted to give everyone a heads up if they do not get bugs > > email. I just submitted a CMR to move over some new paffinity options > > from the trunk to the v1.3 branch. > > Ralphs comments imply that for those sites that share nodes > between jobs (such as ourselves, and most other sites that > I'm aware of in Australia) these changes will severely impact > performance. > > I think that would be a Very Bad Thing(tm). > > Can it be something that defaults to being configured out > for at least 1.3 please ? That way those few sites that > can take advantage can enable it whilst the rest of us > aren't impacted. > > cheers, > Chris > -- > Christopher Samuel - (03) 9925 4751 - Systems Manager > The Victorian Partnership for Advanced Computing > P.O. Box 201, Carlton South, VIC 3053, Australia > VPAC is a not-for-profit Registered Research Agency > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] Heads up on new feature to 1.3.4
- "Terry Dontje" wrote: > I just wanted to give everyone a heads up if they do not get bugs > email. I just submitted a CMR to move over some new paffinity options > from the trunk to the v1.3 branch. Ralphs comments imply that for those sites that share nodes between jobs (such as ourselves, and most other sites that I'm aware of in Australia) these changes will severely impact performance. I think that would be a Very Bad Thing(tm). Can it be something that defaults to being configured out for at least 1.3 please ? That way those few sites that can take advantage can enable it whilst the rest of us aren't impacted. cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency
[OMPI devel] Heads up on new feature to 1.3.4
I just wanted to give everyone a heads up if they do not get bugs email. I just submitted a CMR to move over some new paffinity options from the trunk to the v1.3 branch. You can read the gory details in https://svn.open-mpi.org/trac/ompi/ticket/1997 --td