Re: [OMPI devel] [RFC] Low pressure OPAL progress
I'm not entirely convinced this actually achieves your goals, but I can see some potential benefits. I'm also not sure that power consumption is that big of an issue that MPI needs to begin chasing "power saver" modes of operation, but that can be a separate debate some day. I'm assuming you don't mean that you actually call "sleep()" as this would be very bad - I'm assuming you just change the opal_progress "tick" rate instead. True? If not, and you really call "sleep", then I would have to oppose adding this to the code base pending discussion with others who can corroborate that this won't cause problems. Either way, I could live with this so long as it was done as a "configure-in" capability. Just having the params default to a value that causes the system to behave similarly to today isn't enough - we still wind up adding logic into a very critical timing loop for no reason. A simple configure option of --enable-mpi-progress-monitoring would be sufficient to protect the code. HTH Ralph On Jun 8, 2009, at 9:50 AM, Sylvain Jeaugey wrote: What : when nothing has been received for a very long time - e.g. 5 minutes, stop busy polling in opal_progress and switch to a usleep- based one. Why : when we have long waits, and especially when an application is deadlock'ed, detecting it is not easy and a lot of power is wasted until the end of the time slice (if there is one). Where : an example of how it could be implemented is available at http://bitbucket.org/jeaugeys/low-pressure-opal-progress/ Principle = opal_progress() ensures the progression of MPI communication. The current algorithm is a loop calling progress on all registered components. If the program is blocked, the loop will busy-poll indefinetely. Going to sleep after a certain amount of time with nothing received is interesting for two things : - Administrator can easily detect whether a job is deadlocked : all the processes are in sleep(). Currently, all processors are using 100% cpu and it is very hard to know if progression is still happening or not. - When there is nothing to receive, power usage is highly reduced. However, it could hurt performance in some cases, typically if we go to sleep just before the message arrives. This will highly depend on the parameters you give to the sleep mechanism. At first, we can start with the following assumption : if the sleep takes T usec, then sleeping after 1xT should slow down Receives by a factor less than 0.01 %. However, other processes may suffer from you being late, and be delayed by T usec (which may represent more than 0.01% for them). So, the goal of this mechanism is mainly to detect far-too-long- waits and should quite never be used in normal MPI jobs. It could also trigger a warning message when starting to sleep, or at least a trace in the notifier. Details of Implementation = Three parameters fully control the behaviour of this mechanism : * opal_progress_sleep_count : number of unsuccessful opal_progress() calls before we start the timer (to prevent latency impact). It defaults to -1, which completely deactivates the sleep (and is therefore equivalent to the former code). A value of 1000 can be thought of as a starting point to enable this mechanism. * opal_progress_sleep_trigger : time to wait before going to low- pressure-powersave mode. Default : 600 (in seconds) = 10 minutes. * opal_progress_sleep_duration : time we sleep at each further unsuccessful call to opal_progress(). Default : 1000 (in us) = 1 ms. The duration is big enough to make the process show 0% CPU in top, but low enough to preserve a good trigger/duration ratio. The trigger is voluntary high to keep a good trigger/duration ratio. Indeed, to prevent delays from causing chain reactions, trigger should be higher than duration * numprocs. Possible Improvements & Pitfalls * Trigger could be set automatically at max(trigger, duration * numprocs * 2). * poll_start and poll_count could be fields of the opal_condition_t struct. * The sleep section may be exported in a #define and reported in all the progress pathes (I'm not sure my patch is good for progress threads for example) ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] problem in the ORTE notifier framework
I believe the concern here was that we aren't entirely sure just where you plan to do this. If we are talking about reporting errors, then there is less concern about adding cycles. For example, we already check to see if the IB driver has exceeded the limit on retries - adding more logic to the code that executes when that test is positive is of little concern. However, if we are talking about adding warnings that are not in the error paths, then there is concern because that code will execute every time, even when there isn't a problem. There is no issue with using likely() directives, but I'm not sure there is general agreement with your analysis regarding the potential impact of adding such code, and the belief that it only adds one cycle doesn't appear to be supported by our experience to date. Hence the cautions from other developers. Regardless, it has been our general policy to add this kind of capability on a "configure-in" basis so that those who do not want it are not impacted by it. My proposed method would allow for that policy. Whether you use that approach, or devise your own, I do believe the "configure-in" policy really needs to be used for this capability. Working on a tmp branch will give developers a chance to evaluate the overall impact and help people in deciding whether or not to enable this capability. I suspect (based on prior similar proposals) that many will choose -not- to enable it (e.g., research clusters in universities), while some (e.g., large production clusters) may well do so, depending on exactly what you are reporting. HTH Ralph On Jun 8, 2009, at 4:57 AM, Sylvain Jeaugey wrote: Ralph, Sorry for answering on this old thread, but it seems that my answer was blocked in the "postponed" folder. About the if-then, I thought it was 1 cycle. I mean, if you don't break the pipeline, i.e. use likely() or builtin_expect() or something like that to be sure that the compiler will generate assembly in the right way, it shouldn't be more than 1 cycle, perhaps less on some architectures like Itanium [however, my multi- architecture view is somewhat limited to x86 and ia64, so I may be wrong]. So, in these if-then cases where we know which branch is the more likely to be used, I don't think that 1 CPU cycle is really a problem, especially if we are already in a slow code path. Is there a multi-compiler,multi-arch,multi-os reason not to use likely() directives ? Sylvain On Wed, 27 May 2009, Ralph Castain wrote: While that is a good way of minimizing the impact of the counter, you still have to do an "if-then" to check if the counter exceeds the threshold. This "if-then" also has to get executed every time, and generally consumes more than a few cycles. To be clear: it isn't the output that is the concern. The output only occurs as an exception case, essentially equivalent to dealing with an error, so it can be "slow". The concern is with the impact of testing to see if the output needs to be generated as this testing occurs every time we transit the code. I think Jeff and I are probably closer to agreement on design than it might seem, and may be close to what you might also have had in mind. Basically, I was thinking of a macro like this: ORTE_NOTIFIER_VERBOSE(api, counter, threshold,...) #if WANT_NOTIFIER_VERBOSE opal_atomic_increment(counter); if (counter > threshold) { orte_notifier.api(.) } #endif You would set the specific thresholds for each situation via MCA params, so this could be tuned to fit specific needs. Those who don't want the penalty can just build normally - those who want this level of information can enable it. We can then see just how much penalty is involved in real world situations. My guess is that it won't be that big, but it's hard to know without seeing how frequently we actually insert this code. Hope that makes sense Ralph On Wed, May 27, 2009 at 1:25 AM, Sylvain Jeaugeywrote: About performance, I may miss something, but our first goal was to track already slow pathes. We imagined that it could be possible to add at the beginning (or end) of this "bad path" just one line that would basically do an atomic inc. So, in terms of CPU cycles, something like 1 for the inc and maybe 1 jump before. Are a couple of cycles really an issue in slow pathes (which take at least hundreds of cycles), or do you fear out-of-cache memory accesses - or something else ? As for outputs, they indeed are slow (and can slow down considerably an application if not synchronized), but aggregation on the head node should solve our problems. And if not, we can also disable outputs at runtime. So, in my opinion, no application should notice a difference (unless you tune the framework to output every warning). Sylvain On Tue, 26 May 2009, Jeff Squyres wrote: Nadia --
Re: [OMPI devel] problem in the ORTE notifier framework
Ralph, Sorry for answering on this old thread, but it seems that my answer was blocked in the "postponed" folder. About the if-then, I thought it was 1 cycle. I mean, if you don't break the pipeline, i.e. use likely() or builtin_expect() or something like that to be sure that the compiler will generate assembly in the right way, it shouldn't be more than 1 cycle, perhaps less on some architectures like Itanium [however, my multi-architecture view is somewhat limited to x86 and ia64, so I may be wrong]. So, in these if-then cases where we know which branch is the more likely to be used, I don't think that 1 CPU cycle is really a problem, especially if we are already in a slow code path. Is there a multi-compiler,multi-arch,multi-os reason not to use likely() directives ? Sylvain On Wed, 27 May 2009, Ralph Castain wrote: While that is a good way of minimizing the impact of the counter, you still have to do an "if-then" to check if the counter exceeds the threshold. This "if-then" also has to get executed every time, and generally consumes more than a few cycles. To be clear: it isn't the output that is the concern. The output only occurs as an exception case, essentially equivalent to dealing with an error, so it can be "slow". The concern is with the impact of testing to see if the output needs to be generated as this testing occurs every time we transit the code. I think Jeff and I are probably closer to agreement on design than it might seem, and may be close to what you might also have had in mind. Basically, I was thinking of a macro like this: ORTE_NOTIFIER_VERBOSE(api, counter, threshold,...) #if WANT_NOTIFIER_VERBOSE opal_atomic_increment(counter); if (counter > threshold) { orte_notifier.api(.) } #endif You would set the specific thresholds for each situation via MCA params, so this could be tuned to fit specific needs. Those who don't want the penalty can just build normally - those who want this level of information can enable it. We can then see just how much penalty is involved in real world situations. My guess is that it won't be that big, but it's hard to know without seeing how frequently we actually insert this code. Hope that makes sense Ralph On Wed, May 27, 2009 at 1:25 AM, Sylvain Jeaugeywrote: About performance, I may miss something, but our first goal was to track already slow pathes. We imagined that it could be possible to add at the beginning (or end) of this "bad path" just one line that would basically do an atomic inc. So, in terms of CPU cycles, something like 1 for the inc and maybe 1 jump before. Are a couple of cycles really an issue in slow pathes (which take at least hundreds of cycles), or do you fear out-of-cache memory accesses - or something else ? As for outputs, they indeed are slow (and can slow down considerably an application if not synchronized), but aggregation on the head node should solve our problems. And if not, we can also disable outputs at runtime. So, in my opinion, no application should notice a difference (unless you tune the framework to output every warning). Sylvain On Tue, 26 May 2009, Jeff Squyres wrote: Nadia -- Sorry I didn't get to jump in on the other thread earlier. We have made considerable changes to the notifier framework in a branch to better support "SOS" functionality: https://www.open-mpi.org/hg/auth/hgwebdir.cgi/jsquyres/opal-sos Cisco and Indiana U. have been working on this branch for a while. A description of the SOS stuff is here: https://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages As for setting up an external web server with hg, don't bother -- just get an account at bitbucket.org. They're free and allow you to host hg repositories there. I've used bitbucket to collaborate on code before it hits OMPI's SVN trunk with both internal and external OMPI developers. We can certainly move the opal-sos repo to bitbucket (or branch again off opal-sos to bitbucket -- whatever makes more sense) to facilitate collaborating with you. Back on topic... I'd actually suggest a combination of what has been discussed in the other thread. The notifier can be the mechanism that actually sends the output message, but it doesn't have to be the mechanism that tracks the stats and decides when to output a message. That can be separate logic, and therefore be more fine-grained (and potentially even specific to the MPI layer). The Big Question will how to do this with zero performance impact when it is not being used. This has always been the difficult issue when trying to implement any kind of monitoring inside the core OMPI performance-sensitive paths. Even adding individual branches has met with resistance (in performance-critical code
Re: [OMPI devel] Multi-rail on openib
Hi Tom, Yes, there is a goal in mind, and definetly not performance : we are working on device failover, i.e when a network adapter or switch fails, use the remaining one. We don't intend to improve performance with multi-rail (which as you said, will not happen unless you have a DDR card with PCI Exp 8x Gen2 and a very nice routing - and money to pay for the doubled network :)). The goal here is to use port 1 of each card as a primary way of communication with a fat tree and port 2 as a failover solution with a very light network, just to avoid aborting the MPI app or at least reach a checkpoint. Don't worry, another team is working on opensm, so that routing stays optimal. Thanks for your warnings however, it's true that a lot of people see these "double port IB cards" as "doubled performance". Sylvain On Fri, 5 Jun 2009, Nifty Tom Mitchell wrote: On Fri, Jun 05, 2009 at 09:52:39AM -0400, Jeff Squyres wrote: See this FAQ entry for a description: http://www.open-mpi.org/faq/?category=openfabrics#ofa-port-wireup Right now, there's no way to force a particular connection pattern on the openib btl at run-time. The startup sequence has gotten sufficiently complicated / muddied over the years that it would be quite difficult to do so. Pasha is in the middle of revamping parts of the openib startup (see http://bitbucket.org/pasha/ompi-ofacm/); it *may* be desirable to fully clean up the full openib btl startup sequence when he's all finished. On Jun 5, 2009, at 9:48 AM, Mouhamed Gueye wrote: Hi all, I am working on multi-rail IB and I was wondering how connections are established between ports. I have two hosts, each with 2 ports on a same IB card, connected to the same switch. Is there a goal in mind? In general multi-rail cards run into bandwidth and congestion issues with the host bus. If your card's system side interface cannot support the bandwidth of twin IB links then it is possible that bandwidth would be reduced by the interaction. If the host bus and memory system is fast enough then work with the vendor. In addition to system bandwidth the subnet manager may need to be enhanced to be multi-port card aware. Since IB fabric routes are static it is possible to route or use pairs of links in an identical enough way that there is little bandwidth gain when multiple switches are involved. Your two host case case may be simple enoughto explore and/or generate illuminating or misleading results. It is a good place to start. Start with a look at opensm and the fabric then watch how Open MPI or your applications use the resulting LIDs. If you are using IB directly and not MPI then the list of protocol choices grows dramatically but still centers on LIDs as assigned by the subnet manager (see opensm). How man CPU cores (ranks) are you working with? Do be specific about the IB hardware and associated firmware there are multiple choices out there and the vendor may be able to help... -- T o m M i t c h e l l Found me a new hat, now what? ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel