Re: [PATCH RFC] sched: deferred set priority (dprio)
On Sat, Aug 9, 2014 at 6:04 AM, Mike Galbraith wrote: > You are not going to convince me that it is cool to assign an imaginary > priority to a SCHED_FIFO class task, and still call the resulting mutant > a SCHED_FIFO class task. Those things have defines semantics. It is > not ok for a SCHED_FIFO task of a lower priority to completely ignore a > SCHED_FIFO task of a higher priority because it's part of an application > which has one or more wild cards I am not quite sure what you were trying to say by this, but I am grateful to you for expressing your perspective and arguments in this discussion. For one thing, it stimulated me to perform systematic review of the solution space/tree, square inch by square inch, resulting in exhaustive logical coverage of the solution space and thus an informal logical proof that there is no any uncovered subspace that might hold any undiscovered yet solution. Just so the results do not remain scattered, I will try to summarize and categorize the findings about available solutions to the general "critical section (or critical activity) preemption" problem category, and then draw a conclusion stemming from this summary. ** 1) Post-preemption solutions. (Such as priority inheritance or proxy execution, or even humble yield_to). By the time a post-preemption solution is engaged, a cost had already been paid, and further cost is paid to execute the solution. Sometimes this aggregate cost can be small and acceptable within the overall context of an application, sometimes it can be too high. Sometimes dependency chain cannot be expressed and thus post-preemption solution is not possible at all (other than just waiting for the scheduler to eventually resume the blocking thread and keeping paying the cost in the meanwhile). I won't be further enumerating this subcategory here and will focus instead on the space of preemption avoidance/deferral solutions. ** 2) Reduction in the incidence of thread preemption. There are chief ways to reduce the incidence of thread preemption. One way is by enlarging the scheduling quantum. Its simplistic use however may well backfire: if a lock holder is preempted while holding a lock (or thread executing non-lock-related critical activity is preempted while inside this activity) enlarged scheduling quantum would result in longer delay before the thread will get resumed. Therefore this measure if employed should really be coupled with one of post-preemption solutions. Another way is the reduction of thread preemption incidence due to another thread being woken up. Currently existing schemes of this kind (use of !WAKEUP_PREEMPTION or SCHED_BATCH) are indiscriminate and victimize woken up thread for the sake of executing thread regardless of whether the latter is inside a critical section or not, however means can be provided to refine this behavior and victimize the wakee only in case if currently executing thread is actually inside a critical section. (And also when the section is exited, to switch over to the postponed wakee ASAP). A variation on this theme is an attempt at mitigation of preemption effects via LAST_BUDDY. The extent to which this scheme will be effective or not is obviously determined by the following factors. First, the probability of thread preemption while it is holding a lock, which in turn is determined by: (a) Percentage of time a worker thread holds a lock (or executes non-lock related critical activity). It does not matter too much whether a thread enters a critical section one time, stays in it for a while and then exits or whether the thread enters and exits a critical section many times in a row staying within it each time only a for very short instant. What matters is the aggregate percentage of execution time a thread spends while inside a critical section. This yields a rough probability any random preemption of a thread (either through natural expiration of scheduling timeslice or due to the preemption by a wakee) will result in lock holder or other critical activity preemption. (b) To an extent, request handler execution time. Very short times (well under typical scheduling quantum) separated by voluntary yielding (such as when waiting on the queue to pick up the next processing request) would result in reduced preemption probability due to end-of-quantum events, however I won't be further considering this special case here. (It is also not a particularly good idea to use FIFO ordering for dequeueing of events or requests from a queue by worker threads, as opposed to LIFO ordering, as it negatively impacts cache locality and increases context switch probability.) (c) Reduction in preemption attempt frequency. That's where the impact of the discussed scheme comes in. With the scheme on, thread gets chiefly preempted on quantum expiration, but not intra-quantum in favor of wakee threads. The product of probabilities (a) and (c) determines how likely the thre
Re: [PATCH RFC] sched: deferred set priority (dprio)
On Sun, 2014-08-10 at 05:13 +0200, Mike Galbraith wrote: > On Sat, 2014-08-09 at 20:04 +0200, Andi Kleen wrote: > > > NAK. There it is, my imaginary NAK to imaginary realtime priorities :) > > > > Ok, but do you have any alternative proposal yourself how to solve the > > lockholder preemption problem? I assume you agree it's a real problem. > > > > Just being negative is not very constructive. > > I both acknowledged the problem problem, and made alternative > suggestions. Note who I replied to https://lkml.org/lkml/2014/7/28/77 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] sched: deferred set priority (dprio)
On Sat, 2014-08-09 at 20:04 +0200, Andi Kleen wrote: > > NAK. There it is, my imaginary NAK to imaginary realtime priorities :) > > Ok, but do you have any alternative proposal yourself how to solve the > lockholder preemption problem? I assume you agree it's a real problem. > > Just being negative is not very constructive. I both acknowledged the problem problem, and made alternative suggestions. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] sched: deferred set priority (dprio)
> NAK. There it is, my imaginary NAK to imaginary realtime priorities :) Ok, but do you have any alternative proposal yourself how to solve the lockholder preemption problem? I assume you agree it's a real problem. Just being negative is not very constructive. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] sched: deferred set priority (dprio)
On Sat, 2014-08-09 at 01:38 -0700, Sergey Oboguev wrote: > On Thu, Aug 7, 2014 at 2:03 AM, Mike Galbraith > wrote: > > > I see subversion of a perfectly functional and specified mechanism > > Just wondering if the following line of thinking would sound just as much an > anathema from your perspective or perhaps a bit less terrible... > > Proceeding from the observations (e.g. https://lkml.org/lkml/2014/8/8/492) > that > representative critical section information is not pragmatically expressible > at > development time or dynamically collectable by the application at run time, > the > option still remains to put the weight of managing such information on the > shoulders of the final link in the chain, the system administrator, providing > him with application-specific guidelines and also with monitoring tools. > > It might look approximately like this. > > It might be possible to define the scheduling class or some other kind of > scheduling data entity for the tasks utilizing preemption control. The tasks > belonging to this class and having critical section currently active are > preemptible by RT or DL tasks just like normal threads, however they are > granted a limited and controlled degree of protection against preemption by > normal threads, and also limited ability to urgently preempt normal threads on > a wakeup. Sure, a completely different scheduling class can implement whatever semantics it wants, just has to be useful and not break the world. (spins lots of complexity) > This is not suggest any particular interface of course, but just a crude > sketch > of a basic approach. I am wondering if you would find it more agreeable within > your perspective than the use of RT priorities, or still fundamentally > disagreeable. Yes, the crux of my objection is the subversion in my view of RT. > (Personally I am not particularly thrilled by the complexity that would have > to be added and managed.) (yeah, you described lots of that) -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] sched: deferred set priority (dprio)
On Fri, 2014-08-08 at 13:11 -0700, Sergey Oboguev wrote: > On Thu, Aug 7, 2014 at 2:03 AM, Mike Galbraith > wrote: > > > task priority cannot be used by any task to describe a critical section. > > I assert that is that there is _zero_ critical section information present. > > This appears to be the crux of our disagreement. > > This assertion is incorrect. The use of RT to bracket a critical section > obviously _does_ provide the following information: You sure don't give up easy. > 1) designation of entry point for the start of critical activity Yup, a task can elevate its priority upon entering scram_reactor(), iff it gets there, scram might still be considered a critical activity. > 2) designation of exit point, albeit with timing not known in advance at entry > time Yeah, exit works ok if enter happens. You are not going to convince me that it is cool to assign an imaginary priority to a SCHED_FIFO class task, and still call the resulting mutant a SCHED_FIFO class task. Those things have defines semantics. It is not ok for a SCHED_FIFO task of a lower priority to completely ignore a SCHED_FIFO task of a higher priority because it's part of an application which has one or more wild cards, or maybe even a get out of jail free card it can pull out of its butt. NAK. There it is, my imaginary NAK to imaginary realtime priorities :) -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] sched: deferred set priority (dprio)
On Thu, Aug 7, 2014 at 2:03 AM, Mike Galbraith wrote: > I see subversion of a perfectly functional and specified mechanism Just wondering if the following line of thinking would sound just as much an anathema from your perspective or perhaps a bit less terrible... Proceeding from the observations (e.g. https://lkml.org/lkml/2014/8/8/492) that representative critical section information is not pragmatically expressible at development time or dynamically collectable by the application at run time, the option still remains to put the weight of managing such information on the shoulders of the final link in the chain, the system administrator, providing him with application-specific guidelines and also with monitoring tools. It might look approximately like this. It might be possible to define the scheduling class or some other kind of scheduling data entity for the tasks utilizing preemption control. The tasks belonging to this class and having critical section currently active are preemptible by RT or DL tasks just like normal threads, however they are granted a limited and controlled degree of protection against preemption by normal threads, and also limited ability to urgently preempt normal threads on a wakeup. Tasks inside this class may belong to one of the groups somewhat akin to cgroups (perhaps may be even implemented as an extension to cgroups). The properties of a group are: * Maximum critical section duration (ms). This is not based on actual duration of critical sections for the application and may exceed it manyfold. The purpose is merely to be a safeguard against the runaways. If a task stays inside a critical section longer than the specified time limit, it loses the protection against the preemption and becomes for practical purposes a normal thread. The group keeps a statistics of how often the tasks in the group overstay in critical section and exceed the specified limit. * Percentage of CPU time that members of the group can collectively spend inside their critical sections over some sampling interval while enjoying the protection from preemption. This is the critical parameter. If group members collectively spend larger share of CPU time in their critical sections exceeding the specified limit, they start losing protection from preemption by normal threads, to keep their protected time within the quota. For example the administrator may designate that threads in group "MyDB" can spend no more than 20% of system CPU time combined in the state of being protected from preemption, while threads in group "MyVideoEncoder" can spend not more than 10% of system CPU time in preemption-protected state. If actual aggregate critical-section time spent by threads in all the groups and also by RT tasks starts pushing some system-wide limit (perhaps rt_bandwidth), available group percentages are dynamically scaled down, to reserve some breathing space for normal tasks, and to depress groups in some proportional way. Scaling down can be either proportional to the group quotas, or can be controlled by separate scale-down weights. A monitoring tool can display how often the tasks in the group requesting protection from preemption are not granted it or lose it because of overdrafting the group quota. System administrator may then either choose to enlarge the group's quota, or leave it be and accept the application running sub-optimally. An application can also monitor the statistics on rejection of preemption protection for its threads (and also actual preemption frequency while inside a declared critical section state) and if the rate is high then issue an advisory message to the administrator. Furthermore: Threads within a group need to have relative priorities. There should be a way for a thread processing a highly critical section to be favored over a thread processing medium-significance critical section. There should also be a way to change a thread's group-relative priority both from the inside and from the outside of a thread. If thread A queues an important request for processing by thread B, A should be able to bump B's group-relative priority. Thread having non-zero group-relative priority is considered to be within a critical section. If thread having non-zero group-relative priority is woken up, it preempts normal thread, as long as the group's critical section time usage is within the group's quota. The tricky thing is how to correlate priority ranges of different groups. I.e. suppose there is a thread T1 belonging to group APP1 with group-relative priority 10 within APP1 and a thread T2 belonging to group APP2 with group-relative priority 20 within APP2. Which thread is more important and should run first? Perhaps this can be left to system administrator who can adjust "base priority" property of a group thus sliding groups upwards or downwards relative to each other (or, generally, using some form of intergroup priority mapping control). This is not suggest any particul
Re: [PATCH RFC] sched: deferred set priority (dprio)
On Thu, Aug 7, 2014 at 2:03 AM, Mike Galbraith wrote: > task priority cannot be used by any task to describe a critical section. > I assert that is that there is _zero_ critical section information present. This appears to be the crux of our disagreement. This assertion is incorrect. The use of RT to bracket a critical section obviously _does_ provide the following information: 1) designation of entry point for the start of critical activity 2) designation of exit point, albeit with timing not known in advance at entry time 3) priority value which embodies a developer's assessment of the importance of this critical activity relative to other critical activities within the application or coherent application set, and thus a statement about the cost of the activity's preemption for this application or application set What priority protection _does not_ provide is: 1) advance (at entry time) estimate of the duration of the activity 2) the measure of preemption cost in "objective" (uniform) units that would span across unrelated applications 3) and in units that can be easily stacked against the policy-specified cost of deferring and penalizing normal tasks for too long (although to an extent in RT use case this is done by rt_bandwidth) The implication of (2) and (3) is that one cannot easily have a system management slider saying "penalize application A for the benefit of application B (or default undesignated applications) by giving weights so-and-so to their cost factors"... Well, in a way and to an extent one can do it by remapping priority ranges for tasks A and B, however since priority interface was not designed for it grounds up that would be cumbersome. The provisioning of this missing information however is not realistically possible outside of the simplest use cases ran in a fixed configuration under a fixed workload, as I elaborated in the previous message. Outside of such irrealistic "lab" samples, even in the case of the simplest cost function the estimates of T and X are not pragmatically obtainable. The estimates of Y and T2 (cut-off point for Y accrual) are likewise hard or even harder to obtain. Thus even the simplest cost function cannot be pragmatically provided, let alone any more complex cost function. The issue is not with inventing some sophisticated cost function and system interface to plug it in. The issue is that valid input data that the cost function would need to rely on is not pragmatically/economically obtainable. (And this issue is not pragmatically solvable through dynamically collecting such data at run time as well.) Therefore there is no any pragmatically usable "better" solution in the domain of preemption-deferral solutions (rather than post-preemption solutions) that may be discovered yet. It stands to reason that the choice in this domain is really, and will ever pragmatically stay, between using the long-existing techniques (i.e. priority protection or schedctl-style preempt delay, with the latter having the limitations I outlined earlier) or not using anything at all. - Sergey -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] sched: deferred set priority (dprio)
On Wed, 2014-08-06 at 18:26 -0700, Sergey Oboguev wrote: > On Tue, Aug 5, 2014 at 10:41 PM, Mike Galbraith > wrote: (ok, seems you're not addressing the reasonable, rather me;) > The only reason why anyone would want to use DPRIO instead of regular nice in > this case is because it might be unknown beforehand whether the job will be > short or might take a longer time, with majority of work items being very > short > but occasionally taking longer. In this case using DPRIO would let to cut the > overhead for majority of section instances. To reiterate, this is a marginal > and most likely rare use case, but given the existence of uniform interface > I just do not see why to block it on purpose. Hey, if I had a NAK stamp handy, your patch would be wearing one ;-) > > If some kthread prioritizes _itself_ and mucks up application > > performance, file a bug report, that kthread is busted. Anything a user > > or application does with realtime priorities is on them. > > kthreads do not need RT, they just use spinlocks ;-) That's wrong if kthreads feed your load in any way, but your saying that implies to me that your argument that the kernel is the initiator is void. You are not fighting a kernel issue, you have userspace issues. The prioritization mechanism works fine, but you want to subvert it to do something other than what it is designed and specified to do. Whereas you wrote this patch, see "enhancement", I see "subversion" when I read it. Here lies our disagreement. > On a serious note though, I am certainly not saying that injudicious use of RT > (or even nice) cannot disrupt the system, but is it reason enough to summarily > condemn the judicious use as well? Bah, if Joe Users decides setiathome is worthy of SCHED_FIFO:99, it is by definition worthy of SCHED_FIFO:99. I couldn't care less, it's none of my business what anybody tells their box to do. > >> I disagree. The exact problem is that it is not a developer who initiates > >> the > >> preemption, but the kernel or another part of application code that is > >> unaware > >> of other thread's condition and doing it blindly, lacking the information > >> about > >> the state of the thread being preempted and the expected cost of its > >> preemption > >> in this state. DPRIO is a way to communicate this information. > > > What DPRIO clearly does NOT do is to describe critical sections to the > > kernel. > > First of all let's reflect that your argument is not with DPRIO as such. DPRIO > after all is not a separate scheduling mode, but just a method to reduce the > overhead of regular set_priority calls (i.e. sched_setattr & friends). I see subversion of a perfectly functional and specified mechanism. > You argument is with the use of elevated priority as such, and you are saying > that using RT priority range (or high nice) does not convey to the kernel the > information about the critical section. I maintain that task priority does not contain any critical section information whatsoever. Trying to use priority to delineate critical sections is a FAIL, you must HAVE the CPU to change your priority. "Time for me to boost myself.. hey, wtf happened to the time!?!" That's why you subverted the mechanism to not perform the specified action.. at all, not merely not at this precise instant, because task priority cannot be used by any task to describe a critical section. > I do not agree with this, not wholly anyway. First of all, it is obvious that > set_priority does convey some information about the section, so perhaps a more > accurate re-formulation of your argument could be that it is imperfect, > insufficient information. I assert that is that there is _zero_ critical section information present. Create a task, you create a generic can, giving it a priority puts that can on a shelf. Can content is invisible to the kernel, it can't see a critical section of the content, or whether the content as a whole is a nicely wrapped up critical section of a larger whole. There is no section anything about this, what you have is a generic can of FOO on a shelf BAR. If you need a way to describe userspace critical sections, make a way to identify userspace critical sections. IMHO, task priority ain't it, that's taken, and has specified semantics. > Let's try to imagine then what could make more perfect information. It > obviously should be some cost function describing the cost that would be > incurred if the task gets preempted. Something that would say (if we take the > simplest form) "if you preempt me within the next T microseconds (unless I > cancel or modify this mode), this preemption would incur cost X upfront > further > accruing at a rate Y". You can build something more complex, but the basic bit of missing information appears to me to be is plain old enter/exit. > One issue I see with this approach is that in real life it might be very hard > for a developer to quantify the values for X, Y and T. Developer can ea
Re: [PATCH RFC] sched: deferred set priority (dprio)
On Tue, Aug 5, 2014 at 10:41 PM, Mike Galbraith wrote: >> > SCHED_NORMAL where priority escalation does not work as preemption proofing >> >> Remember, DPRIO is not for lock holders only. >> >> Using DPRIO within SCHED_NORMAL policy would make sense for an application >> that >> has "soft" time-urgent section where it believes strong protection >> from preemption >> is not really necessary, and just a greater claim to CPU time share would do, >> in cases where the application does not know beforehand if the section will >> be >> short or long, and in majority of cases is short (sub-millisecond), but >> occasionally can take longer. > > Every single time that SCHED_NORMAL task boosts its priority (nice) > during a preemption, the math has already been done, vruntime has > already been adjusted. > Sure, when it gets the CPU back, its usage will > be weighed differently, it will become more resistant to preemption, but > in no way immune. There is nothing remotely deterministic about this, > making it somewhat of an oxymoron when combined with critical section. But you overlooked the point I was trying to convey in the paragraph you are responding to. Apart from SCHED_NORMAL being a marginal use case, if it is used at all, I do not see it being used for lock-holding or similar critical section where an application wants to avoid the preemption. I can see DPRIO(SCHED_NORMAL) being used in the same cases as an application would use nice for a temporary section, i.e. when it has a job that needs to be processed relatively promptly over some time interval but not really super-urgently and hard guarantees are not needed, i.e. when the application simply wants to have an improved claim for CPU resources compared to normal threads over let us say the next half-second or so. It is ok if the application gets preempted, all it cares about is a longer timeframe ("next half-second") rather than shorter and immediate timeframe ("next millisecond"). The only reason why anyone would want to use DPRIO instead of regular nice in this case is because it might be unknown beforehand whether the job will be short or might take a longer time, with majority of work items being very short but occasionally taking longer. In this case using DPRIO would let to cut the overhead for majority of section instances. To reiterate, this is a marginal and most likely rare use case, but given the existence of uniform interface I just do not see why to block it on purpose. > If some kthread prioritizes _itself_ and mucks up application > performance, file a bug report, that kthread is busted. Anything a user > or application does with realtime priorities is on them. kthreads do not need RT, they just use spinlocks ;-) On a serious note though, I am certainly not saying that injudicious use of RT (or even nice) cannot disrupt the system, but is it reason enough to summarily condemn the judicious use as well? >> I disagree. The exact problem is that it is not a developer who initiates the >> preemption, but the kernel or another part of application code that is >> unaware >> of other thread's condition and doing it blindly, lacking the information >> about >> the state of the thread being preempted and the expected cost of its >> preemption >> in this state. DPRIO is a way to communicate this information. > What DPRIO clearly does NOT do is to describe critical sections to the > kernel. First of all let's reflect that your argument is not with DPRIO as such. DPRIO after all is not a separate scheduling mode, but just a method to reduce the overhead of regular set_priority calls (i.e. sched_setattr & friends). You argument is with the use of elevated priority as such, and you are saying that using RT priority range (or high nice) does not convey to the kernel the information about the critical section. I do not agree with this, not wholly anyway. First of all, it is obvious that set_priority does convey some information about the section, so perhaps a more accurate re-formulation of your argument could be that it is imperfect, insufficient information. Let's try to imagine then what could make more perfect information. It obviously should be some cost function describing the cost that would be incurred if the task gets preempted. Something that would say (if we take the simplest form) "if you preempt me within the next T microseconds (unless I cancel or modify this mode), this preemption would incur cost X upfront further accruing at a rate Y". One issue I see with this approach is that in real life it might be very hard for a developer to quantify the values for X, Y and T. Developer can easily know that he wants to avoid the preemption in a given section, but actually quantifying the cost of preemption (X, Y) would take a lot of effort (benchmarking) and furthermore really cannot be assigned statically, as the cost varies depending on the load pattern and site-specific configuration. Furthermore, when dealing with multiple
Re: [PATCH RFC] sched: deferred set priority (dprio)
On Wed, 2014-08-06 at 07:41 +0200, Mike Galbraith wrote: > You're reading me entirely wrong, I'm not trying to discourage you from > inventing a better bullet, I just think this particular bullet is a dud. Anyway, I'll try to assume you're talking to the _reasonable_ people on this list in any reply. The chance of penetrating my thick skull once I've made up my mind approaches zero ;-) -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] sched: deferred set priority (dprio)
On Tue, 2014-08-05 at 16:28 -0700, Sergey Oboguev wrote: > On Sun, Aug 3, 2014 at 2:56 AM, Mike Galbraith > wrote: > > > SCHED_NORMAL where priority escalation does not work as preemption proofing > > Remember, DPRIO is not for lock holders only. > > Using DPRIO within SCHED_NORMAL policy would make sense for an application > that > has "soft" time-urgent section where it believes strong protection > from preemption > is not really necessary, and just a greater claim to CPU time share would do, > in cases where the application does not know beforehand if the section will be > short or long, and in majority of cases is short (sub-millisecond), but > occasionally can take longer. Every single time that SCHED_NORMAL task boosts its priority (nice) during a preemption, the math has already been done, vruntime has already been adjusted. The boat has already sailed, and your task is waving from the dock. All this does is waste cycles and add yet more math to the done deal. Sure, when it gets the CPU back, its usage will be weighed differently, it will become more resistant to preemption, but in no way immune. There is nothing remotely deterministic about this, making it somewhat of an oxymoron when combined with critical section. Add group scheduling to the mix, and this becomes even more pointless. > > what I see is a programmer using a mechanism designed > > to initiate preemption arming his tasks with countermeasures to the very > > thing he initiates. > > I disagree. The exact problem is that it is not a developer who initiates the > preemption, but the kernel or another part of application code that is unaware > of other thread's condition and doing it blindly, lacking the information > about > the state of the thread being preempted and the expected cost of its > preemption > in this state. DPRIO is a way to communicate this information. Nope, I'm still not buying. It's a bad idea to play with realtime classes/priorities blindly, if someone does that, tough titty. What DPRIO clearly does NOT do is to describe critical sections to the kernel. If some kthread prioritizes _itself_ and mucks up application performance, file a bug report, that kthread is busted. Anything a user or application does with realtime priorities is on them. Stuffing a priority card up selected tasks sleeves does not describe critical sections to the kernel, it only lets the programmer stymie himself, some other programmer... or perhaps the user. > > Deferred preempt seems to be what you want, but you > > invented something very different. > > "Simple" deferred preempt is one use case. > > More complex case is "ranked" deferred preemption, when there are multiple > contending contexts, and there is a need to express relative costs of > victimizing one vs. another. You didn't invent either of those. Hm. I don't even recall seeing the task pulling a card from its sleeve checking to see if the other player didn't have a joker up its sleeve. > > I'm just not seeing the beauty in your patch. > > Perhaps I might have a dissenting sense of beauty. > But then, it is not only about the beauty (which is subjective anyway), but > even more so about the pudding. > > Seriously though, it's really simple: the whole range of available remedies is > divided across post-preemption solutions and preemption-avoidance solutions > (and of course designing app structure for minimizing the contention in the > first place, but for the sake of this discussion we can assume this had been > done to the extent possible). Post-preemption solutions unavoidably incur cost > (and a part of this cost is incurred even before the solution can be engaged). > If this cost can be maintained insignificant for the given use case, great. > However what do you propose to do with those use cases where it cannot? To > tell > a developer (or IT manager) "we do not care about your 5% or 20% losses, and > if > you do not like it, use another OS that would work better for you"? This would > not sound too productive to me. I didn't suggest ignoring of any problems. I know full well that folks doing Oracle/SAP stuff can in fact make prettier numbers using realtime class. I also know they can get themselves into trouble, those troubles having inspired bugzilla to pull my chain. I didn't say "go away", I created a hack to _enable_ them to turn their pet piggies loose in god mode. I didn't try to disarm or dissuade them from aiming the rt pistol at their own toes, I gave them toe-seeking bullets. You're reading me entirely wrong, I'm not trying to discourage you from inventing a better bullet, I just think this particular bullet is a dud. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] sched: deferred set priority (dprio)
On Sun, Aug 3, 2014 at 2:56 AM, Mike Galbraith wrote: > SCHED_NORMAL where priority escalation does not work as preemption proofing Remember, DPRIO is not for lock holders only. Using DPRIO within SCHED_NORMAL policy would make sense for an application that has "soft" time-urgent section where it believes strong protection from preemption is not really necessary, and just a greater claim to CPU time share would do, in cases where the application does not know beforehand if the section will be short or long, and in majority of cases is short (sub-millisecond), but occasionally can take longer. > what I see is a programmer using a mechanism designed > to initiate preemption arming his tasks with countermeasures to the very > thing he initiates. I disagree. The exact problem is that it is not a developer who initiates the preemption, but the kernel or another part of application code that is unaware of other thread's condition and doing it blindly, lacking the information about the state of the thread being preempted and the expected cost of its preemption in this state. DPRIO is a way to communicate this information. > Deferred preempt seems to be what you want, but you > invented something very different. "Simple" deferred preempt is one use case. More complex case is "ranked" deferred preemption, when there are multiple contending contexts, and there is a need to express relative costs of victimizing one vs. another. > I'm just not seeing the beauty in your patch. Perhaps I might have a dissenting sense of beauty. But then, it is not only about the beauty (which is subjective anyway), but even more so about the pudding. Seriously though, it's really simple: the whole range of available remedies is divided across post-preemption solutions and preemption-avoidance solutions (and of course designing app structure for minimizing the contention in the first place, but for the sake of this discussion we can assume this had been done to the extent possible). Post-preemption solutions unavoidably incur cost (and a part of this cost is incurred even before the solution can be engaged). If this cost can be maintained insignificant for the given use case, great. However what do you propose to do with those use cases where it cannot? To tell a developer (or IT manager) "we do not care about your 5% or 20% losses, and if you do not like it, use another OS that would work better for you"? This would not sound too productive to me. This leaves then preemption-avoidance solutions which are about communicating the cost of the preemption to the kernel in one way or another. DPRIO is one of such solutions, and I believe is as good or better than the others (I explained earlier in which cases it is better than schedctl). (Seemingly transcending this divide may only be the delegation of scheduling decisions from the kernel to the userspace via mechanisms attempted in a number of experimental OSes and likely to see the light of the day again with the expected raise of library OSes/unikernels, but at the end of the day this is still a set of preemption-avoidance vs. post-preemption solutions, just boxed differently, and utilizing greater application state knowledge bandwidth available within the single address space.) I hear you dislike, but I think it is misplaced. I am unsure about the exact origins of your outlook, but I suspect it might possibly be arising at least in part from an aspiration for finding a "holy grail" solution that would cover all the cases (after all, kernel developers tend to be psychologically perfectionists, at least within this little hobby of kernel development), and while such an aspiration is admirable per se, but unfortunately this holy grail just does not exist. Likewise, an aspiration to make sure that everything "mixes and matches" with absolutely everything else in all cases while it may be admirable but is futile in its absolutistic form because there are obvious limits to it, and furthermore applications that need to rely on preemption-avoidance are most often primary-importance applications for the installation, and as such mixing and matching with absolutely everything else is not essential and definitely not a priority compared to letting the primary application to run better. Furthermore, many if not most of primary-importance applications deployments today are already single-purposed, and with continued VM sprawl this trend would only grow, and with it the emphasis on letting a primary application or practical well-defined application set run better over an absolutistic arbitrary mixing and matching will grow as well. This of course certainly not to deny the importance of mix-and-match environments such as desktop systems or personal devices, but it is important to recognize the difference in requirements/priorities between the environments. - Sergey -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kern
Re: [PATCH RFC] sched: deferred set priority (dprio)
On Sun, Aug 3, 2014 at 10:30 AM, Andi Kleen wrote: >> (One example might be virtual machine that runs guest operating system that >> is >> not paravirtualized or can be paravirtualized only to a limited extent. The >> VM >> might guess that preemption of VCPU thread that is processing events such as >> IPI interrupts, clock interrupts or certain device interrupts is likely to >> cause overall performance degradation due to other VCPUs spinning for an IPI >> response or spinning waiting for a spinlock, and thought the general kinds of >> these dependencies may be foreseen, but actual dependency chains between >> VCPUs >> cannot be established at run time.) > > PAUSE loop exiting (PLE) can handle it in limited fashion on VMs > even without paravirtualization. The key word is "limited". Sometimes the use of PLE can help to a limited indeed (see below) extent, and sometimes it can even be counter-productive, which in turn can also be mitigated (http://lwn.net/Articles/424960/) but again in a limited fashion. This indeed is similar to the spinlock example discussed in the previous message. Similarly to PI/PE, the use of PLE and other spin-loop detection techniques is about trying to minimize the cost of after- inopportune-preemption actions to whatever (unavoidably limited, that's the key thing) extent they can be minimized, whereas priority protection is about the avoidance of incurring these costs in the first place. And then, x86 VMs is just one use case, whereas there are others with no PLE or similar recourses. DPRIO as a matter of fact happened to be conceived within the context of the project running legacy non-x86 OS that does not have pause in the loops, and the loops are scattered throughout the pre-existing binary code in huge numbers and are not pragmatically instrumentable. But this just exemplifies a general pattern of having to do with concurrency in a component being driven from or driving another component that is for practical purposes "set in stone" (3rd party, legacy, out of the scope of the project etc.) - Sergey -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] sched: deferred set priority (dprio)
On Sun, Aug 3, 2014 at 1:30 AM, Pavel Machek wrote: > it seems to be a security issue to me. > If root renices the application to high nice value, application should > not be able to work around it by the DPRIO interface. There is no such issue. Since 2.6.12, Linux does allow a task that had been renice'd to increase its priority back as long as it is within RLIMIT_NICE. See see man page for nice(2) and the reference there under EPERM to RLIMIT_NICE. Or sys_nice(...) and can_nice(...) in kernel/sched/core.c. If the administrator wants to clamp down the task in a way it would be unable to come back, he should change the task's rlimit for RLIMIT_NICE. DPRIO does honor the change in RLIMIT_NICE (as well as in RLIMIT_RTPRIO) and won't let a task cross those limits. > You mean "we rely on applications handling the situation they can't and will not handle"? Not really. There are two different cases to be considered. One is when an application's thread priority is changed from inside the application, in a coordinated code, and the code is specifically set up to handle asynchronous thread priority changes. (For example, if the application is a VM, and a virtual device sends an interrupt to a VCPU, the device handler may want to bump up the VCPU thread priority so the interrupt gets processed promptly. When VCPU notices and dequeues the sent interrupt, it reevaluates the thread priority based on the totality of synchronous intra-VCPU conditions and currently visible asynchronous conditions such as a set of pending interrupts and sets new thread priority accordingly.) Another case is when an application's thread priority is changed from the outside of the application in an arbitrary way. There is no radical difference in this case between DPRIO and regular set_priority. Such an external priority change can indeed be disruptive for an application, but it is disruptive for an application that uses regular set_priority aw well. Suppose the thread was running some critical tasks and/or holding some critical locks, and used regular set_priority to that end, and then was knocked down. This would be disruptive for an application using regular set_priority just as it would be for one using DPRIO. The exact mechanics of the disruption would be somewhat different, but the disruption would be present in both cases. Likewise, an application has means for recovery both in regular set_priority and DPRIO cases. In the case of an application using regular set_priority the recovery will automatically happen on the next set_priority call. In DPRIO case it may take a bunch of dprio_set calls, but given that they are meant to be used in high-frequency invocation case, the recovery is likely to happen pretty fast as well, and after a certain number of cycles the "writeback" priority change cached in the userspace is likely to get "written through" to the kernel, albeit this process is somewhat "stochastic" and sequences can be constructed when it won't be for quite a while. If the application wished to give it some guaranteed predictability, it could use dprio_setnow(prio, DPRIO_FORCE) in every N-th invocation instead of dprio_set(prio). Nevertheless this indirection is one reason why I do not think making regular set_priority a wrapper around DPRIO is a good idea. It would strip application developer of direct control. unless DPRIO_FORCE flag was used, but then it becomes regular set_priority again. Another reason is that error reporting in DPRIO case is delayed (e.g. via a callback in DPRIO library implementation), and that's different from the semantics of regular set_priority interface. To summarize it, regular sety_priority (e.g. sched_setattr) defines the interface that is immediate (synchronous) and uncached, whereas DPRIO is deferred (asynchronous) and cached. Semantics is really different to let the former be wrapped around the latter without a distortion of the semantics. - Sergey -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] sched: deferred set priority (dprio)
> (One example might be virtual machine that runs guest operating system that is > not paravirtualized or can be paravirtualized only to a limited extent. The VM > might guess that preemption of VCPU thread that is processing events such as > IPI interrupts, clock interrupts or certain device interrupts is likely to > cause overall performance degradation due to other VCPUs spinning for an IPI > response or spinning waiting for a spinlock, and thought the general kinds of > these dependencies may be foreseen, but actual dependency chains between VCPUs > cannot be established at run time.) PAUSE loop exiting (PLE) can handle it in limited fashion on VMs even without paravirtualization. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] sched: deferred set priority (dprio)
On Sat, 2014-08-02 at 17:43 -0700, Sergey Oboguev wrote: > When reasoning about concurrency management it may be helpful to keep in mind > the fundamental perspective that the problem space and solution space in this > area are fragmented -- just as your message exemplifies as well, but it also > applies across the board to all other solution techniques. There is no > all-unifying solution that works in all use cases and for all purposes. I'm just not seeing the beauty in your patch. Ignoring SCHED_NORMAL where priority escalation does not work as preemption proofing, for realtime classes, what I see is a programmer using a mechanism designed to initiate preemption arming his tasks with countermeasures to the very thing he initiates. Deferred preempt seems to be what you want, but you invented something very different. As noted though, you don't have to convince me. The thing to do is chop your patch up into a nice reviewable series, and submit it. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] sched: deferred set priority (dprio)
On Sat 2014-08-02 17:47:52, Sergey Oboguev wrote: > On Wed, Jul 30, 2014 at 6:02 AM, Pavel Machek wrote: > > Hi! > > > >> One of the intended purposes of this facility (but its not sole purpose) > >> is to > >> render a lightweight mechanism for priority protection of lock-holding > >> critical > >> sections that would be an adequate match for lightweight locking primitives > >> such as futex, with both featuring a fast path completing within the > >> userspace. > > > Do we get a manpage describing the interface...? > > At this point it is just an RFC, and the only available write-up is an article > (URL is in the original message). > > There has been no word from the maintainers yet whether the proposal appears > to be a "go" or "no-go" in general. It appears to be "no-go" to me. > Regular "set priority" cannot be wrapped around "deferred set priority". Umm. Why not? int getpriority(int which, int who); int setpriority(int which, int who, int prio); Description The scheduling priority of the process, process group, or user, as indicated by which and who is obtained with the getpriority() call and set with the setpriority() call. The value which is one of PRIO_PROCESS, PRIO_PGRP, or PRIO_USER, and who is interpreted relative to which (a process identifier for PRIO_PROCESS, process group identifier for PRIO_PGRP, and a user ID for PRIO_USER). A zero value for who denotes (respectively) the calling process, the process group of the calling process, or the real user ID of the calling process. Prio is a value in the range -20 to 19 (but see the Notes below). The default priority is 0; lower priorities cause more favorable scheduling. In vsyscall area: if (which==PRIO_PROCESS && who==0) { perform your optimized priority set } else { perform syscall } Now, you have to make sure to keep reasonably close semantics, but that would be good idea, anyway. > Even more importantly, its use also relies on a thread caching the > application's > current knowledge of the thread priority in the userspace, and if the thread > priority had been changed from the outside of the application (or even by > another thread within the same application), this knowledge becomes invalid, > and then the application is responsible for performing whatever recovery > action > is appropriate. You mean "we rely on applications handling the situation they can't and will not handle"? Actually, it seems to be a security issue to me. If root renices the application to high nice value, application should not be able to work around it by the DPRIO interface. IOW the priority should be always the lower of DPRIO and normal one, at the very least. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] sched: deferred set priority (dprio)
On Wed, Jul 30, 2014 at 6:02 AM, Pavel Machek wrote: > Hi! > >> One of the intended purposes of this facility (but its not sole purpose) is >> to >> render a lightweight mechanism for priority protection of lock-holding >> critical >> sections that would be an adequate match for lightweight locking primitives >> such as futex, with both featuring a fast path completing within the >> userspace. > Do we get a manpage describing the interface...? At this point it is just an RFC, and the only available write-up is an article (URL is in the original message). There has been no word from the maintainers yet whether the proposal appears to be a "go" or "no-go" in general. Assuming the response was favorable, final polishing on the patch can be done (such as perhaps replacing or augmenting authlist by a capability), and man page can be added at this point as well. > Would it make sense to make set_priority a "vsyscall" so it is fast enough, and delayed_set_priority does not need to be exposed to userspace? Regular "set priority" cannot be wrapped around "deferred set priority". "Deferred set priority" acts only on current thread. Even more importantly, its use also relies on a thread caching the application's current knowledge of the thread priority in the userspace, and if the thread priority had been changed from the outside of the application (or even by another thread within the same application), this knowledge becomes invalid, and then the application is responsible for performing whatever recovery action is appropriate. Thus DPRIO is not a replacement for fully-functional "set priority", but rather a specialized tool for certain use cases. - Sergey -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] sched: deferred set priority (dprio)
On Mon, Jul 28, 2014 at 12:24 AM, Mike Galbraith wrote: > On Sun, 2014-07-27 at 18:19 -0700, Andi Kleen wrote: >> Sergey Oboguev writes: >> >> > [This is a repost of the message from few day ago, with patch file >> > inline instead of being pointed by the URL.] >> >> Have you checked out the preemption control that was posted some time >> ago? It did essentially the same thing, but somewhat simpler than your >> patch. >> >> http://lkml.iu.edu/hypermail/linux/kernel/1403.0/00780.html >> >> Yes I agree with you that lock preemption is a serious issue that needs >> solving. > > Yeah, it's a problem, and well known. > > One mitigation mechanism that exists in the stock kernel today is the > LAST_BUDDY scheduler feature. That took pgsql benchmarks from "shite" > to "shiny", and specifically targeted this issue. > > Another is SCHED_BATCH, which can solve a lot of the lock problem by > eliminating wakeup preemption within an application. One could also > create an extended batch class which is not only immune from other > SCHED_BATCH and/or SCHED_IDLE tasks, but all SCHED_NORMAL wakeup > preemption. Trouble is that killing wakeup preemption precludes very > fast very light tasks competing with hogs for CPU time. If your load > depends upon these performing well, you have a problem. > > Mechanism #3 is use of realtime scheduler classes. This one isn't > really a mitigation mechanism, it's more like donning a super suit. > > So three mechanisms exist, the third being supremely effective, but high > frequency usage is expensive, ergo huge patch. > > The lock holder preemption problem being identical to the problem RT > faces with kernel locks... > > A lazy preempt implementation ala RT wouldn't have the SCHED_BATCH > problem, but would have a problem in that should critical sections not > be as tiny as it should be, every time you dodge preemption you're > fighting the fair engine, may pay heavily in terms of scheduling > latency. Not a big hairy deal, if it hurts, don't do that. Bigger > issue is that you have to pop into the kernel on lock acquisition and > release to avoid jabbering with the kernel via some public phone. > Popping into the kernel, if say some futex were victimized, also erases > the "f" in futex, and restricting cost to consumer won't be any easier. > > The difference wrt cost acceptability is that the RT issue is not a > corner case, it's core issue resulting from the nature of the RT beast > itself, so the feature not being free is less annoying. A corner case > fix OTOH should not impact the general case at all. When reasoning about concurrency management it may be helpful to keep in mind the fundamental perspective that the problem space and solution space in this area are fragmented -- just as your message exemplifies as well, but it also applies across the board to all other solution techniques. There is no all-unifying solution that works in all use cases and for all purposes. This applies even to seemingly well-defined problems such as lock holder preemption avoidance. One of the divisions on a broad scale is between cases when wait/dependency chain can be made explicit at run time (e.g. application/system can tell that thread A is waiting on thread B which is waiting on thread C) and those cases when the dependency cannot be explicitly expressed and information on the dependency is lacking. In the former case some form of priority inheritance or proxy execution might often (but not always) work well. In the latter case PI/PE cannot be applied. This case occurs when a component needs to implement time-urgent section acting on behalf of (e.g. in response to an event in) another component in the application and the latter is not (or often cannot practically be) instrumented with the code that expresses its inner dependencies and thus the information about the dependencies is lacking. (One example might be virtual machine that runs guest operating system that is not paravirtualized or can be paravirtualized only to a limited extent. The VM might guess that preemption of VCPU thread that is processing events such as IPI interrupts, clock interrupts or certain device interrupts is likely to cause overall performance degradation due to other VCPUs spinning for an IPI response or spinning waiting for a spinlock, and thought the general kinds of these dependencies may be foreseen, but actual dependency chains between VCPUs cannot be established at run time.) In cases when dependency information is lacking, priority protection remains the only effective recourse. Furthermore, even in cases when dependency information is available, PI/PE per se is not always a satisfactory or sufficient solution. Consider for instance a classic case of a userspace spinlock or hybrid spin-then-block lock that is highly contended and often requested by the threads, and a running thread spends a notable part of a timeslice holding the lock (not necessarily grabbing and holding it, the thread may g
Re: [PATCH RFC] sched: deferred set priority (dprio)
Hi! > One of the intended purposes of this facility (but its not sole purpose) is to > render a lightweight mechanism for priority protection of lock-holding > critical > sections that would be an adequate match for lightweight locking primitives > such as futex, with both featuring a fast path completing within the > userspace. Do we get a manpage describing the interface...? Would it make sense to make set_priority a "vsyscall" so it is fast enough, and delayed_set_priority does not need to be exposed to userspace? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] sched: deferred set priority (dprio)
On Sun, 2014-07-27 at 18:19 -0700, Andi Kleen wrote: > Sergey Oboguev writes: > > > [This is a repost of the message from few day ago, with patch file > > inline instead of being pointed by the URL.] > > Have you checked out the preemption control that was posted some time > ago? It did essentially the same thing, but somewhat simpler than your > patch. > > http://lkml.iu.edu/hypermail/linux/kernel/1403.0/00780.html > > Yes I agree with you that lock preemption is a serious issue that needs > solving. Yeah, it's a problem, and well known. One mitigation mechanism that exists in the stock kernel today is the LAST_BUDDY scheduler feature. That took pgsql benchmarks from "shite" to "shiny", and specifically targeted this issue. Another is SCHED_BATCH, which can solve a lot of the lock problem by eliminating wakeup preemption within an application. One could also create an extended batch class which is not only immune from other SCHED_BATCH and/or SCHED_IDLE tasks, but all SCHED_NORMAL wakeup preemption. Trouble is that killing wakeup preemption precludes very fast very light tasks competing with hogs for CPU time. If your load depends upon these performing well, you have a problem. Mechanism #3 is use of realtime scheduler classes. This one isn't really a mitigation mechanism, it's more like donning a super suit. So three mechanisms exist, the third being supremely effective, but high frequency usage is expensive, ergo huge patch. The lock holder preemption problem being identical to the problem RT faces with kernel locks... A lazy preempt implementation ala RT wouldn't have the SCHED_BATCH problem, but would have a problem in that should critical sections not be as tiny as it should be, every time you dodge preemption you're fighting the fair engine, may pay heavily in terms of scheduling latency. Not a big hairy deal, if it hurts, don't do that. Bigger issue is that you have to pop into the kernel on lock acquisition and release to avoid jabbering with the kernel via some public phone. Popping into the kernel, if say some futex were victimized, also erases the "f" in futex, and restricting cost to consumer won't be any easier. The difference wrt cost acceptability is that the RT issue is not a corner case, it's core issue resulting from the nature of the RT beast itself, so the feature not being free is less annoying. A corner case fix OTOH should not impact the general case at all. Whatever outcome, I hope it'll be tiny. 1886 ain't tiny. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] sched: deferred set priority (dprio)
On Sun, Jul 27, 2014 at 6:19 PM, Andi Kleen wrote: >> [This is a repost of the message from few day ago, with patch file >> inline instead of being pointed by the URL.] > > Have you checked out the preemption control that was posted some time > ago? It did essentially the same thing, but somewhat simpler than your > patch. > > http://lkml.iu.edu/hypermail/linux/kernel/1403.0/00780.html Yes, I have seen this discussion. The patch suggested by Khalid implements a solution very much resembling Solaris/AIX schedctl. Schedctl is less generic and powerful than dprio. I compared dprio vs. schedctl in the write-up https://raw.githubusercontent.com/oboguev/dprio/master/dprio.txt To quote from there, [--- Quote ---] The Solaris schedctl [...] does not provide a way to associate a priority with the resource whose lock is being held (or, more generally, with thread application-specific logical state; see the footnote below). An application is likely to have a range of locks with different criticality levels and different needs for holder protection [*]. For some locks, holder preemption may be tolerated somewhat, while other locks are highly critical, furthermore for some lock holders preemption by a high-priority thread is acceptable but not a preemption by a low-priority thread. The Solaris/AIX schedctl does not provide a capability for priority ranging relative to the context of the whole application and other processes in the system. [*] We refer just to locks here for simplicity, but the need of a thread for preemption control does not reduce to locks held alone, and may result from other intra-application state conditions, such as executing a time-urgent fragment of code in response to a high-priority event (that may potentially be blocking for other threads) or other code paths that can lead to wait chains unless completed promptly. Second, in some cases application may need to perform time-urgent processing without knowing in advance how long it will take. In the majority of cases the processing may be very short (a fraction of a scheduling timeslice), but occasionally may take much longer (such as a fraction of a second). Since schedctl would not be effective in the latter case, an application would have to resort to system calls for thread priority control in all cases [*], even in the majority of "short processing" cases, with all the overhead of this approach. [*] Or introduce extra complexity, most likely very cumbersome, by trying to gauge and monitor the accumulated duration of the processing, with the intention to transition from schedctl to thread priority elevation once a threshold has been reached. [--- End of quote ---] Even so, I felt somewhat puzzled by the response to Khalid's delay-preempt patch. While some arguments put forth against it were certainly valid in their own right, but somehow their focus seemed to be that the solution won't interoperate well with all the conceivable setups and application mixes, won't solve all the concurrency issues, and the worst of all won't slice bread either. Whereas my perception (perhaps incorrectly) was that this patch was not meant to solve a whole range of problems and to be a feature enabled by default in a generic system, but rather a specialized feature configurable in special-purpose systems (e.g. database servers, Khalid was doing it for Oracle, and his JVM use case I believe is also in this context) dedicated to running a primary-importance application that utilizes this mechanism and meant to solve a very particular problem of this specific category of system deployment cases. It appeared to me that the participants to delay-preempt patch discussion might have had different idea of the implied use scope of the suggested feature, and it might have influenced the direction of the discussion. - Sergey -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] sched: deferred set priority (dprio)
Sergey Oboguev writes: > [This is a repost of the message from few day ago, with patch file > inline instead of being pointed by the URL.] Have you checked out the preemption control that was posted some time ago? It did essentially the same thing, but somewhat simpler than your patch. http://lkml.iu.edu/hypermail/linux/kernel/1403.0/00780.html Yes I agree with you that lock preemption is a serious issue that needs solving. -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] sched: deferred set priority (dprio)
On Sun, 2014-07-27 at 02:09 -0700, Sergey Oboguev wrote: > On Sat, Jul 26, 2014 at 9:02 PM, Mike Galbraith > wrote: > > On Sat, 2014-07-26 at 11:30 -0700, Sergey Oboguev wrote: > >> On Sat, Jul 26, 2014 at 1:58 AM, Mike Galbraith > >> wrote: > >> > On Fri, 2014-07-25 at 12:45 -0700, Sergey Oboguev wrote: > >> >> [This is a repost of the message from few day ago, with patch file > >> >> inline instead of being pointed by the URL.] > >> >> > >> >> This patch is intended to improve the support for fine-grain parallel > >> >> applications that may sometimes need to change the priority of their > >> >> threads at > >> >> a very high rate, hundreds or even thousands of times per scheduling > >> >> timeslice. > >> >> > >> >> These are typically applications that have to execute short or very > >> >> short > >> >> lock-holding critical or otherwise time-urgent sections of code at a > >> >> very high > >> >> frequency and need to protect these sections with "set priority" system > >> >> calls, > >> >> one "set priority" call to elevate current thread priority before > >> >> entering the > >> >> critical or time-urgent section, followed by another call to downgrade > >> >> thread > >> >> priority at the completion of the section. Due to the high frequency of > >> >> entering and leaving critical or time-urgent sections, the cost of > >> >> these "set > >> >> priority" system calls may raise to a noticeable part of an > >> >> application's > >> >> overall expended CPU time. Proposed "deferred set priority" facility > >> >> allows to > >> >> largely eliminate the cost of these system calls. > >> > > >> > So you essentially want to ship preempt_disable() off to userspace? > >> > > >> > >> Only to the extent preemption control is already exported to the userspace > >> and > >> a task is already authorized to control its preemption by its > >> RLIMIT_RTPRIO, > >> RLIMIT_NICE and capable(CAP_SYS_NICE). > >> > >> DPRIO does not amplify a taks's capability to elevate its priority and > >> block > >> other tasks, it just reduces the computational cost of frequest > >> sched_setattr(2) calls. > > > You are abusing realtime > > I am unsure why you would label priority ceiling for locks and priority > protection for other forms of time-urgent sections as an "abuse". Ok, maybe "abuse" is too strong. I know there are reasons why people do what they do, even when it may look silly to me. I didn't like what I saw in case you couldn't tell, but lucky you, you're not selling it to me, you're selling it to maintainers. I CCd them, so having voiced my opinion, I'll shut up and listen. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] sched: deferred set priority (dprio)
On Sat, Jul 26, 2014 at 9:02 PM, Mike Galbraith wrote: > On Sat, 2014-07-26 at 11:30 -0700, Sergey Oboguev wrote: >> On Sat, Jul 26, 2014 at 1:58 AM, Mike Galbraith >> wrote: >> > On Fri, 2014-07-25 at 12:45 -0700, Sergey Oboguev wrote: >> >> [This is a repost of the message from few day ago, with patch file >> >> inline instead of being pointed by the URL.] >> >> >> >> This patch is intended to improve the support for fine-grain parallel >> >> applications that may sometimes need to change the priority of their >> >> threads at >> >> a very high rate, hundreds or even thousands of times per scheduling >> >> timeslice. >> >> >> >> These are typically applications that have to execute short or very short >> >> lock-holding critical or otherwise time-urgent sections of code at a very >> >> high >> >> frequency and need to protect these sections with "set priority" system >> >> calls, >> >> one "set priority" call to elevate current thread priority before >> >> entering the >> >> critical or time-urgent section, followed by another call to downgrade >> >> thread >> >> priority at the completion of the section. Due to the high frequency of >> >> entering and leaving critical or time-urgent sections, the cost of these >> >> "set >> >> priority" system calls may raise to a noticeable part of an application's >> >> overall expended CPU time. Proposed "deferred set priority" facility >> >> allows to >> >> largely eliminate the cost of these system calls. >> > >> > So you essentially want to ship preempt_disable() off to userspace? >> > >> >> Only to the extent preemption control is already exported to the userspace >> and >> a task is already authorized to control its preemption by its RLIMIT_RTPRIO, >> RLIMIT_NICE and capable(CAP_SYS_NICE). >> >> DPRIO does not amplify a taks's capability to elevate its priority and block >> other tasks, it just reduces the computational cost of frequest >> sched_setattr(2) calls. > You are abusing realtime I am unsure why you would label priority ceiling for locks and priority protection for other forms of time-urgent sections as an "abuse". It would appear you start from a presumption that the sole valid purpose for ranging task priorities should ever be only hard real-time applications such as plant process control etc., but that's not a valid or provable presumption, but rather an article of faith -- a faith, as you acknowledge, a lot of developers do not share, and a rational argument to the contrary of this faith is that there are no all-fitting satisfactory and practical alternative solutions to the problems that are being solved with those tools, that's the key reason why they are used. The issue then distills to a more basic question of whether this faith should be imposed on the dissenting application developers, and whether Linux should provide a mechanism or a policy. As for DPRIO specifically, while it may encourage somewhat the use of priority ceiling and priority protection, but it does not provide an additional basic mechanism beyond one already exported by the kernel (i.e. "set priority"), it just makes this pre-existing basic mechanism cheaper to use in certain use cases. > if what you want/need is a privileged userspace lock The problem is not reducible to locks. Applications also have time-urgent critical section that arise from wait and interaction chains not expressible via locking notation. > you could make a flavor of futex that makes the owner non-preemptible Lock owner should definitely be preemptible by more time-urgent tasks. > it's not like multiple users could coexist peacefully anyway It depends. A common sense suggests not to run an air traffic control system on the same machine as an airline CRM database system, but perhaps one might co-host CRM and ERP database instances on the same machine. Indeed, applications that are installed with the rights granting them an access to an elevated priority are generally those that are important for the purpose of the system they are deployed on. The machine they are installed on may either be dedicated to running this particular application, or it may be used for running a set of primary-importance applications that can coexist. As an obvious rule of thumb, applications using elevated priorities for the sake of deterministic response time should not be combined "on equal footing" with non-deterministic applications using elevated priorities for the reasons of better overall system throughput and responsiveness. If they are ever combined at all, the former category should use priority levels about the latter. It is however may often be possible -- as far as priority use is concerned -- to combine multiple applications of the latter (non-deterministic) category, as long as their critical sections combined take less than a total of CPU time. If applications are fundamentally incompatible by their aggregate demand for resources exceeding available system resources, be it CPU or memory reso
Re: [PATCH RFC] sched: deferred set priority (dprio)
On Sat, 2014-07-26 at 11:30 -0700, Sergey Oboguev wrote: > On Sat, Jul 26, 2014 at 1:58 AM, Mike Galbraith > wrote: > > On Fri, 2014-07-25 at 12:45 -0700, Sergey Oboguev wrote: > >> [This is a repost of the message from few day ago, with patch file > >> inline instead of being pointed by the URL.] > >> > >> This patch is intended to improve the support for fine-grain parallel > >> applications that may sometimes need to change the priority of their > >> threads at > >> a very high rate, hundreds or even thousands of times per scheduling > >> timeslice. > >> > >> These are typically applications that have to execute short or very short > >> lock-holding critical or otherwise time-urgent sections of code at a very > >> high > >> frequency and need to protect these sections with "set priority" system > >> calls, > >> one "set priority" call to elevate current thread priority before entering > >> the > >> critical or time-urgent section, followed by another call to downgrade > >> thread > >> priority at the completion of the section. Due to the high frequency of > >> entering and leaving critical or time-urgent sections, the cost of these > >> "set > >> priority" system calls may raise to a noticeable part of an application's > >> overall expended CPU time. Proposed "deferred set priority" facility > >> allows to > >> largely eliminate the cost of these system calls. > > > > So you essentially want to ship preempt_disable() off to userspace? > > > > Only to the extent preemption control is already exported to the userspace and > a task is already authorized to control its preemption by its RLIMIT_RTPRIO, > RLIMIT_NICE and capable(CAP_SYS_NICE). > > DPRIO does not amplify a taks's capability to elevate its priority and block > other tasks, it just reduces the computational cost of frequest > sched_setattr(2) calls. Exactly. You are abusing realtime, and you are not the only guy out there doing that. What you want is control over a userspace critical section, and you are willing to do whatever is necessary to get that. I think your code is a really good example of how far people are willing to go, but I hope this goes nowhere beyond getting people to think about what you and others want. I would say cut to the chase, if what you want/need is a privileged userspace lock, make one, and put _all_ of the ugliness inside it. Forget about this "Hello Mr. kernel, here's what I would have done to get what I want if I weren't such a cheap bastard, if you think about preempting me, pretend I actually did that instead" business. Forget about all that RLIMIT_RTPRIO and access list stuff too, either you're privileged or you're not, it's not like multiple users could coexist peacefully anyway. Maybe you could make a flavor of futex that makes the owner non-preemptible, checks upon release or such. Note: if you do touch futex.c, you'll definitely want to document that you eliminated every last remote possibility of breaking anything, and donning Nomex underwear before posting would not be a bad idea ;-) -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] sched: deferred set priority (dprio)
On Sat, Jul 26, 2014 at 1:58 AM, Mike Galbraith wrote: > On Fri, 2014-07-25 at 12:45 -0700, Sergey Oboguev wrote: >> [This is a repost of the message from few day ago, with patch file >> inline instead of being pointed by the URL.] >> >> This patch is intended to improve the support for fine-grain parallel >> applications that may sometimes need to change the priority of their threads >> at >> a very high rate, hundreds or even thousands of times per scheduling >> timeslice. >> >> These are typically applications that have to execute short or very short >> lock-holding critical or otherwise time-urgent sections of code at a very >> high >> frequency and need to protect these sections with "set priority" system >> calls, >> one "set priority" call to elevate current thread priority before entering >> the >> critical or time-urgent section, followed by another call to downgrade thread >> priority at the completion of the section. Due to the high frequency of >> entering and leaving critical or time-urgent sections, the cost of these "set >> priority" system calls may raise to a noticeable part of an application's >> overall expended CPU time. Proposed "deferred set priority" facility allows >> to >> largely eliminate the cost of these system calls. > > So you essentially want to ship preempt_disable() off to userspace? > Only to the extent preemption control is already exported to the userspace and a task is already authorized to control its preemption by its RLIMIT_RTPRIO, RLIMIT_NICE and capable(CAP_SYS_NICE). DPRIO does not amplify a taks's capability to elevate its priority and block other tasks, it just reduces the computational cost of frequest sched_setattr(2) calls. > > -Mike > >> Instead of executing a system call to elevate its thread priority, an >> application simply writes its desired priority level to a designated memory >> location in the userspace. When the kernel attempts to preempt the thread... - Sergey -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] sched: deferred set priority (dprio)
On Fri, 2014-07-25 at 12:45 -0700, Sergey Oboguev wrote: > [This is a repost of the message from few day ago, with patch file > inline instead of being pointed by the URL.] > > This patch is intended to improve the support for fine-grain parallel > applications that may sometimes need to change the priority of their threads > at > a very high rate, hundreds or even thousands of times per scheduling > timeslice. > > These are typically applications that have to execute short or very short > lock-holding critical or otherwise time-urgent sections of code at a very high > frequency and need to protect these sections with "set priority" system calls, > one "set priority" call to elevate current thread priority before entering the > critical or time-urgent section, followed by another call to downgrade thread > priority at the completion of the section. Due to the high frequency of > entering and leaving critical or time-urgent sections, the cost of these "set > priority" system calls may raise to a noticeable part of an application's > overall expended CPU time. Proposed "deferred set priority" facility allows to > largely eliminate the cost of these system calls. So you essentially want to ship preempt_disable() off to userspace? (smiles wickedly, adds CCs) -Mike > Instead of executing a system call to elevate its thread priority, an > application simply writes its desired priority level to a designated memory > location in the userspace. When the kernel attempts to preempt the thread... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] sched: deferred set priority (dprio)
On Fri, Jul 25, 2014 at 1:12 PM, Andy Lutomirski wrote: > On 07/25/2014 12:45 PM, Sergey Oboguev wrote: >> [This is a repost of the message from few day ago, with patch file >> inline instead of being pointed by the URL.] >> >> This patch is intended to improve the support for fine-grain parallel >> applications that may sometimes need to change the priority of their threads >> at >> a very high rate, hundreds or even thousands of times per scheduling >> timeslice. > > What is the authlist stuff for? It looks complicated and it seems like > it should be unnecessary. The authlist controls who is allowed to use the DPRIO facility. DPRIO works by making a check at __schedule() time for whether a deferred set priority request is pending, and if so, then performing an equivalent of sched_setattr(), minus security checks, before the rescheduling. This introduces additional latency at task switch time when a deferred set priority request is pending -- albeit normally a very small latency, but non-zero one and a malicious user can also manoeuvre into increasing it. There are two parts to this latency. One is more or less constant part in the order of 1 usec, that's basic __sched_setscheduler() code. The other part is due to possible priority inheritance chain adjustment in rt_mutex_adjust_pi() and depends on the length of the chain. Malicious user might conceivably construct a very long chain, long enough for processing of this chain at the time of __schedule to cause an objectionable latency. On systems where this might be a concern, administrator may therefore want to restrict the use of DPRIO to legitimate trusted applications (or rather users of those). If a common feeling is that such a safeguard is overly paranoid, I would be happy to drop it, but I feel some security control option may be desirable for the mentioned reason. Rethinking it now though, perhaps a simpler alternative could be adding a capability for DPRIO plus a system-wide setting as to whether this capability is required for the use of DPRIO or DPRIO is allowed for everyone on the system. The benefit of this approach might also be that the administrator can use setcap on trusted executable files installed in the system to grant them this capability. > What's with the save/restore code? What's it saving and restoring? It is used inside execve. There are two DPRIO-related elements stored in task_struct (plus one more for the debug code only). One is an address of a location in the userspace that is used for the userspace <-> kernel communication. Another element stored in task_struct is the pointer to per-task DPRIO information block kept inside the kernel. This block holds pre-authorized priorities. (The address of userspace location is stored in task_struct itself, rather than in DPRIO info block to avoid extra indirection in the frequent path.) Successful execve must result in shutting down DPRIO for the task, i.e. resetting these two pointers to NULL (this must be performed before new executable image is loaded, otherwise a corruption of new image memory can result) and also the deallocation of DPRIO information block. If execve fails however and control returns to the original image, DPRIO settings should be retained. Before calling image loader, execve (i.e. do_execve_common) invokes dprio_save_reset_context() to save these two pointers in on-stack backup structure and to reset their values in task_struct to NULL. If new image loads fine, DPRIO information block is deallocated by dprio_free_context() and control is passed on to the new image, with pointers in task_struct already reset to NULL. If image loading fails, error recovery path invokes dprio_restore_context() to restore the pointers from the backup structure back into task_struct. - Sergey -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] sched: deferred set priority (dprio)
On 07/25/2014 12:45 PM, Sergey Oboguev wrote: > [This is a repost of the message from few day ago, with patch file > inline instead of being pointed by the URL.] > > This patch is intended to improve the support for fine-grain parallel > applications that may sometimes need to change the priority of their threads > at > a very high rate, hundreds or even thousands of times per scheduling > timeslice. What is the authlist stuff for? It looks complicated and it seems like it should be unnecessary. What's with the save/restore code? What's it saving and restoring? --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC] sched: deferred set priority (dprio)
[This is a repost of the message from few day ago, with patch file inline instead of being pointed by the URL.] This patch is intended to improve the support for fine-grain parallel applications that may sometimes need to change the priority of their threads at a very high rate, hundreds or even thousands of times per scheduling timeslice. These are typically applications that have to execute short or very short lock-holding critical or otherwise time-urgent sections of code at a very high frequency and need to protect these sections with "set priority" system calls, one "set priority" call to elevate current thread priority before entering the critical or time-urgent section, followed by another call to downgrade thread priority at the completion of the section. Due to the high frequency of entering and leaving critical or time-urgent sections, the cost of these "set priority" system calls may raise to a noticeable part of an application's overall expended CPU time. Proposed "deferred set priority" facility allows to largely eliminate the cost of these system calls. Instead of executing a system call to elevate its thread priority, an application simply writes its desired priority level to a designated memory location in the userspace. When the kernel attempts to preempt the thread, it first checks the content of this location, and if the application's stated request to change its priority has been posted in the designated memory area, the kernel will execute this request and alter the priority of the thread being preempted before performing a rescheduling, and then make a scheduling decision based on the new thread priority level thus implementing the priority protection of the critical or time-urgent section desired by the application. In a predominant number of cases however, an application will complete the critical section before the end of the current timeslice and cancel or alter the request held in the userspace area. Thus a vast majority of an application's change priority requests will be handled and mutually cancelled or coalesced within the userspace, at a very low overhead and without incurring the cost of a system call, while maintaining safe preemption control. The cost of an actual kernel-level "set priority" operation is incurred only if an application is actually being preempted while inside the critical section, i.e. typically at most once per scheduling timeslice instead of hundreds or thousands "set priority" system calls in the same timeslice. One of the intended purposes of this facility (but its not sole purpose) is to render a lightweight mechanism for priority protection of lock-holding critical sections that would be an adequate match for lightweight locking primitives such as futex, with both featuring a fast path completing within the userspace. More detailed description can be found in: https://raw.githubusercontent.com/oboguev/dprio/master/dprio.txt The patch is currently based on linux-3.15.2. User-level library implementing userspace-side boilerplate code: https://github.com/oboguev/dprio/tree/master/src/userlib Test set: https://github.com/oboguev/dprio/tree/master/src/test The patch is enabled with CONFIG_DEFERRED_SETPRIO. There is also a few other config settings: a setting for dprio debug code, a setting that controls the initial value for the authorization list restricting the use of the facility based on user or group ids, and a setting to improve the determinism in the rescheduling latency when dprio request is pending under low-memory conditions. Please see dprio.txt for details. Comments would be appreciated. Thanks, Sergey Signed-off-by: Sergey Oboguev --- fs/exec.c |8 + fs/proc/Makefile |1 + fs/proc/authlist.c | 493 + include/linux/authlist.h | 114 +++ include/linux/dprio.h | 130 include/linux/init_task.h | 17 + include/linux/sched.h | 15 + include/uapi/linux/prctl.h |2 + init/Kconfig |2 + kernel/Kconfig.dprio | 81 + kernel/exit.c |6 + kernel/fork.c | 87 +- kernel/sched/Makefile |1 + kernel/sched/core.c| 200 +++- kernel/sched/dprio.c | 734 kernel/sys.c |6 + kernel/sysctl.c| 11 + 17 files changed, 1897 insertions(+), 11 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index 238b7aa..9f5b649 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -56,6 +56,7 @@ #include #include #include +#include #include #include @@ -1433,6 +1434,7 @@ static int do_execve_common(struct filename *filename, struct file *file; struct files_struct *displaced; int retval; + struct dprio_saved_context dprio_context; if (IS_ERR(filename)) return PTR_ERR(filename); @@ -1483,6 +1485,9 @@ static int do_execve_common(struct file
Re: [PATCH RFC] sched: deferred set priority (dprio)
On Mon, 21 Jul 2014, Sergey Oboguev wrote: > This patch is intended to improve the support for fine-grain parallel This patch is no patch at all. Please read Documentation/SubmittingPatches Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC] sched: deferred set priority (dprio)
This patch is intended to improve the support for fine-grain parallel applications that may sometimes need to change the priority of their threads at a very high rate, hundreds or even thousands of times per scheduling timeslice. These are typically applications that have to execute short or very short lock-holding critical or otherwise time-urgent sections of code at a very high frequency and need to protect these sections with "set priority" system calls, one "set priority" call to elevate current thread priority before entering the critical or time-urgent section, followed by another call to downgrade thread priority at the completion of the section. Due to the high frequency of entering and leaving critical or time-urgent sections, the cost of these "set priority" system calls may raise to a noticeable part of an application's overall expended CPU time. Proposed "deferred set priority" facility allows to largely eliminate the cost of these system calls. Instead of executing a system call to elevate its thread priority, an application simply writes its desired priority level to a designated memory location in the userspace. When the kernel attempts to preempt the thread, it first checks the content of this location, and if the application's stated request to change its priority has been posted in the designated memory area, the kernel will execute this request and alter the priority of the thread being preempted before performing a rescheduling, and then make a scheduling decision based on the new thread priority level thus implementing the priority protection of the critical or time-urgent section desired by the application. In a predominant number of cases however, an application will complete the critical section before the end of the current timeslice and cancel or alter the request held in the userspace area. Thus a vast majority of an application's change priority requests will be handled and mutually cancelled or coalesced within the userspace, at a very low overhead and without incurring the cost of a system call, while maintaining safe preemption control. The cost of an actual kernel-level "set priority" operation is incurred only if an application is actually being preempted while inside the critical section, i.e. typically at most once per scheduling timeslice instead of hundreds or thousands "set priority" system calls in the same timeslice. One of the intended purposes of this facility (but its not sole purpose) is to render a lightweight mechanism for priority protection of lock-holding critical sections that would be an adequate match for lightweight locking primitives such as futex, with both featuring a fast path completing within the userspace. More detailed description can be found in: https://raw.githubusercontent.com/oboguev/dprio/master/dprio.txt The patch is currently based on 3.15.2. Patch file: https://github.com/oboguev/dprio/blob/master/patch/linux-3.15.2-dprio.patch https://raw.githubusercontent.com/oboguev/dprio/master/patch/linux-3.15.2-dprio.patch Modified source files: https://github.com/oboguev/dprio/tree/master/src/linux-3.15.2 User-level library implementing userspace-side boilerplate code: https://github.com/oboguev/dprio/tree/master/src/userlib Test set: https://github.com/oboguev/dprio/tree/master/src/test The patch is enabled with CONFIG_DEFERRED_SETPRIO. There is also a config setting for the debug code and a setting that controls the initial value of authorization list restricting the use of the facility based on user or group ids. Please see dprio.txt for details. Comments would be appreciated. Thanks, Sergey Signed-off-by: Sergey Oboguev -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/