Re: [RFC 1/2] Eliminate over- and under-counting of io_ticks
Hello! On 6/9/20 6:41 PM, Hou Tao wrote: > Hi, > > For the following case, the under-counting is still possible if io2 wins > cmpxchg(): > > t 0123456 > io1|-| > io2 |--| > stamp 0 6 > io_ticks 0 3 I hadn't noticed that bug. It looks like it can produce an unbounded quantity of undercount. > However considering patch 2 tries to improve sampling rate to 1 us, the > problem will gone. Now that you mention it, the below case is also poorly handled, and will be incorrect regardless of sampling frequency. It experiences issues both under this patch (labeled io_ticks) and the current implementation (labeled io_ticks~): t 0123456 io1|-| io2 |-| stamp 056 io_ticks28 stamp~ 0 3 56 io_ticks~ 1 34 I am beginning to doubt whether it is even possible to produce an algorithm that is simultaneously unbiased and synchronization-lite. At the same time, Ming's comment on patch 2 was leading me to wonder about the value of being synchronization-lite in the first place. At the proposed sampling rate of 1M/s, it is unlikely that we'd ever exercise the synchronization-free code path (and, as your case shows, incorrect). And for every block device that I'm aware of (even the ones that return in 10us), the cost of a disk access still completely dominates the cost of a locked CPU operation by three orders of magnitude. Josh
[RFC 1/2] Eliminate over- and under-counting of io_ticks
Previously, io_ticks could be under-counted. Consider these I/Os along the time axis (in jiffies): t 012345678 io1|| io2|---| Under the old approach, io_ticks would count up to 6, like so: t 012345678 io1|| io2|---| stamp 0 45 8 io_ticks 1 23 6 With this change, io_ticks instead counts to 8, eliminating the under-counting: t 012345678 io1|| io2|---| stamp 05 8 io_ticks 05 8 It was also possible for io_ticks to be over-counted. Consider a workload that issues I/Os deterministically at intervals of 8ms (125Hz). If each I/O takes 1ms, then the true utilization is 12.5%. The previous implementation will increment io_ticks once for each jiffy in which an I/O ends. Since the workload issues an I/O reliably for each jiffy, the reported utilization will be 100%. This commit changes the approach such that only I/Os which cross a boundary between jiffies are counted. With this change, the given workload would count an I/O tick on every eighth jiffy, resulting in a (correct) calculated utilization of 12.5%. Signed-off-by: Josh Snyder Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting") --- block/blk-core.c | 20 +--- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index d1b79dfe9540..a0bbd9e099b9 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1396,14 +1396,22 @@ unsigned int blk_rq_err_bytes(const struct request *rq) } EXPORT_SYMBOL_GPL(blk_rq_err_bytes); -static void update_io_ticks(struct hd_struct *part, unsigned long now, bool end) +static void update_io_ticks(struct hd_struct *part, unsigned long now, unsigned long start) { unsigned long stamp; + unsigned long elapsed; again: stamp = READ_ONCE(part->stamp); if (unlikely(stamp != now)) { - if (likely(cmpxchg(&part->stamp, stamp, now) == stamp)) - __part_stat_add(part, io_ticks, end ? now - stamp : 1); + if (likely(cmpxchg(&part->stamp, stamp, now) == stamp)) { + // stamp denotes the last IO to finish + // If this IO started before stamp, then there was overlap between this IO + // and that one. We increment only by the non-overlap time. + // If not, there was no overlap and we increment by our own time, + // disregarding stamp. + elapsed = now - (start < stamp ? stamp : start); + __part_stat_add(part, io_ticks, elapsed); + } } if (part->partno) { part = &part_to_disk(part)->part0; @@ -1439,7 +1447,7 @@ void blk_account_io_done(struct request *req, u64 now) part_stat_lock(); part = req->part; - update_io_ticks(part, jiffies, true); + update_io_ticks(part, jiffies, nsecs_to_jiffies(req->start_time_ns)); part_stat_inc(part, ios[sgrp]); part_stat_add(part, nsecs[sgrp], now - req->start_time_ns); part_stat_unlock(); @@ -1456,7 +1464,6 @@ void blk_account_io_start(struct request *rq) rq->part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq)); part_stat_lock(); - update_io_ticks(rq->part, jiffies, false); part_stat_unlock(); } @@ -1468,7 +1475,6 @@ unsigned long disk_start_io_acct(struct gendisk *disk, unsigned int sectors, unsigned long now = READ_ONCE(jiffies); part_stat_lock(); - update_io_ticks(part, now, false); part_stat_inc(part, ios[sgrp]); part_stat_add(part, sectors[sgrp], sectors); part_stat_local_inc(part, in_flight[op_is_write(op)]); @@ -1487,7 +1493,7 @@ void disk_end_io_acct(struct gendisk *disk, unsigned int op, unsigned long duration = now - start_time; part_stat_lock(); - update_io_ticks(part, now, true); + update_io_ticks(part, now, start_time); part_stat_add(part, nsecs[sgrp], jiffies_to_nsecs(duration)); part_stat_local_dec(part, in_flight[op_is_write(op)]); part_stat_unlock(); -- 2.25.1
[RFC 0/2] Increase accuracy and precision of sampled io_ticks
Commit 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting") introduces a sampling technique for calculating io_ticks. The sampling algorithm introduces bias in the calculation of I/O utilization. In my production system, this bias means that a workload which previously reported 10% I/O utilization now reports 80%. Patch 1 of this series eliminates the bias. The sampling technique is also subject to statistical noise. Because it infers io_ticks based on only 100 samples per second, io_ticks becomes imprecise, and subject to swings when measuring both random and deterministic workloads. Patch 2 of this series provides increased precision by raising the sampling rate.
[RFC 2/2] Track io_ticks at microsecond granularity.
Previously, we performed truncation of I/O issue/completion times during calculation of io_ticks, counting only I/Os which cross a jiffy boundary. The effect is a sampling of I/Os: at every boundary between jiffies we ask "is there an outstanding I/O" and increment a counter if the answer is yes. This produces results that are accurate (they don't systematically over- or under-count), but not precise (there is high variance associated with only taking 100 samples per second). This change modifies the sampling rate from 100Hz to 976562.5Hz (1 sample per 1024 nanoseconds). I chose this sampling rate by simulating a workload in which I/Os are issued randomly (by a Poisson process), and processed in constant time: an M/D/∞ system (Kendall's notation). My goal was to produce a sampled utilization fraction which was correct to one part-per-thousand given one second of samples. The tradeoff of the higher sampling rate is increased synchronization overhead caused by more frequent compare-and-swap operations. The technique of commit 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting") is to allow multiple I/Os to complete while performing only one synchronized operation. As we are increasing the sample rate by a factor of 1, we will less frequently be able to exercise the synchronization-free code path. Included below is the Python script I used to perform the simulation. It estimates the correct (calculated without sampling) value of %util, and then reports the root-mean-squared error of the as-sampled estimates. The parameters `io_rate`, `sample_rates`, and `avgqu_sz` are meant to be tweaked to fit characteristics of a given workload. I have chosen to simulate against a difficult workload: 1000 I/Os per second with an average queue size of 0.01, implying that each I/O takes 10 microseconds. This I/O latency is on par with some of the fastest production block devices available today, and an order of magnitude faster than a typical datacenter-grade SSD. With this change, an estimate of disk %util will not fluctuate as displayed by iostat with four decimal places, at a refresh rate of 1 Hz. #!/usr/bin/env python3 from math import log from math import sqrt from random import random GIGA = 1_000_000_000 SECOND = GIGA def times(interval, avgqu_sz, sample_rates): time = 0 correct = 0 est_counters = [0] * len(sample_rates) while time < SECOND: gap = -log(random()) * interval busy = svctm if gap > svctm else gap finish_time = time + busy correct += busy for i, rate in enumerate(sample_rates): est_counters[i] += ( float(int(finish_time * rate)) - int(time * rate) ) time += gap return correct, [ correct - (counter / rate) for counter, rate in zip(est_counters, sample_rates) ] # How many I/Os per second? io_rate = 1000 # How frequently are we sampling? (GHz) sample_rates = [ 100 / GIGA, # 100 Hz 1000 / GIGA, # 1000 Hz 1 / 65536, #15259 Hz 1 / 16384, #61035 Hz 1 / 1024,# 976563 Hz 1 / 64, # 15625000 Hz ] avgqu_sz = 0.01 interval = SECOND / io_rate svctm = interval * avgqu_sz total = 0 total_errors = [0] * len(sample_rates) count = 0 while True: correct, errors = times(interval, svctm, sample_rates) for i, error in enumerate(errors): total_errors[i] += error * error total += correct / SECOND count += 1 # prints [{RMS error} for rate in sample_rates] to_print = [ "{:05.2f}".format(100 * sqrt(error / count) / SECOND) for error in total_errors ] print(' '.join(to_print)) Signed-off-by: Josh Snyder Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting") --- block/blk-core.c | 16 +++- block/genhd.c | 4 ++-- include/linux/genhd.h | 2 +- include/linux/part_stat.h | 2 +- 4 files changed, 15 insertions(+), 9 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index a0bbd9e099b9..2749c52d649c 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -62,6 +62,8 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(block_unplug); DEFINE_IDA(blk_queue_ida); +#define IO_TICKS_COARSENESS 10 + /* * For queue allocation */ @@ -1396,10 +1398,14 @@ unsigned int blk_rq_err_bytes(const struct request *rq) } EXPORT_SYMBOL_GPL(blk_rq_err_bytes); -static void update_io_ticks(struct hd_struct *part, unsigned long now, unsigned long start) +static void update_io_ticks(struct hd_struct *part, u64 now, u64 start) { - unsigned long stamp; - unsigned long elapsed; + u64 stamp; + u64 e
Re: [PATCH 1/3] mm/mincore: make mincore() more conservative
On Thu, Jan 31, 2019 at 1:44 AM Michal Hocko wrote: > One thing is still not clear to me though. Is the new owner/writeable > check OK for the Netflix-like usecases? I mean does happycache have > appropriate access to the cache data? I have tried to re-read the > original thread but couldn't find any confirmation. The owner/writable check will suit every database that I've ever used happycache with, including cassandra, postgres and git. Acked-by: Josh Snyder
Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged
On Tue, Jan 15, 2019 at 10:34 PM Dominique Martinet wrote: > > There is a difference with your previous patch though, that used to list no > page in core when it didn't know; this patch lists pages as in core when it > refuses to tell. I don't think that's very important, though. Is there a reason not to return -EPERM in this case? > > If anything, the 0400 user-owner file might be a problem in some edge > case (e.g. if you're preloading git directories, many objects are 0444); > should we *also* check ownership?... Yes, this seems valuable. Some databases with immutable files (e.g. git, as you've mentioned) conceivably operate this way. Josh
Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged
Linus Torvalds wrote on Thu, Jan 10, 2019: > So right now, I consider the mincore change to be a "try to probe the > state of mincore users", and we haven't really gotten a lot of > information back yet. For Netflix, losing accurate information from the mincore syscall would lengthen database cluster maintenance operations from days to months. We rely on cross-process mincore to migrate the contents of a page cache from machine to machine, and across reboots. To do this, I wrote and maintain happycache [1], a page cache dumper/loader tool. It is quite similar in architecture to pgfincore, except that it is agnostic to workload. The gist of happycache's operation is "produce a dump of residence status for each page, do some operation, then reload exactly the same pages which were present before." happycache is entirely dependent on accurate reporting of the in-core status of file-backed pages, as accessed by another process. We primarily use happycache with Cassandra, which (like Postgres + pgfincore) relies heavily on OS page cache to reduce disk accesses. Because our workloads never experience a cold page cache, we are able to provision hardware for a peak utilization level that is far lower than the hypothetical "every query is a cache miss" peak. A database warmed by happycache can be ready for service in seconds (bounded only by the performance of the drives and the I/O subsystem), with no period of in-service degradation. By contrast, putting a database in service without a page cache entails a potentially unbounded period of degradation (at Netflix, the time to populate a single node's cache via natural cache misses varies by workload from hours to weeks). If a single node upgrade were to take weeks, then upgrading an entire cluster would take months. Since we want to apply security upgrades (and other things) on a somewhat tighter schedule, we would have to develop more complex solutions to provide the same functionality already provided by mincore. At the bottom line, happycache is designed to benignly exploit the same information leak documented in the paper [2]. I think it makes perfect sense to remove cross-process mincore functionality from unprivileged users, but not to remove it entirely. Josh Snyder Netflix Cloud Database Engineering [1] https://github.com/hashbrowncipher/happycache [2] https://arxiv.org/abs/1901.01161
[tip:sched/urgent] delayacct: Account blkio completion on the correct task
Commit-ID: c96f5471ce7d2aefd0dda560cc23f08ab00bc65d Gitweb: https://git.kernel.org/tip/c96f5471ce7d2aefd0dda560cc23f08ab00bc65d Author: Josh Snyder AuthorDate: Mon, 18 Dec 2017 16:15:10 + Committer: Ingo Molnar CommitDate: Tue, 16 Jan 2018 03:29:36 +0100 delayacct: Account blkio completion on the correct task Before commit: e33a9bba85a8 ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler") delayacct_blkio_end() was called after context-switching into the task which completed I/O. This resulted in double counting: the task would account a delay both waiting for I/O and for time spent in the runqueue. With e33a9bba85a8, delayacct_blkio_end() is called by try_to_wake_up(). In ttwu, we have not yet context-switched. This is more correct, in that the delay accounting ends when the I/O is complete. But delayacct_blkio_end() relies on 'get_current()', and we have not yet context-switched into the task whose I/O completed. This results in the wrong task having its delay accounting statistics updated. Instead of doing that, pass the task_struct being woken to delayacct_blkio_end(), so that it can update the statistics of the correct task. Signed-off-by: Josh Snyder Acked-by: Tejun Heo Acked-by: Balbir Singh Cc: Cc: Brendan Gregg Cc: Jens Axboe Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-bl...@vger.kernel.org Fixes: e33a9bba85a8 ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler") Link: http://lkml.kernel.org/r/1513613712-571-1-git-send-email-jo...@netflix.com Signed-off-by: Ingo Molnar --- include/linux/delayacct.h | 8 kernel/delayacct.c| 42 ++ kernel/sched/core.c | 6 +++--- 3 files changed, 33 insertions(+), 23 deletions(-) diff --git a/include/linux/delayacct.h b/include/linux/delayacct.h index 4178d24..5e335b6 100644 --- a/include/linux/delayacct.h +++ b/include/linux/delayacct.h @@ -71,7 +71,7 @@ extern void delayacct_init(void); extern void __delayacct_tsk_init(struct task_struct *); extern void __delayacct_tsk_exit(struct task_struct *); extern void __delayacct_blkio_start(void); -extern void __delayacct_blkio_end(void); +extern void __delayacct_blkio_end(struct task_struct *); extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *); extern __u64 __delayacct_blkio_ticks(struct task_struct *); extern void __delayacct_freepages_start(void); @@ -122,10 +122,10 @@ static inline void delayacct_blkio_start(void) __delayacct_blkio_start(); } -static inline void delayacct_blkio_end(void) +static inline void delayacct_blkio_end(struct task_struct *p) { if (current->delays) - __delayacct_blkio_end(); + __delayacct_blkio_end(p); delayacct_clear_flag(DELAYACCT_PF_BLKIO); } @@ -169,7 +169,7 @@ static inline void delayacct_tsk_free(struct task_struct *tsk) {} static inline void delayacct_blkio_start(void) {} -static inline void delayacct_blkio_end(void) +static inline void delayacct_blkio_end(struct task_struct *p) {} static inline int delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk) diff --git a/kernel/delayacct.c b/kernel/delayacct.c index 4a1c334..e2764d7 100644 --- a/kernel/delayacct.c +++ b/kernel/delayacct.c @@ -51,16 +51,16 @@ void __delayacct_tsk_init(struct task_struct *tsk) * Finish delay accounting for a statistic using its timestamps (@start), * accumalator (@total) and @count */ -static void delayacct_end(u64 *start, u64 *total, u32 *count) +static void delayacct_end(spinlock_t *lock, u64 *start, u64 *total, u32 *count) { s64 ns = ktime_get_ns() - *start; unsigned long flags; if (ns > 0) { - spin_lock_irqsave(¤t->delays->lock, flags); + spin_lock_irqsave(lock, flags); *total += ns; (*count)++; - spin_unlock_irqrestore(¤t->delays->lock, flags); + spin_unlock_irqrestore(lock, flags); } } @@ -69,17 +69,25 @@ void __delayacct_blkio_start(void) current->delays->blkio_start = ktime_get_ns(); } -void __delayacct_blkio_end(void) +/* + * We cannot rely on the `current` macro, as we haven't yet switched back to + * the process being woken. + */ +void __delayacct_blkio_end(struct task_struct *p) { - if (current->delays->flags & DELAYACCT_PF_SWAPIN) - /* Swapin block I/O */ - delayacct_end(¤t->delays->blkio_start, - ¤t->delays->swapin_delay, - ¤t->delays->swapin_count); - else/* Other block I/O */ - delayacct_end(¤t->delays->blkio_start, - ¤t->delays->blkio_delay, - ¤t->delays->
[PATCH v2] delayacct: Account blkio completion on the correct task
Before commit e33a9bba85a8 ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler"), delayacct_blkio_end was called after context-switching into the task which completed I/O. This resulted in double counting: the task would account a delay both waiting for I/O and for time spent in the runqueue. With e33a9bba85a8, delayacct_blkio_end is called by try_to_wake_up. In ttwu, we have not yet context-switched. This is more correct, in that the delay accounting ends when the I/O is complete. But delayacct_blkio_end relies upon `get_current()`, and we have not yet context-switched into the task whose I/O completed. This results in the wrong task having its delay accounting statistics updated. Instead of doing that, pass the task_struct being woken to delayacct_blkio_end, so that it can update the statistics of the correct task. Fixes: e33a9bba85a8 ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler") Signed-off-by: Josh Snyder --- include/linux/delayacct.h | 8 kernel/delayacct.c| 42 ++ kernel/sched/core.c | 6 +++--- 3 files changed, 33 insertions(+), 23 deletions(-) diff --git a/include/linux/delayacct.h b/include/linux/delayacct.h index 4178d24..f2ad868 100644 --- a/include/linux/delayacct.h +++ b/include/linux/delayacct.h @@ -71,7 +71,7 @@ extern void delayacct_init(void); extern void __delayacct_tsk_init(struct task_struct *); extern void __delayacct_tsk_exit(struct task_struct *); extern void __delayacct_blkio_start(void); -extern void __delayacct_blkio_end(void); +extern void __delayacct_blkio_end(struct task_struct *); extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *); extern __u64 __delayacct_blkio_ticks(struct task_struct *); extern void __delayacct_freepages_start(void); @@ -122,10 +122,10 @@ static inline void delayacct_blkio_start(void) __delayacct_blkio_start(); } -static inline void delayacct_blkio_end(void) +static inline void delayacct_blkio_end(struct task_struct *p) { if (current->delays) - __delayacct_blkio_end(); + __delayacct_blkio_end(p); delayacct_clear_flag(DELAYACCT_PF_BLKIO); } @@ -169,7 +169,7 @@ static inline void delayacct_tsk_free(struct task_struct *tsk) {} static inline void delayacct_blkio_start(void) {} -static inline void delayacct_blkio_end(void) +static inline void delayacct_blkio_end(struct task_struct *p) {} static inline int delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk) diff --git a/kernel/delayacct.c b/kernel/delayacct.c index 4a1c334..e2ec808 100644 --- a/kernel/delayacct.c +++ b/kernel/delayacct.c @@ -51,16 +51,16 @@ void __delayacct_tsk_init(struct task_struct *tsk) * Finish delay accounting for a statistic using its timestamps (@start), * accumalator (@total) and @count */ -static void delayacct_end(u64 *start, u64 *total, u32 *count) +static void delayacct_end(spinlock_t *lock, u64 *start, u64 *total, u32 *count) { s64 ns = ktime_get_ns() - *start; unsigned long flags; if (ns > 0) { - spin_lock_irqsave(¤t->delays->lock, flags); + spin_lock_irqsave(lock, flags); *total += ns; (*count)++; - spin_unlock_irqrestore(¤t->delays->lock, flags); + spin_unlock_irqrestore(lock, flags); } } @@ -69,17 +69,25 @@ void __delayacct_blkio_start(void) current->delays->blkio_start = ktime_get_ns(); } -void __delayacct_blkio_end(void) +/* + * We cannot rely on the `current` macro, as we haven't yet switched back to + * the process being woken. + */ +void __delayacct_blkio_end(struct task_struct *p) { - if (current->delays->flags & DELAYACCT_PF_SWAPIN) - /* Swapin block I/O */ - delayacct_end(¤t->delays->blkio_start, - ¤t->delays->swapin_delay, - ¤t->delays->swapin_count); - else/* Other block I/O */ - delayacct_end(¤t->delays->blkio_start, - ¤t->delays->blkio_delay, - ¤t->delays->blkio_count); + struct task_delay_info *delays = p->delays; + u64 *total; + u32 *count; + + if (p->delays->flags & DELAYACCT_PF_SWAPIN) { + total = &delays->swapin_delay; + count = &delays->swapin_count; + } else { + total = &delays->blkio_delay; + count = &delays->blkio_count; + } + + delayacct_end(&delays->lock, &delays->blkio_start, total, count); } int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk) @@ -153,8 +161,10 @@ void __delayacct_freepages_start(v
[PATCH] Pass the task_struct explicitly to delayacct_blkio_end
Before e33a9bba85a8, delayacct_blkio_end was called after context-switching into the task which completed I/O. This resulted in double counting: the task would account a delay both waiting for I/O and for time spent in the runqueue. With e33a9bba85a8, delayacct_blkio_end is called by try_to_wake_up. In ttwu, we have not yet context-switched. This is more correct, in that the delay accounting ends when the I/O is complete. But delayacct_blkio_end relies upon `get_current()`, and we have not yet context-switched into the task whose I/O completed. This results in the wrong task having its delay accounting statistics updated. Instead of doing that, pass the task_struct being woken to delayacct_blkio_end, so that it can update the statistics of the correct task_struct. --- include/linux/delayacct.h | 8 kernel/delayacct.c| 42 ++ kernel/sched/core.c | 6 +++--- 3 files changed, 33 insertions(+), 23 deletions(-) diff --git a/include/linux/delayacct.h b/include/linux/delayacct.h index 4178d24..f2ad868 100644 --- a/include/linux/delayacct.h +++ b/include/linux/delayacct.h @@ -71,7 +71,7 @@ extern void delayacct_init(void); extern void __delayacct_tsk_init(struct task_struct *); extern void __delayacct_tsk_exit(struct task_struct *); extern void __delayacct_blkio_start(void); -extern void __delayacct_blkio_end(void); +extern void __delayacct_blkio_end(struct task_struct *); extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *); extern __u64 __delayacct_blkio_ticks(struct task_struct *); extern void __delayacct_freepages_start(void); @@ -122,10 +122,10 @@ static inline void delayacct_blkio_start(void) __delayacct_blkio_start(); } -static inline void delayacct_blkio_end(void) +static inline void delayacct_blkio_end(struct task_struct *p) { if (current->delays) - __delayacct_blkio_end(); + __delayacct_blkio_end(p); delayacct_clear_flag(DELAYACCT_PF_BLKIO); } @@ -169,7 +169,7 @@ static inline void delayacct_tsk_free(struct task_struct *tsk) {} static inline void delayacct_blkio_start(void) {} -static inline void delayacct_blkio_end(void) +static inline void delayacct_blkio_end(struct task_struct *p) {} static inline int delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk) diff --git a/kernel/delayacct.c b/kernel/delayacct.c index 4a1c334..e2ec808 100644 --- a/kernel/delayacct.c +++ b/kernel/delayacct.c @@ -51,16 +51,16 @@ void __delayacct_tsk_init(struct task_struct *tsk) * Finish delay accounting for a statistic using its timestamps (@start), * accumalator (@total) and @count */ -static void delayacct_end(u64 *start, u64 *total, u32 *count) +static void delayacct_end(spinlock_t *lock, u64 *start, u64 *total, u32 *count) { s64 ns = ktime_get_ns() - *start; unsigned long flags; if (ns > 0) { - spin_lock_irqsave(¤t->delays->lock, flags); + spin_lock_irqsave(lock, flags); *total += ns; (*count)++; - spin_unlock_irqrestore(¤t->delays->lock, flags); + spin_unlock_irqrestore(lock, flags); } } @@ -69,17 +69,25 @@ void __delayacct_blkio_start(void) current->delays->blkio_start = ktime_get_ns(); } -void __delayacct_blkio_end(void) +/* + * We cannot rely on the `current` macro, as we haven't yet switched back to + * the process being woken. + */ +void __delayacct_blkio_end(struct task_struct *p) { - if (current->delays->flags & DELAYACCT_PF_SWAPIN) - /* Swapin block I/O */ - delayacct_end(¤t->delays->blkio_start, - ¤t->delays->swapin_delay, - ¤t->delays->swapin_count); - else/* Other block I/O */ - delayacct_end(¤t->delays->blkio_start, - ¤t->delays->blkio_delay, - ¤t->delays->blkio_count); + struct task_delay_info * delays = p->delays; + u64 *total; + u32 *count; + + if (p->delays->flags & DELAYACCT_PF_SWAPIN) { + total = &delays->swapin_delay; + count = &delays->swapin_count; + } else { + total = &delays->blkio_delay; + count = &delays->blkio_count; + } + + delayacct_end(&delays->lock, &delays->blkio_start, total, count); } int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk) @@ -153,8 +161,10 @@ void __delayacct_freepages_start(void) void __delayacct_freepages_end(void) { - delayacct_end(¤t->delays->freepages_start, - ¤t->delays->freepages_delay, - ¤t->delays->freepages_count); + delayacct_end( + ¤t->delays->lock, + ¤t->delays->freepages_start, + ¤t->delays->freepages_delay, + ¤t->delays->freepage