Re: [RFC 1/2] Eliminate over- and under-counting of io_ticks

2020-06-10 Thread Josh Snyder
Hello!

On 6/9/20 6:41 PM, Hou Tao wrote:
> Hi,
> 
> For the following case, the under-counting is still possible if io2 wins 
> cmpxchg():
> 
>   t  0123456
>   io1|-|
>   io2   |--|
>   stamp  0 6
>   io_ticks   0 3

I hadn't noticed that bug. It looks like it can produce an unbounded quantity 
of undercount.

> However considering patch 2 tries to improve sampling rate to 1 us, the 
> problem will gone.

Now that you mention it, the below case is also poorly handled, and will be 
incorrect
regardless of sampling frequency. It experiences issues both under this patch 
(labeled
io_ticks) and the current implementation (labeled io_ticks~):

   t  0123456
   io1|-|
   io2   |-|
   stamp  056
   io_ticks28

   stamp~ 0  3 56
   io_ticks~ 1 34

I am beginning to doubt whether it is even possible to produce an algorithm 
that is
simultaneously unbiased and synchronization-lite. At the same time, Ming's 
comment on
patch 2 was leading me to wonder about the value of being synchronization-lite 
in the
first place. At the proposed sampling rate of 1M/s, it is unlikely that we'd 
ever exercise
the synchronization-free code path (and, as your case shows, incorrect). And 
for every
block device that I'm aware of (even the ones that return in 10us), the cost of 
a disk
access still completely dominates the cost of a locked CPU operation by three 
orders of
magnitude.

Josh


[RFC 1/2] Eliminate over- and under-counting of io_ticks

2020-06-08 Thread Josh Snyder
Previously, io_ticks could be under-counted. Consider these I/Os along
the time axis (in jiffies):

  t  012345678
  io1||
  io2|---|

Under the old approach, io_ticks would count up to 6, like so:

  t  012345678
  io1||
  io2|---|
  stamp  0   45  8
  io_ticks   1   23  6

With this change, io_ticks instead counts to 8, eliminating the
under-counting:

  t  012345678
  io1||
  io2|---|
  stamp  05  8
  io_ticks   05  8

It was also possible for io_ticks to be over-counted. Consider a
workload that issues I/Os deterministically at intervals of 8ms (125Hz).
If each I/O takes 1ms, then the true utilization is 12.5%. The previous
implementation will increment io_ticks once for each jiffy in which an
I/O ends. Since the workload issues an I/O reliably for each jiffy, the
reported utilization will be 100%. This commit changes the approach such
that only I/Os which cross a boundary between jiffies are counted. With
this change, the given workload would count an I/O tick on every eighth
jiffy, resulting in a (correct) calculated utilization of 12.5%.

Signed-off-by: Josh Snyder 
Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise 
counting")
---
 block/blk-core.c | 20 +---
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d1b79dfe9540..a0bbd9e099b9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1396,14 +1396,22 @@ unsigned int blk_rq_err_bytes(const struct request *rq)
 }
 EXPORT_SYMBOL_GPL(blk_rq_err_bytes);
 
-static void update_io_ticks(struct hd_struct *part, unsigned long now, bool 
end)
+static void update_io_ticks(struct hd_struct *part, unsigned long now, 
unsigned long start)
 {
unsigned long stamp;
+   unsigned long elapsed;
 again:
stamp = READ_ONCE(part->stamp);
if (unlikely(stamp != now)) {
-   if (likely(cmpxchg(&part->stamp, stamp, now) == stamp))
-   __part_stat_add(part, io_ticks, end ? now - stamp : 1);
+   if (likely(cmpxchg(&part->stamp, stamp, now) == stamp)) {
+   // stamp denotes the last IO to finish
+   // If this IO started before stamp, then there was 
overlap between this IO
+   // and that one. We increment only by the non-overlap 
time.
+   // If not, there was no overlap and we increment by our 
own time,
+   // disregarding stamp.
+   elapsed = now - (start < stamp ? stamp : start);
+   __part_stat_add(part, io_ticks, elapsed);
+   }
}
if (part->partno) {
part = &part_to_disk(part)->part0;
@@ -1439,7 +1447,7 @@ void blk_account_io_done(struct request *req, u64 now)
part_stat_lock();
part = req->part;
 
-   update_io_ticks(part, jiffies, true);
+   update_io_ticks(part, jiffies, 
nsecs_to_jiffies(req->start_time_ns));
part_stat_inc(part, ios[sgrp]);
part_stat_add(part, nsecs[sgrp], now - req->start_time_ns);
part_stat_unlock();
@@ -1456,7 +1464,6 @@ void blk_account_io_start(struct request *rq)
rq->part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
 
part_stat_lock();
-   update_io_ticks(rq->part, jiffies, false);
part_stat_unlock();
 }
 
@@ -1468,7 +1475,6 @@ unsigned long disk_start_io_acct(struct gendisk *disk, 
unsigned int sectors,
unsigned long now = READ_ONCE(jiffies);
 
part_stat_lock();
-   update_io_ticks(part, now, false);
part_stat_inc(part, ios[sgrp]);
part_stat_add(part, sectors[sgrp], sectors);
part_stat_local_inc(part, in_flight[op_is_write(op)]);
@@ -1487,7 +1493,7 @@ void disk_end_io_acct(struct gendisk *disk, unsigned int 
op,
unsigned long duration = now - start_time;
 
part_stat_lock();
-   update_io_ticks(part, now, true);
+   update_io_ticks(part, now, start_time);
part_stat_add(part, nsecs[sgrp], jiffies_to_nsecs(duration));
part_stat_local_dec(part, in_flight[op_is_write(op)]);
part_stat_unlock();
-- 
2.25.1



[RFC 0/2] Increase accuracy and precision of sampled io_ticks

2020-06-08 Thread Josh Snyder
Commit 5b18b5a73760 ("block: delete part_round_stats and switch to less
precise counting") introduces a sampling technique for calculating
io_ticks. The sampling algorithm introduces bias in the calculation of
I/O utilization. In my production system, this bias means that a
workload which previously reported 10% I/O utilization now reports 80%.
Patch 1 of this series eliminates the bias.

The sampling technique is also subject to statistical noise. Because it
infers io_ticks based on only 100 samples per second, io_ticks becomes
imprecise, and subject to swings when measuring both random and
deterministic workloads. Patch 2 of this series provides increased
precision by raising the sampling rate.




[RFC 2/2] Track io_ticks at microsecond granularity.

2020-06-08 Thread Josh Snyder
Previously, we performed truncation of I/O issue/completion times during
calculation of io_ticks, counting only I/Os which cross a jiffy
boundary. The effect is a sampling of I/Os: at every boundary between
jiffies we ask "is there an outstanding I/O" and increment a counter if
the answer is yes. This produces results that are accurate (they don't
systematically over- or under-count), but not precise (there is high
variance associated with only taking 100 samples per second).

This change modifies the sampling rate from 100Hz to 976562.5Hz (1
sample per 1024 nanoseconds). I chose this sampling rate by simulating a
workload in which I/Os are issued randomly (by a Poisson process), and
processed in constant time: an M/D/∞ system (Kendall's notation). My
goal was to produce a sampled utilization fraction which was correct to
one part-per-thousand given one second of samples.

The tradeoff of the higher sampling rate is increased synchronization
overhead caused by more frequent compare-and-swap operations. The
technique of commit 5b18b5a73760 ("block: delete part_round_stats and
switch to less precise counting") is to allow multiple I/Os to complete
while performing only one synchronized operation. As we are increasing
the sample rate by a factor of 1, we will less frequently be able to
exercise the synchronization-free code path.

Included below is the Python script I used to perform the simulation. It
estimates the correct (calculated without sampling) value of %util, and
then reports the root-mean-squared error of the as-sampled estimates.
The parameters `io_rate`, `sample_rates`, and `avgqu_sz` are meant to be
tweaked to fit characteristics of a given workload. I have chosen to
simulate against a difficult workload: 1000 I/Os per second with an
average queue size of 0.01, implying that each I/O takes 10
microseconds. This I/O latency is on par with some of the fastest
production block devices available today, and an order of magnitude
faster than a typical datacenter-grade SSD. With this change, an
estimate of disk %util will not fluctuate as displayed by iostat with
four decimal places, at a refresh rate of 1 Hz.

#!/usr/bin/env python3
from math import log
from math import sqrt
from random import random

GIGA = 1_000_000_000
SECOND = GIGA

def times(interval, avgqu_sz, sample_rates):
time = 0
correct = 0

est_counters = [0] * len(sample_rates)

while time < SECOND:
gap = -log(random()) * interval
busy = svctm if gap > svctm else gap
finish_time = time + busy

correct += busy
for i, rate in enumerate(sample_rates):
est_counters[i] += (
float(int(finish_time * rate)) - int(time * rate)
)

time += gap

return correct, [
correct - (counter / rate)
for counter, rate in zip(est_counters, sample_rates)
]

# How many I/Os per second?
io_rate = 1000
# How frequently are we sampling? (GHz)
sample_rates = [
100 / GIGA,  #  100 Hz
1000 / GIGA, # 1000 Hz
1 / 65536,   #15259 Hz
1 / 16384,   #61035 Hz
1 / 1024,#   976563 Hz
1 / 64,  # 15625000 Hz
]
avgqu_sz = 0.01

interval = SECOND / io_rate
svctm = interval * avgqu_sz
total = 0
total_errors = [0] * len(sample_rates)
count = 0
while True:
correct, errors = times(interval, svctm, sample_rates)
for i, error in enumerate(errors):
total_errors[i] += error * error
total += correct / SECOND
count += 1

# prints [{RMS error} for rate in sample_rates]
to_print = [
   "{:05.2f}".format(100 * sqrt(error / count) / SECOND)
for error in total_errors
]
print(' '.join(to_print))

Signed-off-by: Josh Snyder 
Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise 
counting")
---
 block/blk-core.c  | 16 +++-
 block/genhd.c |  4 ++--
 include/linux/genhd.h |  2 +-
 include/linux/part_stat.h |  2 +-
 4 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index a0bbd9e099b9..2749c52d649c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -62,6 +62,8 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(block_unplug);
 
 DEFINE_IDA(blk_queue_ida);
 
+#define IO_TICKS_COARSENESS 10
+
 /*
  * For queue allocation
  */
@@ -1396,10 +1398,14 @@ unsigned int blk_rq_err_bytes(const struct request *rq)
 }
 EXPORT_SYMBOL_GPL(blk_rq_err_bytes);
 
-static void update_io_ticks(struct hd_struct *part, unsigned long now, 
unsigned long start)
+static void update_io_ticks(struct hd_struct *part, u64 now, u64 start)
 {
-   unsigned long stamp;
-   unsigned long elapsed;
+   u64 stamp;
+   u64 e

Re: [PATCH 1/3] mm/mincore: make mincore() more conservative

2019-01-31 Thread Josh Snyder
On Thu, Jan 31, 2019 at 1:44 AM Michal Hocko  wrote:
> One thing is still not clear to me though. Is the new owner/writeable
> check OK for the Netflix-like usecases? I mean does happycache have
> appropriate access to the cache data? I have tried to re-read the
> original thread but couldn't find any confirmation.

The owner/writable check will suit every database that I've ever used
happycache with, including cassandra, postgres and git.

Acked-by: Josh Snyder 


Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

2019-01-15 Thread Josh Snyder
On Tue, Jan 15, 2019 at 10:34 PM Dominique Martinet 
wrote:
>
> There is a difference with your previous patch though, that used to list no
> page in core when it didn't know; this patch lists pages as in core when it
> refuses to tell. I don't think that's very important, though.

Is there a reason not to return -EPERM in this case?

>
> If anything, the 0400 user-owner file might be a problem in some edge
> case (e.g. if you're preloading git directories, many objects are 0444);
> should we *also* check ownership?...

Yes, this seems valuable. Some databases with immutable files (e.g. git, as
you've mentioned) conceivably operate this way.

Josh


Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

2019-01-15 Thread Josh Snyder
Linus Torvalds wrote on Thu, Jan 10, 2019:
> So right now, I consider the mincore change to be a "try to probe the
> state of mincore users", and we haven't really gotten a lot of
> information back yet.



For Netflix, losing accurate information from the mincore syscall would
lengthen database cluster maintenance operations from days to months.  We
rely on cross-process mincore to migrate the contents of a page cache from
machine to machine, and across reboots.

To do this, I wrote and maintain happycache [1], a page cache dumper/loader
tool. It is quite similar in architecture to pgfincore, except that it is
agnostic to workload. The gist of happycache's operation is "produce a dump
of residence status for each page, do some operation, then reload exactly
the same pages which were present before." happycache is entirely dependent
on accurate reporting of the in-core status of file-backed pages, as
accessed by another process.

We primarily use happycache with Cassandra, which (like Postgres +
pgfincore) relies heavily on OS page cache to reduce disk accesses. Because
our workloads never experience a cold page cache, we are able to provision
hardware for a peak utilization level that is far lower than the
hypothetical "every query is a cache miss" peak.

A database warmed by happycache can be ready for service in seconds
(bounded only by the performance of the drives and the I/O subsystem), with
no period of in-service degradation. By contrast, putting a database in
service without a page cache entails a potentially unbounded period of
degradation (at Netflix, the time to populate a single node's cache via
natural cache misses varies by workload from hours to weeks). If a single
node upgrade were to take weeks, then upgrading an entire cluster would
take months. Since we want to apply security upgrades (and other things) on
a somewhat tighter schedule, we would have to develop more complex
solutions to provide the same functionality already provided by mincore.

At the bottom line, happycache is designed to benignly exploit the same
information leak documented in the paper [2]. I think it makes perfect
sense to remove cross-process mincore functionality from unprivileged
users, but not to remove it entirely.

Josh Snyder
Netflix Cloud Database Engineering

[1] https://github.com/hashbrowncipher/happycache
[2] https://arxiv.org/abs/1901.01161


[tip:sched/urgent] delayacct: Account blkio completion on the correct task

2018-01-15 Thread tip-bot for Josh Snyder
Commit-ID:  c96f5471ce7d2aefd0dda560cc23f08ab00bc65d
Gitweb: https://git.kernel.org/tip/c96f5471ce7d2aefd0dda560cc23f08ab00bc65d
Author: Josh Snyder 
AuthorDate: Mon, 18 Dec 2017 16:15:10 +
Committer:  Ingo Molnar 
CommitDate: Tue, 16 Jan 2018 03:29:36 +0100

delayacct: Account blkio completion on the correct task

Before commit:

  e33a9bba85a8 ("sched/core: move IO scheduling accounting from 
io_schedule_timeout() into scheduler")

delayacct_blkio_end() was called after context-switching into the task which
completed I/O.

This resulted in double counting: the task would account a delay both waiting
for I/O and for time spent in the runqueue.

With e33a9bba85a8, delayacct_blkio_end() is called by try_to_wake_up().
In ttwu, we have not yet context-switched. This is more correct, in that
the delay accounting ends when the I/O is complete.

But delayacct_blkio_end() relies on 'get_current()', and we have not yet
context-switched into the task whose I/O completed. This results in the
wrong task having its delay accounting statistics updated.

Instead of doing that, pass the task_struct being woken to 
delayacct_blkio_end(),
so that it can update the statistics of the correct task.

Signed-off-by: Josh Snyder 
Acked-by: Tejun Heo 
Acked-by: Balbir Singh 
Cc: 
Cc: Brendan Gregg 
Cc: Jens Axboe 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-bl...@vger.kernel.org
Fixes: e33a9bba85a8 ("sched/core: move IO scheduling accounting from 
io_schedule_timeout() into scheduler")
Link: http://lkml.kernel.org/r/1513613712-571-1-git-send-email-jo...@netflix.com
Signed-off-by: Ingo Molnar 
---
 include/linux/delayacct.h |  8 
 kernel/delayacct.c| 42 ++
 kernel/sched/core.c   |  6 +++---
 3 files changed, 33 insertions(+), 23 deletions(-)

diff --git a/include/linux/delayacct.h b/include/linux/delayacct.h
index 4178d24..5e335b6 100644
--- a/include/linux/delayacct.h
+++ b/include/linux/delayacct.h
@@ -71,7 +71,7 @@ extern void delayacct_init(void);
 extern void __delayacct_tsk_init(struct task_struct *);
 extern void __delayacct_tsk_exit(struct task_struct *);
 extern void __delayacct_blkio_start(void);
-extern void __delayacct_blkio_end(void);
+extern void __delayacct_blkio_end(struct task_struct *);
 extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *);
 extern __u64 __delayacct_blkio_ticks(struct task_struct *);
 extern void __delayacct_freepages_start(void);
@@ -122,10 +122,10 @@ static inline void delayacct_blkio_start(void)
__delayacct_blkio_start();
 }
 
-static inline void delayacct_blkio_end(void)
+static inline void delayacct_blkio_end(struct task_struct *p)
 {
if (current->delays)
-   __delayacct_blkio_end();
+   __delayacct_blkio_end(p);
delayacct_clear_flag(DELAYACCT_PF_BLKIO);
 }
 
@@ -169,7 +169,7 @@ static inline void delayacct_tsk_free(struct task_struct 
*tsk)
 {}
 static inline void delayacct_blkio_start(void)
 {}
-static inline void delayacct_blkio_end(void)
+static inline void delayacct_blkio_end(struct task_struct *p)
 {}
 static inline int delayacct_add_tsk(struct taskstats *d,
struct task_struct *tsk)
diff --git a/kernel/delayacct.c b/kernel/delayacct.c
index 4a1c334..e2764d7 100644
--- a/kernel/delayacct.c
+++ b/kernel/delayacct.c
@@ -51,16 +51,16 @@ void __delayacct_tsk_init(struct task_struct *tsk)
  * Finish delay accounting for a statistic using its timestamps (@start),
  * accumalator (@total) and @count
  */
-static void delayacct_end(u64 *start, u64 *total, u32 *count)
+static void delayacct_end(spinlock_t *lock, u64 *start, u64 *total, u32 *count)
 {
s64 ns = ktime_get_ns() - *start;
unsigned long flags;
 
if (ns > 0) {
-   spin_lock_irqsave(¤t->delays->lock, flags);
+   spin_lock_irqsave(lock, flags);
*total += ns;
(*count)++;
-   spin_unlock_irqrestore(¤t->delays->lock, flags);
+   spin_unlock_irqrestore(lock, flags);
}
 }
 
@@ -69,17 +69,25 @@ void __delayacct_blkio_start(void)
current->delays->blkio_start = ktime_get_ns();
 }
 
-void __delayacct_blkio_end(void)
+/*
+ * We cannot rely on the `current` macro, as we haven't yet switched back to
+ * the process being woken.
+ */
+void __delayacct_blkio_end(struct task_struct *p)
 {
-   if (current->delays->flags & DELAYACCT_PF_SWAPIN)
-   /* Swapin block I/O */
-   delayacct_end(¤t->delays->blkio_start,
-   ¤t->delays->swapin_delay,
-   ¤t->delays->swapin_count);
-   else/* Other block I/O */
-   delayacct_end(¤t->delays->blkio_start,
-   ¤t->delays->blkio_delay,
-   ¤t->delays->

[PATCH v2] delayacct: Account blkio completion on the correct task

2017-12-18 Thread Josh Snyder
Before commit e33a9bba85a8 ("sched/core: move IO scheduling accounting from
io_schedule_timeout() into scheduler"), delayacct_blkio_end was called after
context-switching into the task which completed I/O. This resulted in double
counting: the task would account a delay both waiting for I/O and for time
spent in the runqueue.

With e33a9bba85a8, delayacct_blkio_end is called by try_to_wake_up. In
ttwu, we have not yet context-switched. This is more correct, in that the
delay accounting ends when the I/O is complete. But delayacct_blkio_end
relies upon `get_current()`, and we have not yet context-switched into the
task whose I/O completed. This results in the wrong task having its delay
accounting statistics updated.

Instead of doing that, pass the task_struct being woken to
delayacct_blkio_end, so that it can update the statistics of the correct
task.

Fixes: e33a9bba85a8 ("sched/core: move IO scheduling accounting from 
io_schedule_timeout() into scheduler")
Signed-off-by: Josh Snyder 
---
 include/linux/delayacct.h |  8 
 kernel/delayacct.c| 42 ++
 kernel/sched/core.c   |  6 +++---
 3 files changed, 33 insertions(+), 23 deletions(-)

diff --git a/include/linux/delayacct.h b/include/linux/delayacct.h
index 4178d24..f2ad868 100644
--- a/include/linux/delayacct.h
+++ b/include/linux/delayacct.h
@@ -71,7 +71,7 @@ extern void delayacct_init(void);
 extern void __delayacct_tsk_init(struct task_struct *);
 extern void __delayacct_tsk_exit(struct task_struct *);
 extern void __delayacct_blkio_start(void);
-extern void __delayacct_blkio_end(void);
+extern void __delayacct_blkio_end(struct task_struct *);
 extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *);
 extern __u64 __delayacct_blkio_ticks(struct task_struct *);
 extern void __delayacct_freepages_start(void);
@@ -122,10 +122,10 @@ static inline void delayacct_blkio_start(void)
__delayacct_blkio_start();
 }
 
-static inline void delayacct_blkio_end(void)
+static inline void delayacct_blkio_end(struct task_struct *p)
 {
if (current->delays)
-   __delayacct_blkio_end();
+   __delayacct_blkio_end(p);
delayacct_clear_flag(DELAYACCT_PF_BLKIO);
 }
 
@@ -169,7 +169,7 @@ static inline void delayacct_tsk_free(struct task_struct 
*tsk)
 {}
 static inline void delayacct_blkio_start(void)
 {}
-static inline void delayacct_blkio_end(void)
+static inline void delayacct_blkio_end(struct task_struct *p)
 {}
 static inline int delayacct_add_tsk(struct taskstats *d,
struct task_struct *tsk)
diff --git a/kernel/delayacct.c b/kernel/delayacct.c
index 4a1c334..e2ec808 100644
--- a/kernel/delayacct.c
+++ b/kernel/delayacct.c
@@ -51,16 +51,16 @@ void __delayacct_tsk_init(struct task_struct *tsk)
  * Finish delay accounting for a statistic using its timestamps (@start),
  * accumalator (@total) and @count
  */
-static void delayacct_end(u64 *start, u64 *total, u32 *count)
+static void delayacct_end(spinlock_t *lock, u64 *start, u64 *total, u32 *count)
 {
s64 ns = ktime_get_ns() - *start;
unsigned long flags;
 
if (ns > 0) {
-   spin_lock_irqsave(¤t->delays->lock, flags);
+   spin_lock_irqsave(lock, flags);
*total += ns;
(*count)++;
-   spin_unlock_irqrestore(¤t->delays->lock, flags);
+   spin_unlock_irqrestore(lock, flags);
}
 }
 
@@ -69,17 +69,25 @@ void __delayacct_blkio_start(void)
current->delays->blkio_start = ktime_get_ns();
 }
 
-void __delayacct_blkio_end(void)
+/*
+ * We cannot rely on the `current` macro, as we haven't yet switched back to
+ * the process being woken.
+ */
+void __delayacct_blkio_end(struct task_struct *p)
 {
-   if (current->delays->flags & DELAYACCT_PF_SWAPIN)
-   /* Swapin block I/O */
-   delayacct_end(¤t->delays->blkio_start,
-   ¤t->delays->swapin_delay,
-   ¤t->delays->swapin_count);
-   else/* Other block I/O */
-   delayacct_end(¤t->delays->blkio_start,
-   ¤t->delays->blkio_delay,
-   ¤t->delays->blkio_count);
+   struct task_delay_info *delays = p->delays;
+   u64 *total;
+   u32 *count;
+
+   if (p->delays->flags & DELAYACCT_PF_SWAPIN) {
+   total = &delays->swapin_delay;
+   count = &delays->swapin_count;
+   } else {
+   total = &delays->blkio_delay;
+   count = &delays->blkio_count;
+   }
+
+   delayacct_end(&delays->lock, &delays->blkio_start, total, count);
 }
 
 int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
@@ -153,8 +161,10 @@ void __delayacct_freepages_start(v

[PATCH] Pass the task_struct explicitly to delayacct_blkio_end

2017-12-15 Thread Josh Snyder
Before e33a9bba85a8, delayacct_blkio_end was called after
context-switching into the task which completed I/O. This resulted in
double counting: the task would account a delay both waiting for I/O and
for time spent in the runqueue.

With e33a9bba85a8, delayacct_blkio_end is called by try_to_wake_up. In
ttwu, we have not yet context-switched. This is more correct, in that the
delay accounting ends when the I/O is complete. But delayacct_blkio_end
relies upon `get_current()`, and we have not yet context-switched into the
task whose I/O completed. This results in the wrong task having its delay
accounting statistics updated.

Instead of doing that, pass the task_struct being woken to
delayacct_blkio_end, so that it can update the statistics of the correct
task_struct.
---
 include/linux/delayacct.h |  8 
 kernel/delayacct.c| 42 ++
 kernel/sched/core.c   |  6 +++---
 3 files changed, 33 insertions(+), 23 deletions(-)

diff --git a/include/linux/delayacct.h b/include/linux/delayacct.h
index 4178d24..f2ad868 100644
--- a/include/linux/delayacct.h
+++ b/include/linux/delayacct.h
@@ -71,7 +71,7 @@ extern void delayacct_init(void);
 extern void __delayacct_tsk_init(struct task_struct *);
 extern void __delayacct_tsk_exit(struct task_struct *);
 extern void __delayacct_blkio_start(void);
-extern void __delayacct_blkio_end(void);
+extern void __delayacct_blkio_end(struct task_struct *);
 extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *);
 extern __u64 __delayacct_blkio_ticks(struct task_struct *);
 extern void __delayacct_freepages_start(void);
@@ -122,10 +122,10 @@ static inline void delayacct_blkio_start(void)
__delayacct_blkio_start();
 }
 
-static inline void delayacct_blkio_end(void)
+static inline void delayacct_blkio_end(struct task_struct *p)
 {
if (current->delays)
-   __delayacct_blkio_end();
+   __delayacct_blkio_end(p);
delayacct_clear_flag(DELAYACCT_PF_BLKIO);
 }
 
@@ -169,7 +169,7 @@ static inline void delayacct_tsk_free(struct task_struct 
*tsk)
 {}
 static inline void delayacct_blkio_start(void)
 {}
-static inline void delayacct_blkio_end(void)
+static inline void delayacct_blkio_end(struct task_struct *p)
 {}
 static inline int delayacct_add_tsk(struct taskstats *d,
struct task_struct *tsk)
diff --git a/kernel/delayacct.c b/kernel/delayacct.c
index 4a1c334..e2ec808 100644
--- a/kernel/delayacct.c
+++ b/kernel/delayacct.c
@@ -51,16 +51,16 @@ void __delayacct_tsk_init(struct task_struct *tsk)
  * Finish delay accounting for a statistic using its timestamps (@start),
  * accumalator (@total) and @count
  */
-static void delayacct_end(u64 *start, u64 *total, u32 *count)
+static void delayacct_end(spinlock_t *lock, u64 *start, u64 *total, u32 *count)
 {
s64 ns = ktime_get_ns() - *start;
unsigned long flags;
 
if (ns > 0) {
-   spin_lock_irqsave(¤t->delays->lock, flags);
+   spin_lock_irqsave(lock, flags);
*total += ns;
(*count)++;
-   spin_unlock_irqrestore(¤t->delays->lock, flags);
+   spin_unlock_irqrestore(lock, flags);
}
 }
 
@@ -69,17 +69,25 @@ void __delayacct_blkio_start(void)
current->delays->blkio_start = ktime_get_ns();
 }
 
-void __delayacct_blkio_end(void)
+/*
+ * We cannot rely on the `current` macro, as we haven't yet switched back to
+ * the process being woken.
+ */
+void __delayacct_blkio_end(struct task_struct *p)
 {
-   if (current->delays->flags & DELAYACCT_PF_SWAPIN)
-   /* Swapin block I/O */
-   delayacct_end(¤t->delays->blkio_start,
-   ¤t->delays->swapin_delay,
-   ¤t->delays->swapin_count);
-   else/* Other block I/O */
-   delayacct_end(¤t->delays->blkio_start,
-   ¤t->delays->blkio_delay,
-   ¤t->delays->blkio_count);
+   struct task_delay_info * delays = p->delays;
+   u64 *total;
+   u32 *count;
+
+   if (p->delays->flags & DELAYACCT_PF_SWAPIN) {
+   total = &delays->swapin_delay;
+   count = &delays->swapin_count;
+   } else {
+   total = &delays->blkio_delay;
+   count = &delays->blkio_count;
+   }
+
+   delayacct_end(&delays->lock, &delays->blkio_start, total, count);
 }
 
 int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
@@ -153,8 +161,10 @@ void __delayacct_freepages_start(void)
 
 void __delayacct_freepages_end(void)
 {
-   delayacct_end(¤t->delays->freepages_start,
-   ¤t->delays->freepages_delay,
-   ¤t->delays->freepages_count);
+   delayacct_end(
+   ¤t->delays->lock,
+   ¤t->delays->freepages_start,
+   ¤t->delays->freepages_delay,
+   ¤t->delays->freepage