Re: bug in tag handling in blk-mq?

2018-05-07 Thread Mike Galbraith
On Mon, 2018-05-07 at 20:02 +0200, Paolo Valente wrote:
> 
> 
> > Is there a reproducer?

Just building fat config kernels works for me.  It was highly non-
deterministic, but reproduced quickly twice in a row with Paolos hack.
  
> Ok Mike, I guess it's your turn now, for at least a stack trace.

Sure.  I'm deadlined ATM, but will get to it.

-Mike


Re: [PATCH] loop: add recursion validation to LOOP_CHANGE_FD

2018-05-07 Thread Theodore Y. Ts'o
On Tue, May 08, 2018 at 09:28:17AM +0900, Tetsuo Handa wrote:
> The thread I mean is:
> 
>   general protection fault in lo_ioctl (2)
>   
> https://syzkaller.appspot.com/bug?id=f3cfe26e785d85f9ee259f385515291d21bd80a3
> 
> Are you sure that your patch solves this problem as well?

Well, I can't be sure, since there's not enough information in that
particular syzkaller report to definitively pin down the root cause.

And while I can't reproduce the crash using the syzkaller repro with
the patch; I can't reproduce the crash *without* the patch, either.

This is what Syzkaller has to say, but of course, in its own
documentation's words, "It's only a dumb bot".  :-)e

That being said, triggering the problem which it is so concerned about
requires root privilieges, so I would not consider it high priority to
track down --- especially given that we don't have a reliable
reproducer for it.


- Ted

Hello,

syzbot has tested the proposed patch and the reproducer did not trigger  
crash:

Reported-and-tested-by:  
syzbot+bf89c128e05dd6c62...@syzkaller.appspotmail.com

Tested on:

commit: 170785a9cc72 loop: add recursion validation to LOOP_CHANGE..
git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/loop-fix
kernel config:  https://syzkaller.appspot.com/x/.config?x=5a1dc06635c10d27
compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
userspace arch: i386

Note: testing is done by a robot and is best-effort only.




Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-07 Thread kbuild test robot
Hi Johannes,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.17-rc4]
[cannot apply to next-20180507]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Johannes-Weiner/mm-workingset-don-t-drop-refault-information-prematurely/20180508-081214
config: x86_64-randconfig-x012-201818 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   In file included from kernel/livepatch/../sched/sched.h:1317:0,
from kernel/livepatch/transition.c:27:
>> kernel/livepatch/../sched/stats.h:126:1: error: expected identifier or '(' 
>> before '{' token
{
^

vim +126 kernel/livepatch/../sched/stats.h

57  
58  #ifdef CONFIG_PSI
59  /*
60   * PSI tracks state that persists across sleeps, such as iowaits and
61   * memory stalls. As a result, it has to distinguish between sleeps,
62   * where a task's runnable state changes, and requeues, where a task
63   * and its state are being moved between CPUs and runqueues.
64   */
65  static inline void psi_enqueue(struct task_struct *p, u64 now)
66  {
67  int clear = 0, set = TSK_RUNNING;
68  
69  if (p->state == TASK_RUNNING || p->sched_psi_wake_requeue) {
70  if (p->flags & PF_MEMSTALL)
71  set |= TSK_MEMSTALL;
72  p->sched_psi_wake_requeue = 0;
73  } else {
74  if (p->in_iowait)
75  clear |= TSK_IOWAIT;
76  }
77  
78  psi_task_change(p, now, clear, set);
79  }
80  static inline void psi_dequeue(struct task_struct *p, u64 now)
81  {
82  int clear = TSK_RUNNING, set = 0;
83  
84  if (p->state == TASK_RUNNING) {
85  if (p->flags & PF_MEMSTALL)
86  clear |= TSK_MEMSTALL;
87  } else {
88  if (p->in_iowait)
89  set |= TSK_IOWAIT;
90  }
91  
92  psi_task_change(p, now, clear, set);
93  }
94  static inline void psi_ttwu_dequeue(struct task_struct *p)
95  {
96  /*
97   * Is the task being migrated during a wakeup? Make sure to
98   * deregister its sleep-persistent psi states from the old
99   * queue, and let psi_enqueue() know it has to requeue.
   100   */
   101  if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) {
   102  struct rq_flags rf;
   103  struct rq *rq;
   104  int clear = 0;
   105  
   106  if (p->in_iowait)
   107  clear |= TSK_IOWAIT;
   108  if (p->flags & PF_MEMSTALL)
   109  clear |= TSK_MEMSTALL;
   110  
   111  rq = __task_rq_lock(p, );
   112  update_rq_clock(rq);
   113  psi_task_change(p, rq_clock(rq), clear, 0);
   114  p->sched_psi_wake_requeue = 1;
   115  __task_rq_unlock(rq, );
   116  }
   117  }
   118  #else /* CONFIG_PSI */
   119  static inline void psi_enqueue(struct task_struct *p, u64 now)
   120  {
   121  }
   122  static inline void psi_dequeue(struct task_struct *p, u64 now)
   123  {
   124  }
   125  static inline void psi_ttwu_dequeue(struct task_struct *p) {}
 > 126  {
   127  }
   128  #endif /* CONFIG_PSI */
   129  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-07 Thread kbuild test robot
Hi Johannes,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.17-rc4]
[cannot apply to next-20180507]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Johannes-Weiner/mm-workingset-don-t-drop-refault-information-prematurely/20180508-081214
config: i386-randconfig-x073-201818 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   In file included from kernel/sched/sched.h:1317:0,
from kernel/sched/core.c:8:
>> kernel/sched/stats.h:126:1: error: expected identifier or '(' before '{' 
>> token
{
^

vim +126 kernel/sched/stats.h

57  
58  #ifdef CONFIG_PSI
59  /*
60   * PSI tracks state that persists across sleeps, such as iowaits and
61   * memory stalls. As a result, it has to distinguish between sleeps,
62   * where a task's runnable state changes, and requeues, where a task
63   * and its state are being moved between CPUs and runqueues.
64   */
65  static inline void psi_enqueue(struct task_struct *p, u64 now)
66  {
67  int clear = 0, set = TSK_RUNNING;
68  
69  if (p->state == TASK_RUNNING || p->sched_psi_wake_requeue) {
70  if (p->flags & PF_MEMSTALL)
71  set |= TSK_MEMSTALL;
72  p->sched_psi_wake_requeue = 0;
73  } else {
74  if (p->in_iowait)
75  clear |= TSK_IOWAIT;
76  }
77  
78  psi_task_change(p, now, clear, set);
79  }
80  static inline void psi_dequeue(struct task_struct *p, u64 now)
81  {
82  int clear = TSK_RUNNING, set = 0;
83  
84  if (p->state == TASK_RUNNING) {
85  if (p->flags & PF_MEMSTALL)
86  clear |= TSK_MEMSTALL;
87  } else {
88  if (p->in_iowait)
89  set |= TSK_IOWAIT;
90  }
91  
92  psi_task_change(p, now, clear, set);
93  }
94  static inline void psi_ttwu_dequeue(struct task_struct *p)
95  {
96  /*
97   * Is the task being migrated during a wakeup? Make sure to
98   * deregister its sleep-persistent psi states from the old
99   * queue, and let psi_enqueue() know it has to requeue.
   100   */
   101  if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) {
   102  struct rq_flags rf;
   103  struct rq *rq;
   104  int clear = 0;
   105  
   106  if (p->in_iowait)
   107  clear |= TSK_IOWAIT;
   108  if (p->flags & PF_MEMSTALL)
   109  clear |= TSK_MEMSTALL;
   110  
   111  rq = __task_rq_lock(p, );
   112  update_rq_clock(rq);
   113  psi_task_change(p, rq_clock(rq), clear, 0);
   114  p->sched_psi_wake_requeue = 1;
   115  __task_rq_unlock(rq, );
   116  }
   117  }
   118  #else /* CONFIG_PSI */
   119  static inline void psi_enqueue(struct task_struct *p, u64 now)
   120  {
   121  }
   122  static inline void psi_dequeue(struct task_struct *p, u64 now)
   123  {
   124  }
   125  static inline void psi_ttwu_dequeue(struct task_struct *p) {}
 > 126  {
   127  }
   128  #endif /* CONFIG_PSI */
   129  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


[PATCH v4] block: add verifier for cmdline partition

2018-05-07 Thread Wang YanQing
I meet strange filesystem corruption issue recently, the reason
is there are overlaps partitions in cmdline partition argument.

This patch add verifier for cmdline partition, then if there are
overlaps partitions, cmdline_partition will log a warning. We don't
treat overlaps partition as a error:
"
Caizhiyong  said:
Partition overlap was intentionally designed in this cmdline partition.
reference http://lists.infradead.org/pipermail/linux-mtd/2013-August/048092.html
"

Signed-off-by: Wang YanQing 
---
 Changes
 v3-v4:
 1:Fix grammar typo, reported by Randy Dunlap.
 2:Delete unnecessary type casting, reported by Caizhiyong.
 3:Record the reason why we can't treat overlapping partitions
   as errors into changelog.

 v2-v3:
 1:Fix log one pair of overlaps partitions twice in cmdline_parts_verifier.
 2:Fix out of bound access in cmdline_parts_verifier.

 v1-v2:
 1:Don't treat overlaps partition as a error, but log a warning.
 block/partitions/cmdline.c | 66 ++
 1 file changed, 66 insertions(+)

diff --git a/block/partitions/cmdline.c b/block/partitions/cmdline.c
index e333583..2fd1b99 100644
--- a/block/partitions/cmdline.c
+++ b/block/partitions/cmdline.c
@@ -58,6 +58,71 @@ static int __init cmdline_parts_setup(char *s)
 }
 __setup("blkdevparts=", cmdline_parts_setup);
 
+static bool has_overlaps(sector_t from, sector_t size,
+sector_t from2, sector_t size2)
+{
+   sector_t end = from + size;
+   sector_t end2 = from2 + size2;
+
+   if (from >= from2 && from < end2)
+   return true;
+
+   if (end > from2 && end <= end2)
+   return true;
+
+   if (from2 >= from && from2 < end)
+   return true;
+
+   if (end2 > from && end2 <= end)
+   return true;
+
+   return false;
+}
+
+static inline void overlaps_warns_header(void)
+{
+   pr_warn("\n");
+   pr_warn("Overlapping partitions are used in command line 
partitions.\n");
+   pr_warn("Don't use filesystems on overlapping partitions:\n");
+}
+
+static inline void overlaps_warns_tailer(void)
+{
+   pr_warn("\n");
+}
+
+static void cmdline_parts_verifier(int slot, struct parsed_partitions *state)
+{
+   int i;
+   bool header = true;
+
+   for (; slot < state->limit && state->parts[slot].has_info; slot++) {
+   for (i = slot+1; i < state->limit && state->parts[i].has_info;
+i++) {
+   if (has_overlaps(state->parts[slot].from,
+state->parts[slot].size,
+state->parts[i].from,
+state->parts[i].size)) {
+   if (header) {
+   header = false;
+   overlaps_warns_header();
+   }
+   pr_warn("%s[%llu,%llu] overlaps with "
+   "%s[%llu,%llu].\n",
+   state->parts[slot].info.volname,
+   (u64)state->parts[slot].from << 9,
+   (u64)state->parts[slot].size << 9,
+   state->parts[i].info.volname,
+   (u64)state->parts[i].from << 9,
+   (u64)state->parts[i].size << 9);
+   }
+   }
+   }
+
+   if (!header)
+   overlaps_warns_tailer();
+}
+
 /*
  * Purpose: allocate cmdline partitions.
  * Returns:
@@ -93,6 +158,7 @@ int cmdline_partition(struct parsed_partitions *state)
disk_size = get_capacity(state->bdev->bd_disk) << 9;
 
cmdline_parts_set(parts, disk_size, 1, add_part, (void *)state);
+   cmdline_parts_verifier(1, state);
 
strlcat(state->pp_buf, "\n", PAGE_SIZE);
 
-- 
1.8.5.6.2.g3d8a54e.dirty


Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-07 Thread Randy Dunlap
On 05/07/2018 02:01 PM, Johannes Weiner wrote:
> 
> Signed-off-by: Johannes Weiner 
> ---
>  Documentation/accounting/psi.txt |  73 ++
>  include/linux/psi.h  |  27 ++
>  include/linux/psi_types.h|  84 ++
>  include/linux/sched.h|  10 +
>  include/linux/sched/stat.h   |  10 +-
>  init/Kconfig |  16 ++
>  kernel/fork.c|   4 +
>  kernel/sched/Makefile|   1 +
>  kernel/sched/core.c  |   3 +
>  kernel/sched/psi.c   | 424 +++
>  kernel/sched/sched.h | 166 ++--
>  kernel/sched/stats.h |  91 ++-
>  mm/compaction.c  |   5 +
>  mm/filemap.c |  15 +-
>  mm/page_alloc.c  |  10 +
>  mm/vmscan.c  |  13 +
>  16 files changed, 859 insertions(+), 93 deletions(-)
>  create mode 100644 Documentation/accounting/psi.txt
>  create mode 100644 include/linux/psi.h
>  create mode 100644 include/linux/psi_types.h
>  create mode 100644 kernel/sched/psi.c
> 
> diff --git a/Documentation/accounting/psi.txt 
> b/Documentation/accounting/psi.txt
> new file mode 100644
> index ..e051810d5127
> --- /dev/null
> +++ b/Documentation/accounting/psi.txt
> @@ -0,0 +1,73 @@

Looks good to me.


> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> new file mode 100644
> index ..052c529a053b
> --- /dev/null
> +++ b/kernel/sched/psi.c
> @@ -0,0 +1,424 @@
> +/*
> + * Measure workload productivity impact from overcommitting CPU, memory, IO
> + *
> + * Copyright (c) 2017 Facebook, Inc.
> + * Author: Johannes Weiner 
> + *
> + * Implementation
> + *
> + * Task states -- running, iowait, memstall -- are tracked through the
> + * scheduler and aggregated into a system-wide productivity state. The
> + * ratio between the times spent in productive states and delays tells
> + * us the overall productivity of the workload.
> + *
> + * The ratio is tracked in decaying time averages over 10s, 1m, 5m
> + * windows. Cumluative stall times are tracked and exported as well to

   Cumulative

> + * allow detection of latency spikes and custom time averaging.
> + *
> + * Multiple CPUs
> + *
> + * To avoid cache contention, times are tracked local to the CPUs. To
> + * get a comprehensive view of a system or cgroup, we have to consider
> + * the fact that CPUs could be unevenly loaded or even entirely idle
> + * if the workload doesn't have enough threads. To avoid artifacts
> + * caused by that, when adding up the global pressure ratio, the
> + * CPU-local ratios are weighed according to their non-idle time:
> + *
> + *   Time the CPU had stalled tasksTime the CPU was non-idle
> + *   -- * ---
> + *WalltimeTime all CPUs were non-idle
> + */


> +
> +/**
> + * psi_memstall_leave - mark the end of an memory stall section

end of a memory

> + * @flags: flags to handle nested memdelay sections
> + *
> + * Marks the calling task as no longer stalled due to lack of memory.
> + */
> +void psi_memstall_leave(unsigned long *flags)
> +{



-- 
~Randy


Re: [PATCH 4/4] blk-wbt: throttle discards like background writes

2018-05-07 Thread Darrick J. Wong
On Mon, May 07, 2018 at 10:13:35AM -0600, Jens Axboe wrote:
> Throttle discards like we would any background write. Discards should
> be background activity, so if they are impacting foreground IO, then
> we will throttle them down.
> 
> Signed-off-by: Jens Axboe 

Looks ok,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  block/blk-stat.h |  6 +++---
>  block/blk-wbt.c  | 43 ++-
>  block/blk-wbt.h  |  4 +++-
>  3 files changed, 32 insertions(+), 21 deletions(-)
> 
> diff --git a/block/blk-stat.h b/block/blk-stat.h
> index 2dd36347252a..c22049a8125e 100644
> --- a/block/blk-stat.h
> +++ b/block/blk-stat.h
> @@ -10,11 +10,11 @@
>  
>  /*
>   * from upper:
> - * 3 bits: reserved for other usage
> + * 4 bits: reserved for other usage
>   * 12 bits: size
> - * 49 bits: time
> + * 48 bits: time
>   */
> -#define BLK_STAT_RES_BITS3
> +#define BLK_STAT_RES_BITS4
>  #define BLK_STAT_SIZE_BITS   12
>  #define BLK_STAT_RES_SHIFT   (64 - BLK_STAT_RES_BITS)
>  #define BLK_STAT_SIZE_SHIFT  (BLK_STAT_RES_SHIFT - BLK_STAT_SIZE_BITS)
> diff --git a/block/blk-wbt.c b/block/blk-wbt.c
> index 25d202345965..a7a724580033 100644
> --- a/block/blk-wbt.c
> +++ b/block/blk-wbt.c
> @@ -106,6 +106,8 @@ static inline struct rq_wait *get_rq_wait(struct rq_wb 
> *rwb,
>  {
>   if (wb_acct & WBT_KSWAPD)
>   return >rq_wait[WBT_RWQ_KSWAPD];
> + else if (wb_acct & WBT_DISCARD)
> + return >rq_wait[WBT_RWQ_DISCARD];
>  
>   return >rq_wait[WBT_RWQ_BG];
>  }
> @@ -143,10 +145,13 @@ void __wbt_done(struct rq_wb *rwb, enum wbt_flags 
> wb_acct)
>   }
>  
>   /*
> -  * If the device does write back caching, drop further down
> -  * before we wake people up.
> +  * For discards, our limit is always the background. For writes, if
> +  * the device does write back caching, drop further down before we
> +  * wake people up.
>*/
> - if (rwb->wc && !wb_recent_wait(rwb))
> + if (wb_acct & WBT_DISCARD)
> + limit = rwb->wb_background;
> + else if (rwb->wc && !wb_recent_wait(rwb))
>   limit = 0;
>   else
>   limit = rwb->wb_normal;
> @@ -483,6 +488,9 @@ static inline unsigned int get_limit(struct rq_wb *rwb, 
> unsigned long rw)
>  {
>   unsigned int limit;
>  
> + if ((rw & REQ_OP_MASK) == REQ_OP_DISCARD)
> + return rwb->wb_background;
> +
>   /*
>* At this point we know it's a buffered write. If this is
>* kswapd trying to free memory, or REQ_SYNC is set, then
> @@ -564,21 +572,20 @@ static void __wbt_wait(struct rq_wb *rwb, enum 
> wbt_flags wb_acct,
>  
>  static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)
>  {
> - const int op = bio_op(bio);
> -
> - /*
> -  * If not a WRITE, do nothing
> -  */
> - if (op != REQ_OP_WRITE)
> - return false;
> -
> - /*
> -  * Don't throttle WRITE_ODIRECT
> -  */
> - if ((bio->bi_opf & (REQ_SYNC | REQ_IDLE)) == (REQ_SYNC | REQ_IDLE))
> + switch (bio_op(bio)) {
> + case REQ_OP_WRITE:
> + /*
> +  * Don't throttle WRITE_ODIRECT
> +  */
> + if ((bio->bi_opf & (REQ_SYNC | REQ_IDLE)) ==
> + (REQ_SYNC | REQ_IDLE))
> + return false;
> + /* fallthrough */
> + case REQ_OP_DISCARD:
> + return true;
> + default:
>   return false;
> -
> - return true;
> + }
>  }
>  
>  /*
> @@ -605,6 +612,8 @@ enum wbt_flags wbt_wait(struct rq_wb *rwb, struct bio 
> *bio, spinlock_t *lock)
>  
>   if (current_is_kswapd())
>   ret |= WBT_KSWAPD;
> + if (bio_op(bio) == REQ_OP_DISCARD)
> + ret |= WBT_DISCARD;
>  
>   __wbt_wait(rwb, ret, bio->bi_opf, lock);
>  
> diff --git a/block/blk-wbt.h b/block/blk-wbt.h
> index 8038b4a0d4ef..d6a125e49db5 100644
> --- a/block/blk-wbt.h
> +++ b/block/blk-wbt.h
> @@ -14,13 +14,15 @@ enum wbt_flags {
>   WBT_TRACKED = 1,/* write, tracked for throttling */
>   WBT_READ= 2,/* read */
>   WBT_KSWAPD  = 4,/* write, from kswapd */
> + WBT_DISCARD = 8,/* discard */
>  
> - WBT_NR_BITS = 3,/* number of bits */
> + WBT_NR_BITS = 4,/* number of bits */
>  };
>  
>  enum {
>   WBT_RWQ_BG  = 0,
>   WBT_RWQ_KSWAPD,
> + WBT_RWQ_DISCARD,
>   WBT_NUM_RWQ,
>  };
>  
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] blk-wbt: pass in enum wbt_flags to get_rq_wait()

2018-05-07 Thread Darrick J. Wong
On Mon, May 07, 2018 at 10:13:34AM -0600, Jens Axboe wrote:
> This is in preparation for having more write queues, in which
> case we would have needed to pass in more information than just
> a simple 'is_kswapd' boolean.
> 
> Signed-off-by: Jens Axboe 

Looks ok (having not run any kind of testing on this yet),
Reviewed-by: Darrick J. Wong 

--D

> ---
>  block/blk-wbt.c | 25 +++--
>  block/blk-wbt.h |  4 +++-
>  2 files changed, 18 insertions(+), 11 deletions(-)
> 
> diff --git a/block/blk-wbt.c b/block/blk-wbt.c
> index 3e34b41bcefc..25d202345965 100644
> --- a/block/blk-wbt.c
> +++ b/block/blk-wbt.c
> @@ -101,9 +101,13 @@ static bool wb_recent_wait(struct rq_wb *rwb)
>   return time_before(jiffies, wb->dirty_sleep + HZ);
>  }
>  
> -static inline struct rq_wait *get_rq_wait(struct rq_wb *rwb, bool is_kswapd)
> +static inline struct rq_wait *get_rq_wait(struct rq_wb *rwb,
> +   enum wbt_flags wb_acct)
>  {
> - return >rq_wait[is_kswapd];
> + if (wb_acct & WBT_KSWAPD)
> + return >rq_wait[WBT_RWQ_KSWAPD];
> +
> + return >rq_wait[WBT_RWQ_BG];
>  }
>  
>  static void rwb_wake_all(struct rq_wb *rwb)
> @@ -126,7 +130,7 @@ void __wbt_done(struct rq_wb *rwb, enum wbt_flags wb_acct)
>   if (!(wb_acct & WBT_TRACKED))
>   return;
>  
> - rqw = get_rq_wait(rwb, wb_acct & WBT_KSWAPD);
> + rqw = get_rq_wait(rwb, wb_acct);
>   inflight = atomic_dec_return(>inflight);
>  
>   /*
> @@ -529,11 +533,12 @@ static inline bool may_queue(struct rq_wb *rwb, struct 
> rq_wait *rqw,
>   * Block if we will exceed our limit, or if we are currently waiting for
>   * the timer to kick off queuing again.
>   */
> -static void __wbt_wait(struct rq_wb *rwb, unsigned long rw, spinlock_t *lock)
> +static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
> +unsigned long rw, spinlock_t *lock)
>   __releases(lock)
>   __acquires(lock)
>  {
> - struct rq_wait *rqw = get_rq_wait(rwb, current_is_kswapd());
> + struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
>   DEFINE_WAIT(wait);
>  
>   if (may_queue(rwb, rqw, , rw))
> @@ -584,7 +589,7 @@ static inline bool wbt_should_throttle(struct rq_wb *rwb, 
> struct bio *bio)
>   */
>  enum wbt_flags wbt_wait(struct rq_wb *rwb, struct bio *bio, spinlock_t *lock)
>  {
> - unsigned int ret = 0;
> + enum wbt_flags ret = 0;
>  
>   if (!rwb_enabled(rwb))
>   return 0;
> @@ -598,14 +603,14 @@ enum wbt_flags wbt_wait(struct rq_wb *rwb, struct bio 
> *bio, spinlock_t *lock)
>   return ret;
>   }
>  
> - __wbt_wait(rwb, bio->bi_opf, lock);
> + if (current_is_kswapd())
> + ret |= WBT_KSWAPD;
> +
> + __wbt_wait(rwb, ret, bio->bi_opf, lock);
>  
>   if (!blk_stat_is_active(rwb->cb))
>   rwb_arm_timer(rwb);
>  
> - if (current_is_kswapd())
> - ret |= WBT_KSWAPD;
> -
>   return ret | WBT_TRACKED;
>  }
>  
> diff --git a/block/blk-wbt.h b/block/blk-wbt.h
> index a232c98fbf4d..8038b4a0d4ef 100644
> --- a/block/blk-wbt.h
> +++ b/block/blk-wbt.h
> @@ -19,7 +19,9 @@ enum wbt_flags {
>  };
>  
>  enum {
> - WBT_NUM_RWQ = 2,
> + WBT_RWQ_BG  = 0,
> + WBT_RWQ_KSWAPD,
> + WBT_NUM_RWQ,
>  };
>  
>  /*
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] blk-wbt: account any writing command as a write

2018-05-07 Thread Darrick J. Wong
On Mon, May 07, 2018 at 10:13:33AM -0600, Jens Axboe wrote:
> We currently special case WRITE and FLUSH, but we should really
> just include any command with the write bit set. This ensures
> that we account DISCARD.
> 
> Reviewed-by: Christoph Hellwig 
> Signed-off-by: Jens Axboe 

Looks ok,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  block/blk-wbt.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/blk-wbt.c b/block/blk-wbt.c
> index f92fc84b5e2c..3e34b41bcefc 100644
> --- a/block/blk-wbt.c
> +++ b/block/blk-wbt.c
> @@ -701,7 +701,7 @@ static int wbt_data_dir(const struct request *rq)
>  
>   if (op == REQ_OP_READ)
>   return READ;
> - else if (op == REQ_OP_WRITE || op == REQ_OP_FLUSH)
> + else if (op_is_write(op))
>   return WRITE;
>  
>   /* don't account */
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] block: break discard submissions into the user defined size

2018-05-07 Thread Darrick J. Wong
On Mon, May 07, 2018 at 10:13:32AM -0600, Jens Axboe wrote:
> Don't build discards bigger than what the user asked for, if the
> user decided to limit the size by writing to 'discard_max_bytes'.
> 
> Signed-off-by: Jens Axboe 

Seems fine to me,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  block/blk-lib.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index a676084d4740..7417d617091b 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -62,10 +62,11 @@ int __blkdev_issue_discard(struct block_device *bdev, 
> sector_t sector,
>   unsigned int req_sects;
>   sector_t end_sect, tmp;
>  
> - /* Make sure bi_size doesn't overflow */
> - req_sects = min_t(sector_t, nr_sects, UINT_MAX >> 9);
> + /* Issue in chunks of the user defined max discard setting */
> + req_sects = min_t(sector_t, nr_sects,
> + q->limits.max_discard_sectors);
>  
> - /**
> + /*
>* If splitting a request, and the next starting sector would be
>* misaligned, stop the discard at the previous aligned sector.
>*/
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] loop: add recursion validation to LOOP_CHANGE_FD

2018-05-07 Thread Theodore Y. Ts'o
On Tue, May 08, 2018 at 05:45:21AM +0900, Tetsuo Handa wrote:
> > 
> > > By the way, are you aware that current "/* Avoid recursion */" loop is 
> > > not thread safe?
> > 
> > Actually, it is safe.  While the child loop device has an open file on
> > the parent, lo_refcnt is elevated, which prevents loop_clr_fd from
> > actually set setting lo_state to Lo_rundown and clearing
> > lo_backing_file
> 
> If you think it is safe, please explain that the crash referenced in a patch
> at https://groups.google.com/d/msg/syzkaller-bugs/2Rw8-OM6IbM/PzdobV8kAgAJ is
> no longer possible. syzbot is hitting crashes there.

Huh?  You were worried about a race where loop_change_fd could race
with loop_clr_fd causing a NULL dereference of lo_backing_file.

The mail thread you are referencing is a deadlock problem with
loop_reread_partitions() and lo_release().  This is unreleated to the
possible race you were concerned about in loop_change_fd().

- Ted



Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-05-07 Thread Logan Gunthorpe

> How do you envison merging this?  There's a big chunk in drivers/pci, but
> really no opportunity for conflicts there, and there's significant stuff in
> block and nvme that I don't really want to merge.
> 
> If Alex is OK with the ACS situation, I can ack the PCI parts and you could
> merge it elsewhere?

Honestly, I don't know. I guess with your ACK on the PCI parts, the vast
balance is NVMe stuff so we could look at merging it through that tree.
The block patch and IB patch are pretty small.

Thanks,

Logan


Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-05-07 Thread Bjorn Helgaas
On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote:
> Hi Everyone,
> 
> Here's v4 of our series to introduce P2P based copy offload to NVMe
> fabrics. This version has been rebased onto v4.17-rc2. A git repo
> is here:
> 
> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
> ...

> Logan Gunthorpe (14):
>   PCI/P2PDMA: Support peer-to-peer memory
>   PCI/P2PDMA: Add sysfs group to display p2pmem stats
>   PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
>   PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
>   docs-rst: Add a new directory for PCI documentation
>   PCI/P2PDMA: Add P2P DMA driver writer's documentation
>   block: Introduce PCI P2P flags for request and request queue
>   IB/core: Ensure we map P2P memory correctly in
> rdma_rw_ctx_[init|destroy]()
>   nvme-pci: Use PCI p2pmem subsystem to manage the CMB
>   nvme-pci: Add support for P2P memory in requests
>   nvme-pci: Add a quirk for a pseudo CMB
>   nvmet: Introduce helper functions to allocate and free request SGLs
>   nvmet-rdma: Use new SGL alloc/free helper for requests
>   nvmet: Optionally use PCI P2P memory
> 
>  Documentation/ABI/testing/sysfs-bus-pci|  25 +
>  Documentation/PCI/index.rst|  14 +
>  Documentation/driver-api/index.rst |   2 +-
>  Documentation/driver-api/pci/index.rst |  20 +
>  Documentation/driver-api/pci/p2pdma.rst| 166 ++
>  Documentation/driver-api/{ => pci}/pci.rst |   0
>  Documentation/index.rst|   3 +-
>  block/blk-core.c   |   3 +
>  drivers/infiniband/core/rw.c   |  13 +-
>  drivers/nvme/host/core.c   |   4 +
>  drivers/nvme/host/nvme.h   |   8 +
>  drivers/nvme/host/pci.c| 118 +++--
>  drivers/nvme/target/configfs.c |  67 +++
>  drivers/nvme/target/core.c | 143 -
>  drivers/nvme/target/io-cmd.c   |   3 +
>  drivers/nvme/target/nvmet.h|  15 +
>  drivers/nvme/target/rdma.c |  22 +-
>  drivers/pci/Kconfig|  26 +
>  drivers/pci/Makefile   |   1 +
>  drivers/pci/p2pdma.c   | 814 
> +
>  drivers/pci/pci.c  |   6 +
>  include/linux/blk_types.h  |  18 +-
>  include/linux/blkdev.h |   3 +
>  include/linux/memremap.h   |  19 +
>  include/linux/pci-p2pdma.h | 118 +
>  include/linux/pci.h|   4 +
>  26 files changed, 1579 insertions(+), 56 deletions(-)
>  create mode 100644 Documentation/PCI/index.rst
>  create mode 100644 Documentation/driver-api/pci/index.rst
>  create mode 100644 Documentation/driver-api/pci/p2pdma.rst
>  rename Documentation/driver-api/{ => pci}/pci.rst (100%)
>  create mode 100644 drivers/pci/p2pdma.c
>  create mode 100644 include/linux/pci-p2pdma.h

How do you envison merging this?  There's a big chunk in drivers/pci, but
really no opportunity for conflicts there, and there's significant stuff in
block and nvme that I don't really want to merge.

If Alex is OK with the ACS situation, I can ack the PCI parts and you could
merge it elsewhere?

Bjorn


Re: [PATCH v4 06/14] PCI/P2PDMA: Add P2P DMA driver writer's documentation

2018-05-07 Thread Bjorn Helgaas
On Mon, Apr 23, 2018 at 05:30:38PM -0600, Logan Gunthorpe wrote:
> Add a restructured text file describing how to write drivers
> with support for P2P DMA transactions. The document describes
> how to use the APIs that were added in the previous few
> commits.
> 
> Also adds an index for the PCI documentation tree even though this
> is the only PCI document that has been converted to restructured text
> at this time.
> 
> Signed-off-by: Logan Gunthorpe 
> Cc: Jonathan Corbet 
> ---
>  Documentation/PCI/index.rst |  14 +++
>  Documentation/driver-api/pci/index.rst  |   1 +
>  Documentation/driver-api/pci/p2pdma.rst | 166 
> 
>  Documentation/index.rst |   3 +-
>  4 files changed, 183 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/PCI/index.rst
>  create mode 100644 Documentation/driver-api/pci/p2pdma.rst
> 
> diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst
> new file mode 100644
> index ..2fdc4b3c291d
> --- /dev/null
> +++ b/Documentation/PCI/index.rst
> @@ -0,0 +1,14 @@
> +==
> +Linux PCI Driver Developer's Guide
> +==
> +
> +.. toctree::
> +
> +   p2pdma
> +
> +.. only::  subproject and html
> +
> +   Indices
> +   ===
> +
> +   * :ref:`genindex`
> diff --git a/Documentation/driver-api/pci/index.rst 
> b/Documentation/driver-api/pci/index.rst
> index 03b57cbf8cc2..d12eeafbfc90 100644
> --- a/Documentation/driver-api/pci/index.rst
> +++ b/Documentation/driver-api/pci/index.rst
> @@ -10,6 +10,7 @@ The Linux PCI driver implementer's API guide
> :maxdepth: 2
>  
> pci
> +   p2pdma
>  
>  .. only::  subproject and html
>  
> diff --git a/Documentation/driver-api/pci/p2pdma.rst 
> b/Documentation/driver-api/pci/p2pdma.rst
> new file mode 100644
> index ..49a512c405b2
> --- /dev/null
> +++ b/Documentation/driver-api/pci/p2pdma.rst
> @@ -0,0 +1,166 @@
> +
> +PCI Peer-to-Peer DMA Support
> +
> +
> +The PCI bus has pretty decent support for performing DMA transfers
> +between two endpoints on the bus. This type of transaction is

s/endpoints/devices/

> +henceforth called Peer-to-Peer (or P2P). However, there are a number of
> +issues that make P2P transactions tricky to do in a perfectly safe way.
> +
> +One of the biggest issues is that PCI Root Complexes are not required

s/PCI Root Complexes .../
  PCI doesn't require forwarding transactions between hierarchy domains,
and in PCIe, each Root Port defines a separate hierarchy domain./

> +to support forwarding packets between Root Ports. To make things worse,
> +there is no simple way to determine if a given Root Complex supports
> +this or not. (See PCIe r4.0, sec 1.3.1). Therefore, as of this writing,
> +the kernel only supports doing P2P when the endpoints involved are all
> +behind the same PCIe root port as the spec guarantees that all
> +packets will always be routable but does not require routing between
> +root ports.

s/endpoints involved .../
  devices involved are all behind the same PCI bridge, as such devices are
  all in the same PCI hierarchy domain, and the spec guarantees that all
  transactions within the hierarchy will be routable, but it does not
  require routing between hierarchies./

> +
> +The second issue is that to make use of existing interfaces in Linux,
> +memory that is used for P2P transactions needs to be backed by struct
> +pages. However, PCI BARs are not typically cache coherent so there are
> +a few corner case gotchas with these pages so developers need to
> +be careful about what they do with them.


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-07 Thread Bjorn Helgaas
[+to Alex]

Alex,

Are you happy with this strategy of turning off ACS based on
CONFIG_PCI_P2PDMA?  We only check this at enumeration-time and 
I don't know if there are other places we would care?

On Mon, Apr 23, 2018 at 05:30:36PM -0600, Logan Gunthorpe wrote:
> For peer-to-peer transactions to work the downstream ports in each
> switch must not have the ACS flags set. At this time there is no way
> to dynamically change the flags and update the corresponding IOMMU
> groups so this is done at enumeration time before the groups are
> assigned.
> 
> This effectively means that if CONFIG_PCI_P2PDMA is selected then
> all devices behind any PCIe switch heirarchy will be in the same IOMMU
> group. Which implies that individual devices behind any switch
> heirarchy will not be able to be assigned to separate VMs because
> there is no isolation between them. Additionally, any malicious PCIe
> devices will be able to DMA to memory exposed by other EPs in the same
> domain as TLPs will not be checked by the IOMMU.
> 
> Given that the intended use case of P2P Memory is for users with
> custom hardware designed for purpose, we do not expect distributors
> to ever need to enable this option. Users that want to use P2P
> must have compiled a custom kernel with this configuration option
> and understand the implications regarding ACS. They will either
> not require ACS or will have design the system in such a way that
> devices that require isolation will be separate from those using P2P
> transactions.
> 
> Signed-off-by: Logan Gunthorpe 
> ---
>  drivers/pci/Kconfig|  9 +
>  drivers/pci/p2pdma.c   | 45 ++---
>  drivers/pci/pci.c  |  6 ++
>  include/linux/pci-p2pdma.h |  5 +
>  4 files changed, 50 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index b2396c22b53e..b6db41d4b708 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -139,6 +139,15 @@ config PCI_P2PDMA
> transations must be between devices behind the same root port.
> (Typically behind a network of PCIe switches).
>  
> +   Enabling this option will also disable ACS on all ports behind
> +   any PCIe switch. This effectively puts all devices behind any
> +   switch heirarchy into the same IOMMU group. Which implies that

s/heirarchy/hierarchy/ (also above in changelog)

> +   individual devices behind any switch will not be able to be
> +   assigned to separate VMs because there is no isolation between
> +   them. Additionally, any malicious PCIe devices will be able to
> +   DMA to memory exposed by other EPs in the same domain as TLPs
> +   will not be checked by the IOMMU.
> +
> If unsure, say N.
>  
>  config PCI_LABEL
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index ed9dce8552a2..e9f43b43acac 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -240,27 +240,42 @@ static struct pci_dev *find_parent_pci_dev(struct 
> device *dev)
>  }
>  
>  /*
> - * If a device is behind a switch, we try to find the upstream bridge
> - * port of the switch. This requires two calls to pci_upstream_bridge():
> - * one for the upstream port on the switch, one on the upstream port
> - * for the next level in the hierarchy. Because of this, devices connected
> - * to the root port will be rejected.
> + * pci_p2pdma_disable_acs - disable ACS flags for all PCI bridges
> + * @pdev: device to disable ACS flags for
> + *
> + * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
> + * to be disabled on any PCI bridge in order for the TLPS to not be forwarded
> + * up to the RC which is not what we want for P2P.

s/PCI bridge/PCIe switch/ (ACS doesn't apply to conventional PCI)

> + *
> + * This function is called when the devices are first enumerated and
> + * will result in all devices behind any bridge to be in the same IOMMU
> + * group. At this time, there is no way to "hotplug" IOMMU groups so we rely
> + * on this largish hammer. If you need the devices to be in separate groups
> + * don't enable CONFIG_PCI_P2PDMA.
> + *
> + * Returns 1 if the ACS bits for this device was cleared, otherwise 0.
>   */
> -static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
> +int pci_p2pdma_disable_acs(struct pci_dev *pdev)
>  {
> - struct pci_dev *up1, *up2;
> + int pos;
> + u16 ctrl;
>  
> - if (!pdev)
> - return NULL;
> + if (!pci_is_bridge(pdev))
> + return 0;
>  
> - up1 = pci_dev_get(pci_upstream_bridge(pdev));
> - if (!up1)
> - return NULL;
> + pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
> + if (!pos)
> + return 0;
> +
> + pci_info(pdev, "disabling ACS flags for peer-to-peer DMA\n");
> +
> + pci_read_config_word(pdev, pos + PCI_ACS_CTRL, );
> +
> + ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
>  
> - 

Re: [PATCH v4 01/14] PCI/P2PDMA: Support peer-to-peer memory

2018-05-07 Thread Logan Gunthorpe
Thanks for the review. I'll apply all of these for the changes for next
version of the set.
>> +/*
>> + * If a device is behind a switch, we try to find the upstream bridge
>> + * port of the switch. This requires two calls to pci_upstream_bridge():
>> + * one for the upstream port on the switch, one on the upstream port
>> + * for the next level in the hierarchy. Because of this, devices connected
>> + * to the root port will be rejected.
>> + */
>> +static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
> 
> This function doesn't seem to be used anymore.  Thanks for all your hard
> work to get rid of it!

Oops, I thought I had gotten rid of it entirely, but I guess I messed it
up a bit and it gets removed in patch 4. I'll fix it for v5.

Logan


Re: [PATCH v4 03/14] PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset

2018-05-07 Thread Bjorn Helgaas
s/dma/DMA/ (in subject)

On Mon, Apr 23, 2018 at 05:30:35PM -0600, Logan Gunthorpe wrote:
> The DMA address used when mapping PCI P2P memory must be the PCI bus
> address. Thus, introduce pci_p2pmem_[un]map_sg() to map the correct
> addresses when using P2P memory.
> 
> For this, we assume that an SGL passed to these functions contain all
> P2P memory or no P2P memory.
> 
> Signed-off-by: Logan Gunthorpe 


Re: [PATCH v4 01/14] PCI/P2PDMA: Support peer-to-peer memory

2018-05-07 Thread Bjorn Helgaas
On Mon, Apr 23, 2018 at 05:30:33PM -0600, Logan Gunthorpe wrote:
> Some PCI devices may have memory mapped in a BAR space that's
> intended for use in peer-to-peer transactions. In order to enable
> such transactions the memory must be registered with ZONE_DEVICE pages
> so it can be used by DMA interfaces in existing drivers.
> 
> Add an interface for other subsystems to find and allocate chunks of P2P
> memory as necessary to facilitate transfers between two PCI peers:
> 
> int pci_p2pdma_add_client();
> struct pci_dev *pci_p2pmem_find();
> void *pci_alloc_p2pmem();
> 
> The new interface requires a driver to collect a list of client devices
> involved in the transaction with the pci_p2pmem_add_client*() functions
> then call pci_p2pmem_find() to obtain any suitable P2P memory. Once
> this is done the list is bound to the memory and the calling driver is
> free to add and remove clients as necessary (adding incompatible clients
> will fail). With a suitable p2pmem device, memory can then be
> allocated with pci_alloc_p2pmem() for use in DMA transactions.
> 
> Depending on hardware, using peer-to-peer memory may reduce the bandwidth
> of the transfer but can significantly reduce pressure on system memory.
> This may be desirable in many cases: for example a system could be designed
> with a small CPU connected to a PCI switch by a small number of lanes

s/PCI/PCIe/

> which would maximize the number of lanes available to connect to NVMe
> devices.
> 
> The code is designed to only utilize the p2pmem device if all the devices
> involved in a transfer are behind the same root port (typically through

s/root port/PCI bridge/

> a network of PCIe switches). This is because we have no way of knowing
> whether peer-to-peer routing between PCIe Root Ports is supported
> (PCIe r4.0, sec 1.3.1).  Additionally, the benefits of P2P transfers that
> go through the RC is limited to only reducing DRAM usage and, in some
> cases, coding convenience. The PCI-SIG may be exploring adding a new
> capability bit to advertise whether this is possible for future
> hardware.
> 
> This commit includes significant rework and feedback from Christoph
> Hellwig.
> 
> Signed-off-by: Christoph Hellwig 
> Signed-off-by: Logan Gunthorpe 
> ---
>  drivers/pci/Kconfig|  17 ++
>  drivers/pci/Makefile   |   1 +
>  drivers/pci/p2pdma.c   | 694 
> +
>  include/linux/memremap.h   |  18 ++
>  include/linux/pci-p2pdma.h | 100 +++
>  include/linux/pci.h|   4 +
>  6 files changed, 834 insertions(+)
>  create mode 100644 drivers/pci/p2pdma.c
>  create mode 100644 include/linux/pci-p2pdma.h
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 34b56a8f8480..b2396c22b53e 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -124,6 +124,23 @@ config PCI_PASID
>  
> If unsure, say N.
>  
> +config PCI_P2PDMA
> + bool "PCI peer-to-peer transfer support"
> + depends on PCI && ZONE_DEVICE && EXPERT
> + select GENERIC_ALLOCATOR
> + help
> +   Enableѕ drivers to do PCI peer-to-peer transactions to and from
> +   BARs that are exposed in other devices that are the part of
> +   the hierarchy where peer-to-peer DMA is guaranteed by the PCI
> +   specification to work (ie. anything below a single PCI bridge).
> +
> +   Many PCIe root complexes do not support P2P transactions and
> +   it's hard to tell which support it at all, so at this time, DMA
> +   transations must be between devices behind the same root port.

s/DMA transactions/PCIe DMA transactions/

(Theoretically P2P should work on conventional PCI, and this sentence only
applies to PCIe.)

> +   (Typically behind a network of PCIe switches).

Not sure this last sentence adds useful information.

> +++ b/drivers/pci/p2pdma.c
> @@ -0,0 +1,694 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * PCI Peer 2 Peer DMA support.
> + *
> + * Copyright (c) 2016-2018, Logan Gunthorpe
> + * Copyright (c) 2016-2017, Microsemi Corporation
> + * Copyright (c) 2017, Christoph Hellwig
> + * Copyright (c) 2018, Eideticom Inc.
> + *

Nit: unnecessary blank line.

> +/*
> + * If a device is behind a switch, we try to find the upstream bridge
> + * port of the switch. This requires two calls to pci_upstream_bridge():
> + * one for the upstream port on the switch, one on the upstream port
> + * for the next level in the hierarchy. Because of this, devices connected
> + * to the root port will be rejected.
> + */
> +static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)

This function doesn't seem to be used anymore.  Thanks for all your hard
work to get rid of it!

> +{
> + struct pci_dev *up1, *up2;
> +
> + if (!pdev)
> + return NULL;
> +
> + up1 = pci_dev_get(pci_upstream_bridge(pdev));
> + if (!up1)
> + return NULL;
> +
> + up2 = pci_dev_get(pci_upstream_bridge(up1));
> + 

[PATCH 1/7] mm: workingset: don't drop refault information prematurely

2018-05-07 Thread Johannes Weiner
From: Johannes Weiner 

If we just keep enough refault information to match the CURRENT page
cache during reclaim time, we could lose a lot of events when there is
only a temporary spike in non-cache memory consumption that pushes out
all the cache. Once cache comes back, we won't see those refaults.
They might not be actionable for LRU aging, but we want to know about
them for measuring memory pressure.

Signed-off-by: Johannes Weiner 
---
 mm/workingset.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index 40ee02c83978..53759a3cf99a 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -364,7 +364,7 @@ static unsigned long count_shadow_nodes(struct shrinker 
*shrinker,
 {
unsigned long max_nodes;
unsigned long nodes;
-   unsigned long cache;
+   unsigned long pages;
 
/* list_lru lock nests inside the IRQ-safe i_pages lock */
local_irq_disable();
@@ -393,14 +393,14 @@ static unsigned long count_shadow_nodes(struct shrinker 
*shrinker,
 *
 * PAGE_SIZE / radix_tree_nodes / node_entries * 8 / PAGE_SIZE
 */
-   if (sc->memcg) {
-   cache = mem_cgroup_node_nr_lru_pages(sc->memcg, sc->nid,
-LRU_ALL_FILE);
-   } else {
-   cache = node_page_state(NODE_DATA(sc->nid), NR_ACTIVE_FILE) +
-   node_page_state(NODE_DATA(sc->nid), NR_INACTIVE_FILE);
-   }
-   max_nodes = cache >> (RADIX_TREE_MAP_SHIFT - 3);
+#ifdef CONFIG_MEMCG
+   if (sc->memcg)
+   pages = page_counter_read(>memcg->memory);
+   else
+#endif
+   pages = node_present_pages(sc->nid);
+
+   max_nodes = pages >> (RADIX_TREE_MAP_SHIFT - 3);
 
if (nodes <= max_nodes)
return 0;
-- 
2.17.0



[PATCH 2/7] mm: workingset: tell cache transitions from workingset thrashing

2018-05-07 Thread Johannes Weiner
Refaults happen during transitions between workingsets as well as
in-place thrashing. Knowing the difference between the two has a range
of applications, including measuring the impact of memory shortage on
the system performance, as well as the ability to smarter balance
pressure between the filesystem cache and the swap-backed workingset.

During workingset transitions, inactive cache refaults and pushes out
established active cache. When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.

Introduce a new page flag that tells on eviction whether the page has
been active or not in its lifetime. This bit is then stored in the
shadow entry, to classify refaults as transitioning or thrashing.

How many page->flags does this leave us with on 32-bit?

20 bits are always page flags

21 if you have an MMU

23 with the zone bits for DMA, Normal, HighMem, Movable

29 with the sparsemem section bits

30 if PAE is enabled

31 with this patch.

So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA
nodes. If that's not enough, the system can switch to discontigmem and
re-gain the 6 or 7 sparsemem section bits.

Signed-off-by: Johannes Weiner 
---
 include/linux/mmzone.h |  1 +
 include/linux/page-flags.h |  5 +-
 include/linux/swap.h   |  2 +-
 include/trace/events/mmflags.h |  1 +
 mm/filemap.c   |  9 ++--
 mm/huge_memory.c   |  1 +
 mm/memcontrol.c|  2 +
 mm/migrate.c   |  2 +
 mm/swap_state.c|  1 +
 mm/vmscan.c|  1 +
 mm/vmstat.c|  1 +
 mm/workingset.c| 95 ++
 12 files changed, 79 insertions(+), 42 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 32699b2dc52a..6af87946d241 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -163,6 +163,7 @@ enum node_stat_item {
NR_ISOLATED_FILE,   /* Temporary isolated pages from file lru */
WORKINGSET_REFAULT,
WORKINGSET_ACTIVATE,
+   WORKINGSET_RESTORE,
WORKINGSET_NODERECLAIM,
NR_ANON_MAPPED, /* Mapped anonymous pages */
NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e34a27727b9a..7af1c3c15d8e 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -69,13 +69,14 @@
  */
 enum pageflags {
PG_locked,  /* Page is locked. Don't touch. */
-   PG_error,
PG_referenced,
PG_uptodate,
PG_dirty,
PG_lru,
PG_active,
+   PG_workingset,
PG_waiters, /* Page has waiters, check its waitqueue. Must 
be bit #7 and in the same byte as "PG_locked" */
+   PG_error,
PG_slab,
PG_owner_priv_1,/* Owner use. If pagecache, fs may use*/
PG_arch_1,
@@ -280,6 +281,8 @@ PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, 
PF_HEAD)
 PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD)
 PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
TESTCLEARFLAG(Active, active, PF_HEAD)
+PAGEFLAG(Workingset, workingset, PF_HEAD)
+   TESTCLEARFLAG(Workingset, workingset, PF_HEAD)
 __PAGEFLAG(Slab, slab, PF_NO_TAIL)
 __PAGEFLAG(SlobFree, slob_free, PF_NO_TAIL)
 PAGEFLAG(Checked, checked, PF_NO_COMPOUND)/* Used by some filesystems 
*/
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2417d288e016..d8c47dcdec6f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -296,7 +296,7 @@ struct vma_swap_readahead {
 
 /* linux/mm/workingset.c */
 void *workingset_eviction(struct address_space *mapping, struct page *page);
-bool workingset_refault(void *shadow);
+void workingset_refault(struct page *page, void *shadow);
 void workingset_activation(struct page *page);
 
 /* Do not use directly, use workingset_lookup_update */
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index a81cffb76d89..a1675d43777e 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -88,6 +88,7 @@
{1UL << PG_dirty,   "dirty" },  \
{1UL << PG_lru, "lru"   },  \
{1UL << PG_active,  "active"},  \
+   {1UL << PG_workingset,  "workingset"},  \
{1UL << PG_slab,"slab"  },  \
{1UL << PG_owner_priv_1,"owner_priv_1"  },  \
{1UL << PG_arch_1,  "arch_1"},  \
diff --git a/mm/filemap.c b/mm/filemap.c
index 0604cb02e6f3..bd36b7226cf4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -915,12 +915,9 @@ int add_to_page_cache_lru(struct page *page, struct 

[PATCH 7/7] psi: cgroup support

2018-05-07 Thread Johannes Weiner
On a system that executes multiple cgrouped jobs and independent
workloads, we don't just care about the health of the overall system,
but also that of individual jobs, so that we can ensure individual job
health, fairness between jobs, or prioritize some jobs over others.

This patch implements pressure stall tracking for cgroups. In kernels
with CONFIG_PSI=y, cgroups will have cpu.pressure, memory.pressure,
and io.pressure files that track aggregate pressure stall times for
only the tasks inside the cgroup.

Signed-off-by: Johannes Weiner 
---
 Documentation/cgroup-v2.txt | 18 +
 include/linux/cgroup-defs.h |  4 ++
 include/linux/cgroup.h  | 15 +++
 include/linux/psi.h | 25 
 init/Kconfig|  4 ++
 kernel/cgroup/cgroup.c  | 45 -
 kernel/sched/psi.c  | 79 -
 7 files changed, 186 insertions(+), 4 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 74cdeaed9f7a..a22879dba019 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -963,6 +963,12 @@ All time durations are in microseconds.
$PERIOD duration.  "max" for $MAX indicates no limit.  If only
one number is written, $MAX is updated.
 
+  cpu.pressure
+   A read-only nested-key file which exists on non-root cgroups.
+
+   Shows pressure stall information for CPU. See
+   Documentation/accounting/psi.txt for details.
+
 
 Memory
 --
@@ -1199,6 +1205,12 @@ PAGE_SIZE multiple when read back.
Swap usage hard limit.  If a cgroup's swap usage reaches this
limit, anonymous memory of the cgroup will not be swapped out.
 
+  memory.pressure
+   A read-only nested-key file which exists on non-root cgroups.
+
+   Shows pressure stall information for memory. See
+   Documentation/accounting/psi.txt for details.
+
 
 Usage Guidelines
 
@@ -1334,6 +1346,12 @@ IO Interface Files
 
  8:16 rbps=2097152 wbps=max riops=max wiops=max
 
+  io.pressure
+   A read-only nested-key file which exists on non-root cgroups.
+
+   Shows pressure stall information for IO. See
+   Documentation/accounting/psi.txt for details.
+
 
 Writeback
 ~
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index dc5b70449dc6..280f18da956a 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_CGROUPS
 
@@ -424,6 +425,9 @@ struct cgroup {
/* used to schedule release agent */
struct work_struct release_agent_work;
 
+   /* used to track pressure stalls */
+   struct psi_group psi;
+
/* used to store eBPF programs */
struct cgroup_bpf bpf;
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 473e0c0abb86..fd94c294c207 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -627,6 +627,11 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
pr_cont_kernfs_path(cgrp->kn);
 }
 
+static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
+{
+   return >psi;
+}
+
 static inline void cgroup_init_kthreadd(void)
 {
/*
@@ -680,6 +685,16 @@ static inline union kernfs_node_id 
*cgroup_get_kernfs_id(struct cgroup *cgrp)
return NULL;
 }
 
+static inline struct cgroup *cgroup_parent(struct cgroup *cgrp)
+{
+   return NULL;
+}
+
+static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
+{
+   return NULL;
+}
+
 static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
   struct cgroup *ancestor)
 {
diff --git a/include/linux/psi.h b/include/linux/psi.h
index 371af1479699..05c3dae3e9c5 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -4,6 +4,9 @@
 #include 
 #include 
 
+struct seq_file;
+struct css_set;
+
 #ifdef CONFIG_PSI
 
 extern bool psi_disabled;
@@ -15,6 +18,14 @@ void psi_task_change(struct task_struct *task, u64 now, int 
clear, int set);
 void psi_memstall_enter(unsigned long *flags);
 void psi_memstall_leave(unsigned long *flags);
 
+int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res);
+
+#ifdef CONFIG_CGROUPS
+int psi_cgroup_alloc(struct cgroup *cgrp);
+void psi_cgroup_free(struct cgroup *cgrp);
+void cgroup_move_task(struct task_struct *p, struct css_set *to);
+#endif
+
 #else /* CONFIG_PSI */
 
 static inline void psi_init(void) {}
@@ -22,6 +33,20 @@ static inline void psi_init(void) {}
 static inline void psi_memstall_enter(unsigned long *flags) {}
 static inline void psi_memstall_leave(unsigned long *flags) {}
 
+#ifdef CONFIG_CGROUPS
+static inline int psi_cgroup_alloc(struct cgroup *cgrp)
+{
+   return 0;
+}
+static inline void psi_cgroup_free(struct cgroup *cgrp)
+{
+}
+static inline void cgroup_move_task(struct task_struct *p, struct css_set *to)
+{
+   

[PATCH 3/7] delayacct: track delays from thrashing cache pages

2018-05-07 Thread Johannes Weiner
Delay accounting already measures the time a task spends in direct
reclaim and waiting for swapin, but in low memory situations tasks
spend can spend a significant amount of their time waiting on
thrashing page cache. This isn't tracked right now.

To know the full impact of memory contention on an individual task,
measure the delay when waiting for a recently evicted active cache
page to read back into memory.

Also update tools/accounting/getdelays.c:

 [hannes@computer accounting]$ sudo ./getdelays -d -p 1
 print delayacct stats ON
 PID 1

 CPU count real total  virtual totaldelay total  delay 
average
 50318  74500  847346785  400533713 
 0.008ms
 IO  countdelay total  delay average
   435  122601218  0ms
 SWAPcountdelay total  delay average
 0  0  0ms
 RECLAIM countdelay total  delay average
 0  0  0ms
 THRASHING   countdelay total  delay average
19   12621439  0ms

Signed-off-by: Johannes Weiner 
---
 include/linux/delayacct.h  | 23 +++
 include/uapi/linux/taskstats.h |  6 +-
 kernel/delayacct.c | 15 +++
 mm/filemap.c   | 11 +++
 tools/accounting/getdelays.c   |  8 +++-
 5 files changed, 61 insertions(+), 2 deletions(-)

diff --git a/include/linux/delayacct.h b/include/linux/delayacct.h
index 5e335b6203f4..d3e75b3ba487 100644
--- a/include/linux/delayacct.h
+++ b/include/linux/delayacct.h
@@ -57,7 +57,12 @@ struct task_delay_info {
 
u64 freepages_start;
u64 freepages_delay;/* wait for memory reclaim */
+
+   u64 thrashing_start;
+   u64 thrashing_delay;/* wait for thrashing page */
+
u32 freepages_count;/* total count of memory reclaim */
+   u32 thrashing_count;/* total count of thrash waits */
 };
 #endif
 
@@ -76,6 +81,8 @@ extern int __delayacct_add_tsk(struct taskstats *, struct 
task_struct *);
 extern __u64 __delayacct_blkio_ticks(struct task_struct *);
 extern void __delayacct_freepages_start(void);
 extern void __delayacct_freepages_end(void);
+extern void __delayacct_thrashing_start(void);
+extern void __delayacct_thrashing_end(void);
 
 static inline int delayacct_is_task_waiting_on_io(struct task_struct *p)
 {
@@ -156,6 +163,18 @@ static inline void delayacct_freepages_end(void)
__delayacct_freepages_end();
 }
 
+static inline void delayacct_thrashing_start(void)
+{
+   if (current->delays)
+   __delayacct_thrashing_start();
+}
+
+static inline void delayacct_thrashing_end(void)
+{
+   if (current->delays)
+   __delayacct_thrashing_end();
+}
+
 #else
 static inline void delayacct_set_flag(int flag)
 {}
@@ -182,6 +201,10 @@ static inline void delayacct_freepages_start(void)
 {}
 static inline void delayacct_freepages_end(void)
 {}
+static inline void delayacct_thrashing_start(void)
+{}
+static inline void delayacct_thrashing_end(void)
+{}
 
 #endif /* CONFIG_TASK_DELAY_ACCT */
 
diff --git a/include/uapi/linux/taskstats.h b/include/uapi/linux/taskstats.h
index b7aa7bb2349f..5e8ca16a9079 100644
--- a/include/uapi/linux/taskstats.h
+++ b/include/uapi/linux/taskstats.h
@@ -34,7 +34,7 @@
  */
 
 
-#define TASKSTATS_VERSION  8
+#define TASKSTATS_VERSION  9
 #define TS_COMM_LEN32  /* should be >= TASK_COMM_LEN
 * in linux/sched.h */
 
@@ -164,6 +164,10 @@ struct taskstats {
/* Delay waiting for memory reclaim */
__u64   freepages_count;
__u64   freepages_delay_total;
+
+   /* Delay waiting for thrashing page */
+   __u64   thrashing_count;
+   __u64   thrashing_delay_total;
 };
 
 
diff --git a/kernel/delayacct.c b/kernel/delayacct.c
index e2764d767f18..02ba745c448d 100644
--- a/kernel/delayacct.c
+++ b/kernel/delayacct.c
@@ -134,9 +134,12 @@ int __delayacct_add_tsk(struct taskstats *d, struct 
task_struct *tsk)
d->swapin_delay_total = (tmp < d->swapin_delay_total) ? 0 : tmp;
tmp = d->freepages_delay_total + tsk->delays->freepages_delay;
d->freepages_delay_total = (tmp < d->freepages_delay_total) ? 0 : tmp;
+   tmp = d->thrashing_delay_total + tsk->delays->thrashing_delay;
+   d->thrashing_delay_total = (tmp < d->thrashing_delay_total) ? 0 : tmp;
d->blkio_count += tsk->delays->blkio_count;
d->swapin_count += tsk->delays->swapin_count;
d->freepages_count += tsk->delays->freepages_count;
+   d->thrashing_count += tsk->delays->thrashing_count;
spin_unlock_irqrestore(>delays->lock, flags);
 
return 0;
@@ -168,3 +171,15 @@ void __delayacct_freepages_end(void)
>delays->freepages_count);
 }
 
+void 

[PATCH 5/7] sched: loadavg: make calc_load_n() public

2018-05-07 Thread Johannes Weiner
It's going to be used in the following patch. Keep the churn separate.

Signed-off-by: Johannes Weiner 
---
 include/linux/sched/loadavg.h | 69 +++
 kernel/sched/loadavg.c| 69 ---
 2 files changed, 69 insertions(+), 69 deletions(-)

diff --git a/include/linux/sched/loadavg.h b/include/linux/sched/loadavg.h
index cc9cc62bb1f8..0e4c24978751 100644
--- a/include/linux/sched/loadavg.h
+++ b/include/linux/sched/loadavg.h
@@ -37,6 +37,75 @@ calc_load(unsigned long load, unsigned long exp, unsigned 
long active)
return newload / FIXED_1;
 }
 
+/**
+ * fixed_power_int - compute: x^n, in O(log n) time
+ *
+ * @x: base of the power
+ * @frac_bits: fractional bits of @x
+ * @n: power to raise @x to.
+ *
+ * By exploiting the relation between the definition of the natural power
+ * function: x^n := x*x*...*x (x multiplied by itself for n times), and
+ * the binary encoding of numbers used by computers: n := \Sum n_i * 2^i,
+ * (where: n_i \elem {0, 1}, the binary vector representing n),
+ * we find: x^n := x^(\Sum n_i * 2^i) := \Prod x^(n_i * 2^i), which is
+ * of course trivially computable in O(log_2 n), the length of our binary
+ * vector.
+ */
+static inline unsigned long
+fixed_power_int(unsigned long x, unsigned int frac_bits, unsigned int n)
+{
+   unsigned long result = 1UL << frac_bits;
+
+   if (n) {
+   for (;;) {
+   if (n & 1) {
+   result *= x;
+   result += 1UL << (frac_bits - 1);
+   result >>= frac_bits;
+   }
+   n >>= 1;
+   if (!n)
+   break;
+   x *= x;
+   x += 1UL << (frac_bits - 1);
+   x >>= frac_bits;
+   }
+   }
+
+   return result;
+}
+
+/*
+ * a1 = a0 * e + a * (1 - e)
+ *
+ * a2 = a1 * e + a * (1 - e)
+ *= (a0 * e + a * (1 - e)) * e + a * (1 - e)
+ *= a0 * e^2 + a * (1 - e) * (1 + e)
+ *
+ * a3 = a2 * e + a * (1 - e)
+ *= (a0 * e^2 + a * (1 - e) * (1 + e)) * e + a * (1 - e)
+ *= a0 * e^3 + a * (1 - e) * (1 + e + e^2)
+ *
+ *  ...
+ *
+ * an = a0 * e^n + a * (1 - e) * (1 + e + ... + e^n-1) [1]
+ *= a0 * e^n + a * (1 - e) * (1 - e^n)/(1 - e)
+ *= a0 * e^n + a * (1 - e^n)
+ *
+ * [1] application of the geometric series:
+ *
+ *  n 1 - x^(n+1)
+ * S_n := \Sum x^i = -
+ * i=0  1 - x
+ */
+static inline unsigned long
+calc_load_n(unsigned long load, unsigned long exp,
+   unsigned long active, unsigned int n)
+{
+   return calc_load(load, fixed_power_int(exp, FSHIFT, n), active);
+}
+
 #define LOAD_INT(x) ((x) >> FSHIFT)
 #define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
 
diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
index 54fbdfb2d86c..0736e349a54e 100644
--- a/kernel/sched/loadavg.c
+++ b/kernel/sched/loadavg.c
@@ -210,75 +210,6 @@ static long calc_load_nohz_fold(void)
return delta;
 }
 
-/**
- * fixed_power_int - compute: x^n, in O(log n) time
- *
- * @x: base of the power
- * @frac_bits: fractional bits of @x
- * @n: power to raise @x to.
- *
- * By exploiting the relation between the definition of the natural power
- * function: x^n := x*x*...*x (x multiplied by itself for n times), and
- * the binary encoding of numbers used by computers: n := \Sum n_i * 2^i,
- * (where: n_i \elem {0, 1}, the binary vector representing n),
- * we find: x^n := x^(\Sum n_i * 2^i) := \Prod x^(n_i * 2^i), which is
- * of course trivially computable in O(log_2 n), the length of our binary
- * vector.
- */
-static unsigned long
-fixed_power_int(unsigned long x, unsigned int frac_bits, unsigned int n)
-{
-   unsigned long result = 1UL << frac_bits;
-
-   if (n) {
-   for (;;) {
-   if (n & 1) {
-   result *= x;
-   result += 1UL << (frac_bits - 1);
-   result >>= frac_bits;
-   }
-   n >>= 1;
-   if (!n)
-   break;
-   x *= x;
-   x += 1UL << (frac_bits - 1);
-   x >>= frac_bits;
-   }
-   }
-
-   return result;
-}
-
-/*
- * a1 = a0 * e + a * (1 - e)
- *
- * a2 = a1 * e + a * (1 - e)
- *= (a0 * e + a * (1 - e)) * e + a * (1 - e)
- *= a0 * e^2 + a * (1 - e) * (1 + e)
- *
- * a3 = a2 * e + a * (1 - e)
- *= (a0 * e^2 + a * (1 - e) * (1 + e)) * e + a * (1 - e)
- *= a0 * e^3 + a * (1 - e) * (1 + e + e^2)
- *
- *  ...
- *
- * an = a0 * e^n + a * (1 - e) * (1 + e + ... + e^n-1) [1]
- *= a0 * e^n + a * (1 - e) * (1 - e^n)/(1 - e)
- *= a0 * e^n + a * (1 - e^n)
- *
- * [1] 

[PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

2018-05-07 Thread Johannes Weiner
When systems are overcommitted and resources become contended, it's
hard to tell exactly the impact this has on workload productivity, or
how close the system is to lockups and OOM kills. In particular, when
machines work multiple jobs concurrently, the impact of overcommit in
terms of latency and throughput on the individual job can be enormous.

In order to maximize hardware utilization without sacrificing
individual job health or risk complete machine lockups, this patch
implements a way to quantify resource pressure in the system.

A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or
IO, respectively. Stall states are aggregate versions of the per-task
delay accounting delays:

   cpu: some tasks are runnable but not executing on a CPU
   memory: tasks are reclaiming, or waiting for swapin or thrashing cache
   io: tasks are waiting for io completions

These percentages of walltime can be thought of as pressure
percentages, and they give a general sense of system health and
productivity loss incurred by resource overcommit. They can also
indicate when the system is approaching lockup scenarios and OOMs.

To do this, psi keeps track of the task states associated with each
CPU and samples the time they spend in stall states. Every 2 seconds,
the samples are averaged across CPUs - weighted by the CPUs' non-idle
time to eliminate artifacts from unused CPUs - and translated into
percentages of walltime. A running average of those percentages is
maintained over 10s, 1m, and 5m periods (similar to the loadaverage).

Signed-off-by: Johannes Weiner 
---
 Documentation/accounting/psi.txt |  73 ++
 include/linux/psi.h  |  27 ++
 include/linux/psi_types.h|  84 ++
 include/linux/sched.h|  10 +
 include/linux/sched/stat.h   |  10 +-
 init/Kconfig |  16 ++
 kernel/fork.c|   4 +
 kernel/sched/Makefile|   1 +
 kernel/sched/core.c  |   3 +
 kernel/sched/psi.c   | 424 +++
 kernel/sched/sched.h | 166 ++--
 kernel/sched/stats.h |  91 ++-
 mm/compaction.c  |   5 +
 mm/filemap.c |  15 +-
 mm/page_alloc.c  |  10 +
 mm/vmscan.c  |  13 +
 16 files changed, 859 insertions(+), 93 deletions(-)
 create mode 100644 Documentation/accounting/psi.txt
 create mode 100644 include/linux/psi.h
 create mode 100644 include/linux/psi_types.h
 create mode 100644 kernel/sched/psi.c

diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt
new file mode 100644
index ..e051810d5127
--- /dev/null
+++ b/Documentation/accounting/psi.txt
@@ -0,0 +1,73 @@
+
+PSI - Pressure Stall Information
+
+
+:Date: April, 2018
+:Author: Johannes Weiner 
+
+When CPU, memory or IO devices are contended, workloads experience
+latency spikes, throughput losses, and run the risk of OOM kills.
+
+Without an accurate measure of such contention, users are forced to
+either play it safe and under-utilize their hardware resources, or
+roll the dice and frequently suffer the disruptions resulting from
+excessive overcommit.
+
+The psi feature identifies and quantifies the disruptions caused by
+such resource crunches and the time impact it has on complex workloads
+or even entire systems.
+
+Having an accurate measure of productivity losses caused by resource
+scarcity aids users in sizing workloads to hardware--or provisioning
+hardware according to workload demand.
+
+As psi aggregates this information in realtime, systems can be managed
+dynamically using techniques such as load shedding, migrating jobs to
+other systems or data centers, or strategically pausing or killing low
+priority or restartable batch jobs.
+
+This allows maximizing hardware utilization without sacrificing
+workload health or risking major disruptions such as OOM kills.
+
+Pressure interface
+==
+
+Pressure information for each resource is exported through the
+respective file in /proc/pressure/ -- cpu, memory, and io.
+
+In both cases, the format for CPU is as such:
+
+some avg10=0.00 avg60=0.00 avg300=0.00 total=0
+
+and for memory and IO:
+
+some avg10=0.00 avg60=0.00 avg300=0.00 total=0
+full avg10=0.00 avg60=0.00 avg300=0.00 total=0
+
+The "some" line indicates the share of time in which at least some
+tasks are stalled on a given resource.
+
+The "full" line indicates the share of time in which all non-idle
+tasks are stalled on a given resource simultaneously. In this state
+actual CPU cycles are going to waste, and a workload that spends
+extended time in this state is considered to be thrashing. This has
+severe impact on performance, and it's useful to distinguish this
+situation from 

[PATCH 0/7] psi: pressure stall information for CPU, memory, and IO

2018-05-07 Thread Johannes Weiner
Hi,

I previously submitted a version of this patch set called "memdelay",
which translated delays from reclaim, swap-in, thrashing page cache
into a pressure percentage of lost walltime. I've since extended this
code to aggregate all delay states tracked by delayacct in order to
have generalized pressure/overcommit levels for CPU, memory, and IO.

There was feedback from Peter on the previous version that I have
incorporated as much as possible and as it still applies to this code:

- got rid of the extra lock in the sched callbacks; all task
  state changes we care about serialize through rq->lock

- got rid of ktime_get() inside the sched callbacks and
  switched time measuring to rq_clock()

- got rid of all divisions inside the sched callbacks,
  tracking everything natively in ns now

I also moved this stuff into existing sched/stat.h callbacks, so it
doesn't get in the way in sched/core.c, and of course moved the whole
thing behind CONFIG_PSI since not everyone is going to want it.

Real-world applications

Since the last posting, we've begun using the data collected by this
code quite extensively at Facebook, and with several success stories.

First we used it on systems that frequently locked up in low memory
situations. The reason this happens is that the OOM killer is
triggered by reclaim not being able to make forward progress, but with
fast flash devices there is *always* some clean and uptodate cache to
reclaim; the OOM killer never kicks in, even as tasks wait 80-90% of
the time faulting executables. There is no situation where this ever
makes sense in practice. We wrote a <100 line POC python script to
monitor memory pressure and kill stuff manually, way before such
pathological thrashing.

We've since extended the python script into a more generic oomd that
we use all over the place, not just to avoid livelocks but also to
guarantee latency and throughput SLAs, since they're usually violated
way before the kernel OOM killer would ever kick in.

We also use the memory pressure info for loadshedding. Our batch job
infrastructure used to refuse new requests on heuristics based on RSS
and other existing VM metrics in an attempt to avoid OOM kills and
maximize utilization. Since it was still plagued by frequent OOM
kills, we switched it to shed load on psi memory pressure, which has
turned out to be a much better bellwether, and we managed to reduce
OOM kills drastically. Reducing the rate of OOM outages from the
worker pool raised its aggregate productivity, and we were able to
switch that service to smaller machines.

Lastly, we use cgroups to isolate a machine's main workload from
maintenance crap like package upgrades, logging, configuration, as
well as to prevent multiple workloads on a machine from stepping on
each others' toes. We were not able to do this properly without the
pressure metrics; we would see latency or bandwidth drops, but it
would often be hard to impossible to rootcause it post-mortem. We now
log and graph the pressure metrics for all containers in our fleet and
can trivially link service drops to resource pressure after the fact.

How do you use this?

A kernel with CONFIG_PSI=y will create a /proc/pressure directory with
3 files: cpu, memory, and io. If using cgroup2, cgroups will also have
cpu.pressure, memory.pressure and io.pressure files, which simply
calculate pressure at the cgroup level instead of system-wide.

The cpu file contains one line:

some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722

The averages give the percentage of walltime in which some tasks are
delayed on the runqueue while another task has the CPU. They're recent
averages over 10s, 1m, 5m windows, so you can tell short term trends
from long term ones, similarly to the load average.

What to make of this number? If CPU utilization is at 100% and CPU
pressure is 0, it means the system is perfectly utilized, with one
runnable thread per CPU and nobody waiting. At two or more runnable
tasks per CPU, the system is 100% overcommitted and the pressure
average will indicate as much. From a utilization perspective this is
a great state of course: no CPU cycles are being wasted, even when 50%
of the threads were to go idle (and most workloads do vary). From the
perspective of the individual job it's not great, however, and they
might do better with more resources. Depending on what your priority
is, an elevated "some" number may or may not require action.

The memory file contains two lines:

some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828
full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258

The some line is the same as for cpu: the time in which at least one
task is stalled on the resource.

The full line, however, indicates time in which *nobody* is using the
CPU productively due to pressure: all non-idle tasks could be waiting
on thrashing cache simultaneously. It can also happen when a single
reclaimer occupies the CPU, 

[PATCH 4/7] sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD

2018-05-07 Thread Johannes Weiner
There are several definitions of those functions/macso in places that
mess with fixed-point load averages. Provide an official version.

Signed-off-by: Johannes Weiner 
---
 .../platforms/cell/cpufreq_spudemand.c|  2 +-
 arch/powerpc/platforms/cell/spufs/sched.c |  9 +++-
 arch/s390/appldata/appldata_os.c  |  4 
 drivers/cpuidle/governors/menu.c  |  4 
 fs/proc/loadavg.c |  3 ---
 include/linux/sched/loadavg.h | 21 +++
 kernel/debug/kdb/kdb_main.c   |  7 +--
 kernel/sched/loadavg.c| 15 -
 8 files changed, 22 insertions(+), 43 deletions(-)

diff --git a/arch/powerpc/platforms/cell/cpufreq_spudemand.c 
b/arch/powerpc/platforms/cell/cpufreq_spudemand.c
index 882944c36ef5..5d8e8b6bb1cc 100644
--- a/arch/powerpc/platforms/cell/cpufreq_spudemand.c
+++ b/arch/powerpc/platforms/cell/cpufreq_spudemand.c
@@ -49,7 +49,7 @@ static int calc_freq(struct spu_gov_info_struct *info)
cpu = info->policy->cpu;
busy_spus = atomic_read(_spu_info[cpu_to_node(cpu)].busy_spus);
 
-   CALC_LOAD(info->busy_spus, EXP, busy_spus * FIXED_1);
+   info->busy_spus = calc_load(info->busy_spus, EXP, busy_spus * FIXED_1);
pr_debug("cpu %d: busy_spus=%d, info->busy_spus=%ld\n",
cpu, busy_spus, info->busy_spus);
 
diff --git a/arch/powerpc/platforms/cell/spufs/sched.c 
b/arch/powerpc/platforms/cell/spufs/sched.c
index ccc421503363..70101510b19d 100644
--- a/arch/powerpc/platforms/cell/spufs/sched.c
+++ b/arch/powerpc/platforms/cell/spufs/sched.c
@@ -987,9 +987,9 @@ static void spu_calc_load(void)
unsigned long active_tasks; /* fixed-point */
 
active_tasks = count_active_contexts() * FIXED_1;
-   CALC_LOAD(spu_avenrun[0], EXP_1, active_tasks);
-   CALC_LOAD(spu_avenrun[1], EXP_5, active_tasks);
-   CALC_LOAD(spu_avenrun[2], EXP_15, active_tasks);
+   spu_avenrun[0] = calc_load(spu_avenrun[0], EXP_1, active_tasks);
+   spu_avenrun[1] = calc_load(spu_avenrun[1], EXP_5, active_tasks);
+   spu_avenrun[2] = calc_load(spu_avenrun[2], EXP_15, active_tasks);
 }
 
 static void spusched_wake(struct timer_list *unused)
@@ -1071,9 +1071,6 @@ void spuctx_switch_state(struct spu_context *ctx,
}
 }
 
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
-
 static int show_spu_loadavg(struct seq_file *s, void *private)
 {
int a, b, c;
diff --git a/arch/s390/appldata/appldata_os.c b/arch/s390/appldata/appldata_os.c
index 433a994b1a89..54f375627532 100644
--- a/arch/s390/appldata/appldata_os.c
+++ b/arch/s390/appldata/appldata_os.c
@@ -25,10 +25,6 @@
 
 #include "appldata.h"
 
-
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
-
 /*
  * OS data
  *
diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 1bfe03ceb236..3738b670df7a 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -133,10 +133,6 @@ struct menu_device {
int interval_ptr;
 };
 
-
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
-
 static inline int get_loadavg(unsigned long load)
 {
return LOAD_INT(load) * 10 + LOAD_FRAC(load) / 10;
diff --git a/fs/proc/loadavg.c b/fs/proc/loadavg.c
index b572cc865b92..8bee50a97c0f 100644
--- a/fs/proc/loadavg.c
+++ b/fs/proc/loadavg.c
@@ -10,9 +10,6 @@
 #include 
 #include 
 
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
-
 static int loadavg_proc_show(struct seq_file *m, void *v)
 {
unsigned long avnrun[3];
diff --git a/include/linux/sched/loadavg.h b/include/linux/sched/loadavg.h
index 80bc84ba5d2a..cc9cc62bb1f8 100644
--- a/include/linux/sched/loadavg.h
+++ b/include/linux/sched/loadavg.h
@@ -22,10 +22,23 @@ extern void get_avenrun(unsigned long *loads, unsigned long 
offset, int shift);
 #define EXP_5  2014/* 1/exp(5sec/5min) */
 #define EXP_15 2037/* 1/exp(5sec/15min) */
 
-#define CALC_LOAD(load,exp,n) \
-   load *= exp; \
-   load += n*(FIXED_1-exp); \
-   load >>= FSHIFT;
+/*
+ * a1 = a0 * e + a * (1 - e)
+ */
+static inline unsigned long
+calc_load(unsigned long load, unsigned long exp, unsigned long active)
+{
+   unsigned long newload;
+
+   newload = load * exp + active * (FIXED_1 - exp);
+   if (active >= load)
+   newload += FIXED_1-1;
+
+   return newload / FIXED_1;
+}
+
+#define LOAD_INT(x) ((x) >> FSHIFT)
+#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
 
 extern void calc_global_load(unsigned long ticks);
 
diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c
index e405677ee08d..a8f5aca5eb5e 100644
--- a/kernel/debug/kdb/kdb_main.c
+++ b/kernel/debug/kdb/kdb_main.c
@@ 

Re: general protection fault in lo_ioctl (2)

2018-05-07 Thread Tetsuo Handa
On 2018/05/02 20:23, Dmitry Vyukov wrote:
> #syz dup: INFO: rcu detected stall in blkdev_ioctl

The cause of stall turned out to be ioctl(loop_fd, LOOP_CHANGE_FD, loop_fd).

But we haven't explained the cause of NULL pointer dereference which can
occur when raced with ioctl(LOOP_CLR_FD). Therefore,

#syz undup


Re: [PATCH] loop: add recursion validation to LOOP_CHANGE_FD

2018-05-07 Thread Tetsuo Handa
Theodore Y. Ts'o wrote:
> On Mon, May 07, 2018 at 10:21:04PM +0900, Tetsuo Handa wrote:
> > > I don't understand your concern; where are we going to out_putf when
> > > error == 0?
> 
> Ah, now I see it, thanks.  I'll send a revised patch.
> 
> > By the way, are you aware that current "/* Avoid recursion */" loop is not 
> > thread safe?
> 
> Actually, it is safe.  While the child loop device has an open file on
> the parent, lo_refcnt is elevated, which prevents loop_clr_fd from
> actually set setting lo_state to Lo_rundown and clearing
> lo_backing_file

If you think it is safe, please explain that the crash referenced in a patch
at https://groups.google.com/d/msg/syzkaller-bugs/2Rw8-OM6IbM/PzdobV8kAgAJ is
no longer possible. syzbot is hitting crashes there.


Re: [PATCH BUGFIX] block, bfq: postpone rq preparation to insert or merge

2018-05-07 Thread Paolo Valente


> Il giorno 07 mag 2018, alle ore 12:01, Mike Galbraith  ha 
> scritto:
> 
> On Mon, 2018-05-07 at 11:27 +0200, Paolo Valente wrote:
>> 
>> 
>> Where is the bug?
> 
> Hm, seems potent pain-killers and C don't mix all that well.
> 

I'll try to keep it in mind, and I do wis you to get well soon.

Thanks,
Paolo

Re: bug in tag handling in blk-mq?

2018-05-07 Thread Paolo Valente


> Il giorno 07 mag 2018, alle ore 18:39, Jens Axboe  ha 
> scritto:
> 
> On 5/7/18 8:03 AM, Paolo Valente wrote:
>> Hi Jens, Christoph, all,
>> Mike Galbraith has been experiencing hangs, on blk_mq_get_tag, only
>> with bfq [1].  Symptoms seem to clearly point to a problem in I/O-tag
>> handling, triggered by bfq because it limits the number of tags for
>> async and sync write requests (in bfq_limit_depth).
>> 
>> Fortunately, I just happened to find a way to apparently confirm it.
>> With the following one-liner for block/bfq-iosched.c:
>> 
>> @@ -554,8 +554,7 @@ static void bfq_limit_depth(unsigned int op, struct 
>> blk_mq_alloc_data *data)
>>if (unlikely(bfqd->sb_shift != bt->sb.shift))
>>bfq_update_depths(bfqd, bt);
>> 
>> -   data->shallow_depth =
>> -   bfqd->word_depths[!!bfqd->wr_busy_queues][op_is_sync(op)];
>> +   data->shallow_depth = 1;
>> 
>>bfq_log(bfqd, "[%s] wr_busy %d sync %d depth %u",
>>__func__, bfqd->wr_busy_queues, op_is_sync(op),
>> 
>> Mike's machine now crashes soon and systematically, while nothing bad
>> happens on my machines, even with heavy workloads (apart from an
>> expected throughput drop).
>> 
>> This change simply reduces to 1 the maximum possible value for the sum
>> of the number of async requests and of sync write requests.
>> 
>> This email is basically a request for help to knowledgeable people.  To
>> start, here are my first doubts/questions:
>> 1) Just to be certain, I guess it is not normal that blk-mq hangs if
>> async requests and sync write requests can be at most one, right?
>> 2) Do you have any hint to where I could look for, to chase this bug?
>> Of course, the bug may be in bfq, i.e, it may be a somehow unrelated
>> bfq bug that causes this hang in blk-mq, indirectly.  But it is hard
>> for me to understand how.
> 
> CC Omar, since he implemented the shallow part. But we'll need some
> traces to show where we are hung, probably also the value of the
> /sys/debug/kernel/block// directory. For the crash mentioned, a
> trace as well. Otherwise we'll be wasting a lot of time on this.
> 
> Is there a reproducer?
> 

Ok Mike, I guess it's your turn now, for at least a stack trace.

Thanks,
Paolo

> -- 
> Jens Axboe



Re: bug in tag handling in blk-mq?

2018-05-07 Thread Jens Axboe
On 5/7/18 8:03 AM, Paolo Valente wrote:
> Hi Jens, Christoph, all,
> Mike Galbraith has been experiencing hangs, on blk_mq_get_tag, only
> with bfq [1].  Symptoms seem to clearly point to a problem in I/O-tag
> handling, triggered by bfq because it limits the number of tags for
> async and sync write requests (in bfq_limit_depth).
> 
> Fortunately, I just happened to find a way to apparently confirm it.
> With the following one-liner for block/bfq-iosched.c:
> 
> @@ -554,8 +554,7 @@ static void bfq_limit_depth(unsigned int op, struct 
> blk_mq_alloc_data *data)
> if (unlikely(bfqd->sb_shift != bt->sb.shift))
> bfq_update_depths(bfqd, bt);
>  
> -   data->shallow_depth =
> -   bfqd->word_depths[!!bfqd->wr_busy_queues][op_is_sync(op)];
> +   data->shallow_depth = 1;
>  
> bfq_log(bfqd, "[%s] wr_busy %d sync %d depth %u",
> __func__, bfqd->wr_busy_queues, op_is_sync(op),
> 
> Mike's machine now crashes soon and systematically, while nothing bad
> happens on my machines, even with heavy workloads (apart from an
> expected throughput drop).
> 
> This change simply reduces to 1 the maximum possible value for the sum
> of the number of async requests and of sync write requests.
> 
> This email is basically a request for help to knowledgeable people.  To
> start, here are my first doubts/questions:
> 1) Just to be certain, I guess it is not normal that blk-mq hangs if
> async requests and sync write requests can be at most one, right?
> 2) Do you have any hint to where I could look for, to chase this bug?
> Of course, the bug may be in bfq, i.e, it may be a somehow unrelated
> bfq bug that causes this hang in blk-mq, indirectly.  But it is hard
> for me to understand how.

CC Omar, since he implemented the shallow part. But we'll need some
traces to show where we are hung, probably also the value of the
/sys/debug/kernel/block// directory. For the crash mentioned, a
trace as well. Otherwise we'll be wasting a lot of time on this.

Is there a reproducer?

-- 
Jens Axboe



[PATCH 1/4] block: break discard submissions into the user defined size

2018-05-07 Thread Jens Axboe
Don't build discards bigger than what the user asked for, if the
user decided to limit the size by writing to 'discard_max_bytes'.

Signed-off-by: Jens Axboe 
---
 block/blk-lib.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index a676084d4740..7417d617091b 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -62,10 +62,11 @@ int __blkdev_issue_discard(struct block_device *bdev, 
sector_t sector,
unsigned int req_sects;
sector_t end_sect, tmp;
 
-   /* Make sure bi_size doesn't overflow */
-   req_sects = min_t(sector_t, nr_sects, UINT_MAX >> 9);
+   /* Issue in chunks of the user defined max discard setting */
+   req_sects = min_t(sector_t, nr_sects,
+   q->limits.max_discard_sectors);
 
-   /**
+   /*
 * If splitting a request, and the next starting sector would be
 * misaligned, stop the discard at the previous aligned sector.
 */
-- 
2.7.4



[PATCHSET v2 0/4] Add throttling for discards

2018-05-07 Thread Jens Axboe
I implemented support for discards in blk-wbt, so we treat them like
background writes. If we have competing foreground IO, then we may
throttle discards. Otherwise they should run at full speed.

Ran a quick local test case on a NVMe device. The test case here is
doing reads from a file, while other files are being delete. File
system is XFS. Three separate runs:

1) 'justreader' is just running the reader workload by itself.
2) 'nothrottle' is running the reader with the unlinks on the stock
   kernel.
3) 'throttled' is running the reader with the unlinks with the patches
   applied.

The fio run logs latencies in windows of 50msec, logging only the
worst latency seen in that window:

[read]
filename=file0
rw=randread
bs=4k
direct=1
ioengine=libaio
iodepth=8
write_lat_log=read_lat
log_avg_msec=50
log_max_value=1

Graph of the latencies here: http://kernel.dk/wbt-discard.jpg

As you can see, with the throttling, the latencies and performance is
the same as not having the unlinks running. They finish within 100msec
of each other, after having read 10G of data. Without the throttling
and unlinks running, we see spikes that are almost an order of magnitude
worse. The reader also takes 1 second longer to complete. 322MB/sec vs
313MB/sec. It's also visible in the higher Pxx. Here's the throttled
end of the percentiles:

 | 99.00th=[  145], 99.50th=[  153], 99.90th=[  180], 99.95th=[  198],
 | 99.99th=[  889]

and here's the unthrottled equivalent:

 | 99.00th=[  145], 99.50th=[  155], 99.90th=[  188], 99.95th=[  322],
 | 99.99th=[ 8586]


Changes since v1:

- Pass in enum wbt_flags for get_rq_wait()
- Turn wbt_should_throttle() into a switch statement

 blk-lib.c  |7 +++---
 blk-stat.h |6 ++---
 blk-wbt.c  |   70 -
 blk-wbt.h  |8 +-
 4 files changed, 55 insertions(+), 36 deletions(-)



[PATCH 4/4] blk-wbt: throttle discards like background writes

2018-05-07 Thread Jens Axboe
Throttle discards like we would any background write. Discards should
be background activity, so if they are impacting foreground IO, then
we will throttle them down.

Signed-off-by: Jens Axboe 
---
 block/blk-stat.h |  6 +++---
 block/blk-wbt.c  | 43 ++-
 block/blk-wbt.h  |  4 +++-
 3 files changed, 32 insertions(+), 21 deletions(-)

diff --git a/block/blk-stat.h b/block/blk-stat.h
index 2dd36347252a..c22049a8125e 100644
--- a/block/blk-stat.h
+++ b/block/blk-stat.h
@@ -10,11 +10,11 @@
 
 /*
  * from upper:
- * 3 bits: reserved for other usage
+ * 4 bits: reserved for other usage
  * 12 bits: size
- * 49 bits: time
+ * 48 bits: time
  */
-#define BLK_STAT_RES_BITS  3
+#define BLK_STAT_RES_BITS  4
 #define BLK_STAT_SIZE_BITS 12
 #define BLK_STAT_RES_SHIFT (64 - BLK_STAT_RES_BITS)
 #define BLK_STAT_SIZE_SHIFT(BLK_STAT_RES_SHIFT - BLK_STAT_SIZE_BITS)
diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 25d202345965..a7a724580033 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -106,6 +106,8 @@ static inline struct rq_wait *get_rq_wait(struct rq_wb *rwb,
 {
if (wb_acct & WBT_KSWAPD)
return >rq_wait[WBT_RWQ_KSWAPD];
+   else if (wb_acct & WBT_DISCARD)
+   return >rq_wait[WBT_RWQ_DISCARD];
 
return >rq_wait[WBT_RWQ_BG];
 }
@@ -143,10 +145,13 @@ void __wbt_done(struct rq_wb *rwb, enum wbt_flags wb_acct)
}
 
/*
-* If the device does write back caching, drop further down
-* before we wake people up.
+* For discards, our limit is always the background. For writes, if
+* the device does write back caching, drop further down before we
+* wake people up.
 */
-   if (rwb->wc && !wb_recent_wait(rwb))
+   if (wb_acct & WBT_DISCARD)
+   limit = rwb->wb_background;
+   else if (rwb->wc && !wb_recent_wait(rwb))
limit = 0;
else
limit = rwb->wb_normal;
@@ -483,6 +488,9 @@ static inline unsigned int get_limit(struct rq_wb *rwb, 
unsigned long rw)
 {
unsigned int limit;
 
+   if ((rw & REQ_OP_MASK) == REQ_OP_DISCARD)
+   return rwb->wb_background;
+
/*
 * At this point we know it's a buffered write. If this is
 * kswapd trying to free memory, or REQ_SYNC is set, then
@@ -564,21 +572,20 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags 
wb_acct,
 
 static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)
 {
-   const int op = bio_op(bio);
-
-   /*
-* If not a WRITE, do nothing
-*/
-   if (op != REQ_OP_WRITE)
-   return false;
-
-   /*
-* Don't throttle WRITE_ODIRECT
-*/
-   if ((bio->bi_opf & (REQ_SYNC | REQ_IDLE)) == (REQ_SYNC | REQ_IDLE))
+   switch (bio_op(bio)) {
+   case REQ_OP_WRITE:
+   /*
+* Don't throttle WRITE_ODIRECT
+*/
+   if ((bio->bi_opf & (REQ_SYNC | REQ_IDLE)) ==
+   (REQ_SYNC | REQ_IDLE))
+   return false;
+   /* fallthrough */
+   case REQ_OP_DISCARD:
+   return true;
+   default:
return false;
-
-   return true;
+   }
 }
 
 /*
@@ -605,6 +612,8 @@ enum wbt_flags wbt_wait(struct rq_wb *rwb, struct bio *bio, 
spinlock_t *lock)
 
if (current_is_kswapd())
ret |= WBT_KSWAPD;
+   if (bio_op(bio) == REQ_OP_DISCARD)
+   ret |= WBT_DISCARD;
 
__wbt_wait(rwb, ret, bio->bi_opf, lock);
 
diff --git a/block/blk-wbt.h b/block/blk-wbt.h
index 8038b4a0d4ef..d6a125e49db5 100644
--- a/block/blk-wbt.h
+++ b/block/blk-wbt.h
@@ -14,13 +14,15 @@ enum wbt_flags {
WBT_TRACKED = 1,/* write, tracked for throttling */
WBT_READ= 2,/* read */
WBT_KSWAPD  = 4,/* write, from kswapd */
+   WBT_DISCARD = 8,/* discard */
 
-   WBT_NR_BITS = 3,/* number of bits */
+   WBT_NR_BITS = 4,/* number of bits */
 };
 
 enum {
WBT_RWQ_BG  = 0,
WBT_RWQ_KSWAPD,
+   WBT_RWQ_DISCARD,
WBT_NUM_RWQ,
 };
 
-- 
2.7.4



[PATCH 3/4] blk-wbt: pass in enum wbt_flags to get_rq_wait()

2018-05-07 Thread Jens Axboe
This is in preparation for having more write queues, in which
case we would have needed to pass in more information than just
a simple 'is_kswapd' boolean.

Signed-off-by: Jens Axboe 
---
 block/blk-wbt.c | 25 +++--
 block/blk-wbt.h |  4 +++-
 2 files changed, 18 insertions(+), 11 deletions(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 3e34b41bcefc..25d202345965 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -101,9 +101,13 @@ static bool wb_recent_wait(struct rq_wb *rwb)
return time_before(jiffies, wb->dirty_sleep + HZ);
 }
 
-static inline struct rq_wait *get_rq_wait(struct rq_wb *rwb, bool is_kswapd)
+static inline struct rq_wait *get_rq_wait(struct rq_wb *rwb,
+ enum wbt_flags wb_acct)
 {
-   return >rq_wait[is_kswapd];
+   if (wb_acct & WBT_KSWAPD)
+   return >rq_wait[WBT_RWQ_KSWAPD];
+
+   return >rq_wait[WBT_RWQ_BG];
 }
 
 static void rwb_wake_all(struct rq_wb *rwb)
@@ -126,7 +130,7 @@ void __wbt_done(struct rq_wb *rwb, enum wbt_flags wb_acct)
if (!(wb_acct & WBT_TRACKED))
return;
 
-   rqw = get_rq_wait(rwb, wb_acct & WBT_KSWAPD);
+   rqw = get_rq_wait(rwb, wb_acct);
inflight = atomic_dec_return(>inflight);
 
/*
@@ -529,11 +533,12 @@ static inline bool may_queue(struct rq_wb *rwb, struct 
rq_wait *rqw,
  * Block if we will exceed our limit, or if we are currently waiting for
  * the timer to kick off queuing again.
  */
-static void __wbt_wait(struct rq_wb *rwb, unsigned long rw, spinlock_t *lock)
+static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
+  unsigned long rw, spinlock_t *lock)
__releases(lock)
__acquires(lock)
 {
-   struct rq_wait *rqw = get_rq_wait(rwb, current_is_kswapd());
+   struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
DEFINE_WAIT(wait);
 
if (may_queue(rwb, rqw, , rw))
@@ -584,7 +589,7 @@ static inline bool wbt_should_throttle(struct rq_wb *rwb, 
struct bio *bio)
  */
 enum wbt_flags wbt_wait(struct rq_wb *rwb, struct bio *bio, spinlock_t *lock)
 {
-   unsigned int ret = 0;
+   enum wbt_flags ret = 0;
 
if (!rwb_enabled(rwb))
return 0;
@@ -598,14 +603,14 @@ enum wbt_flags wbt_wait(struct rq_wb *rwb, struct bio 
*bio, spinlock_t *lock)
return ret;
}
 
-   __wbt_wait(rwb, bio->bi_opf, lock);
+   if (current_is_kswapd())
+   ret |= WBT_KSWAPD;
+
+   __wbt_wait(rwb, ret, bio->bi_opf, lock);
 
if (!blk_stat_is_active(rwb->cb))
rwb_arm_timer(rwb);
 
-   if (current_is_kswapd())
-   ret |= WBT_KSWAPD;
-
return ret | WBT_TRACKED;
 }
 
diff --git a/block/blk-wbt.h b/block/blk-wbt.h
index a232c98fbf4d..8038b4a0d4ef 100644
--- a/block/blk-wbt.h
+++ b/block/blk-wbt.h
@@ -19,7 +19,9 @@ enum wbt_flags {
 };
 
 enum {
-   WBT_NUM_RWQ = 2,
+   WBT_RWQ_BG  = 0,
+   WBT_RWQ_KSWAPD,
+   WBT_NUM_RWQ,
 };
 
 /*
-- 
2.7.4



[PATCH 2/4] blk-wbt: account any writing command as a write

2018-05-07 Thread Jens Axboe
We currently special case WRITE and FLUSH, but we should really
just include any command with the write bit set. This ensures
that we account DISCARD.

Reviewed-by: Christoph Hellwig 
Signed-off-by: Jens Axboe 
---
 block/blk-wbt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index f92fc84b5e2c..3e34b41bcefc 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -701,7 +701,7 @@ static int wbt_data_dir(const struct request *rq)
 
if (op == REQ_OP_READ)
return READ;
-   else if (op == REQ_OP_WRITE || op == REQ_OP_FLUSH)
+   else if (op_is_write(op))
return WRITE;
 
/* don't account */
-- 
2.7.4



Re: [PATCH 3/3] blk-wbt: throttle discards like background writes

2018-05-07 Thread Jens Axboe
On 5/7/18 3:57 AM, Christoph Hellwig wrote:
>> -static inline struct rq_wait *get_rq_wait(struct rq_wb *rwb, bool is_kswapd)
>> +static inline struct rq_wait *get_rq_wait(struct rq_wb *rwb, bool is_trim,
>> +  bool is_kswapd)
>>  {
>> -return >rq_wait[is_kswapd];
>> +if (is_trim)
>> +return >rq_wait[WBT_REQ_DISCARD];
>> +else if (is_kswapd)
>> +return >rq_wait[WBT_REQ_KSWAPD];
>> +else
>> +return >rq_wait[WBT_REQ_BG];
>>  }
> 
> Wouldn't it be more useful to pass a enum wbt_flag here?
> 
> Or just have a wbt_flag_to_wait_idx helper and do the array indexing
> in the callers?

It would be cleaner, but we don't have wbt_flag everywhere we need it.
Though I guess we could swap the masking in wbt_wait() and do it
before the __wbt_wait() call, and just use that. Since we only do
the indexing in that one spot, I don't think we should add a helper.

> 
>>  {
>>  const int op = bio_op(bio);
>>  
>> -/*
>> - * If not a WRITE, do nothing
>> - */
>> -if (op != REQ_OP_WRITE)
>> -return false;
>> +if (op == REQ_OP_WRITE) {
>> +/*
>> + * Don't throttle WRITE_ODIRECT
>> + */
>> +if ((bio->bi_opf & (REQ_SYNC | REQ_IDLE)) ==
>> +(REQ_SYNC | REQ_IDLE))
>> +return false;
>>  
>> -/*
>> - * Don't throttle WRITE_ODIRECT
>> - */
>> -if ((bio->bi_opf & (REQ_SYNC | REQ_IDLE)) == (REQ_SYNC | REQ_IDLE))
>> -return false;
>> +return true;
>> +} else if (op == REQ_OP_DISCARD)
>> +return true;
> 
> what about:
> 
>   switch (bio_op(bio)) {
>   case REQ_OP_WRITE:
>   /*
>* Don't throttle WRITE_ODIRECT
>*/
>   if ((bio->bi_opf & (REQ_SYNC | REQ_IDLE)) ==
>   (REQ_SYNC | REQ_IDLE))
>   return false;
>   /*FALLTHROUGH*/
>   case REQ_OP_DISCARD:
>   return true;
>   default:
>   return false;

Sure, I can do that. I'll spin a v2.

-- 
Jens Axboe



Re: [PATCH 1/3] block: break discard submissions into the user defined size

2018-05-07 Thread Jens Axboe
On 5/7/18 3:51 AM, Christoph Hellwig wrote:
> On Thu, May 03, 2018 at 09:20:41AM -0600, Jens Axboe wrote:
>> Don't build discards bigger than what the user asked for, if the
>> user decided to limit the size by writing to 'discard_max_bytes'.
>>
>> Signed-off-by: Jens Axboe 
> 
> Why do this here when blk_bio_discard_split already takes care of it?

Functionally it should be the same, just seems pointless to defer
the chopping if we can build them the right size to begin with.

-- 
Jens Axboe



[PATCH -v2] loop: add recursion validation to LOOP_CHANGE_FD

2018-05-07 Thread Theodore Ts'o
Refactor the validation code used in LOOP_SET_FD so it is also used in
LOOP_CHANGE_FD.  Otherwise it is possible to construct a set of loop
devices that all refer to each other.  This can lead to a infinite
loop in starting with "while (is_loop_device(f)) .." in loop_set_fd().

Fix this by refactoring out the validation code and using it for
LOOP_CHANGE_FD as well as LOOP_SET_FD.

Reported-by: 
syzbot+4349872271ece473a7c91190b68b4bac7c5db...@syzkaller.appspotmail.com
Reported-by: syzbot+40bd32c4d9a3cc12a...@syzkaller.appspotmail.com
Reported-by: syzbot+769c54e66f994b041...@syzkaller.appspotmail.com
Reported-by: syzbot+0a89a9ce473936c57...@syzkaller.appspotmail.com
Signed-off-by: Theodore Ts'o 
---
 drivers/block/loop.c | 68 +---
 1 file changed, 38 insertions(+), 30 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 5d4e31655d96..0373d64e174c 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -644,6 +644,36 @@ static void loop_reread_partitions(struct loop_device *lo,
__func__, lo->lo_number, lo->lo_file_name, rc);
 }
 
+static inline int is_loop_device(struct file *file)
+{
+   struct inode *i = file->f_mapping->host;
+
+   return i && S_ISBLK(i->i_mode) && MAJOR(i->i_rdev) == LOOP_MAJOR;
+}
+
+static int loop_validate_file(struct file *file, struct block_device *bdev)
+{
+   struct inode*inode = file->f_mapping->host;
+   struct file *f = file;
+
+   /* Avoid recursion */
+   while (is_loop_device(f)) {
+   struct loop_device *l;
+
+   if (f->f_mapping->host->i_bdev == bdev)
+   return -EBADF;
+
+   l = f->f_mapping->host->i_bdev->bd_disk->private_data;
+   if (l->lo_state == Lo_unbound) {
+   return -EINVAL;
+   }
+   f = l->lo_backing_file;
+   }
+   if (!S_ISREG(inode->i_mode) && !S_ISBLK(inode->i_mode))
+   return -EINVAL;
+   return 0;
+}
+
 /*
  * loop_change_fd switched the backing store of a loopback device to
  * a new file. This is useful for operating system installers to free up
@@ -673,14 +703,15 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
if (!file)
goto out;
 
+   error = loop_validate_file(file, bdev);
+   if (error)
+   goto out_putf;
+
inode = file->f_mapping->host;
old_file = lo->lo_backing_file;
 
error = -EINVAL;
 
-   if (!S_ISREG(inode->i_mode) && !S_ISBLK(inode->i_mode))
-   goto out_putf;
-
/* size of the new backing store needs to be the same */
if (get_loop_size(lo, file) != get_loop_size(lo, old_file))
goto out_putf;
@@ -706,13 +737,6 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
return error;
 }
 
-static inline int is_loop_device(struct file *file)
-{
-   struct inode *i = file->f_mapping->host;
-
-   return i && S_ISBLK(i->i_mode) && MAJOR(i->i_rdev) == LOOP_MAJOR;
-}
-
 /* loop sysfs attributes */
 
 static ssize_t loop_attr_show(struct device *dev, char *page,
@@ -877,7 +901,7 @@ static int loop_prepare_queue(struct loop_device *lo)
 static int loop_set_fd(struct loop_device *lo, fmode_t mode,
   struct block_device *bdev, unsigned int arg)
 {
-   struct file *file, *f;
+   struct file *file;
struct inode*inode;
struct address_space *mapping;
int lo_flags = 0;
@@ -896,29 +920,13 @@ static int loop_set_fd(struct loop_device *lo, fmode_t 
mode,
if (lo->lo_state != Lo_unbound)
goto out_putf;
 
-   /* Avoid recursion */
-   f = file;
-   while (is_loop_device(f)) {
-   struct loop_device *l;
-
-   if (f->f_mapping->host->i_bdev == bdev)
-   goto out_putf;
-
-   l = f->f_mapping->host->i_bdev->bd_disk->private_data;
-   if (l->lo_state == Lo_unbound) {
-   error = -EINVAL;
-   goto out_putf;
-   }
-   f = l->lo_backing_file;
-   }
+   error = loop_validate_file(file, bdev);
+   if (error)
+   goto out_putf;
 
mapping = file->f_mapping;
inode = mapping->host;
 
-   error = -EINVAL;
-   if (!S_ISREG(inode->i_mode) && !S_ISBLK(inode->i_mode))
-   goto out_putf;
-
if (!(file->f_mode & FMODE_WRITE) || !(mode & FMODE_WRITE) ||
!file->f_op->write_iter)
lo_flags |= LO_FLAGS_READ_ONLY;
-- 
2.16.1.72.g5be1f00a9a



Re: [PATCH] loop: add recursion validation to LOOP_CHANGE_FD

2018-05-07 Thread Theodore Y. Ts'o
On Mon, May 07, 2018 at 10:21:04PM +0900, Tetsuo Handa wrote:
> > I don't understand your concern; where are we going to out_putf when
> > error == 0?

Ah, now I see it, thanks.  I'll send a revised patch.

> By the way, are you aware that current "/* Avoid recursion */" loop is not 
> thread safe?

Actually, it is safe.  While the child loop device has an open file on
the parent, lo_refcnt is elevated, which prevents loop_clr_fd from
actually set setting lo_state to Lo_rundown and clearing
lo_backing_file

- Ted


Re: [PATCH V4 6/7] nvme: pci: prepare for supporting error recovery from resetting context

2018-05-07 Thread James Smart



On 5/5/2018 6:59 AM, Ming Lei wrote:

--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2365,14 +2365,14 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, 
int status)
nvme_put_ctrl(>ctrl);
  }
  
-static void nvme_reset_work(struct work_struct *work)

+static void nvme_reset_dev(struct nvme_dev *dev)
  {
-   struct nvme_dev *dev =
-   container_of(work, struct nvme_dev, ctrl.reset_work);
bool was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL);
int result = -ENODEV;
enum nvme_ctrl_state new_state = NVME_CTRL_LIVE;
  
+	mutex_lock(>ctrl.reset_lock);

+
if (WARN_ON(dev->ctrl.state != NVME_CTRL_RESETTING))
goto out;
  


I believe the reset_lock is unnecessary (patch 5) as it should be 
covered by the transition of the state to RESETTING which is done under 
lock.


Thus the error is:
instead of:
 if (WARN_ON(dev->ctrl.state != NVME_CTRL_RESETTING))
     goto out;

it should be:
 if (dev->ctrl.state != NVME_CTRL_RESETTING))
     return;


-- james



Re: lockdep splats with blk-mq drivers

2018-05-07 Thread Sebastian Ott
Hi,

On Thu, 19 Apr 2018, Sebastian Ott wrote:
> Since commit 1d9bd5161ba3 ("blk-mq: replace timeout synchronization with a
> RCU and generation based scheme") I observe lockdep complaints (full
> message attached):
> 
> [   21.763369]CPU0CPU1
> [   21.763370]
> [   21.763371]   lock(>gstate_seq);
> [   21.763374]local_irq_disable();
> [   21.763375]lock(&(>lock)->rlock);
> [   21.763377]lock(>gstate_seq);
> [   21.763379]   
> [   21.763380] lock(&(>lock)->rlock);
> [   21.763382] 
> *** DEADLOCK ***
> 
> 
> This only happens in combination with DASD (s390 specific) and another
> blk-mq driver (scsi, null_blk). The distinctive behavior of DASD is that
> it calls blk_mq_start_request with irqs disabled.
> 
> This is a false positive since gstate_seq on CPU0 is from a different
> request queue / block driver than gstate_seq on CPU1.
> 
> Possible fixes are to use local_irq_save/restore in blk_mq_start_request.
> Or, since it's a false positive, teach lockdep that these are different
> locks - like below:

Something we can do here? I've send 2 proposals to address this. Do they
make sense?

Regards,
Sebastian



bug in tag handling in blk-mq?

2018-05-07 Thread Paolo Valente
Hi Jens, Christoph, all,
Mike Galbraith has been experiencing hangs, on blk_mq_get_tag, only
with bfq [1].  Symptoms seem to clearly point to a problem in I/O-tag
handling, triggered by bfq because it limits the number of tags for
async and sync write requests (in bfq_limit_depth).

Fortunately, I just happened to find a way to apparently confirm it.
With the following one-liner for block/bfq-iosched.c:

@@ -554,8 +554,7 @@ static void bfq_limit_depth(unsigned int op, struct 
blk_mq_alloc_data *data)
if (unlikely(bfqd->sb_shift != bt->sb.shift))
bfq_update_depths(bfqd, bt);
 
-   data->shallow_depth =
-   bfqd->word_depths[!!bfqd->wr_busy_queues][op_is_sync(op)];
+   data->shallow_depth = 1;
 
bfq_log(bfqd, "[%s] wr_busy %d sync %d depth %u",
__func__, bfqd->wr_busy_queues, op_is_sync(op),

Mike's machine now crashes soon and systematically, while nothing bad
happens on my machines, even with heavy workloads (apart from an
expected throughput drop).

This change simply reduces to 1 the maximum possible value for the sum
of the number of async requests and of sync write requests.

This email is basically a request for help to knowledgeable people.  To
start, here are my first doubts/questions:
1) Just to be certain, I guess it is not normal that blk-mq hangs if
async requests and sync write requests can be at most one, right?
2) Do you have any hint to where I could look for, to chase this bug?
Of course, the bug may be in bfq, i.e, it may be a somehow unrelated
bfq bug that causes this hang in blk-mq, indirectly.  But it is hard
for me to understand how.

Looking forward to some help.

Thanks,
Paolo

[1] https://www.spinics.net/lists/stable/msg215036.html

Re: [PATCH] loop: add recursion validation to LOOP_CHANGE_FD

2018-05-07 Thread Tetsuo Handa
Theodore Y. Ts'o wrote:
> On Mon, May 07, 2018 at 08:16:58PM +0900, Tetsuo Handa wrote:
> > Oh, your message did not arrive at news.gmane.org and I didn't notice that 
> > you
> > already wrote this patch. But please update yours or review mine shown 
> > below.
> > 
> > > @@ -673,14 +703,13 @@ static int loop_change_fd(struct loop_device *lo, 
> > > struct block_device *bdev,
> > >   if (!file)
> > >   goto out;
> > >  
> > > + error = loop_validate_file(file, bdev);
> > > + if (error)
> > > + goto out_putf;
> > > +
> > >   inode = file->f_mapping->host;
> > >   old_file = lo->lo_backing_file;
> > >  
> > > - error = -EINVAL;
> > > -
> > > - if (!S_ISREG(inode->i_mode) && !S_ISBLK(inode->i_mode))
> > > - goto out_putf;
> > > -
> > >   /* size of the new backing store needs to be the same */
> > >   if (get_loop_size(lo, file) != get_loop_size(lo, old_file))
> > >   goto out_putf;
> > 
> > error == 0 upon "goto out_putf" is wrong.
> 
> I don't understand your concern; where are we going to out_putf when
> error == 0?

-   error = -EINVAL;  /* <= You are trying to remove this assignment. */
-
-   if (!S_ISREG(inode->i_mode) && !S_ISBLK(inode->i_mode))
-   goto out_putf;
/* size of the new backing store needs to be the same */
if (get_loop_size(lo, file) != get_loop_size(lo, old_file))
goto out_putf; /* <= Hence error == 0 at this point. */

By the way, are you aware that current "/* Avoid recursion */" loop is not 
thread safe?


Re: [PATCH] loop: add recursion validation to LOOP_CHANGE_FD

2018-05-07 Thread Theodore Y. Ts'o
On Mon, May 07, 2018 at 08:16:58PM +0900, Tetsuo Handa wrote:
> Oh, your message did not arrive at news.gmane.org and I didn't notice that you
> already wrote this patch. But please update yours or review mine shown below.
> 
> > @@ -673,14 +703,13 @@ static int loop_change_fd(struct loop_device *lo, 
> > struct block_device *bdev,
> > if (!file)
> > goto out;
> >  
> > +   error = loop_validate_file(file, bdev);
> > +   if (error)
> > +   goto out_putf;
> > +
> > inode = file->f_mapping->host;
> > old_file = lo->lo_backing_file;
> >  
> > -   error = -EINVAL;
> > -
> > -   if (!S_ISREG(inode->i_mode) && !S_ISBLK(inode->i_mode))
> > -   goto out_putf;
> > -
> > /* size of the new backing store needs to be the same */
> > if (get_loop_size(lo, file) != get_loop_size(lo, old_file))
> > goto out_putf;
> 
> error == 0 upon "goto out_putf" is wrong.

I don't understand your concern; where are we going to out_putf when
error == 0?

The relevant code that was added is:

error = loop_validate_file(file, bdev);
if (error)
goto out_putf;


- Ted


Re: [PATCH v2 0/3] Rework write error handling in pblk

2018-05-07 Thread Matias Bjørling

On 04/24/2018 07:45 AM, Hans Holmberg wrote:

From: Hans Holmberg 

This patch series fixes the(currently incomplete) write error handling
in pblk by:

  * queuing and re-submitting failed writes in the write buffer
  * evacuating valid data data in lines with write failures, so the
chunk(s) with write failures can be reset to a known state by the fw

Lines with failures in smeta are put back on the free list.
Failed chunks will be reset on the next use.

If a write failes in emeta, the lba list is cached so the line can be
garbage collected without scanning the out-of-band area.

Changes in V2:
- Added the recov_writes counter increase to the new path
- Moved lba list emeta reading during gc to a separate function
- Allocating the saved lba list with pblk_malloc instead of kmalloc
- Fixed formatting issues
- Removed dead code

Hans Holmberg (3):
   lightnvm: pblk: rework write error recovery path
   lightnvm: pblk: garbage collect lines with failed writes
   lightnvm: pblk: fix smeta write error path

  drivers/lightnvm/pblk-core.c |  52 +++-
  drivers/lightnvm/pblk-gc.c   | 102 +--
  drivers/lightnvm/pblk-init.c |  47 ---
  drivers/lightnvm/pblk-rb.c   |  39 --
  drivers/lightnvm/pblk-recovery.c |  91 -
  drivers/lightnvm/pblk-rl.c   |  29 -
  drivers/lightnvm/pblk-sysfs.c|  15 ++-
  drivers/lightnvm/pblk-write.c| 269 ++-
  drivers/lightnvm/pblk.h  |  36 --
  9 files changed, 384 insertions(+), 296 deletions(-)



Applied for 4.18.


Re: [V2 PATCH 00/11] lightnvm: pblk: small fixes

2018-05-07 Thread Matias Bjørling

On 04/30/2018 11:09 AM, Javier González wrote:

Changes since V1 (from Matias):
  - Improve commit messages
  - Fix error code on read path refactoring

Changes can be found at. This includes this series, the 2 extra patches I sent
for pblk and Hans' write recovery on top.
https://github.com/OpenChannelSSD/linux/tree/for-4.18/pblk

A bunch of small fixes and extra checks for pblk. Non is critical, though
("lightnvm: pblk: check for chunk size before allocating it") might be nice to
get into 4.17 as it is a fix for the 2.0 pblk patches.

Javier

Javier González (11):
   lightnvm: pblk: fail gracefully on line alloc. failure
   lightnvm: pblk: recheck for bad lines at runtime
   lightnvm: pblk: check read lba on gc path
   lightnvn: pblk: improve error msg on corrupted LBAs
   lightnvm: pblk: warn in case of corrupted write buffer
   lightnvm: pblk: return NVM_ error on failed submission
   lightnvm: pblk: remove unnecessary indirection
   lightnvm: pblk: remove unnecessary argument
   lightnvm: pblk: check for chunk size before allocating it
   lightnvn: pass flag on graceful teardown to targets
   lightnvm: pblk: remove dead function

  drivers/lightnvm/core.c  | 10 +++---
  drivers/lightnvm/pblk-core.c | 86 ++--
  drivers/lightnvm/pblk-gc.c   | 10 +++---
  drivers/lightnvm/pblk-init.c | 38 
  drivers/lightnvm/pblk-map.c  | 33 -
  drivers/lightnvm/pblk-rb.c   |  5 ++-
  drivers/lightnvm/pblk-read.c | 75 --
  drivers/lightnvm/pblk.h  |  7 ++--
  include/linux/lightnvm.h |  2 +-
  9 files changed, 174 insertions(+), 92 deletions(-)



Applied for 4.18.


Re: [PATCH] loop: add recursion validation to LOOP_CHANGE_FD

2018-05-07 Thread Tetsuo Handa
Oh, your message did not arrive at news.gmane.org and I didn't notice that you
already wrote this patch. But please update yours or review mine shown below.

> @@ -673,14 +703,13 @@ static int loop_change_fd(struct loop_device *lo, 
> struct block_device *bdev,
>   if (!file)
>   goto out;
>  
> + error = loop_validate_file(file, bdev);
> + if (error)
> + goto out_putf;
> +
>   inode = file->f_mapping->host;
>   old_file = lo->lo_backing_file;
>  
> - error = -EINVAL;
> -
> - if (!S_ISREG(inode->i_mode) && !S_ISBLK(inode->i_mode))
> - goto out_putf;
> -
>   /* size of the new backing store needs to be the same */
>   if (get_loop_size(lo, file) != get_loop_size(lo, old_file))
>   goto out_putf;

error == 0 upon "goto out_putf" is wrong.



>From eed54c6ae475860a9c63b37b58f34735e792eef7 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa 
Date: Sat, 5 May 2018 12:59:12 +0900
Subject: [PATCH] block/loop: Add recursion check for LOOP_CHANGE_FD request.

syzbot is reporting hung tasks at blk_freeze_queue() [1]. This is
due to ioctl(loop_fd, LOOP_CHANGE_FD, loop_fd) request which should be
rejected. Fix this by adding same recursion check which is used by
LOOP_SET_FD request.

[1] 
https://syzkaller.appspot.com/bug?id=078805a692853552e08154b1bcd2bc2c729eda88

Signed-off-by: Tetsuo Handa 
Reported-by: syzbot 
Cc: Jens Axboe 
---
 drivers/block/loop.c | 59 
 1 file changed, 37 insertions(+), 22 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 5d4e316..cee3c01 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -644,6 +644,34 @@ static void loop_reread_partitions(struct loop_device *lo,
__func__, lo->lo_number, lo->lo_file_name, rc);
 }
 
+static inline int is_loop_device(struct file *file)
+{
+   struct inode *i = file->f_mapping->host;
+
+   return i && S_ISBLK(i->i_mode) && MAJOR(i->i_rdev) == LOOP_MAJOR;
+}
+
+static int check_loop_recursion(struct file *f, struct block_device *bdev)
+{
+   /*
+* FIXME: Traversing on other loop devices without corresponding
+* lo_ctl_mutex is not safe. l->lo_state can become Lo_rundown and
+* l->lo_backing_file can become NULL when raced with LOOP_CLR_FD.
+*/
+   while (is_loop_device(f)) {
+   struct loop_device *l;
+
+   if (f->f_mapping->host->i_bdev == bdev)
+   return -EBUSY;
+
+   l = f->f_mapping->host->i_bdev->bd_disk->private_data;
+   if (l->lo_state == Lo_unbound)
+   return -EINVAL;
+   f = l->lo_backing_file;
+   }
+   return 0;
+}
+
 /*
  * loop_change_fd switched the backing store of a loopback device to
  * a new file. This is useful for operating system installers to free up
@@ -673,6 +701,11 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
if (!file)
goto out;
 
+   /* Avoid recursion */
+   error = check_loop_recursion(file, bdev);
+   if (error)
+   goto out_putf;
+
inode = file->f_mapping->host;
old_file = lo->lo_backing_file;
 
@@ -706,13 +739,6 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
return error;
 }
 
-static inline int is_loop_device(struct file *file)
-{
-   struct inode *i = file->f_mapping->host;
-
-   return i && S_ISBLK(i->i_mode) && MAJOR(i->i_rdev) == LOOP_MAJOR;
-}
-
 /* loop sysfs attributes */
 
 static ssize_t loop_attr_show(struct device *dev, char *page,
@@ -877,7 +903,7 @@ static int loop_prepare_queue(struct loop_device *lo)
 static int loop_set_fd(struct loop_device *lo, fmode_t mode,
   struct block_device *bdev, unsigned int arg)
 {
-   struct file *file, *f;
+   struct file *file;
struct inode*inode;
struct address_space *mapping;
int lo_flags = 0;
@@ -897,20 +923,9 @@ static int loop_set_fd(struct loop_device *lo, fmode_t 
mode,
goto out_putf;
 
/* Avoid recursion */
-   f = file;
-   while (is_loop_device(f)) {
-   struct loop_device *l;
-
-   if (f->f_mapping->host->i_bdev == bdev)
-   goto out_putf;
-
-   l = f->f_mapping->host->i_bdev->bd_disk->private_data;
-   if (l->lo_state == Lo_unbound) {
-   error = -EINVAL;
-   goto out_putf;
-   }
-   f = l->lo_backing_file;
-   }
+   error = check_loop_recursion(file, bdev);
+   if (error)
+   goto out_putf;
 
mapping = file->f_mapping;
inode = mapping->host;
-- 
1.8.3.1


Re: [PATCH BUGFIX] block, bfq: postpone rq preparation to insert or merge

2018-05-07 Thread Mike Galbraith
On Mon, 2018-05-07 at 11:27 +0200, Paolo Valente wrote:
> 
> 
> Where is the bug?

Hm, seems potent pain-killers and C don't mix all that well.



Re: [PATCH 3/3] blk-wbt: throttle discards like background writes

2018-05-07 Thread Christoph Hellwig
> -static inline struct rq_wait *get_rq_wait(struct rq_wb *rwb, bool is_kswapd)
> +static inline struct rq_wait *get_rq_wait(struct rq_wb *rwb, bool is_trim,
> +   bool is_kswapd)
>  {
> - return >rq_wait[is_kswapd];
> + if (is_trim)
> + return >rq_wait[WBT_REQ_DISCARD];
> + else if (is_kswapd)
> + return >rq_wait[WBT_REQ_KSWAPD];
> + else
> + return >rq_wait[WBT_REQ_BG];
>  }

Wouldn't it be more useful to pass a enum wbt_flag here?

Or just have a wbt_flag_to_wait_idx helper and do the array indexing
in the callers?

>  {
>   const int op = bio_op(bio);
>  
> - /*
> -  * If not a WRITE, do nothing
> -  */
> - if (op != REQ_OP_WRITE)
> - return false;
> + if (op == REQ_OP_WRITE) {
> + /*
> +  * Don't throttle WRITE_ODIRECT
> +  */
> + if ((bio->bi_opf & (REQ_SYNC | REQ_IDLE)) ==
> + (REQ_SYNC | REQ_IDLE))
> + return false;
>  
> - /*
> -  * Don't throttle WRITE_ODIRECT
> -  */
> - if ((bio->bi_opf & (REQ_SYNC | REQ_IDLE)) == (REQ_SYNC | REQ_IDLE))
> - return false;
> + return true;
> + } else if (op == REQ_OP_DISCARD)
> + return true;

what about:

switch (bio_op(bio)) {
case REQ_OP_WRITE:
/*
 * Don't throttle WRITE_ODIRECT
 */
if ((bio->bi_opf & (REQ_SYNC | REQ_IDLE)) ==
(REQ_SYNC | REQ_IDLE))
return false;
/*FALLTHROUGH*/
case REQ_OP_DISCARD:
return true;
default:
return false;


Re: [PATCH 2/3] blk-wbt: account any writing command as a write

2018-05-07 Thread Christoph Hellwig
On Thu, May 03, 2018 at 09:20:42AM -0600, Jens Axboe wrote:
> We currently special case WRITE and FLUSH, but we should really
> just include any command with the write bit set. This ensures
> that we account DISCARD.

Looks fine,

Reviewed-by: Christoph Hellwig 

> diff --git a/block/blk-wbt.c b/block/blk-wbt.c
> index f92fc84b5e2c..3e34b41bcefc 100644
> --- a/block/blk-wbt.c
> +++ b/block/blk-wbt.c
> @@ -701,7 +701,7 @@ static int wbt_data_dir(const struct request *rq)
>  
>   if (op == REQ_OP_READ)
>   return READ;
> - else if (op == REQ_OP_WRITE || op == REQ_OP_FLUSH)
> + else if (op_is_write(op))
>   return WRITE;
>  
>   /* don't account */

But this whole function smells, the return value is an array
index, so instead of abusing read/write this should return 0 or 1
and be called _idx or similar.  But I guess that is a question for the
higher level interface.


Re: [PATCH 1/3] block: break discard submissions into the user defined size

2018-05-07 Thread Christoph Hellwig
On Thu, May 03, 2018 at 09:20:41AM -0600, Jens Axboe wrote:
> Don't build discards bigger than what the user asked for, if the
> user decided to limit the size by writing to 'discard_max_bytes'.
> 
> Signed-off-by: Jens Axboe 

Why do this here when blk_bio_discard_split already takes care of it?


Re: [PATCH BUGFIX] block, bfq: postpone rq preparation to insert or merge

2018-05-07 Thread Paolo Valente


> Il giorno 07 mag 2018, alle ore 05:23, Mike Galbraith  ha 
> scritto:
> 
> On Mon, 2018-05-07 at 04:43 +0200, Mike Galbraith wrote:
>> On Sun, 2018-05-06 at 09:42 +0200, Paolo Valente wrote:
>>> 
>>> I've attached a compressed patch (to avoid possible corruption from my
>>> mailer).  I'm little confident, but no pain, no gain, right?
>>> 
>>> If possible, apply this patch on top of the fix I proposed in this
>>> thread, just to eliminate possible further noise. Finally, the
>>> patch content follows.
>>> 
>>> Hoping for a stroke of luck,
>> 
>> FWIW, box didn't survive the first full build of the morning.
> 
> Nor the second.
> 

Great, finally the first good news!

Help from blk-mq experts would be fundamental here.  To increase the
chances to get it, I'm going to open a new thread on this.  In that
thread I'll ask you to provide an OOPS or something, I hope it is now
easier for you to get it.

Thanks,
Paolo

>   -Mike



Re: [PATCH BUGFIX] block, bfq: postpone rq preparation to insert or merge

2018-05-07 Thread Paolo Valente


> Il giorno 07 mag 2018, alle ore 07:56, Mike Galbraith  ha 
> scritto:
> 
> On Sun, 2018-05-06 at 09:42 +0200, Paolo Valente wrote:
>> 
>> diff --git a/block/bfq-mq-iosched.c b/block/bfq-mq-iosched.c
>> index 118f319af7c0..6662efe29b69 100644
>> --- a/block/bfq-mq-iosched.c
>> +++ b/block/bfq-mq-iosched.c
>> @@ -525,8 +525,13 @@ static void bfq_limit_depth(unsigned int op, struct 
>> blk_mq_alloc_data *data)
>>if (unlikely(bfqd->sb_shift != bt->sb.shift))
>>bfq_update_depths(bfqd, bt);
>> 
>> +#if 0
>>data->shallow_depth =
>>bfqd->word_depths[!!bfqd->wr_busy_queues][op_is_sync(op)];
>^
> 
> Q: why doesn't the top of this function look like so?
> 
> ---
> block/bfq-iosched.c |2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -539,7 +539,7 @@ static void bfq_limit_depth(unsigned int
>   struct bfq_data *bfqd = data->q->elevator->elevator_data;
>   struct sbitmap_queue *bt;
> 
> - if (op_is_sync(op) && !op_is_write(op))
> + if (!op_is_write(op))
>   return;
> 
>   if (data->flags & BLK_MQ_REQ_RESERVED) {
> 
> It looks a bit odd that these elements exist...
> 
> +   /*
> +* no more than 75% of tags for sync writes (25% extra tags
> +* w.r.t. async I/O, to prevent async I/O from starving sync
> +* writes)
> +*/
> +   bfqd->word_depths[0][1] = max(((1U>2, 1U);
> 
> +   /* no more than ~37% of tags for sync writes (~20% extra tags) */
> +   bfqd->word_depths[1][1] = max(((1U>4, 1U);
> 
> ...yet we index via and log a guaranteed zero.
> 

I'm not sure I got your point, so, to help you help me quickly, I'll
repeat what I expect the code you highlight to do:

- sync reads must have no limitation, and the lines
if (op_is_sync(op) && !op_is_write(op))
return;
make sure they don't

- sync writes must be limited, and the code you pasted above computes
those limits

- for sync writes, for which op_is_sync(op) is true (but the condition
"op_is_sync(op) && !op_is_write(op)" is false), the line:
bfqd->word_depths[!!bfqd->wr_busy_queues][op_is_sync(op)];
becomes
bfqd->word_depths[!!bfqd->wr_busy_queues][1];
e yields the right limit for sync writes, depending on bfqd->wr_busy_queues.

Where is the bug?

Thanks,
Paolo


>   -Mike
> 
>