[GIT PULL] MD update for 4.20

2018-10-26 Thread Shaohua Li
Hi,
Please pull MD update for 4.20. This pull mainly improves raid10 cluster and
fixes some bugs.
- raid10 cluster improvements from Guoqing
- Memory leak fixes from Jack and Xiao
- raid10 hang fix from Alex
- raid5 block faulty device fix from Mariusz
- metadata update fix from Neil
- Invalid disk role fix from Me
- Other clearnups

Thanks,
Shaohua

The following changes since commit f151f57bfd97fb8c76bbef9e181ecba5dd750f2a:

  Merge tag 'drm-fixes-2018-09-28' of git://anongit.freedesktop.org/drm/drm 
(2018-09-28 18:55:17 +0200)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git for-next

for you to fetch changes up to af9b926de9c5986ab009e64917de87c9758bab10:

  MD: Memory leak when flush bio size is zero (2018-10-22 09:15:26 -0700)


Alex Wu (1):
  md/raid10: Fix raid10 replace hang when new added disk faulty

Colin Ian King (1):
  md: remove redundant code that is no longer reachable

Guoqing Jiang (8):
  md-cluster/raid10: resize all the bitmaps before start reshape
  md-cluster/raid10: support add disk under grow mode
  md-cluster: introduce resync_info_get interface for sanity check
  md-cluster/raid10: call update_size in md_reap_sync_thread
  md-cluster/raid10: don't call remove_and_add_spares during reshaping stage
  md-cluster/bitmap: don't call md_bitmap_sync_with_cluster during 
reshaping stage
  md-cluster: send BITMAP_NEEDS_SYNC message if reshaping is interrupted
  md-cluster: remove suspend_info

Jack Wang (2):
  md/bitmap: use mddev_suspend/resume instead of ->quiesce()
  md: fix memleak for mempool

Mariusz Tkaczyk (1):
  raid5: block failing device if raid will be failed

NeilBrown (1):
  md: allow metadata updates while suspending an array - fix

Shaohua Li (2):
  MD: fix invalid stored role for a disk
  MD: fix invalid stored role for a disk - try2

Xiao Ni (1):
  MD: Memory leak when flush bio size is zero

 drivers/md/md-bitmap.c   |   9 +-
 drivers/md/md-cluster.c  | 234 ---
 drivers/md/md-cluster.h  |   2 +
 drivers/md/md.c  | 113 +--
 drivers/md/md.h  |   1 +
 drivers/md/raid1.c   |   1 +
 drivers/md/raid10.c  | 109 ++
 drivers/md/raid5-cache.c |   2 -
 drivers/md/raid5.c   |  12 +++
 9 files changed, 356 insertions(+), 127 deletions(-)


[GIT PULL] MD update for 4.20

2018-10-26 Thread Shaohua Li
Hi,
Please pull MD update for 4.20. This pull mainly improves raid10 cluster and
fixes some bugs.
- raid10 cluster improvements from Guoqing
- Memory leak fixes from Jack and Xiao
- raid10 hang fix from Alex
- raid5 block faulty device fix from Mariusz
- metadata update fix from Neil
- Invalid disk role fix from Me
- Other clearnups

Thanks,
Shaohua

The following changes since commit f151f57bfd97fb8c76bbef9e181ecba5dd750f2a:

  Merge tag 'drm-fixes-2018-09-28' of git://anongit.freedesktop.org/drm/drm 
(2018-09-28 18:55:17 +0200)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git for-next

for you to fetch changes up to af9b926de9c5986ab009e64917de87c9758bab10:

  MD: Memory leak when flush bio size is zero (2018-10-22 09:15:26 -0700)


Alex Wu (1):
  md/raid10: Fix raid10 replace hang when new added disk faulty

Colin Ian King (1):
  md: remove redundant code that is no longer reachable

Guoqing Jiang (8):
  md-cluster/raid10: resize all the bitmaps before start reshape
  md-cluster/raid10: support add disk under grow mode
  md-cluster: introduce resync_info_get interface for sanity check
  md-cluster/raid10: call update_size in md_reap_sync_thread
  md-cluster/raid10: don't call remove_and_add_spares during reshaping stage
  md-cluster/bitmap: don't call md_bitmap_sync_with_cluster during 
reshaping stage
  md-cluster: send BITMAP_NEEDS_SYNC message if reshaping is interrupted
  md-cluster: remove suspend_info

Jack Wang (2):
  md/bitmap: use mddev_suspend/resume instead of ->quiesce()
  md: fix memleak for mempool

Mariusz Tkaczyk (1):
  raid5: block failing device if raid will be failed

NeilBrown (1):
  md: allow metadata updates while suspending an array - fix

Shaohua Li (2):
  MD: fix invalid stored role for a disk
  MD: fix invalid stored role for a disk - try2

Xiao Ni (1):
  MD: Memory leak when flush bio size is zero

 drivers/md/md-bitmap.c   |   9 +-
 drivers/md/md-cluster.c  | 234 ---
 drivers/md/md-cluster.h  |   2 +
 drivers/md/md.c  | 113 +--
 drivers/md/md.h  |   1 +
 drivers/md/raid1.c   |   1 +
 drivers/md/raid10.c  | 109 ++
 drivers/md/raid5-cache.c |   2 -
 drivers/md/raid5.c   |  12 +++
 9 files changed, 356 insertions(+), 127 deletions(-)


Re: [LKP] [MD] d595567dc4: mdadm-selftests.02lineargrow.fail

2018-10-14 Thread Shaohua Li
On Fri, Oct 12, 2018 at 04:58:44PM +0800, kernel test robot wrote:
> FYI, we noticed the following commit (built with gcc-7):
> 
> commit: d595567dc4f0c1d90685ec1e2e296e2cad2643ac ("MD: fix invalid stored 
> role for a disk")
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> 
> in testcase: mdadm-selftests
> with following parameters:
> 
>   disk: 1HDD
>   test_prefix: 02
> 
> 
> 
> on test machine: qemu-system-x86_64 -enable-kvm -cpu host -smp 2 -m 4G
> 
> caused below changes (please refer to attached dmesg/kmsg for entire 
> log/backtrace):
> 
> 
> 
> 
> 
> Welcome to fdisk (util-linux 2.29.2).
> Changes will remain in memory only, until you decide to write them.
> Be careful before using the write command.
> 
> Device does not contain a recognized partition table.
> Created a new DOS disklabel with disk identifier 0xd50ae9f0.
> 
> Command (m for help): Partition type
>p   primary (0 primary, 0 extended, 4 free)
>e   extended (container for logical partitions)
> Select (default p): Partition number (1-4, default 1): First sector 
> (2048-536870911, default 2048): Last sector, +sectors or +size{K,M,G,T,P} 
> (2048-536870911, default 536870911): 
> Created a new partition 1 of type 'Linux' and of size 5 GiB.
> 
> Command (m for help): The partition table has been altered.
> Calling ioctl() to re-read partition table.
> Syncing disks.
> 
> 2018-10-12 11:05:55 mkdir -p /var/tmp
> 2018-10-12 11:05:55 mke2fs -t ext3 -b 1024 -J size=1 -q /dev/vda1
> 2018-10-12 11:06:00 mount -t ext3 /dev/vda1 /var/tmp
> sed -e 's/{DEFAULT_METADATA}/1.2/g' \
> -e 's,{MAP_PATH},/run/mdadm/map,g'  mdadm.8.in > mdadm.8
> /usr/bin/install -D -m 644 mdadm.8 /usr/share/man/man8/mdadm.8
> /usr/bin/install -D -m 644 mdmon.8 /usr/share/man/man8/mdmon.8
> /usr/bin/install -D -m 644 md.4 /usr/share/man/man4/md.4
> /usr/bin/install -D -m 644 mdadm.conf.5 /usr/share/man/man5/mdadm.conf.5
> /usr/bin/install -D -m 644 udev-md-raid-arrays.rules 
> /lib/udev/rules.d/63-md-raid-arrays.rules
> /usr/bin/install -D -m 644 udev-md-raid-assembly.rules 
> /lib/udev/rules.d/64-md-raid-assembly.rules
> /usr/bin/install -D  -m 755 mdadm /sbin/mdadm
> /usr/bin/install -D  -m 755 mdmon /sbin/mdmon
> Testing on linux-4.19.0-rc5-00182-gd595567 kernel
> tests/02lineargrow... FAILED - see /var/tmp/log for details
> Testing on linux-4.19.0-rc5-00182-gd595567 kernel
> tests/02r1add... succeeded
> Testing on linux-4.19.0-rc5-00182-gd595567 kernel
> tests/02r1grow... succeeded
> Testing on linux-4.19.0-rc5-00182-gd595567 kernel
> tests/02r5grow... succeeded
> Testing on linux-4.19.0-rc5-00182-gd595567 kernel
> tests/02r6grow... succeeded
> 
> 
> 
> To reproduce:
> 
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz
>   bin/lkp qemu -k  -m modules.cgz job-script # job-script is 
> attached in this email

Thanks for reporting, I fixed it in md-next tree.

Thanks,
Shaohua


Re: [LKP] [MD] d595567dc4: mdadm-selftests.02lineargrow.fail

2018-10-14 Thread Shaohua Li
On Fri, Oct 12, 2018 at 04:58:44PM +0800, kernel test robot wrote:
> FYI, we noticed the following commit (built with gcc-7):
> 
> commit: d595567dc4f0c1d90685ec1e2e296e2cad2643ac ("MD: fix invalid stored 
> role for a disk")
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> 
> in testcase: mdadm-selftests
> with following parameters:
> 
>   disk: 1HDD
>   test_prefix: 02
> 
> 
> 
> on test machine: qemu-system-x86_64 -enable-kvm -cpu host -smp 2 -m 4G
> 
> caused below changes (please refer to attached dmesg/kmsg for entire 
> log/backtrace):
> 
> 
> 
> 
> 
> Welcome to fdisk (util-linux 2.29.2).
> Changes will remain in memory only, until you decide to write them.
> Be careful before using the write command.
> 
> Device does not contain a recognized partition table.
> Created a new DOS disklabel with disk identifier 0xd50ae9f0.
> 
> Command (m for help): Partition type
>p   primary (0 primary, 0 extended, 4 free)
>e   extended (container for logical partitions)
> Select (default p): Partition number (1-4, default 1): First sector 
> (2048-536870911, default 2048): Last sector, +sectors or +size{K,M,G,T,P} 
> (2048-536870911, default 536870911): 
> Created a new partition 1 of type 'Linux' and of size 5 GiB.
> 
> Command (m for help): The partition table has been altered.
> Calling ioctl() to re-read partition table.
> Syncing disks.
> 
> 2018-10-12 11:05:55 mkdir -p /var/tmp
> 2018-10-12 11:05:55 mke2fs -t ext3 -b 1024 -J size=1 -q /dev/vda1
> 2018-10-12 11:06:00 mount -t ext3 /dev/vda1 /var/tmp
> sed -e 's/{DEFAULT_METADATA}/1.2/g' \
> -e 's,{MAP_PATH},/run/mdadm/map,g'  mdadm.8.in > mdadm.8
> /usr/bin/install -D -m 644 mdadm.8 /usr/share/man/man8/mdadm.8
> /usr/bin/install -D -m 644 mdmon.8 /usr/share/man/man8/mdmon.8
> /usr/bin/install -D -m 644 md.4 /usr/share/man/man4/md.4
> /usr/bin/install -D -m 644 mdadm.conf.5 /usr/share/man/man5/mdadm.conf.5
> /usr/bin/install -D -m 644 udev-md-raid-arrays.rules 
> /lib/udev/rules.d/63-md-raid-arrays.rules
> /usr/bin/install -D -m 644 udev-md-raid-assembly.rules 
> /lib/udev/rules.d/64-md-raid-assembly.rules
> /usr/bin/install -D  -m 755 mdadm /sbin/mdadm
> /usr/bin/install -D  -m 755 mdmon /sbin/mdmon
> Testing on linux-4.19.0-rc5-00182-gd595567 kernel
> tests/02lineargrow... FAILED - see /var/tmp/log for details
> Testing on linux-4.19.0-rc5-00182-gd595567 kernel
> tests/02r1add... succeeded
> Testing on linux-4.19.0-rc5-00182-gd595567 kernel
> tests/02r1grow... succeeded
> Testing on linux-4.19.0-rc5-00182-gd595567 kernel
> tests/02r5grow... succeeded
> Testing on linux-4.19.0-rc5-00182-gd595567 kernel
> tests/02r6grow... succeeded
> 
> 
> 
> To reproduce:
> 
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz
>   bin/lkp qemu -k  -m modules.cgz job-script # job-script is 
> attached in this email

Thanks for reporting, I fixed it in md-next tree.

Thanks,
Shaohua


Re: [PATCH] md/bitmap: use mddev_suspend/resume instead of ->quiesce()

2018-09-28 Thread Shaohua Li
On Thu, Sep 27, 2018 at 10:07:57AM +0200, Jack Wang wrote:
> From: Jack Wang 
> 
> After 9e1cc0a54556 ("md: use mddev_suspend/resume instead of ->quiesce()")
> We still have similar left in bitmap functions.
> 
> Replace quiesce() with mddev_suspend/resume.
> 
> Also move md_bitmap_create out of mddev_suspend. and move mddev_resume
> after md_bitmap_destroy. as we did in set_bitmap_file.
> 
> Signed-off-by: Jack Wang 
> Reviewed-by: Gioh Kim 
> 
> ---
> v2->v1: add reviewed-by.
> ---
>  drivers/md/md-bitmap.c | 13 +++--
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
> index 2fc8c113977f..c369f1753ea6 100644
> --- a/drivers/md/md-bitmap.c
> +++ b/drivers/md/md-bitmap.c
> @@ -2123,7 +2123,7 @@ int md_bitmap_resize(struct bitmap *bitmap, sector_t 
> blocks,
>   }
>  
>   if (!init)
> - bitmap->mddev->pers->quiesce(bitmap->mddev, 1);
> + mddev_suspend(bitmap->mddev);


mddev_suspend is supposed to be called with reconfig_mutex hold. At least one
place this isn't true with this change, for example, raid_preresume doesn't
call md_bitmap_resize with the lock hold. Could you please double check the
lock usage?
 
>   store.file = bitmap->storage.file;
>   bitmap->storage.file = NULL;
> @@ -2239,7 +2239,7 @@ int md_bitmap_resize(struct bitmap *bitmap, sector_t 
> blocks,
>  
>   if (!init) {
>   md_bitmap_unplug(bitmap);
> - bitmap->mddev->pers->quiesce(bitmap->mddev, 0);
> + mddev_resume(bitmap->mddev);
>   }
>   ret = 0;
>  err:
> @@ -2288,9 +2288,9 @@ location_store(struct mddev *mddev, const char *buf, 
> size_t len)
>   goto out;
>   }
>   if (mddev->pers) {
> - mddev->pers->quiesce(mddev, 1);
> + mddev_suspend(mddev);
>   md_bitmap_destroy(mddev);
> - mddev->pers->quiesce(mddev, 0);
> + mddev_resume(mddev);
>   }
>   mddev->bitmap_info.offset = 0;
>   if (mddev->bitmap_info.file) {
> @@ -2327,8 +2327,8 @@ location_store(struct mddev *mddev, const char *buf, 
> size_t len)
>   mddev->bitmap_info.offset = offset;
>   if (mddev->pers) {
>   struct bitmap *bitmap;
> - mddev->pers->quiesce(mddev, 1);
>   bitmap = md_bitmap_create(mddev, -1);
> + mddev_suspend(mddev);
>   if (IS_ERR(bitmap))
>   rv = PTR_ERR(bitmap);
>   else {
> @@ -2337,11 +2337,12 @@ location_store(struct mddev *mddev, const char *buf, 
> size_t len)
>   if (rv)
>   mddev->bitmap_info.offset = 0;
>   }
> - mddev->pers->quiesce(mddev, 0);
>   if (rv) {
>   md_bitmap_destroy(mddev);
> + mddev_resume(mddev);
>   goto out;
>   }
> + mddev_resume(mddev);
>   }
>   }
>   }
> -- 
> 2.7.4
> 


Re: [PATCH] md/bitmap: use mddev_suspend/resume instead of ->quiesce()

2018-09-28 Thread Shaohua Li
On Thu, Sep 27, 2018 at 10:07:57AM +0200, Jack Wang wrote:
> From: Jack Wang 
> 
> After 9e1cc0a54556 ("md: use mddev_suspend/resume instead of ->quiesce()")
> We still have similar left in bitmap functions.
> 
> Replace quiesce() with mddev_suspend/resume.
> 
> Also move md_bitmap_create out of mddev_suspend. and move mddev_resume
> after md_bitmap_destroy. as we did in set_bitmap_file.
> 
> Signed-off-by: Jack Wang 
> Reviewed-by: Gioh Kim 
> 
> ---
> v2->v1: add reviewed-by.
> ---
>  drivers/md/md-bitmap.c | 13 +++--
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
> index 2fc8c113977f..c369f1753ea6 100644
> --- a/drivers/md/md-bitmap.c
> +++ b/drivers/md/md-bitmap.c
> @@ -2123,7 +2123,7 @@ int md_bitmap_resize(struct bitmap *bitmap, sector_t 
> blocks,
>   }
>  
>   if (!init)
> - bitmap->mddev->pers->quiesce(bitmap->mddev, 1);
> + mddev_suspend(bitmap->mddev);


mddev_suspend is supposed to be called with reconfig_mutex hold. At least one
place this isn't true with this change, for example, raid_preresume doesn't
call md_bitmap_resize with the lock hold. Could you please double check the
lock usage?
 
>   store.file = bitmap->storage.file;
>   bitmap->storage.file = NULL;
> @@ -2239,7 +2239,7 @@ int md_bitmap_resize(struct bitmap *bitmap, sector_t 
> blocks,
>  
>   if (!init) {
>   md_bitmap_unplug(bitmap);
> - bitmap->mddev->pers->quiesce(bitmap->mddev, 0);
> + mddev_resume(bitmap->mddev);
>   }
>   ret = 0;
>  err:
> @@ -2288,9 +2288,9 @@ location_store(struct mddev *mddev, const char *buf, 
> size_t len)
>   goto out;
>   }
>   if (mddev->pers) {
> - mddev->pers->quiesce(mddev, 1);
> + mddev_suspend(mddev);
>   md_bitmap_destroy(mddev);
> - mddev->pers->quiesce(mddev, 0);
> + mddev_resume(mddev);
>   }
>   mddev->bitmap_info.offset = 0;
>   if (mddev->bitmap_info.file) {
> @@ -2327,8 +2327,8 @@ location_store(struct mddev *mddev, const char *buf, 
> size_t len)
>   mddev->bitmap_info.offset = offset;
>   if (mddev->pers) {
>   struct bitmap *bitmap;
> - mddev->pers->quiesce(mddev, 1);
>   bitmap = md_bitmap_create(mddev, -1);
> + mddev_suspend(mddev);
>   if (IS_ERR(bitmap))
>   rv = PTR_ERR(bitmap);
>   else {
> @@ -2337,11 +2337,12 @@ location_store(struct mddev *mddev, const char *buf, 
> size_t len)
>   if (rv)
>   mddev->bitmap_info.offset = 0;
>   }
> - mddev->pers->quiesce(mddev, 0);
>   if (rv) {
>   md_bitmap_destroy(mddev);
> + mddev_resume(mddev);
>   goto out;
>   }
> + mddev_resume(mddev);
>   }
>   }
>   }
> -- 
> 2.7.4
> 


[GIT PULL] MD update for 4.19-rc2

2018-09-07 Thread Shaohua Li
Hi,
Please pull MD fixes for 4.19-rc2:
- Fix a locking issue for md-cluster from Guoqing
- Fix a sync crash for raid10 from Ni
- Fix a reshape bug with raid5 cache enabled from Me

Thanks,
Shaohua

The following changes since commit 420f51f4ab6bce6e580390729fadb89c31123636:

  Merge tag 'arm64-fixes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux (2018-08-31 09:20:30 
-0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git tags/md/4.19-rc2

for you to fetch changes up to 41a95041126522a921fb73df22cbdd520dfdebad:

  md-cluster: release RESYNC lock after the last resync message (2018-08-31 
17:38:10 -0700)


Guoqing Jiang (1):
  md-cluster: release RESYNC lock after the last resync message

Shaohua Li (1):
  md/raid5-cache: disable reshape completely

Xiao Ni (1):
  RAID10 BUG_ON in raise_barrier when force is true and conf->barrier is 0

 drivers/md/md-cluster.c | 10 +-
 drivers/md/raid10.c |  5 -
 drivers/md/raid5-log.h  |  5 +
 drivers/md/raid5.c  |  6 +++---
 4 files changed, 17 insertions(+), 9 deletions(-)


[GIT PULL] MD update for 4.19-rc2

2018-09-07 Thread Shaohua Li
Hi,
Please pull MD fixes for 4.19-rc2:
- Fix a locking issue for md-cluster from Guoqing
- Fix a sync crash for raid10 from Ni
- Fix a reshape bug with raid5 cache enabled from Me

Thanks,
Shaohua

The following changes since commit 420f51f4ab6bce6e580390729fadb89c31123636:

  Merge tag 'arm64-fixes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux (2018-08-31 09:20:30 
-0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git tags/md/4.19-rc2

for you to fetch changes up to 41a95041126522a921fb73df22cbdd520dfdebad:

  md-cluster: release RESYNC lock after the last resync message (2018-08-31 
17:38:10 -0700)


Guoqing Jiang (1):
  md-cluster: release RESYNC lock after the last resync message

Shaohua Li (1):
  md/raid5-cache: disable reshape completely

Xiao Ni (1):
  RAID10 BUG_ON in raise_barrier when force is true and conf->barrier is 0

 drivers/md/md-cluster.c | 10 +-
 drivers/md/raid10.c |  5 -
 drivers/md/raid5-log.h  |  5 +
 drivers/md/raid5.c  |  6 +++---
 4 files changed, 17 insertions(+), 9 deletions(-)


[GIT PULL] MD update for 4.19-rc1

2018-08-13 Thread Shaohua Li
Hi,
A few MD fixes for 4.19-rc1:
- Several md-cluster fixes from Guoqing
- A data corruption fix from BingJing
- Other cleanups
Please pull!

Thanks,
Shaohua

The following changes since commit 06c85639897cf3ea6a11c5cb6929fb0d9d7efea5:

  Merge tag 'acpi-4.18-rc4' of 
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm (2018-07-05 
09:52:30 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git for-next

for you to fetch changes up to d63e2fc804c46e50eee825c5d3a7228e07048b47:

  md/raid5: fix data corruption of replacements after originals dropped 
(2018-08-02 11:22:06 -0700)


Anna-Maria Gleixner (2):
  drivers/md/raid5: Use irqsave variant of atomic_dec_and_lock()
  drivers/md/raid5: Do not disable irq on release_inactive_stripe_list() 
call

BingJing Chang (1):
  md/raid5: fix data corruption of replacements after originals dropped

Colin Ian King (1):
  md/r5cache: remove redundant pointer bio

Guoqing Jiang (3):
  md-cluster: clear another node's suspend_area after the copy is finished
  md-cluster: show array's status more accurate
  md-cluster: don't send msg if array is closing

 drivers/md/md-cluster.c  | 47 +++
 drivers/md/md.c  | 17 +
 drivers/md/md.h  |  1 +
 drivers/md/raid5-cache.c |  2 --
 drivers/md/raid5.c   | 12 
 5 files changed, 61 insertions(+), 18 deletions(-)


[GIT PULL] MD update for 4.19-rc1

2018-08-13 Thread Shaohua Li
Hi,
A few MD fixes for 4.19-rc1:
- Several md-cluster fixes from Guoqing
- A data corruption fix from BingJing
- Other cleanups
Please pull!

Thanks,
Shaohua

The following changes since commit 06c85639897cf3ea6a11c5cb6929fb0d9d7efea5:

  Merge tag 'acpi-4.18-rc4' of 
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm (2018-07-05 
09:52:30 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git for-next

for you to fetch changes up to d63e2fc804c46e50eee825c5d3a7228e07048b47:

  md/raid5: fix data corruption of replacements after originals dropped 
(2018-08-02 11:22:06 -0700)


Anna-Maria Gleixner (2):
  drivers/md/raid5: Use irqsave variant of atomic_dec_and_lock()
  drivers/md/raid5: Do not disable irq on release_inactive_stripe_list() 
call

BingJing Chang (1):
  md/raid5: fix data corruption of replacements after originals dropped

Colin Ian King (1):
  md/r5cache: remove redundant pointer bio

Guoqing Jiang (3):
  md-cluster: clear another node's suspend_area after the copy is finished
  md-cluster: show array's status more accurate
  md-cluster: don't send msg if array is closing

 drivers/md/md-cluster.c  | 47 +++
 drivers/md/md.c  | 17 +
 drivers/md/md.h  |  1 +
 drivers/md/raid5-cache.c |  2 --
 drivers/md/raid5.c   | 12 
 5 files changed, 61 insertions(+), 18 deletions(-)


Re: [PATCH 1/6] drivers/md/raid5: Use irqsave variant of atomic_dec_and_lock()

2018-07-18 Thread Shaohua Li
On Wed, Jul 18, 2018 at 12:57:21PM +0200, Sebastian Andrzej Siewior wrote:
> On 2018-07-16 17:37:27 [-0700], Shaohua Li wrote:
> > On Mon, Jul 16, 2018 at 02:27:40PM +0200, Sebastian Andrzej Siewior wrote:
> > > On 2018-07-03 22:01:36 [+0200], To linux-kernel@vger.kernel.org wrote:
> > > > From: Anna-Maria Gleixner 
> > > > 
> > > > The irqsave variant of atomic_dec_and_lock handles irqsave/restore when
> > > > taking/releasing the spin lock. With this variant the call of
> > > > local_irq_save is no longer required.
> > > 
> > > Shaohua, are you with this?
> > 
> > Acked-by: Shaohua Li  
> 
> Thank you. 
> Is there a reason why you did not apply it or are you too busy now and
> do it later (and just signal the ack that you are fine with it)?

Since you sent a series, I suppose you want someone else to take it. But I can
take it for sure, will do soon.

Thanks,
Shaohua


Re: [PATCH 1/6] drivers/md/raid5: Use irqsave variant of atomic_dec_and_lock()

2018-07-18 Thread Shaohua Li
On Wed, Jul 18, 2018 at 12:57:21PM +0200, Sebastian Andrzej Siewior wrote:
> On 2018-07-16 17:37:27 [-0700], Shaohua Li wrote:
> > On Mon, Jul 16, 2018 at 02:27:40PM +0200, Sebastian Andrzej Siewior wrote:
> > > On 2018-07-03 22:01:36 [+0200], To linux-kernel@vger.kernel.org wrote:
> > > > From: Anna-Maria Gleixner 
> > > > 
> > > > The irqsave variant of atomic_dec_and_lock handles irqsave/restore when
> > > > taking/releasing the spin lock. With this variant the call of
> > > > local_irq_save is no longer required.
> > > 
> > > Shaohua, are you with this?
> > 
> > Acked-by: Shaohua Li  
> 
> Thank you. 
> Is there a reason why you did not apply it or are you too busy now and
> do it later (and just signal the ack that you are fine with it)?

Since you sent a series, I suppose you want someone else to take it. But I can
take it for sure, will do soon.

Thanks,
Shaohua


Re: [PATCH 1/6] drivers/md/raid5: Use irqsave variant of atomic_dec_and_lock()

2018-07-16 Thread Shaohua Li
On Mon, Jul 16, 2018 at 02:27:40PM +0200, Sebastian Andrzej Siewior wrote:
> On 2018-07-03 22:01:36 [+0200], To linux-kernel@vger.kernel.org wrote:
> > From: Anna-Maria Gleixner 
> > 
> > The irqsave variant of atomic_dec_and_lock handles irqsave/restore when
> > taking/releasing the spin lock. With this variant the call of
> > local_irq_save is no longer required.
> 
> Shaohua, are you with this?

Acked-by: Shaohua Li  
> > Cc: Shaohua Li 
> > Cc: linux-r...@vger.kernel.org
> > Acked-by: Peter Zijlstra (Intel) 
> > Signed-off-by: Anna-Maria Gleixner 
> > Signed-off-by: Sebastian Andrzej Siewior 
> > ---
> >  drivers/md/raid5.c | 5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > index 2031506a0ecd..e933bff9459e 100644
> > --- a/drivers/md/raid5.c
> > +++ b/drivers/md/raid5.c
> > @@ -409,16 +409,15 @@ void raid5_release_stripe(struct stripe_head *sh)
> > md_wakeup_thread(conf->mddev->thread);
> > return;
> >  slow_path:
> > -   local_irq_save(flags);
> > /* we are ok here if STRIPE_ON_RELEASE_LIST is set or not */
> > -   if (atomic_dec_and_lock(>count, >device_lock)) {
> > +   if (atomic_dec_and_lock_irqsave(>count, >device_lock, flags)) 
> > {
> > INIT_LIST_HEAD();
> > hash = sh->hash_lock_index;
> > do_release_stripe(conf, sh, );
> > spin_unlock(>device_lock);
> > release_inactive_stripe_list(conf, , hash);
> > +   local_irq_restore(flags);
> > }
> > -   local_irq_restore(flags);
> >  }
> >  
> >  static inline void remove_hash(struct stripe_head *sh)
> > -- 
> > 2.18.0
> > 


Re: [PATCH 1/6] drivers/md/raid5: Use irqsave variant of atomic_dec_and_lock()

2018-07-16 Thread Shaohua Li
On Mon, Jul 16, 2018 at 02:27:40PM +0200, Sebastian Andrzej Siewior wrote:
> On 2018-07-03 22:01:36 [+0200], To linux-kernel@vger.kernel.org wrote:
> > From: Anna-Maria Gleixner 
> > 
> > The irqsave variant of atomic_dec_and_lock handles irqsave/restore when
> > taking/releasing the spin lock. With this variant the call of
> > local_irq_save is no longer required.
> 
> Shaohua, are you with this?

Acked-by: Shaohua Li  
> > Cc: Shaohua Li 
> > Cc: linux-r...@vger.kernel.org
> > Acked-by: Peter Zijlstra (Intel) 
> > Signed-off-by: Anna-Maria Gleixner 
> > Signed-off-by: Sebastian Andrzej Siewior 
> > ---
> >  drivers/md/raid5.c | 5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > index 2031506a0ecd..e933bff9459e 100644
> > --- a/drivers/md/raid5.c
> > +++ b/drivers/md/raid5.c
> > @@ -409,16 +409,15 @@ void raid5_release_stripe(struct stripe_head *sh)
> > md_wakeup_thread(conf->mddev->thread);
> > return;
> >  slow_path:
> > -   local_irq_save(flags);
> > /* we are ok here if STRIPE_ON_RELEASE_LIST is set or not */
> > -   if (atomic_dec_and_lock(>count, >device_lock)) {
> > +   if (atomic_dec_and_lock_irqsave(>count, >device_lock, flags)) 
> > {
> > INIT_LIST_HEAD();
> > hash = sh->hash_lock_index;
> > do_release_stripe(conf, sh, );
> > spin_unlock(>device_lock);
> > release_inactive_stripe_list(conf, , hash);
> > +   local_irq_restore(flags);
> > }
> > -   local_irq_restore(flags);
> >  }
> >  
> >  static inline void remove_hash(struct stripe_head *sh)
> > -- 
> > 2.18.0
> > 


[GIT PULL] MD update for 4.18-rc3

2018-07-02 Thread Shaohua Li
Hi,
Two small fixes for MD:
- An error handling fix from me
- A recover bug fix for raid10 from BingJing
please pull!

Thanks,
Shaohua


The following changes since commit 9ffc59d57228d74809700be6f7ecb1db10292f05:

  Merge tag '4.18-rc1-more-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6 
(2018-06-18 14:28:19 +0900)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git for-next

for you to fetch changes up to bda3153998f3eb2cafa4a6311971143628eacdbc:

  md/raid10: fix that replacement cannot complete recovery after reassemble 
(2018-06-28 13:04:49 -0700)


BingJing Chang (1):
  md/raid10: fix that replacement cannot complete recovery after reassemble

Shaohua Li (1):
  MD: cleanup resources in failure

 drivers/md/md.c | 8 +---
 drivers/md/raid10.c | 7 +++
 2 files changed, 12 insertions(+), 3 deletions(-)


[GIT PULL] MD update for 4.18-rc3

2018-07-02 Thread Shaohua Li
Hi,
Two small fixes for MD:
- An error handling fix from me
- A recover bug fix for raid10 from BingJing
please pull!

Thanks,
Shaohua


The following changes since commit 9ffc59d57228d74809700be6f7ecb1db10292f05:

  Merge tag '4.18-rc1-more-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6 
(2018-06-18 14:28:19 +0900)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git for-next

for you to fetch changes up to bda3153998f3eb2cafa4a6311971143628eacdbc:

  md/raid10: fix that replacement cannot complete recovery after reassemble 
(2018-06-28 13:04:49 -0700)


BingJing Chang (1):
  md/raid10: fix that replacement cannot complete recovery after reassemble

Shaohua Li (1):
  MD: cleanup resources in failure

 drivers/md/md.c | 8 +---
 drivers/md/raid10.c | 7 +++
 2 files changed, 12 insertions(+), 3 deletions(-)


[GIT PULL] MD update for 4.18-rc

2018-06-09 Thread Shaohua Li
Hi,
A few fixes of MD for this merge window. Mostly bug fixes:
- raid5 stripe batch fix from Amy
- Read error handling for raid1 FailFast device from Gioh
- raid10 recovery NULL pointer dereference fix from Guoqing
- Support write hint for raid5 stripe cache from Mariusz
- Fixes for device hot add/remove from Neil and Yufen
- Improve flush bio scalability from Xiao

There is a merge conflict, I attached the fix below. Please pull!

Thanks,
Shaohua

The following changes since commit 83beed7b2b26f232d782127792dd0cd4362fdc41:

  Merge branch 'fixes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal 
(2018-04-20 10:56:32 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git for-next

for you to fetch changes up to 5a409b4f56d50b212334f338cb8465d65550cd85:

  MD: fix lock contention for flush bios (2018-05-21 09:30:26 -0700)


Amy Chiang (1):
  md/raid5: Assigning NULL to sh->batch_head before testing bit R5_Overlap 
of a stripe

Gioh Kim (1):
  md/raid1: add error handling of read error from FailFast device

Guoqing Jiang (1):
  raid10: check bio in r10buf_pool_free to void NULL pointer dereference

Mariusz Dabrowski (1):
  raid5: copy write hint from origin bio to stripe

NeilBrown (1):
  md: fix two problems with setting the "re-add" device state.

Xiao Ni (1):
  MD: fix lock contention for flush bios

Yufen Yu (2):
  md: fix an error code format and remove unsed bio_sector
  md: fix NULL dereference of mddev->pers in remove_and_add_spares()

 drivers/md/md.c | 165 +++-
 drivers/md/md.h |  22 ---
 drivers/md/raid1.c  |   4 +-
 drivers/md/raid10.c |  10 ++--
 drivers/md/raid5.c  |  12 +++-
 drivers/md/raid5.h  |   1 +
 6 files changed, 144 insertions(+), 70 deletions(-)


diff --cc drivers/md/md.c
index 22203eba1e6e,6b4e2f29fe4e..
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@@ -5489,16 -5518,34 +5510,32 @@@ int md_run(struct mddev *mddev
sysfs_notify_dirent_safe(rdev->sysfs_state);
}
  
 -  if (mddev->bio_set == NULL) {
 -  mddev->bio_set = bioset_create(BIO_POOL_SIZE, 0, 
BIOSET_NEED_BVECS);
 -  if (!mddev->bio_set)
 -  return -ENOMEM;
 +  if (!bioset_initialized(>bio_set)) {
 +  err = bioset_init(>bio_set, BIO_POOL_SIZE, 0, 
BIOSET_NEED_BVECS);
 +  if (err)
 +  return err;
}
 -  if (mddev->sync_set == NULL) {
 -  mddev->sync_set = bioset_create(BIO_POOL_SIZE, 0, 
BIOSET_NEED_BVECS);
 -  if (!mddev->sync_set) {
 -  err = -ENOMEM;
 -  goto abort;
 -  }
 +  if (!bioset_initialized(>sync_set)) {
 +  err = bioset_init(>sync_set, BIO_POOL_SIZE, 0, 
BIOSET_NEED_BVECS);
 +  if (err)
 +  return err;
}
+   if (mddev->flush_pool == NULL) {
+   mddev->flush_pool = mempool_create(NR_FLUSH_INFOS, 
flush_info_alloc,
+   flush_info_free, mddev);
+   if (!mddev->flush_pool) {
+   err = -ENOMEM;
+   goto abort;
+   }
+   }
+   if (mddev->flush_bio_pool == NULL) {
+   mddev->flush_bio_pool = mempool_create(NR_FLUSH_BIOS, 
flush_bio_alloc,
+   flush_bio_free, mddev);
+   if (!mddev->flush_bio_pool) {
+   err = -ENOMEM;
+   goto abort;
+   }
+   }
  
spin_lock(_lock);
pers = find_pers(mddev->level, mddev->clevel);
@@@ -5654,6 -5703,26 +5693,17 @@@
sysfs_notify_dirent_safe(mddev->sysfs_action);
sysfs_notify(>kobj, NULL, "degraded");
return 0;
 -
+ abort:
+   if (mddev->flush_bio_pool) {
+   mempool_destroy(mddev->flush_bio_pool);
+   mddev->flush_bio_pool = NULL;
+   }
+   if (mddev->flush_pool){
+   mempool_destroy(mddev->flush_pool);
+   mddev->flush_pool = NULL;
+   }
 -  if (mddev->bio_set) {
 -  bioset_free(mddev->bio_set);
 -  mddev->bio_set = NULL;
 -  }
 -  if (mddev->sync_set) {
 -  bioset_free(mddev->sync_set);
 -  mddev->sync_set = NULL;
 -  }
+ 
+   return err;
  }
  EXPORT_SYMBOL_GPL(md_run);
  
@@@ -5864,8 -5933,22 +5914,16 @@@ void md_stop(struct mddev *mddev
 * This is called from dm-raid
 */
__md_stop(mddev);
+   if (mddev->flush_bio_pool) {
+   mempool_destroy(mddev->flush_bio_pool);
+   mddev->flush_bio_pool = NULL;
+   }
+   if (mddev->flush_pool) {
+   mempool_destroy(mddev->flush_pool);
+   

[GIT PULL] MD update for 4.18-rc

2018-06-09 Thread Shaohua Li
Hi,
A few fixes of MD for this merge window. Mostly bug fixes:
- raid5 stripe batch fix from Amy
- Read error handling for raid1 FailFast device from Gioh
- raid10 recovery NULL pointer dereference fix from Guoqing
- Support write hint for raid5 stripe cache from Mariusz
- Fixes for device hot add/remove from Neil and Yufen
- Improve flush bio scalability from Xiao

There is a merge conflict, I attached the fix below. Please pull!

Thanks,
Shaohua

The following changes since commit 83beed7b2b26f232d782127792dd0cd4362fdc41:

  Merge branch 'fixes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal 
(2018-04-20 10:56:32 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git for-next

for you to fetch changes up to 5a409b4f56d50b212334f338cb8465d65550cd85:

  MD: fix lock contention for flush bios (2018-05-21 09:30:26 -0700)


Amy Chiang (1):
  md/raid5: Assigning NULL to sh->batch_head before testing bit R5_Overlap 
of a stripe

Gioh Kim (1):
  md/raid1: add error handling of read error from FailFast device

Guoqing Jiang (1):
  raid10: check bio in r10buf_pool_free to void NULL pointer dereference

Mariusz Dabrowski (1):
  raid5: copy write hint from origin bio to stripe

NeilBrown (1):
  md: fix two problems with setting the "re-add" device state.

Xiao Ni (1):
  MD: fix lock contention for flush bios

Yufen Yu (2):
  md: fix an error code format and remove unsed bio_sector
  md: fix NULL dereference of mddev->pers in remove_and_add_spares()

 drivers/md/md.c | 165 +++-
 drivers/md/md.h |  22 ---
 drivers/md/raid1.c  |   4 +-
 drivers/md/raid10.c |  10 ++--
 drivers/md/raid5.c  |  12 +++-
 drivers/md/raid5.h  |   1 +
 6 files changed, 144 insertions(+), 70 deletions(-)


diff --cc drivers/md/md.c
index 22203eba1e6e,6b4e2f29fe4e..
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@@ -5489,16 -5518,34 +5510,32 @@@ int md_run(struct mddev *mddev
sysfs_notify_dirent_safe(rdev->sysfs_state);
}
  
 -  if (mddev->bio_set == NULL) {
 -  mddev->bio_set = bioset_create(BIO_POOL_SIZE, 0, 
BIOSET_NEED_BVECS);
 -  if (!mddev->bio_set)
 -  return -ENOMEM;
 +  if (!bioset_initialized(>bio_set)) {
 +  err = bioset_init(>bio_set, BIO_POOL_SIZE, 0, 
BIOSET_NEED_BVECS);
 +  if (err)
 +  return err;
}
 -  if (mddev->sync_set == NULL) {
 -  mddev->sync_set = bioset_create(BIO_POOL_SIZE, 0, 
BIOSET_NEED_BVECS);
 -  if (!mddev->sync_set) {
 -  err = -ENOMEM;
 -  goto abort;
 -  }
 +  if (!bioset_initialized(>sync_set)) {
 +  err = bioset_init(>sync_set, BIO_POOL_SIZE, 0, 
BIOSET_NEED_BVECS);
 +  if (err)
 +  return err;
}
+   if (mddev->flush_pool == NULL) {
+   mddev->flush_pool = mempool_create(NR_FLUSH_INFOS, 
flush_info_alloc,
+   flush_info_free, mddev);
+   if (!mddev->flush_pool) {
+   err = -ENOMEM;
+   goto abort;
+   }
+   }
+   if (mddev->flush_bio_pool == NULL) {
+   mddev->flush_bio_pool = mempool_create(NR_FLUSH_BIOS, 
flush_bio_alloc,
+   flush_bio_free, mddev);
+   if (!mddev->flush_bio_pool) {
+   err = -ENOMEM;
+   goto abort;
+   }
+   }
  
spin_lock(_lock);
pers = find_pers(mddev->level, mddev->clevel);
@@@ -5654,6 -5703,26 +5693,17 @@@
sysfs_notify_dirent_safe(mddev->sysfs_action);
sysfs_notify(>kobj, NULL, "degraded");
return 0;
 -
+ abort:
+   if (mddev->flush_bio_pool) {
+   mempool_destroy(mddev->flush_bio_pool);
+   mddev->flush_bio_pool = NULL;
+   }
+   if (mddev->flush_pool){
+   mempool_destroy(mddev->flush_pool);
+   mddev->flush_pool = NULL;
+   }
 -  if (mddev->bio_set) {
 -  bioset_free(mddev->bio_set);
 -  mddev->bio_set = NULL;
 -  }
 -  if (mddev->sync_set) {
 -  bioset_free(mddev->sync_set);
 -  mddev->sync_set = NULL;
 -  }
+ 
+   return err;
  }
  EXPORT_SYMBOL_GPL(md_run);
  
@@@ -5864,8 -5933,22 +5914,16 @@@ void md_stop(struct mddev *mddev
 * This is called from dm-raid
 */
__md_stop(mddev);
+   if (mddev->flush_bio_pool) {
+   mempool_destroy(mddev->flush_bio_pool);
+   mddev->flush_bio_pool = NULL;
+   }
+   if (mddev->flush_pool) {
+   mempool_destroy(mddev->flush_pool);
+   

Re: [PATCH 3/8] md: raid5: use refcount_t for reference counting instead atomic_t

2018-05-23 Thread Shaohua Li
On Wed, May 23, 2018 at 07:49:04PM +0200, Peter Zijlstra wrote:
> On Wed, May 23, 2018 at 06:21:19AM -0700, Matthew Wilcox wrote:
> > On Wed, May 09, 2018 at 09:36:40PM +0200, Sebastian Andrzej Siewior wrote:
> > > refcount_t type and corresponding API should be used instead of atomic_t 
> > > when
> > > the variable is used as a reference counter. This allows to avoid 
> > > accidental
> > > refcounter overflows that might lead to use-after-free situations.
> > > 
> > > Most changes are 1:1 replacements except for
> > >   BUG_ON(atomic_inc_return(>count) != 1);
> > > 
> > > which has been turned into
> > > refcount_inc(>count);
> > > BUG_ON(refcount_read(>count) != 1);
> > 
> > @@ -5387,7 +5387,8 @@ static struct stripe_head 
> > *__get_priority_stripe(struct
> > +r5conf *conf, int group)
> > sh->group = NULL;
> > }
> > list_del_init(>lru);
> > -   BUG_ON(atomic_inc_return(>count) != 1);
> > +   refcount_inc(>count);
> > +   BUG_ON(refcount_read(>count) != 1);
> > return sh;
> >  }
> > 
> > 
> > That's the only problematic usage.  And I think what it's really saying is:
> > 
> > BUG_ON(refcount_read(>count) != 0);
> > refcount_set(>count, 1);
> > 
> > With that, this looks like a reasonable use of refcount_t to me.
> 
> I'm not so sure, look at:
> 
>   r5c_do_reclaim():
> 
>   if (!list_empty(>lru) &&
>   !test_bit(STRIPE_HANDLE, >state) &&
>   atomic_read(>count) == 0) {
> r5c_flush_stripe(cond, sh)
> 
> Which does:
> 
>   r5c_flush_stripe():
> 
>   atomic_inc(>count);
> 
> Which is another inc-from-zero. Also, having sh's with count==0 in a
> list is counter to the concept of refcounts and smells like usage-counts
> to me. For refcount 0 really means deads and gone.
> 
> If this really is supposed to be a refcount, someone more familiar with
> the raid5 should do the patch and write a comprehensive changelog on it.

I don't know what is changed in the refcount, such raid5 change has attempted
before and didn't work. 0 for the stripe count is a valid usage and we do
inc-from-zero in several places.

Thanks,
Shaohua


Re: [PATCH 3/8] md: raid5: use refcount_t for reference counting instead atomic_t

2018-05-23 Thread Shaohua Li
On Wed, May 23, 2018 at 07:49:04PM +0200, Peter Zijlstra wrote:
> On Wed, May 23, 2018 at 06:21:19AM -0700, Matthew Wilcox wrote:
> > On Wed, May 09, 2018 at 09:36:40PM +0200, Sebastian Andrzej Siewior wrote:
> > > refcount_t type and corresponding API should be used instead of atomic_t 
> > > when
> > > the variable is used as a reference counter. This allows to avoid 
> > > accidental
> > > refcounter overflows that might lead to use-after-free situations.
> > > 
> > > Most changes are 1:1 replacements except for
> > >   BUG_ON(atomic_inc_return(>count) != 1);
> > > 
> > > which has been turned into
> > > refcount_inc(>count);
> > > BUG_ON(refcount_read(>count) != 1);
> > 
> > @@ -5387,7 +5387,8 @@ static struct stripe_head 
> > *__get_priority_stripe(struct
> > +r5conf *conf, int group)
> > sh->group = NULL;
> > }
> > list_del_init(>lru);
> > -   BUG_ON(atomic_inc_return(>count) != 1);
> > +   refcount_inc(>count);
> > +   BUG_ON(refcount_read(>count) != 1);
> > return sh;
> >  }
> > 
> > 
> > That's the only problematic usage.  And I think what it's really saying is:
> > 
> > BUG_ON(refcount_read(>count) != 0);
> > refcount_set(>count, 1);
> > 
> > With that, this looks like a reasonable use of refcount_t to me.
> 
> I'm not so sure, look at:
> 
>   r5c_do_reclaim():
> 
>   if (!list_empty(>lru) &&
>   !test_bit(STRIPE_HANDLE, >state) &&
>   atomic_read(>count) == 0) {
> r5c_flush_stripe(cond, sh)
> 
> Which does:
> 
>   r5c_flush_stripe():
> 
>   atomic_inc(>count);
> 
> Which is another inc-from-zero. Also, having sh's with count==0 in a
> list is counter to the concept of refcounts and smells like usage-counts
> to me. For refcount 0 really means deads and gone.
> 
> If this really is supposed to be a refcount, someone more familiar with
> the raid5 should do the patch and write a comprehensive changelog on it.

I don't know what is changed in the refcount, such raid5 change has attempted
before and didn't work. 0 for the stripe count is a valid usage and we do
inc-from-zero in several places.

Thanks,
Shaohua


Re: [PATCH] md: zero the original position of sb for 0.90 and 1.0

2018-05-17 Thread Shaohua Li
On Wed, May 16, 2018 at 05:18:39PM +0800, Jianchao Wang wrote:
> For sb version 0.90 and 1.0 which locates after data, when we increase
> the spindle volume size and grow the raid arry size, the older sb which is
> different between spindles will be left there. Due to this left sb, the
> spindle volume cannot be grown with zero filled part with  --asume-clean.
> applications have to take some workarounds such as sync_min/max to avoid
> resync.
> 
> It is easy and convevient for md to zero the older sb position after
> write sb to new position, so do it here.

>From the description, this looks easy to do in userspace, eg, you can zero the
super before increase the spindle volume. Convenience isn't a reason we should
put these code into kernel space.

Thanks,
Shaohua 
> Cc: robert.but...@oracle.com
> Signed-off-by: Jianchao Wang 
> ---
>  drivers/md/md.c | 27 +++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index c208c01..a042add 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -1020,6 +1020,23 @@ int md_check_no_bitmap(struct mddev *mddev)
>  }
>  EXPORT_SYMBOL(md_check_no_bitmap);
>  
> +static void md_zero_original_sb_position(struct md_rdev *rdev, sector_t 
> old_pos)
> +{
> + struct page *zero_pg;
> +
> + zero_pg = alloc_pages(GFP_KERNEL | __GFP_ZERO, 0);
> + if (!zero_pg) {
> + pr_warn("%s: failed to get page for zero original sb position",
> + mdname(rdev->mddev));
> + return;
> + }
> + md_super_write(rdev->mddev, rdev, old_pos, rdev->sb_size,
> +zero_pg);
> + wait_event(rdev->mddev->sb_wait,
> + (atomic_read(>mddev->pending_writes) == 0));
> + __free_pages(zero_pg, 0);
> +}
> +
>  /*
>   * load_super for 0.90.0
>   */
> @@ -1394,6 +1411,8 @@ static void super_90_sync(struct mddev *mddev, struct 
> md_rdev *rdev)
>  static unsigned long long
>  super_90_rdev_size_change(struct md_rdev *rdev, sector_t num_sectors)
>  {
> + sector_t old_pos = rdev->sb_start;
> +
>   if (num_sectors && num_sectors < rdev->mddev->dev_sectors)
>   return 0; /* component must fit device */
>   if (rdev->mddev->bitmap_info.offset)
> @@ -1411,6 +1430,8 @@ super_90_rdev_size_change(struct md_rdev *rdev, 
> sector_t num_sectors)
>   md_super_write(rdev->mddev, rdev, rdev->sb_start, rdev->sb_size,
>  rdev->sb_page);
>   } while (md_super_wait(rdev->mddev) < 0);
> +
> + md_zero_original_sb_position(rdev, old_pos);
>   return num_sectors;
>  }
>  
> @@ -1953,6 +1974,8 @@ super_1_rdev_size_change(struct md_rdev *rdev, sector_t 
> num_sectors)
>  {
>   struct mdp_superblock_1 *sb;
>   sector_t max_sectors;
> + sector_t old_pos = 0;
> +
>   if (num_sectors && num_sectors < rdev->mddev->dev_sectors)
>   return 0; /* component must fit device */
>   if (rdev->data_offset != rdev->new_data_offset)
> @@ -1969,6 +1992,7 @@ super_1_rdev_size_change(struct md_rdev *rdev, sector_t 
> num_sectors)
>   } else {
>   /* minor version 0; superblock after data */
>   sector_t sb_start;
> + old_pos = rdev->sb_start;
>   sb_start = (i_size_read(rdev->bdev->bd_inode) >> 9) - 8*2;
>   sb_start &= ~(sector_t)(4*2 - 1);
>   max_sectors = rdev->sectors + sb_start - rdev->sb_start;
> @@ -1984,6 +2008,9 @@ super_1_rdev_size_change(struct md_rdev *rdev, sector_t 
> num_sectors)
>   md_super_write(rdev->mddev, rdev, rdev->sb_start, rdev->sb_size,
>  rdev->sb_page);
>   } while (md_super_wait(rdev->mddev) < 0);
> +
> + if (old_pos)
> + md_zero_original_sb_position(rdev, old_pos);
>   return num_sectors;
>  
>  }
> -- 
> 2.7.4
> 


Re: [PATCH] md: zero the original position of sb for 0.90 and 1.0

2018-05-17 Thread Shaohua Li
On Wed, May 16, 2018 at 05:18:39PM +0800, Jianchao Wang wrote:
> For sb version 0.90 and 1.0 which locates after data, when we increase
> the spindle volume size and grow the raid arry size, the older sb which is
> different between spindles will be left there. Due to this left sb, the
> spindle volume cannot be grown with zero filled part with  --asume-clean.
> applications have to take some workarounds such as sync_min/max to avoid
> resync.
> 
> It is easy and convevient for md to zero the older sb position after
> write sb to new position, so do it here.

>From the description, this looks easy to do in userspace, eg, you can zero the
super before increase the spindle volume. Convenience isn't a reason we should
put these code into kernel space.

Thanks,
Shaohua 
> Cc: robert.but...@oracle.com
> Signed-off-by: Jianchao Wang 
> ---
>  drivers/md/md.c | 27 +++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index c208c01..a042add 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -1020,6 +1020,23 @@ int md_check_no_bitmap(struct mddev *mddev)
>  }
>  EXPORT_SYMBOL(md_check_no_bitmap);
>  
> +static void md_zero_original_sb_position(struct md_rdev *rdev, sector_t 
> old_pos)
> +{
> + struct page *zero_pg;
> +
> + zero_pg = alloc_pages(GFP_KERNEL | __GFP_ZERO, 0);
> + if (!zero_pg) {
> + pr_warn("%s: failed to get page for zero original sb position",
> + mdname(rdev->mddev));
> + return;
> + }
> + md_super_write(rdev->mddev, rdev, old_pos, rdev->sb_size,
> +zero_pg);
> + wait_event(rdev->mddev->sb_wait,
> + (atomic_read(>mddev->pending_writes) == 0));
> + __free_pages(zero_pg, 0);
> +}
> +
>  /*
>   * load_super for 0.90.0
>   */
> @@ -1394,6 +1411,8 @@ static void super_90_sync(struct mddev *mddev, struct 
> md_rdev *rdev)
>  static unsigned long long
>  super_90_rdev_size_change(struct md_rdev *rdev, sector_t num_sectors)
>  {
> + sector_t old_pos = rdev->sb_start;
> +
>   if (num_sectors && num_sectors < rdev->mddev->dev_sectors)
>   return 0; /* component must fit device */
>   if (rdev->mddev->bitmap_info.offset)
> @@ -1411,6 +1430,8 @@ super_90_rdev_size_change(struct md_rdev *rdev, 
> sector_t num_sectors)
>   md_super_write(rdev->mddev, rdev, rdev->sb_start, rdev->sb_size,
>  rdev->sb_page);
>   } while (md_super_wait(rdev->mddev) < 0);
> +
> + md_zero_original_sb_position(rdev, old_pos);
>   return num_sectors;
>  }
>  
> @@ -1953,6 +1974,8 @@ super_1_rdev_size_change(struct md_rdev *rdev, sector_t 
> num_sectors)
>  {
>   struct mdp_superblock_1 *sb;
>   sector_t max_sectors;
> + sector_t old_pos = 0;
> +
>   if (num_sectors && num_sectors < rdev->mddev->dev_sectors)
>   return 0; /* component must fit device */
>   if (rdev->data_offset != rdev->new_data_offset)
> @@ -1969,6 +1992,7 @@ super_1_rdev_size_change(struct md_rdev *rdev, sector_t 
> num_sectors)
>   } else {
>   /* minor version 0; superblock after data */
>   sector_t sb_start;
> + old_pos = rdev->sb_start;
>   sb_start = (i_size_read(rdev->bdev->bd_inode) >> 9) - 8*2;
>   sb_start &= ~(sector_t)(4*2 - 1);
>   max_sectors = rdev->sectors + sb_start - rdev->sb_start;
> @@ -1984,6 +2008,9 @@ super_1_rdev_size_change(struct md_rdev *rdev, sector_t 
> num_sectors)
>   md_super_write(rdev->mddev, rdev, rdev->sb_start, rdev->sb_size,
>  rdev->sb_page);
>   } while (md_super_wait(rdev->mddev) < 0);
> +
> + if (old_pos)
> + md_zero_original_sb_position(rdev, old_pos);
>   return num_sectors;
>  
>  }
> -- 
> 2.7.4
> 


Re: [PATCH] md/raid1: add error handling of read error from FailFast device

2018-05-14 Thread Shaohua Li
On Wed, May 02, 2018 at 01:08:11PM +0200, Gioh Kim wrote:
> Current handle_read_error() function calls fix_read_error()
> only if md device is RW and rdev does not include FailFast flag.
> It does not handle a read error from a RW device including
> FailFast flag.
> 
> I am not sure it is intended. But I found that write IO error
> sets rdev faulty. The md module should handle the read IO error and
> write IO error equally. So I think read IO error should set rdev faulty.
> 
> Signed-off-by: Gioh Kim 
> ---
>  drivers/md/raid1.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index e9e3308cb0a7..4445179aa4c8 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -2474,6 +2474,8 @@ static void handle_read_error(struct r1conf *conf, 
> struct r1bio *r1_bio)
>   fix_read_error(conf, r1_bio->read_disk,
>  r1_bio->sector, r1_bio->sectors);
>   unfreeze_array(conf);
> + } else if (mddev->ro == 0 && test_bit(FailFast, >flags)) {
> + md_error(mddev, rdev);
>   } else {
>   r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED;
>   }
> -- 
> 2.14.1

Looks reasonable, applied! 


Re: [PATCH] md/raid1: add error handling of read error from FailFast device

2018-05-14 Thread Shaohua Li
On Wed, May 02, 2018 at 01:08:11PM +0200, Gioh Kim wrote:
> Current handle_read_error() function calls fix_read_error()
> only if md device is RW and rdev does not include FailFast flag.
> It does not handle a read error from a RW device including
> FailFast flag.
> 
> I am not sure it is intended. But I found that write IO error
> sets rdev faulty. The md module should handle the read IO error and
> write IO error equally. So I think read IO error should set rdev faulty.
> 
> Signed-off-by: Gioh Kim 
> ---
>  drivers/md/raid1.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index e9e3308cb0a7..4445179aa4c8 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -2474,6 +2474,8 @@ static void handle_read_error(struct r1conf *conf, 
> struct r1bio *r1_bio)
>   fix_read_error(conf, r1_bio->read_disk,
>  r1_bio->sector, r1_bio->sectors);
>   unfreeze_array(conf);
> + } else if (mddev->ro == 0 && test_bit(FailFast, >flags)) {
> + md_error(mddev, rdev);
>   } else {
>   r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED;
>   }
> -- 
> 2.14.1

Looks reasonable, applied! 


[GIT PULL] MD update for 4.17-rc1

2018-04-19 Thread Shaohua Li
Hi,
3 small fixes for MD:
- md-cluster fix for faulty device from Guoqing
- writehint fix for writebehind IO for raid1 from Mariusz
- a live lock fix for interrupted recovery from Yufen
Please pull!


The following changes since commit f8cf2f16a7c95acce497bfafa90e7c6d8397d653:

  Merge branch 'next-integrity' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security 
(2018-04-07 16:53:59 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git tags/md/4.17-rc1

for you to fetch changes up to dba40d46ebf41e3f7ac9480609529bb6037a918d:

  raid1: copy write hint from master bio to behind bio (2018-04-09 08:54:34 
-0700)


Guoqing Jiang (1):
  md-cluster: don't update recovery_offset for faulty device

Mariusz Dabrowski (1):
  raid1: copy write hint from master bio to behind bio

Yufen Yu (1):
  md/raid1: exit sync request if MD_RECOVERY_INTR is set

 drivers/md/md.c|  6 --
 drivers/md/raid1.c | 25 -
 2 files changed, 24 insertions(+), 7 deletions(-)


[GIT PULL] MD update for 4.17-rc1

2018-04-19 Thread Shaohua Li
Hi,
3 small fixes for MD:
- md-cluster fix for faulty device from Guoqing
- writehint fix for writebehind IO for raid1 from Mariusz
- a live lock fix for interrupted recovery from Yufen
Please pull!


The following changes since commit f8cf2f16a7c95acce497bfafa90e7c6d8397d653:

  Merge branch 'next-integrity' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security 
(2018-04-07 16:53:59 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git tags/md/4.17-rc1

for you to fetch changes up to dba40d46ebf41e3f7ac9480609529bb6037a918d:

  raid1: copy write hint from master bio to behind bio (2018-04-09 08:54:34 
-0700)


Guoqing Jiang (1):
  md-cluster: don't update recovery_offset for faulty device

Mariusz Dabrowski (1):
  raid1: copy write hint from master bio to behind bio

Yufen Yu (1):
  md/raid1: exit sync request if MD_RECOVERY_INTR is set

 drivers/md/md.c|  6 --
 drivers/md/raid1.c | 25 -
 2 files changed, 24 insertions(+), 7 deletions(-)


[GIT PULL] MD update for 4.16-rc3

2018-03-01 Thread Shaohua Li
Hi,
A few bug fixes for MD, please pull:
- Fix raid5-ppl flush request handling hang from Artur
- Fix a potential deadlock in raid5/10 reshape from BingJing
- Fix a deadlock for dm-raid from Heinz
- Fix two md-cluster of raid10 from Lidong and Guoqing
- Fix a NULL deference problem in device removal from Neil
- Fix a NULL deference problem in raid1/raid10 in specific condition from Yufen
- Other cleanup and fixes

Thanks,
Shaohua

The following changes since commit c786427f57b6dc4f56f9a84da52b41216e94f125:

  Merge tag 'for-linus-20180217' of git://git.kernel.dk/linux-block (2018-02-17 
10:20:47 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git for-next

for you to fetch changes up to 3de59bb9d551428cbdc76a9ea57883f82e350b4d:

  md/raid1: fix NULL pointer dereference (2018-02-25 10:44:39 -0800)


Aliaksei Karaliou (1):
  md/raid5: simplify uninitialization of shrinker

Arnd Bergmann (1):
  md: raid5: avoid string overflow warning

Artur Paszkiewicz (1):
  raid5-ppl: fix handling flush requests

BingJing Chang (1):
  md: fix a potential deadlock of raid5/raid10 reshape

Guoqing Jiang (1):
  raid10: change the size of resync window for clustered raid

Heinz Mauelshagen (1):
  md: fix md_write_start() deadlock w/o metadata devices

Lidong Zhong (1):
  md-cluster: choose correct label when clustered layout is not supported

Luis de Bethencourt (1):
  md/raid1: Fix trailing semicolon

Markus Elfring (1):
  md-multipath: Use seq_putc() in multipath_status()

NeilBrown (2):
  md: document lifetime of internal rdev pointer.
  md: only allow remove_and_add_spares when no sync_thread running.

Xiao Ni (1):
  MD: Free bioset when md_run fails

Yufen Yu (2):
  md raid10: fix NULL deference in handle_write_completed()
  md/raid1: fix NULL pointer dereference

 drivers/md/md-multipath.c |  2 +-
 drivers/md/md.c   | 53 ++-
 drivers/md/md.h   |  2 ++
 drivers/md/raid1.c| 13 +++-
 drivers/md/raid1.h| 12 +++
 drivers/md/raid10.c   | 18 +++-
 drivers/md/raid10.h   | 13 
 drivers/md/raid5-log.h|  3 ++-
 drivers/md/raid5-ppl.c| 10 +
 drivers/md/raid5.c| 19 ++---
 drivers/md/raid5.h| 12 +++
 11 files changed, 125 insertions(+), 32 deletions(-)


[GIT PULL] MD update for 4.16-rc3

2018-03-01 Thread Shaohua Li
Hi,
A few bug fixes for MD, please pull:
- Fix raid5-ppl flush request handling hang from Artur
- Fix a potential deadlock in raid5/10 reshape from BingJing
- Fix a deadlock for dm-raid from Heinz
- Fix two md-cluster of raid10 from Lidong and Guoqing
- Fix a NULL deference problem in device removal from Neil
- Fix a NULL deference problem in raid1/raid10 in specific condition from Yufen
- Other cleanup and fixes

Thanks,
Shaohua

The following changes since commit c786427f57b6dc4f56f9a84da52b41216e94f125:

  Merge tag 'for-linus-20180217' of git://git.kernel.dk/linux-block (2018-02-17 
10:20:47 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git for-next

for you to fetch changes up to 3de59bb9d551428cbdc76a9ea57883f82e350b4d:

  md/raid1: fix NULL pointer dereference (2018-02-25 10:44:39 -0800)


Aliaksei Karaliou (1):
  md/raid5: simplify uninitialization of shrinker

Arnd Bergmann (1):
  md: raid5: avoid string overflow warning

Artur Paszkiewicz (1):
  raid5-ppl: fix handling flush requests

BingJing Chang (1):
  md: fix a potential deadlock of raid5/raid10 reshape

Guoqing Jiang (1):
  raid10: change the size of resync window for clustered raid

Heinz Mauelshagen (1):
  md: fix md_write_start() deadlock w/o metadata devices

Lidong Zhong (1):
  md-cluster: choose correct label when clustered layout is not supported

Luis de Bethencourt (1):
  md/raid1: Fix trailing semicolon

Markus Elfring (1):
  md-multipath: Use seq_putc() in multipath_status()

NeilBrown (2):
  md: document lifetime of internal rdev pointer.
  md: only allow remove_and_add_spares when no sync_thread running.

Xiao Ni (1):
  MD: Free bioset when md_run fails

Yufen Yu (2):
  md raid10: fix NULL deference in handle_write_completed()
  md/raid1: fix NULL pointer dereference

 drivers/md/md-multipath.c |  2 +-
 drivers/md/md.c   | 53 ++-
 drivers/md/md.h   |  2 ++
 drivers/md/raid1.c| 13 +++-
 drivers/md/raid1.h| 12 +++
 drivers/md/raid10.c   | 18 +++-
 drivers/md/raid10.h   | 13 
 drivers/md/raid5-log.h|  3 ++-
 drivers/md/raid5-ppl.c| 10 +
 drivers/md/raid5.c| 19 ++---
 drivers/md/raid5.h| 12 +++
 11 files changed, 125 insertions(+), 32 deletions(-)


Re: [PATCH] md: raid5: avoid string overflow warning

2018-02-21 Thread Shaohua Li
On Tue, Feb 20, 2018 at 02:09:11PM +0100, Arnd Bergmann wrote:
> gcc warns about a possible overflow of the kmem_cache string, when adding
> four characters to a string of the same length:
> 
> drivers/md/raid5.c: In function 'setup_conf':
> drivers/md/raid5.c:2207:34: error: '-alt' directive writing 4 bytes into a 
> region of size between 1 and 32 [-Werror=format-overflow=]
>   sprintf(conf->cache_name[1], "%s-alt", conf->cache_name[0]);
>   ^~~~
> drivers/md/raid5.c:2207:2: note: 'sprintf' output between 5 and 36 bytes into 
> a destination of size 32
>   sprintf(conf->cache_name[1], "%s-alt", conf->cache_name[0]);
>   ^~~
> 
> If I'm counting correctly, we need 11 characters for the fixed part
> of the string and 18 characters for a 64-bit pointer (when no gendisk
> is used), so that leaves three characters for conf->level, which should
> always be sufficient.
> 
> This makes the code use snprintf() with the correct length, to
> make the code more robust against changes, and to get the compiler
> to shut up.
> 
> In commit f4be6b43f1ac ("md/raid5: ensure we create a unique name for
> kmem_cache when mddev has no gendisk") from 2010, Neil said that
> the pointer could be removed "shortly" once devices without gendisk
> are disallowed. I have no idea if that happened, but if it did, that
> should probably be changed as well.
> 
> Signed-off-by: Arnd Bergmann 

Applied, thanks!
> ---
>  drivers/md/raid5.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 50d01144b805..7ef368061424 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -2196,15 +2196,16 @@ static int grow_one_stripe(struct r5conf *conf, gfp_t 
> gfp)
>  static int grow_stripes(struct r5conf *conf, int num)
>  {
>   struct kmem_cache *sc;
> + size_t namelen = sizeof(conf->cache_name[0]);
>   int devs = max(conf->raid_disks, conf->previous_raid_disks);
>  
>   if (conf->mddev->gendisk)
> - sprintf(conf->cache_name[0],
> + snprintf(conf->cache_name[0], namelen,
>   "raid%d-%s", conf->level, mdname(conf->mddev));
>   else
> - sprintf(conf->cache_name[0],
> + snprintf(conf->cache_name[0], namelen,
>   "raid%d-%p", conf->level, conf->mddev);
> - sprintf(conf->cache_name[1], "%s-alt", conf->cache_name[0]);
> + snprintf(conf->cache_name[1], namelen, "%.27s-alt", 
> conf->cache_name[0]);
>  
>   conf->active_name = 0;
>   sc = kmem_cache_create(conf->cache_name[conf->active_name],
> -- 
> 2.9.0
> 


Re: [PATCH] md: raid5: avoid string overflow warning

2018-02-21 Thread Shaohua Li
On Tue, Feb 20, 2018 at 02:09:11PM +0100, Arnd Bergmann wrote:
> gcc warns about a possible overflow of the kmem_cache string, when adding
> four characters to a string of the same length:
> 
> drivers/md/raid5.c: In function 'setup_conf':
> drivers/md/raid5.c:2207:34: error: '-alt' directive writing 4 bytes into a 
> region of size between 1 and 32 [-Werror=format-overflow=]
>   sprintf(conf->cache_name[1], "%s-alt", conf->cache_name[0]);
>   ^~~~
> drivers/md/raid5.c:2207:2: note: 'sprintf' output between 5 and 36 bytes into 
> a destination of size 32
>   sprintf(conf->cache_name[1], "%s-alt", conf->cache_name[0]);
>   ^~~
> 
> If I'm counting correctly, we need 11 characters for the fixed part
> of the string and 18 characters for a 64-bit pointer (when no gendisk
> is used), so that leaves three characters for conf->level, which should
> always be sufficient.
> 
> This makes the code use snprintf() with the correct length, to
> make the code more robust against changes, and to get the compiler
> to shut up.
> 
> In commit f4be6b43f1ac ("md/raid5: ensure we create a unique name for
> kmem_cache when mddev has no gendisk") from 2010, Neil said that
> the pointer could be removed "shortly" once devices without gendisk
> are disallowed. I have no idea if that happened, but if it did, that
> should probably be changed as well.
> 
> Signed-off-by: Arnd Bergmann 

Applied, thanks!
> ---
>  drivers/md/raid5.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 50d01144b805..7ef368061424 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -2196,15 +2196,16 @@ static int grow_one_stripe(struct r5conf *conf, gfp_t 
> gfp)
>  static int grow_stripes(struct r5conf *conf, int num)
>  {
>   struct kmem_cache *sc;
> + size_t namelen = sizeof(conf->cache_name[0]);
>   int devs = max(conf->raid_disks, conf->previous_raid_disks);
>  
>   if (conf->mddev->gendisk)
> - sprintf(conf->cache_name[0],
> + snprintf(conf->cache_name[0], namelen,
>   "raid%d-%s", conf->level, mdname(conf->mddev));
>   else
> - sprintf(conf->cache_name[0],
> + snprintf(conf->cache_name[0], namelen,
>   "raid%d-%p", conf->level, conf->mddev);
> - sprintf(conf->cache_name[1], "%s-alt", conf->cache_name[0]);
> + snprintf(conf->cache_name[1], namelen, "%.27s-alt", 
> conf->cache_name[0]);
>  
>   conf->active_name = 0;
>   sc = kmem_cache_create(conf->cache_name[conf->active_name],
> -- 
> 2.9.0
> 


Re: [PATCH] md: fix md_write_start() deadlock w/o metadata devices

2018-02-18 Thread Shaohua Li
On Fri, Feb 02, 2018 at 11:13:19PM +0100, Heinz Mauelshagen wrote:
> If no metadata devices are configured on raid1/4/5/6/10
> (e.g. via dm-raid), md_write_start() unconditionally waits
> for superblocks to be written thus deadlocking.
> 
> Fix introduces mddev->has_superblocks bool, defines it in md_run()
> and checks for it in md_write_start() to conditionally avoid waiting.
> 
> Once on it, check for non-existing superblocks in md_super_write().
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=198647
> Fixes: cc27b0c78c796 ("md: fix deadlock between mddev_suspend() and 
> md_write_start()")

Applied, thanks! 
> Signed-off-by: Heinz Mauelshagen 
> ---
>  drivers/md/md.c | 10 ++
>  drivers/md/md.h |  2 ++
>  2 files changed, 12 insertions(+)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 0081ace39a64..8a7e7034962c 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -801,6 +801,9 @@ void md_super_write(struct mddev *mddev, struct md_rdev 
> *rdev,
>   struct bio *bio;
>   int ff = 0;
>  
> + if (!page)
> + return;
> +
>   if (test_bit(Faulty, >flags))
>   return;
>  
> @@ -5452,6 +5455,7 @@ int md_run(struct mddev *mddev)
>* the only valid external interface is through the md
>* device.
>*/
> + mddev->has_superblocks = false;
>   rdev_for_each(rdev, mddev) {
>   if (test_bit(Faulty, >flags))
>   continue;
> @@ -5465,6 +5469,9 @@ int md_run(struct mddev *mddev)
>   set_disk_ro(mddev->gendisk, 1);
>   }
>  
> + if (rdev->sb_page)
> + mddev->has_superblocks = true;
> +
>   /* perform some consistency tests on the device.
>* We don't want the data to overlap the metadata,
>* Internal Bitmap issues have been handled elsewhere.
> @@ -8049,6 +8056,7 @@ EXPORT_SYMBOL(md_done_sync);
>  bool md_write_start(struct mddev *mddev, struct bio *bi)
>  {
>   int did_change = 0;
> +
>   if (bio_data_dir(bi) != WRITE)
>   return true;
>  
> @@ -8081,6 +8089,8 @@ bool md_write_start(struct mddev *mddev, struct bio *bi)
>   rcu_read_unlock();
>   if (did_change)
>   sysfs_notify_dirent_safe(mddev->sysfs_state);
> + if (!mddev->has_superblocks)
> + return true;
>   wait_event(mddev->sb_wait,
>  !test_bit(MD_SB_CHANGE_PENDING, >sb_flags) ||
>  mddev->suspended);
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 58cd20a5e85e..fbc925cce810 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -468,6 +468,8 @@ struct mddev {
>   void (*sync_super)(struct mddev *mddev, struct md_rdev *rdev);
>   struct md_cluster_info  *cluster_info;
>   unsigned intgood_device_nr; /* good device num 
> within cluster raid */
> +
> + boolhas_superblocks:1;
>  };
>  
>  enum recovery_flags {
> -- 
> 2.14.3
> 


Re: [PATCH] md: fix md_write_start() deadlock w/o metadata devices

2018-02-18 Thread Shaohua Li
On Fri, Feb 02, 2018 at 11:13:19PM +0100, Heinz Mauelshagen wrote:
> If no metadata devices are configured on raid1/4/5/6/10
> (e.g. via dm-raid), md_write_start() unconditionally waits
> for superblocks to be written thus deadlocking.
> 
> Fix introduces mddev->has_superblocks bool, defines it in md_run()
> and checks for it in md_write_start() to conditionally avoid waiting.
> 
> Once on it, check for non-existing superblocks in md_super_write().
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=198647
> Fixes: cc27b0c78c796 ("md: fix deadlock between mddev_suspend() and 
> md_write_start()")

Applied, thanks! 
> Signed-off-by: Heinz Mauelshagen 
> ---
>  drivers/md/md.c | 10 ++
>  drivers/md/md.h |  2 ++
>  2 files changed, 12 insertions(+)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 0081ace39a64..8a7e7034962c 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -801,6 +801,9 @@ void md_super_write(struct mddev *mddev, struct md_rdev 
> *rdev,
>   struct bio *bio;
>   int ff = 0;
>  
> + if (!page)
> + return;
> +
>   if (test_bit(Faulty, >flags))
>   return;
>  
> @@ -5452,6 +5455,7 @@ int md_run(struct mddev *mddev)
>* the only valid external interface is through the md
>* device.
>*/
> + mddev->has_superblocks = false;
>   rdev_for_each(rdev, mddev) {
>   if (test_bit(Faulty, >flags))
>   continue;
> @@ -5465,6 +5469,9 @@ int md_run(struct mddev *mddev)
>   set_disk_ro(mddev->gendisk, 1);
>   }
>  
> + if (rdev->sb_page)
> + mddev->has_superblocks = true;
> +
>   /* perform some consistency tests on the device.
>* We don't want the data to overlap the metadata,
>* Internal Bitmap issues have been handled elsewhere.
> @@ -8049,6 +8056,7 @@ EXPORT_SYMBOL(md_done_sync);
>  bool md_write_start(struct mddev *mddev, struct bio *bi)
>  {
>   int did_change = 0;
> +
>   if (bio_data_dir(bi) != WRITE)
>   return true;
>  
> @@ -8081,6 +8089,8 @@ bool md_write_start(struct mddev *mddev, struct bio *bi)
>   rcu_read_unlock();
>   if (did_change)
>   sysfs_notify_dirent_safe(mddev->sysfs_state);
> + if (!mddev->has_superblocks)
> + return true;
>   wait_event(mddev->sb_wait,
>  !test_bit(MD_SB_CHANGE_PENDING, >sb_flags) ||
>  mddev->suspended);
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 58cd20a5e85e..fbc925cce810 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -468,6 +468,8 @@ struct mddev {
>   void (*sync_super)(struct mddev *mddev, struct md_rdev *rdev);
>   struct md_cluster_info  *cluster_info;
>   unsigned intgood_device_nr; /* good device num 
> within cluster raid */
> +
> + boolhas_superblocks:1;
>  };
>  
>  enum recovery_flags {
> -- 
> 2.14.3
> 


Re: [PATCH v2] md-multipath: Use seq_putc() in multipath_status()

2018-02-17 Thread Shaohua Li
On Sat, Jan 13, 2018 at 09:55:08AM +0100, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Sat, 13 Jan 2018 09:49:03 +0100
> 
> A single character (closing square bracket) should be put into a sequence.
> Thus use the corresponding function "seq_putc".
> 
> This issue was detected by using the Coccinelle software.
> 
> Signed-off-by: Markus Elfring 
applied, thanks
> ---
>  drivers/md/md-multipath.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/md/md-multipath.c b/drivers/md/md-multipath.c
> index e40065bdbfc8..0a7e99d62c69 100644
> --- a/drivers/md/md-multipath.c
> +++ b/drivers/md/md-multipath.c
> @@ -157,7 +157,7 @@ static void multipath_status(struct seq_file *seq, struct 
> mddev *mddev)
>   seq_printf (seq, "%s", rdev && test_bit(In_sync, >flags) 
> ? "U" : "_");
>   }
>   rcu_read_unlock();
> - seq_printf (seq, "]");
> + seq_putc(seq, ']');
>  }
>  
>  static int multipath_congested(struct mddev *mddev, int bits)
> -- 
> 2.15.1
> 


Re: [PATCH v2] md-multipath: Use seq_putc() in multipath_status()

2018-02-17 Thread Shaohua Li
On Sat, Jan 13, 2018 at 09:55:08AM +0100, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Sat, 13 Jan 2018 09:49:03 +0100
> 
> A single character (closing square bracket) should be put into a sequence.
> Thus use the corresponding function "seq_putc".
> 
> This issue was detected by using the Coccinelle software.
> 
> Signed-off-by: Markus Elfring 
applied, thanks
> ---
>  drivers/md/md-multipath.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/md/md-multipath.c b/drivers/md/md-multipath.c
> index e40065bdbfc8..0a7e99d62c69 100644
> --- a/drivers/md/md-multipath.c
> +++ b/drivers/md/md-multipath.c
> @@ -157,7 +157,7 @@ static void multipath_status(struct seq_file *seq, struct 
> mddev *mddev)
>   seq_printf (seq, "%s", rdev && test_bit(In_sync, >flags) 
> ? "U" : "_");
>   }
>   rcu_read_unlock();
> - seq_printf (seq, "]");
> + seq_putc(seq, ']');
>  }
>  
>  static int multipath_congested(struct mddev *mddev, int bits)
> -- 
> 2.15.1
> 


Re: [PATCH] md/raid1: Fix trailing semicolon

2018-02-17 Thread Shaohua Li
On Wed, Jan 17, 2018 at 01:38:02PM +, Luis de Bethencourt wrote:
> The trailing semicolon is an empty statement that does no operation.
> Removing it since it doesn't do anything.
> 
> Signed-off-by: Luis de Bethencourt 
> ---
> 
> Hi,
> 
> After fixing the same thing in drivers/staging/rtl8723bs/, Joe Perches
> suggested I fix it treewide [0].

applied, thanks!
> 
> Best regards,
> Luis
> 
> 
> [0] 
> http://driverdev.linuxdriverproject.org/pipermail/driverdev-devel/2018-January/115410.html
> [1] 
> http://driverdev.linuxdriverproject.org/pipermail/driverdev-devel/2018-January/115390.html
> 
>  drivers/md/raid1.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index b2eae332e1a2..f978eddc7a21 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1108,7 +1108,7 @@ static void alloc_behind_master_bio(struct r1bio 
> *r1_bio,
>  
>   bio_copy_data(behind_bio, bio);
>  skip_copy:
> - r1_bio->behind_master_bio = behind_bio;;
> + r1_bio->behind_master_bio = behind_bio;
>   set_bit(R1BIO_BehindIO, _bio->state);
>  
>   return;
> -- 
> 2.15.1
> 


Re: [PATCH] md/raid1: Fix trailing semicolon

2018-02-17 Thread Shaohua Li
On Wed, Jan 17, 2018 at 01:38:02PM +, Luis de Bethencourt wrote:
> The trailing semicolon is an empty statement that does no operation.
> Removing it since it doesn't do anything.
> 
> Signed-off-by: Luis de Bethencourt 
> ---
> 
> Hi,
> 
> After fixing the same thing in drivers/staging/rtl8723bs/, Joe Perches
> suggested I fix it treewide [0].

applied, thanks!
> 
> Best regards,
> Luis
> 
> 
> [0] 
> http://driverdev.linuxdriverproject.org/pipermail/driverdev-devel/2018-January/115410.html
> [1] 
> http://driverdev.linuxdriverproject.org/pipermail/driverdev-devel/2018-January/115390.html
> 
>  drivers/md/raid1.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index b2eae332e1a2..f978eddc7a21 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1108,7 +1108,7 @@ static void alloc_behind_master_bio(struct r1bio 
> *r1_bio,
>  
>   bio_copy_data(behind_bio, bio);
>  skip_copy:
> - r1_bio->behind_master_bio = behind_bio;;
> + r1_bio->behind_master_bio = behind_bio;
>   set_bit(R1BIO_BehindIO, _bio->state);
>  
>   return;
> -- 
> 2.15.1
> 


[GIT PULL] MD update for 4.16-rc1

2018-01-30 Thread Shaohua Li
Hi,
Some small fixes for MD 4.16:
- Fix raid5-cache potential problems if raid5 cache isn't fully recovered
- Fix a wait-within-wait warning in raid1/10
- Make raid5-PPL support disks with writeback cache enabled
Please pull!

Thanks,
Shaohua

The following changes since commit 50c4c4e268a2d7a3e58ebb698ac74da0de40ae36:

  Linux 4.15-rc3 (2017-12-10 17:56:26 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git for-next

for you to fetch changes up to 1532d9e87e8b2377f12929f9e40724d5fbe6ecc5:

  raid5-ppl: PPL support for disks with write-back cache enabled (2018-01-15 
14:29:42 -0800)


NeilBrown (1):
  md/raid1,raid10: silence warning about wait-within-wait

Song Liu (2):
  md: introduce new personality funciton start()
  md/r5cache: print more info of log recovery

Tomasz Majchrzak (1):
  raid5-ppl: PPL support for disks with write-back cache enabled

 Documentation/md/raid5-ppl.txt |   7 +-
 drivers/md/dm-raid.c   |   9 +++
 drivers/md/md.c|  31 ++--
 drivers/md/md.h|   9 +++
 drivers/md/raid1.c |  11 +++
 drivers/md/raid10.c|  12 +++
 drivers/md/raid5-cache.c   |  31 +---
 drivers/md/raid5-log.h |  30 
 drivers/md/raid5-ppl.c | 167 ++---
 drivers/md/raid5.c |  16 +++-
 10 files changed, 285 insertions(+), 38 deletions(-)


[GIT PULL] MD update for 4.16-rc1

2018-01-30 Thread Shaohua Li
Hi,
Some small fixes for MD 4.16:
- Fix raid5-cache potential problems if raid5 cache isn't fully recovered
- Fix a wait-within-wait warning in raid1/10
- Make raid5-PPL support disks with writeback cache enabled
Please pull!

Thanks,
Shaohua

The following changes since commit 50c4c4e268a2d7a3e58ebb698ac74da0de40ae36:

  Linux 4.15-rc3 (2017-12-10 17:56:26 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git for-next

for you to fetch changes up to 1532d9e87e8b2377f12929f9e40724d5fbe6ecc5:

  raid5-ppl: PPL support for disks with write-back cache enabled (2018-01-15 
14:29:42 -0800)


NeilBrown (1):
  md/raid1,raid10: silence warning about wait-within-wait

Song Liu (2):
  md: introduce new personality funciton start()
  md/r5cache: print more info of log recovery

Tomasz Majchrzak (1):
  raid5-ppl: PPL support for disks with write-back cache enabled

 Documentation/md/raid5-ppl.txt |   7 +-
 drivers/md/dm-raid.c   |   9 +++
 drivers/md/md.c|  31 ++--
 drivers/md/md.h|   9 +++
 drivers/md/raid1.c |  11 +++
 drivers/md/raid10.c|  12 +++
 drivers/md/raid5-cache.c   |  31 +---
 drivers/md/raid5-log.h |  30 
 drivers/md/raid5-ppl.c | 167 ++---
 drivers/md/raid5.c |  16 +++-
 10 files changed, 285 insertions(+), 38 deletions(-)


Re: Regression with a0747a859ef6 ("bdi: add error handle for bdi_debug_register")

2017-12-19 Thread Shaohua Li
On Tue, Dec 19, 2017 at 10:17:43AM -0600, Bruno Wolff III wrote:
> On Sun, Dec 17, 2017 at 21:43:50 +0800,
>  weiping zhang  wrote:
> > Hi, thanks for testing, I think you first reproduce this issue(got WARNING
> > at device_add_disk) by your own build, then add my debug patch.
> 
> The problem is still in rc4. Reverting the commit still fixes the problem. I
> tested that warning level messages should appear using lkdtm. While there
> could be something weird relating to the WARN_ON macro, more likely there is
> something different about the boots with the kernels I build (the exact way
> initramfs is built is probably different) and probably that (WARN_ON) code
> is not getting executed.

Not sure if this is MD related, but could you please check if this debug patch
changes anything?

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 4e4dee0..c365179 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -518,7 +518,6 @@ static void mddev_put(struct mddev *mddev)
mddev->ctime == 0 && !mddev->hold_active) {
/* Array is not configured at all, and not held active,
 * so destroy it */
-   list_del_init(>all_mddevs);
bs = mddev->bio_set;
sync_bs = mddev->sync_set;
mddev->bio_set = NULL;
@@ -5210,6 +5209,10 @@ static void md_free(struct kobject *ko)
}
percpu_ref_exit(>writes_pending);
 
+   spin_lock(_mddevs_lock);
+   list_del_init(>all_mddevs);
+   spin_unlock(_mddevs_lock);
+
kfree(mddev);
 }
 


Re: Regression with a0747a859ef6 ("bdi: add error handle for bdi_debug_register")

2017-12-19 Thread Shaohua Li
On Tue, Dec 19, 2017 at 10:17:43AM -0600, Bruno Wolff III wrote:
> On Sun, Dec 17, 2017 at 21:43:50 +0800,
>  weiping zhang  wrote:
> > Hi, thanks for testing, I think you first reproduce this issue(got WARNING
> > at device_add_disk) by your own build, then add my debug patch.
> 
> The problem is still in rc4. Reverting the commit still fixes the problem. I
> tested that warning level messages should appear using lkdtm. While there
> could be something weird relating to the WARN_ON macro, more likely there is
> something different about the boots with the kernels I build (the exact way
> initramfs is built is probably different) and probably that (WARN_ON) code
> is not getting executed.

Not sure if this is MD related, but could you please check if this debug patch
changes anything?

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 4e4dee0..c365179 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -518,7 +518,6 @@ static void mddev_put(struct mddev *mddev)
mddev->ctime == 0 && !mddev->hold_active) {
/* Array is not configured at all, and not held active,
 * so destroy it */
-   list_del_init(>all_mddevs);
bs = mddev->bio_set;
sync_bs = mddev->sync_set;
mddev->bio_set = NULL;
@@ -5210,6 +5209,10 @@ static void md_free(struct kobject *ko)
}
percpu_ref_exit(>writes_pending);
 
+   spin_lock(_mddevs_lock);
+   list_del_init(>all_mddevs);
+   spin_unlock(_mddevs_lock);
+
kfree(mddev);
 }
 


[GIT PULL] MD update for 4.15-rc2

2017-12-07 Thread Shaohua Li
Hi,

Please pull some MD fixes. The notable one is a raid5-cache deadlock bug with
dm-raid, others are not significant.

Thanks,
Shaohua

The following changes since commit a0651c7fa2c088a605f63792279859608ed7f2c8:

  Merge tag 'powerpc-4.15-3' of 
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux (2017-12-01 
08:40:17 -0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git tags/md/4.15-rc2

for you to fetch changes up to 18022a1bd3709b74ca31ef0b28fccd52bcd6c504:

  md/raid1/10: add missed blk plug (2017-12-01 12:19:48 -0800)


Nate Dailey (1):
  md: limit mdstat resync progress to max_sectors

Shaohua Li (1):
  md/raid1/10: add missed blk plug

Song Liu (1):
  md/r5cache: move mddev_lock() out of r5c_journal_mode_set()

bingjingc (1):
  md/raid5: correct degraded calculation in raid5_error

 drivers/md/md.c  |  4 +++-
 drivers/md/raid1.c   |  4 
 drivers/md/raid10.c  |  4 
 drivers/md/raid5-cache.c | 22 +-
 drivers/md/raid5.c   |  2 +-
 5 files changed, 21 insertions(+), 15 deletions(-)


[GIT PULL] MD update for 4.15-rc2

2017-12-07 Thread Shaohua Li
Hi,

Please pull some MD fixes. The notable one is a raid5-cache deadlock bug with
dm-raid, others are not significant.

Thanks,
Shaohua

The following changes since commit a0651c7fa2c088a605f63792279859608ed7f2c8:

  Merge tag 'powerpc-4.15-3' of 
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux (2017-12-01 
08:40:17 -0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git tags/md/4.15-rc2

for you to fetch changes up to 18022a1bd3709b74ca31ef0b28fccd52bcd6c504:

  md/raid1/10: add missed blk plug (2017-12-01 12:19:48 -0800)


Nate Dailey (1):
  md: limit mdstat resync progress to max_sectors

Shaohua Li (1):
  md/raid1/10: add missed blk plug

Song Liu (1):
  md/r5cache: move mddev_lock() out of r5c_journal_mode_set()

bingjingc (1):
  md/raid5: correct degraded calculation in raid5_error

 drivers/md/md.c  |  4 +++-
 drivers/md/raid1.c   |  4 
 drivers/md/raid10.c  |  4 
 drivers/md/raid5-cache.c | 22 +-
 drivers/md/raid5.c   |  2 +-
 5 files changed, 21 insertions(+), 15 deletions(-)


Re: [RFC PATCH] blk-throttle: add burst allowance.

2017-11-17 Thread Shaohua Li
On Thu, Nov 16, 2017 at 08:25:58PM -0800, Khazhismel Kumykov wrote:
> On Thu, Nov 16, 2017 at 8:50 AM, Shaohua Li <s...@kernel.org> wrote:
> > On Tue, Nov 14, 2017 at 03:10:22PM -0800, Khazhismel Kumykov wrote:
> >> Allows configuration additional bytes or ios before a throttle is
> >> triggered.
> >>
> >> This allows implementation of a bucket style rate-limit/throttle on a
> >> block device. Previously, bursting to a device was limited to allowance
> >> granted in a single throtl_slice (similar to a bucket with limit N and
> >> refill rate N/slice).
> >>
> >> Additional parameters bytes/io_burst_conf defined for tg, which define a
> >> number of bytes/ios that must be depleted before throttling happens. A
> >> tg that does not deplete this allowance functions as though it has no
> >> configured limits. tgs earn additional allowance at rate defined by
> >> bps/iops for the tg. Once a tg has *_disp > *_burst_conf, throttling
> >> kicks in. If a tg is idle for a while, it will again have some burst
> >> allowance before it gets throttled again.
> >>
> >> slice_end for a tg is extended until io_disp/byte_disp would fall to 0,
> >> when all "used" burst allowance would be earned back. trim_slice still
> >> does progress slice_start as before and decrements *_disp as before, and
> >> tgs continue to get bytes/ios in throtl_slice intervals.
> >
> > Can you describe why we need this? It would be great if you can describe the
> > usage model and an example. Does this work for io.low/io.max or both?
> >
> > Thanks,
> > Shaohua
> >
> 
> Use case that brought this up was configuring limits for a remote
> shared device. Bursting beyond io.max is desired but only for so much
> before the limit kicks in, afterwards with sustained usage throughput
> is capped. (This proactively avoids remote-side limits). In that case
> one would configure in a root container io.max + io.burst, and
> configure low/other limits on descendants sharing the resource on the
> same node.
> 
> With this patch, so long as tg has not dispatched more than the burst,
> no limit is applied at all by that tg, including limit imposed by
> io.low in tg_iops_limit, etc.

I'd appreciate if you can give more details about the 'why'. 'configuring
limits for a remote shared device' doesn't justify the change.


Re: [RFC PATCH] blk-throttle: add burst allowance.

2017-11-17 Thread Shaohua Li
On Thu, Nov 16, 2017 at 08:25:58PM -0800, Khazhismel Kumykov wrote:
> On Thu, Nov 16, 2017 at 8:50 AM, Shaohua Li  wrote:
> > On Tue, Nov 14, 2017 at 03:10:22PM -0800, Khazhismel Kumykov wrote:
> >> Allows configuration additional bytes or ios before a throttle is
> >> triggered.
> >>
> >> This allows implementation of a bucket style rate-limit/throttle on a
> >> block device. Previously, bursting to a device was limited to allowance
> >> granted in a single throtl_slice (similar to a bucket with limit N and
> >> refill rate N/slice).
> >>
> >> Additional parameters bytes/io_burst_conf defined for tg, which define a
> >> number of bytes/ios that must be depleted before throttling happens. A
> >> tg that does not deplete this allowance functions as though it has no
> >> configured limits. tgs earn additional allowance at rate defined by
> >> bps/iops for the tg. Once a tg has *_disp > *_burst_conf, throttling
> >> kicks in. If a tg is idle for a while, it will again have some burst
> >> allowance before it gets throttled again.
> >>
> >> slice_end for a tg is extended until io_disp/byte_disp would fall to 0,
> >> when all "used" burst allowance would be earned back. trim_slice still
> >> does progress slice_start as before and decrements *_disp as before, and
> >> tgs continue to get bytes/ios in throtl_slice intervals.
> >
> > Can you describe why we need this? It would be great if you can describe the
> > usage model and an example. Does this work for io.low/io.max or both?
> >
> > Thanks,
> > Shaohua
> >
> 
> Use case that brought this up was configuring limits for a remote
> shared device. Bursting beyond io.max is desired but only for so much
> before the limit kicks in, afterwards with sustained usage throughput
> is capped. (This proactively avoids remote-side limits). In that case
> one would configure in a root container io.max + io.burst, and
> configure low/other limits on descendants sharing the resource on the
> same node.
> 
> With this patch, so long as tg has not dispatched more than the burst,
> no limit is applied at all by that tg, including limit imposed by
> io.low in tg_iops_limit, etc.

I'd appreciate if you can give more details about the 'why'. 'configuring
limits for a remote shared device' doesn't justify the change.


Re: [RFC PATCH] blk-throttle: add burst allowance.

2017-11-16 Thread Shaohua Li
On Tue, Nov 14, 2017 at 03:10:22PM -0800, Khazhismel Kumykov wrote:
> Allows configuration additional bytes or ios before a throttle is
> triggered.
> 
> This allows implementation of a bucket style rate-limit/throttle on a
> block device. Previously, bursting to a device was limited to allowance
> granted in a single throtl_slice (similar to a bucket with limit N and
> refill rate N/slice).
> 
> Additional parameters bytes/io_burst_conf defined for tg, which define a
> number of bytes/ios that must be depleted before throttling happens. A
> tg that does not deplete this allowance functions as though it has no
> configured limits. tgs earn additional allowance at rate defined by
> bps/iops for the tg. Once a tg has *_disp > *_burst_conf, throttling
> kicks in. If a tg is idle for a while, it will again have some burst
> allowance before it gets throttled again.
> 
> slice_end for a tg is extended until io_disp/byte_disp would fall to 0,
> when all "used" burst allowance would be earned back. trim_slice still
> does progress slice_start as before and decrements *_disp as before, and
> tgs continue to get bytes/ios in throtl_slice intervals.

Can you describe why we need this? It would be great if you can describe the
usage model and an example. Does this work for io.low/io.max or both?

Thanks,
Shaohua
 
> Signed-off-by: Khazhismel Kumykov 
> ---
>  block/Kconfig|  11 +++
>  block/blk-throttle.c | 192 
> +++
>  2 files changed, 189 insertions(+), 14 deletions(-)
> 
> diff --git a/block/Kconfig b/block/Kconfig
> index 28ec55752b68..fbd05b419f93 100644
> --- a/block/Kconfig
> +++ b/block/Kconfig
> @@ -128,6 +128,17 @@ config BLK_DEV_THROTTLING_LOW
>  
>   Note, this is an experimental interface and could be changed someday.
>  
> +config BLK_DEV_THROTTLING_BURST
> +bool "Block throttling .burst allowance interface"
> +depends on BLK_DEV_THROTTLING
> +default n
> +---help---
> +Add .burst allowance for block throttling. Burst allowance allows for
> +additional unthrottled usage, while still limiting speed for sustained
> +usage.
> +
> +If in doubt, say N.
> +
>  config BLK_CMDLINE_PARSER
>   bool "Block device command line partition parser"
>   default n
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 96ad32623427..27c084312772 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -157,6 +157,11 @@ struct throtl_grp {
>   /* Number of bio's dispatched in current slice */
>   unsigned int io_disp[2];
>  
> +#ifdef CONFIG_BLK_DEV_THROTTLING_BURST
> + uint64_t bytes_burst_conf[2];
> + unsigned int io_burst_conf[2];
> +#endif
> +
>   unsigned long last_low_overflow_time[2];
>  
>   uint64_t last_bytes_disp[2];
> @@ -507,6 +512,12 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t 
> gfp, int node)
>   tg->bps_conf[WRITE][LIMIT_MAX] = U64_MAX;
>   tg->iops_conf[READ][LIMIT_MAX] = UINT_MAX;
>   tg->iops_conf[WRITE][LIMIT_MAX] = UINT_MAX;
> +#ifdef CONFIG_BLK_DEV_THROTTLING_BURST
> + tg->bytes_burst_conf[READ] = 0;
> + tg->bytes_burst_conf[WRITE] = 0;
> + tg->io_burst_conf[READ] = 0;
> + tg->io_burst_conf[WRITE] = 0;
> +#endif
>   /* LIMIT_LOW will have default value 0 */
>  
>   tg->latency_target = DFL_LATENCY_TARGET;
> @@ -800,6 +811,26 @@ static inline void throtl_start_new_slice(struct 
> throtl_grp *tg, bool rw)
>  tg->slice_end[rw], jiffies);
>  }
>  
> +/*
> + * When current slice should end.
> + *
> + * With CONFIG_BLK_DEV_THROTTLING_BURST, we will wait longer than min_wait
> + * for slice to recover used burst allowance. (*_disp -> 0). Setting 
> slice_end
> + * before this would result in tg receiving additional burst allowance.
> + */
> +static inline unsigned long throtl_slice_wait(struct throtl_grp *tg, bool rw,
> + unsigned long min_wait)
> +{
> + unsigned long bytes_wait = 0, io_wait = 0;
> +#ifdef CONFIG_BLK_DEV_THROTTLING_BURST
> + if (tg->bytes_burst_conf[rw])
> + bytes_wait = (tg->bytes_disp[rw] * HZ) / tg_bps_limit(tg, rw);
> + if (tg->io_burst_conf[rw])
> + io_wait = (tg->io_disp[rw] * HZ) / tg_iops_limit(tg, rw);
> +#endif
> + return max(min_wait, max(bytes_wait, io_wait));
> +}
> +
>  static inline void throtl_set_slice_end(struct throtl_grp *tg, bool rw,
>   unsigned long jiffy_end)
>  {
> @@ -849,7 +880,8 @@ static inline void throtl_trim_slice(struct throtl_grp 
> *tg, bool rw)
>* is bad because it does not allow new slice to start.
>*/
>  
> - throtl_set_slice_end(tg, rw, jiffies + tg->td->throtl_slice);
> + throtl_set_slice_end(tg, rw,
> + jiffies + throtl_slice_wait(tg, rw, tg->td->throtl_slice));
>  
>   time_elapsed = jiffies - tg->slice_start[rw];
>  
> @@ -889,7 +921,7 @@ static bool 

Re: [RFC PATCH] blk-throttle: add burst allowance.

2017-11-16 Thread Shaohua Li
On Tue, Nov 14, 2017 at 03:10:22PM -0800, Khazhismel Kumykov wrote:
> Allows configuration additional bytes or ios before a throttle is
> triggered.
> 
> This allows implementation of a bucket style rate-limit/throttle on a
> block device. Previously, bursting to a device was limited to allowance
> granted in a single throtl_slice (similar to a bucket with limit N and
> refill rate N/slice).
> 
> Additional parameters bytes/io_burst_conf defined for tg, which define a
> number of bytes/ios that must be depleted before throttling happens. A
> tg that does not deplete this allowance functions as though it has no
> configured limits. tgs earn additional allowance at rate defined by
> bps/iops for the tg. Once a tg has *_disp > *_burst_conf, throttling
> kicks in. If a tg is idle for a while, it will again have some burst
> allowance before it gets throttled again.
> 
> slice_end for a tg is extended until io_disp/byte_disp would fall to 0,
> when all "used" burst allowance would be earned back. trim_slice still
> does progress slice_start as before and decrements *_disp as before, and
> tgs continue to get bytes/ios in throtl_slice intervals.

Can you describe why we need this? It would be great if you can describe the
usage model and an example. Does this work for io.low/io.max or both?

Thanks,
Shaohua
 
> Signed-off-by: Khazhismel Kumykov 
> ---
>  block/Kconfig|  11 +++
>  block/blk-throttle.c | 192 
> +++
>  2 files changed, 189 insertions(+), 14 deletions(-)
> 
> diff --git a/block/Kconfig b/block/Kconfig
> index 28ec55752b68..fbd05b419f93 100644
> --- a/block/Kconfig
> +++ b/block/Kconfig
> @@ -128,6 +128,17 @@ config BLK_DEV_THROTTLING_LOW
>  
>   Note, this is an experimental interface and could be changed someday.
>  
> +config BLK_DEV_THROTTLING_BURST
> +bool "Block throttling .burst allowance interface"
> +depends on BLK_DEV_THROTTLING
> +default n
> +---help---
> +Add .burst allowance for block throttling. Burst allowance allows for
> +additional unthrottled usage, while still limiting speed for sustained
> +usage.
> +
> +If in doubt, say N.
> +
>  config BLK_CMDLINE_PARSER
>   bool "Block device command line partition parser"
>   default n
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 96ad32623427..27c084312772 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -157,6 +157,11 @@ struct throtl_grp {
>   /* Number of bio's dispatched in current slice */
>   unsigned int io_disp[2];
>  
> +#ifdef CONFIG_BLK_DEV_THROTTLING_BURST
> + uint64_t bytes_burst_conf[2];
> + unsigned int io_burst_conf[2];
> +#endif
> +
>   unsigned long last_low_overflow_time[2];
>  
>   uint64_t last_bytes_disp[2];
> @@ -507,6 +512,12 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t 
> gfp, int node)
>   tg->bps_conf[WRITE][LIMIT_MAX] = U64_MAX;
>   tg->iops_conf[READ][LIMIT_MAX] = UINT_MAX;
>   tg->iops_conf[WRITE][LIMIT_MAX] = UINT_MAX;
> +#ifdef CONFIG_BLK_DEV_THROTTLING_BURST
> + tg->bytes_burst_conf[READ] = 0;
> + tg->bytes_burst_conf[WRITE] = 0;
> + tg->io_burst_conf[READ] = 0;
> + tg->io_burst_conf[WRITE] = 0;
> +#endif
>   /* LIMIT_LOW will have default value 0 */
>  
>   tg->latency_target = DFL_LATENCY_TARGET;
> @@ -800,6 +811,26 @@ static inline void throtl_start_new_slice(struct 
> throtl_grp *tg, bool rw)
>  tg->slice_end[rw], jiffies);
>  }
>  
> +/*
> + * When current slice should end.
> + *
> + * With CONFIG_BLK_DEV_THROTTLING_BURST, we will wait longer than min_wait
> + * for slice to recover used burst allowance. (*_disp -> 0). Setting 
> slice_end
> + * before this would result in tg receiving additional burst allowance.
> + */
> +static inline unsigned long throtl_slice_wait(struct throtl_grp *tg, bool rw,
> + unsigned long min_wait)
> +{
> + unsigned long bytes_wait = 0, io_wait = 0;
> +#ifdef CONFIG_BLK_DEV_THROTTLING_BURST
> + if (tg->bytes_burst_conf[rw])
> + bytes_wait = (tg->bytes_disp[rw] * HZ) / tg_bps_limit(tg, rw);
> + if (tg->io_burst_conf[rw])
> + io_wait = (tg->io_disp[rw] * HZ) / tg_iops_limit(tg, rw);
> +#endif
> + return max(min_wait, max(bytes_wait, io_wait));
> +}
> +
>  static inline void throtl_set_slice_end(struct throtl_grp *tg, bool rw,
>   unsigned long jiffy_end)
>  {
> @@ -849,7 +880,8 @@ static inline void throtl_trim_slice(struct throtl_grp 
> *tg, bool rw)
>* is bad because it does not allow new slice to start.
>*/
>  
> - throtl_set_slice_end(tg, rw, jiffies + tg->td->throtl_slice);
> + throtl_set_slice_end(tg, rw,
> + jiffies + throtl_slice_wait(tg, rw, tg->td->throtl_slice));
>  
>   time_elapsed = jiffies - tg->slice_start[rw];
>  
> @@ -889,7 +921,7 @@ static bool tg_with_in_iops_limit(struct 

Re: [PATCH 6/7] blkcg: account requests instead of bios for request based request_queues

2017-11-14 Thread Shaohua Li
On Sun, Nov 12, 2017 at 02:26:12PM -0800, Tejun Heo wrote:
> blkcg accounting is currently bio based, which is silly for request
> based request_queues.  This is silly as the number of bios doesn't
> have much to do with the actual number of IOs issued to the underlying
> device (can be significantly higher or lower) and may change depending
> on the implementation details on how the bios are issued (e.g. from
> the recent split-bios-while-issuing change).
> 
> request based request_queues have QUEUE_FLAG_IO_STAT set by default
> which controls the gendisk accounting.  Do cgroup accounting for those
> request_queues together with gendisk accounting on request completion.
> 
> This makes cgroup accounting consistent with gendisk accounting and
> what's happening on the system.
> 
> Signed-off-by: Tejun Heo <t...@kernel.org>
> ---
>  block/blk-core.c   |  3 +++
>  include/linux/blk-cgroup.h | 18 +-
>  2 files changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 048be4a..ad23b96 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2429,6 +2429,7 @@ void blk_account_io_completion(struct request *req, 
> unsigned int bytes)
>   cpu = part_stat_lock();
>   part = req->part;
>   part_stat_add(cpu, part, sectors[rw], bytes >> 9);
> + blkcg_account_io_completion(req, bytes);
>   part_stat_unlock();
>   }
>  }
> @@ -2454,6 +2455,8 @@ void blk_account_io_done(struct request *req)
>   part_round_stats(req->q, cpu, part);
>   part_dec_in_flight(req->q, part, rw);
>  
> + blkcg_account_io_done(req);
> +
>   hd_struct_put(part);
>   part_stat_unlock();
>   }
> diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
> index 96eed0f..f2f9691 100644
> --- a/include/linux/blk-cgroup.h
> +++ b/include/linux/blk-cgroup.h
> @@ -715,7 +715,8 @@ static inline bool blkcg_bio_issue_check(struct 
> request_queue *q,
>  
>   throtl = blk_throtl_bio(q, blkg, bio);
>  
> - if (!throtl) {
> + /* if @q does io stat, blkcg stats are updated together with them */
> + if (!blk_queue_io_stat(q) && !throtl) {

Reviewed-by: Shaohua Li <s...@kernel.org>

One nitpick, can we use q->request_fn to determine request based queue? I think
that is more reliable and usual way for this.

Thanks,
Shaohua
>   blkg_rwstat_add(>stat_bytes, bio->bi_opf,
>   bio->bi_iter.bi_size);
>   blkg_rwstat_add(>stat_ios, bio->bi_opf, 1);
> @@ -764,6 +765,17 @@ static inline void blk_rq_disassociate_blkg(struct 
> request *rq)
>   rq->blkg = NULL;
>  }
>  
> +static inline void blkcg_account_io_completion(struct request *rq,
> +unsigned int bytes)
> +{
> + blkg_rwstat_add(>blkg->stat_bytes, rq_data_dir(rq), bytes);
> +}
> +
> +static inline void blkcg_account_io_done(struct request *rq)
> +{
> + blkg_rwstat_add(>blkg->stat_ios, rq_data_dir(rq), 1);
> +}
> +
>  #else/* CONFIG_BLK_CGROUP */
>  
>  struct blkcg {
> @@ -823,6 +835,10 @@ static inline bool blkcg_bio_issue_check(struct 
> request_queue *q,
>  static inline void blk_rq_associate_blkg(struct request *rq, struct blkcg 
> *blkcg) { }
>  static inline void blk_rq_disassociate_blkg(struct request *rq) { }
>  
> +static inline void blkcg_account_io_completion(struct request *rq,
> +unsigned int bytes) { }
> +static inline void blkcg_account_io_done(struct request *rq) { }
> +
>  #define blk_queue_for_each_rl(rl, q) \
>   for ((rl) = &(q)->root_rl; (rl); (rl) = NULL)
>  
> -- 
> 2.9.5
> 


Re: [PATCH 6/7] blkcg: account requests instead of bios for request based request_queues

2017-11-14 Thread Shaohua Li
On Sun, Nov 12, 2017 at 02:26:12PM -0800, Tejun Heo wrote:
> blkcg accounting is currently bio based, which is silly for request
> based request_queues.  This is silly as the number of bios doesn't
> have much to do with the actual number of IOs issued to the underlying
> device (can be significantly higher or lower) and may change depending
> on the implementation details on how the bios are issued (e.g. from
> the recent split-bios-while-issuing change).
> 
> request based request_queues have QUEUE_FLAG_IO_STAT set by default
> which controls the gendisk accounting.  Do cgroup accounting for those
> request_queues together with gendisk accounting on request completion.
> 
> This makes cgroup accounting consistent with gendisk accounting and
> what's happening on the system.
> 
> Signed-off-by: Tejun Heo 
> ---
>  block/blk-core.c   |  3 +++
>  include/linux/blk-cgroup.h | 18 +-
>  2 files changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 048be4a..ad23b96 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2429,6 +2429,7 @@ void blk_account_io_completion(struct request *req, 
> unsigned int bytes)
>   cpu = part_stat_lock();
>   part = req->part;
>   part_stat_add(cpu, part, sectors[rw], bytes >> 9);
> + blkcg_account_io_completion(req, bytes);
>   part_stat_unlock();
>   }
>  }
> @@ -2454,6 +2455,8 @@ void blk_account_io_done(struct request *req)
>   part_round_stats(req->q, cpu, part);
>   part_dec_in_flight(req->q, part, rw);
>  
> + blkcg_account_io_done(req);
> +
>   hd_struct_put(part);
>   part_stat_unlock();
>   }
> diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
> index 96eed0f..f2f9691 100644
> --- a/include/linux/blk-cgroup.h
> +++ b/include/linux/blk-cgroup.h
> @@ -715,7 +715,8 @@ static inline bool blkcg_bio_issue_check(struct 
> request_queue *q,
>  
>   throtl = blk_throtl_bio(q, blkg, bio);
>  
> - if (!throtl) {
> + /* if @q does io stat, blkcg stats are updated together with them */
> + if (!blk_queue_io_stat(q) && !throtl) {

Reviewed-by: Shaohua Li 

One nitpick, can we use q->request_fn to determine request based queue? I think
that is more reliable and usual way for this.

Thanks,
Shaohua
>   blkg_rwstat_add(>stat_bytes, bio->bi_opf,
>   bio->bi_iter.bi_size);
>   blkg_rwstat_add(>stat_ios, bio->bi_opf, 1);
> @@ -764,6 +765,17 @@ static inline void blk_rq_disassociate_blkg(struct 
> request *rq)
>   rq->blkg = NULL;
>  }
>  
> +static inline void blkcg_account_io_completion(struct request *rq,
> +unsigned int bytes)
> +{
> + blkg_rwstat_add(>blkg->stat_bytes, rq_data_dir(rq), bytes);
> +}
> +
> +static inline void blkcg_account_io_done(struct request *rq)
> +{
> + blkg_rwstat_add(>blkg->stat_ios, rq_data_dir(rq), 1);
> +}
> +
>  #else/* CONFIG_BLK_CGROUP */
>  
>  struct blkcg {
> @@ -823,6 +835,10 @@ static inline bool blkcg_bio_issue_check(struct 
> request_queue *q,
>  static inline void blk_rq_associate_blkg(struct request *rq, struct blkcg 
> *blkcg) { }
>  static inline void blk_rq_disassociate_blkg(struct request *rq) { }
>  
> +static inline void blkcg_account_io_completion(struct request *rq,
> +unsigned int bytes) { }
> +static inline void blkcg_account_io_done(struct request *rq) { }
> +
>  #define blk_queue_for_each_rl(rl, q) \
>   for ((rl) = &(q)->root_rl; (rl); (rl) = NULL)
>  
> -- 
> 2.9.5
> 


Re: [PATCH v2 3/7] blkcg: associate a request with its blkcg_gq instead of request_list

2017-11-14 Thread Shaohua Li
On Mon, Nov 13, 2017 at 12:15:23PM -0800, Tejun Heo wrote:
> From c856a199ec70e4022e997609f1b17b9106408777 Mon Sep 17 00:00:00 2001
> From: Tejun Heo 
> Date: Mon, 13 Nov 2017 12:11:57 -0800
> 
> On the legacy request_queue, each blkcg_gq has a dedicated
> request_list that requests for the cgroup are allocated from.  Each
> request points to the originating request_list and cgroup membership
> can be determined by examining which cgroup the associated
> request_list belongs to.
> 
> This doesn't work for blk-mq which doesn't use request_list.  Let's
> associate each request with the blkcg_gq directly.  Except for minor
> wrapper changes, this doesn't affect the legacy request_queue and will
> allow blk-mq to track each request's cgroup association.
> 
> v2: Build fix for !CONFIG_BLK_CGROUP.
> 
> Signed-off-by: Tejun Heo 
> ---
>  block/blk-mq.c | 2 +-
>  include/linux/blk-cgroup.h | 7 +--
>  include/linux/blkdev.h | 2 +-
>  3 files changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 98a1860..af958c4 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -306,7 +306,7 @@ static struct request *blk_mq_rq_ctx_init(struct 
> blk_mq_alloc_data *data,
>   rq->part = NULL;
>   rq->start_time = jiffies;
>  #ifdef CONFIG_BLK_CGROUP
> - rq->rl = NULL;
> + rq->blkg = NULL;
>   set_start_time_ns(rq);
>   rq->io_start_time_ns = 0;
>  #endif
> diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
> index c0d4736..47db75a 100644
> --- a/include/linux/blk-cgroup.h
> +++ b/include/linux/blk-cgroup.h
> @@ -463,7 +463,7 @@ static inline void blk_put_rl(struct request_list *rl)
>   */
>  static inline void blk_rq_set_rl(struct request *rq, struct request_list *rl)
>  {
> - rq->rl = rl;
> + rq->blkg = rl->blkg;
>  }

After patch 5, we don't really need the blk_put_rl, blk_get_rl, blk_rq_set_rl
functions. I'd like to delete them and only keep blk_rq_rl. The name
blk_rq_set_rl is confusing too.

Thanks,
Shaohua
>  /**
> @@ -474,7 +474,10 @@ static inline void blk_rq_set_rl(struct request *rq, 
> struct request_list *rl)
>   */
>  static inline struct request_list *blk_rq_rl(struct request *rq)
>  {
> - return rq->rl;
> + if (!rq->blkg->parent)
> + return >q->root_rl;
> + else
> + return >blkg->rl;
>  }
>  
>  struct request_list *__blk_queue_next_rl(struct request_list *rl,
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 8da6637..6ec1067 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -206,7 +206,7 @@ struct request {
>   unsigned long start_time;
>   struct blk_issue_stat issue_stat;
>  #ifdef CONFIG_BLK_CGROUP
> - struct request_list *rl;/* rl this rq is alloced from */
> + struct blkcg_gq *blkg;  /* blkg of this rq */
>   unsigned long long start_time_ns;
>   unsigned long long io_start_time_ns;/* when passed to hardware */
>  #endif
> -- 
> 2.9.5
> 


Re: [PATCH v2 3/7] blkcg: associate a request with its blkcg_gq instead of request_list

2017-11-14 Thread Shaohua Li
On Mon, Nov 13, 2017 at 12:15:23PM -0800, Tejun Heo wrote:
> From c856a199ec70e4022e997609f1b17b9106408777 Mon Sep 17 00:00:00 2001
> From: Tejun Heo 
> Date: Mon, 13 Nov 2017 12:11:57 -0800
> 
> On the legacy request_queue, each blkcg_gq has a dedicated
> request_list that requests for the cgroup are allocated from.  Each
> request points to the originating request_list and cgroup membership
> can be determined by examining which cgroup the associated
> request_list belongs to.
> 
> This doesn't work for blk-mq which doesn't use request_list.  Let's
> associate each request with the blkcg_gq directly.  Except for minor
> wrapper changes, this doesn't affect the legacy request_queue and will
> allow blk-mq to track each request's cgroup association.
> 
> v2: Build fix for !CONFIG_BLK_CGROUP.
> 
> Signed-off-by: Tejun Heo 
> ---
>  block/blk-mq.c | 2 +-
>  include/linux/blk-cgroup.h | 7 +--
>  include/linux/blkdev.h | 2 +-
>  3 files changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 98a1860..af958c4 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -306,7 +306,7 @@ static struct request *blk_mq_rq_ctx_init(struct 
> blk_mq_alloc_data *data,
>   rq->part = NULL;
>   rq->start_time = jiffies;
>  #ifdef CONFIG_BLK_CGROUP
> - rq->rl = NULL;
> + rq->blkg = NULL;
>   set_start_time_ns(rq);
>   rq->io_start_time_ns = 0;
>  #endif
> diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
> index c0d4736..47db75a 100644
> --- a/include/linux/blk-cgroup.h
> +++ b/include/linux/blk-cgroup.h
> @@ -463,7 +463,7 @@ static inline void blk_put_rl(struct request_list *rl)
>   */
>  static inline void blk_rq_set_rl(struct request *rq, struct request_list *rl)
>  {
> - rq->rl = rl;
> + rq->blkg = rl->blkg;
>  }

After patch 5, we don't really need the blk_put_rl, blk_get_rl, blk_rq_set_rl
functions. I'd like to delete them and only keep blk_rq_rl. The name
blk_rq_set_rl is confusing too.

Thanks,
Shaohua
>  /**
> @@ -474,7 +474,10 @@ static inline void blk_rq_set_rl(struct request *rq, 
> struct request_list *rl)
>   */
>  static inline struct request_list *blk_rq_rl(struct request *rq)
>  {
> - return rq->rl;
> + if (!rq->blkg->parent)
> + return >q->root_rl;
> + else
> + return >blkg->rl;
>  }
>  
>  struct request_list *__blk_queue_next_rl(struct request_list *rl,
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 8da6637..6ec1067 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -206,7 +206,7 @@ struct request {
>   unsigned long start_time;
>   struct blk_issue_stat issue_stat;
>  #ifdef CONFIG_BLK_CGROUP
> - struct request_list *rl;/* rl this rq is alloced from */
> + struct blkcg_gq *blkg;  /* blkg of this rq */
>   unsigned long long start_time_ns;
>   unsigned long long io_start_time_ns;/* when passed to hardware */
>  #endif
> -- 
> 2.9.5
> 


Re: [PATCH 4/7] blkcg: refactor blkcg_gq lookup and creation in blkcg_bio_issue_check()

2017-11-14 Thread Shaohua Li
On Sun, Nov 12, 2017 at 02:26:10PM -0800, Tejun Heo wrote:
> blkcg_bio_issue_check() open codes blkcg_gq lookup and creation using
> blkg_lookup() and blkg_lookup_create().  Refactor the code so that
> 
> * blkg_lookup_create() is renamed to blkcg_lookup_create_locked().
> 
> * blkg_lookup_create() is now a new function which encapsulates the
>   RCU protected lookup and queue_locked protected creation.
> 
> * blkg_lookup_create() is guaranteed to return a non-NULL blkcg_gq.
>   The NULL checks are removed from the users.
> 
> This is pure refactoring and doesn't introduce any functional changes.
> 
> Signed-off-by: Tejun Heo <t...@kernel.org>

Reviewed-by: Shaohua Li <s...@kernel.org>
> ---
>  block/blk-cgroup.c |  6 +++---
>  block/blk-throttle.c   |  2 +-
>  include/linux/blk-cgroup.h | 32 +---
>  3 files changed, 25 insertions(+), 15 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 60a4486..490b0a6 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -290,7 +290,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
>  }
>  
>  /**
> - * blkg_lookup_create - lookup blkg, try to create one if not there
> + * blkg_lookup_create_locked - lookup blkg, try to create one if not there
>   * @blkcg: blkcg of interest
>   * @q: request_queue of interest
>   *
> @@ -303,8 +303,8 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
>   * value on error.  If @q is dead, returns ERR_PTR(-EINVAL).  If @q is not
>   * dead and bypassing, returns ERR_PTR(-EBUSY).
>   */
> -struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
> - struct request_queue *q)
> +struct blkcg_gq *blkg_lookup_create_locked(struct blkcg *blkcg,
> +struct request_queue *q)
>  {
>   struct blkcg_gq *blkg;
>  
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 8631763..1e6916b 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -2123,7 +2123,7 @@ bool blk_throtl_bio(struct request_queue *q, struct 
> blkcg_gq *blkg,
>   struct bio *bio)
>  {
>   struct throtl_qnode *qn = NULL;
> - struct throtl_grp *tg = blkg_to_tg(blkg ?: q->root_blkg);
> + struct throtl_grp *tg = blkg_to_tg(blkg);
>   struct throtl_service_queue *sq;
>   bool rw = bio_data_dir(bio);
>   bool throttled = false;
> diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
> index b5721d32..1de5158 100644
> --- a/include/linux/blk-cgroup.h
> +++ b/include/linux/blk-cgroup.h
> @@ -172,8 +172,8 @@ extern struct cgroup_subsys_state * const blkcg_root_css;
>  
>  struct blkcg_gq *blkg_lookup_slowpath(struct blkcg *blkcg,
> struct request_queue *q, bool 
> update_hint);
> -struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
> - struct request_queue *q);
> +struct blkcg_gq *blkg_lookup_create_locked(struct blkcg *blkcg,
> +struct request_queue *q);
>  int blkcg_init_queue(struct request_queue *q);
>  void blkcg_drain_queue(struct request_queue *q);
>  void blkcg_exit_queue(struct request_queue *q);
> @@ -680,6 +680,24 @@ static inline bool blk_throtl_bio(struct request_queue 
> *q, struct blkcg_gq *blkg
> struct bio *bio) { return false; }
>  #endif
>  
> +static inline struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
> +   struct request_queue *q)
> +{
> + struct blkcg_gq *blkg;
> +
> + blkg = blkg_lookup(blkcg, q);
> + if (likely(blkg))
> + return blkg;
> +
> + spin_lock_irq(q->queue_lock);
> + blkg = blkg_lookup_create_locked(blkcg, q);
> + spin_unlock_irq(q->queue_lock);
> + if (likely(!IS_ERR(blkg)))
> + return blkg;
> + else
> + return q->root_blkg;
> +}
> +
>  static inline bool blkcg_bio_issue_check(struct request_queue *q,
>struct bio *bio)
>  {
> @@ -693,19 +711,11 @@ static inline bool blkcg_bio_issue_check(struct 
> request_queue *q,
>   /* associate blkcg if bio hasn't attached one */
>   bio_associate_blkcg(bio, >css);
>  
> - blkg = blkg_lookup(blkcg, q);
> - if (unlikely(!blkg)) {
> - spin_lock_irq(q->queue_lock);
> - blkg = blkg_lookup_create(blkcg, q);
> - if (IS_ERR(blkg))
> - blkg = NULL;
> - spin_unlock_irq(q->queue_lock);
> - }
> + blkg = blkg_lookup_create(blkcg, q);
>  
>   throtl = blk_throtl_bio(q, blkg, bio);
>  
>   if (!throtl) {
> - blkg = blkg ?: q->root_blkg;
>   blkg_rwstat_add(>stat_bytes, bio->bi_opf,
>   bio->bi_iter.bi_size);
>   blkg_rwstat_add(>stat_ios, bio->bi_opf, 1);
> -- 
> 2.9.5
> 


Re: [PATCH 4/7] blkcg: refactor blkcg_gq lookup and creation in blkcg_bio_issue_check()

2017-11-14 Thread Shaohua Li
On Sun, Nov 12, 2017 at 02:26:10PM -0800, Tejun Heo wrote:
> blkcg_bio_issue_check() open codes blkcg_gq lookup and creation using
> blkg_lookup() and blkg_lookup_create().  Refactor the code so that
> 
> * blkg_lookup_create() is renamed to blkcg_lookup_create_locked().
> 
> * blkg_lookup_create() is now a new function which encapsulates the
>   RCU protected lookup and queue_locked protected creation.
> 
> * blkg_lookup_create() is guaranteed to return a non-NULL blkcg_gq.
>   The NULL checks are removed from the users.
> 
> This is pure refactoring and doesn't introduce any functional changes.
> 
> Signed-off-by: Tejun Heo 

Reviewed-by: Shaohua Li 
> ---
>  block/blk-cgroup.c |  6 +++---
>  block/blk-throttle.c   |  2 +-
>  include/linux/blk-cgroup.h | 32 +---
>  3 files changed, 25 insertions(+), 15 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 60a4486..490b0a6 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -290,7 +290,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
>  }
>  
>  /**
> - * blkg_lookup_create - lookup blkg, try to create one if not there
> + * blkg_lookup_create_locked - lookup blkg, try to create one if not there
>   * @blkcg: blkcg of interest
>   * @q: request_queue of interest
>   *
> @@ -303,8 +303,8 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
>   * value on error.  If @q is dead, returns ERR_PTR(-EINVAL).  If @q is not
>   * dead and bypassing, returns ERR_PTR(-EBUSY).
>   */
> -struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
> - struct request_queue *q)
> +struct blkcg_gq *blkg_lookup_create_locked(struct blkcg *blkcg,
> +struct request_queue *q)
>  {
>   struct blkcg_gq *blkg;
>  
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 8631763..1e6916b 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -2123,7 +2123,7 @@ bool blk_throtl_bio(struct request_queue *q, struct 
> blkcg_gq *blkg,
>   struct bio *bio)
>  {
>   struct throtl_qnode *qn = NULL;
> - struct throtl_grp *tg = blkg_to_tg(blkg ?: q->root_blkg);
> + struct throtl_grp *tg = blkg_to_tg(blkg);
>   struct throtl_service_queue *sq;
>   bool rw = bio_data_dir(bio);
>   bool throttled = false;
> diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
> index b5721d32..1de5158 100644
> --- a/include/linux/blk-cgroup.h
> +++ b/include/linux/blk-cgroup.h
> @@ -172,8 +172,8 @@ extern struct cgroup_subsys_state * const blkcg_root_css;
>  
>  struct blkcg_gq *blkg_lookup_slowpath(struct blkcg *blkcg,
> struct request_queue *q, bool 
> update_hint);
> -struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
> - struct request_queue *q);
> +struct blkcg_gq *blkg_lookup_create_locked(struct blkcg *blkcg,
> +struct request_queue *q);
>  int blkcg_init_queue(struct request_queue *q);
>  void blkcg_drain_queue(struct request_queue *q);
>  void blkcg_exit_queue(struct request_queue *q);
> @@ -680,6 +680,24 @@ static inline bool blk_throtl_bio(struct request_queue 
> *q, struct blkcg_gq *blkg
> struct bio *bio) { return false; }
>  #endif
>  
> +static inline struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
> +   struct request_queue *q)
> +{
> + struct blkcg_gq *blkg;
> +
> + blkg = blkg_lookup(blkcg, q);
> + if (likely(blkg))
> + return blkg;
> +
> + spin_lock_irq(q->queue_lock);
> + blkg = blkg_lookup_create_locked(blkcg, q);
> + spin_unlock_irq(q->queue_lock);
> + if (likely(!IS_ERR(blkg)))
> + return blkg;
> + else
> + return q->root_blkg;
> +}
> +
>  static inline bool blkcg_bio_issue_check(struct request_queue *q,
>struct bio *bio)
>  {
> @@ -693,19 +711,11 @@ static inline bool blkcg_bio_issue_check(struct 
> request_queue *q,
>   /* associate blkcg if bio hasn't attached one */
>   bio_associate_blkcg(bio, >css);
>  
> - blkg = blkg_lookup(blkcg, q);
> - if (unlikely(!blkg)) {
> - spin_lock_irq(q->queue_lock);
> - blkg = blkg_lookup_create(blkcg, q);
> - if (IS_ERR(blkg))
> - blkg = NULL;
> - spin_unlock_irq(q->queue_lock);
> - }
> + blkg = blkg_lookup_create(blkcg, q);
>  
>   throtl = blk_throtl_bio(q, blkg, bio);
>  
>   if (!throtl) {
> - blkg = blkg ?: q->root_blkg;
>   blkg_rwstat_add(>stat_bytes, bio->bi_opf,
>   bio->bi_iter.bi_size);
>   blkg_rwstat_add(>stat_ios, bio->bi_opf, 1);
> -- 
> 2.9.5
> 


Re: [PATCH 2/7] blkcg: use percpu_ref for blkcg_gq->refcnt

2017-11-14 Thread Shaohua Li
On Sun, Nov 12, 2017 at 02:26:08PM -0800, Tejun Heo wrote:
> blkcg_gq->refcnt is an atomic_t.  This was due to the following two
> reasons.
> 
> * When blkcg_gq was added, the percpu memory allocator didn't support
>   allocations from !GFP_KERNEL contexts.  Because blkcg_gq's are
>   created from IO issue paths, it couldn't use GFP_KERNEL allocations.
> 
> * A blkcg_gq represents the connection between a cgroup and a
>   request_queue.  Because a in-flight bio already pins both, blkcg_gq
>   didn't usually need explicit pinning, making the use of atomic_t
>   acceptable albeit restrictive and fragile.
> 
> Now that the percpu allocator supports !GFP_KERNEL allocations,
> there's no reason to keep using atomic_t refcnt.  This will allow
> clean separation between bio and request layers helping blkcg support
> in blk-mq.
> 
> Signed-off-by: Tejun Heo <t...@kernel.org>

Reviewed-by: Shaohua Li <s...@kernel.org>
> ---
>  block/blk-cgroup.c | 21 -
>  include/linux/blk-cgroup.h | 13 -
>  2 files changed, 20 insertions(+), 14 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 6482be5..60a4486 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -78,6 +78,7 @@ static void blkg_free(struct blkcg_gq *blkg)
>  
>   blkg_rwstat_exit(>stat_ios);
>   blkg_rwstat_exit(>stat_bytes);
> + percpu_ref_exit(>refcnt);
>   kfree(blkg);
>  }
>  
> @@ -89,7 +90,7 @@ static void blkg_free(struct blkcg_gq *blkg)
>   * Having a reference to blkg under an rcu allows accesses to only values
>   * local to groups like group stats and group rate limits.
>   */
> -void __blkg_release_rcu(struct rcu_head *rcu_head)
> +static void blkg_release_rcu(struct rcu_head *rcu_head)
>  {
>   struct blkcg_gq *blkg = container_of(rcu_head, struct blkcg_gq, 
> rcu_head);
>  
> @@ -102,7 +103,13 @@ void __blkg_release_rcu(struct rcu_head *rcu_head)
>  
>   blkg_free(blkg);
>  }
> -EXPORT_SYMBOL_GPL(__blkg_release_rcu);
> +
> +static void blkg_release(struct percpu_ref *refcnt)
> +{
> + struct blkcg_gq *blkg = container_of(refcnt, struct blkcg_gq, refcnt);
> +
> + call_rcu(>rcu_head, blkg_release_rcu);
> +}
>  
>  /**
>   * blkg_alloc - allocate a blkg
> @@ -123,6 +130,11 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, 
> struct request_queue *q,
>   if (!blkg)
>   return NULL;
>  
> + if (percpu_ref_init(>refcnt, blkg_release, 0, gfp_mask)) {
> + kfree(blkg);
> + return NULL;
> + }
> +
>   if (blkg_rwstat_init(>stat_bytes, gfp_mask) ||
>   blkg_rwstat_init(>stat_ios, gfp_mask))
>   goto err_free;
> @@ -130,7 +142,6 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, 
> struct request_queue *q,
>   blkg->q = q;
>   INIT_LIST_HEAD(>q_node);
>   blkg->blkcg = blkcg;
> - atomic_set(>refcnt, 1);
>  
>   /* root blkg uses @q->root_rl, init rl only for !root blkgs */
>   if (blkcg != _root) {
> @@ -266,7 +277,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
>   return blkg;
>  
>   /* @blkg failed fully initialized, use the usual release path */
> - blkg_put(blkg);
> + percpu_ref_kill(>refcnt);
>   return ERR_PTR(ret);
>  
>  err_put_congested:
> @@ -373,7 +384,7 @@ static void blkg_destroy(struct blkcg_gq *blkg)
>* Put the reference taken at the time of creation so that when all
>* queues are gone, group can be destroyed.
>*/
> - blkg_put(blkg);
> + percpu_ref_kill(>refcnt);
>  }
>  
>  /**
> diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
> index 8bbc371..c0d4736 100644
> --- a/include/linux/blk-cgroup.h
> +++ b/include/linux/blk-cgroup.h
> @@ -19,7 +19,7 @@
>  #include 
>  #include 
>  #include 
> -#include 
> +#include 
>  
>  /* percpu_counter batch for blkg_[rw]stats, per-cpu drift doesn't matter */
>  #define BLKG_STAT_CPU_BATCH  (INT_MAX / 2)
> @@ -123,7 +123,7 @@ struct blkcg_gq {
>   struct request_list rl;
>  
>   /* reference count */
> - atomic_trefcnt;
> + struct percpu_ref   refcnt;
>  
>   /* is this blkg online? protected by both blkcg and q locks */
>   boolonline;
> @@ -355,21 +355,16 @@ static inline int blkg_path(struct blkcg_gq *blkg, char 
> *buf, int buflen)
>   */
>  static inline void blkg_get(struct blkcg_gq *blkg)
>  {
> - WARN_ON_ONCE(atomic_read(>refcnt) &l

Re: [PATCH 2/7] blkcg: use percpu_ref for blkcg_gq->refcnt

2017-11-14 Thread Shaohua Li
On Sun, Nov 12, 2017 at 02:26:08PM -0800, Tejun Heo wrote:
> blkcg_gq->refcnt is an atomic_t.  This was due to the following two
> reasons.
> 
> * When blkcg_gq was added, the percpu memory allocator didn't support
>   allocations from !GFP_KERNEL contexts.  Because blkcg_gq's are
>   created from IO issue paths, it couldn't use GFP_KERNEL allocations.
> 
> * A blkcg_gq represents the connection between a cgroup and a
>   request_queue.  Because a in-flight bio already pins both, blkcg_gq
>   didn't usually need explicit pinning, making the use of atomic_t
>   acceptable albeit restrictive and fragile.
> 
> Now that the percpu allocator supports !GFP_KERNEL allocations,
> there's no reason to keep using atomic_t refcnt.  This will allow
> clean separation between bio and request layers helping blkcg support
> in blk-mq.
> 
> Signed-off-by: Tejun Heo 

Reviewed-by: Shaohua Li 
> ---
>  block/blk-cgroup.c | 21 -
>  include/linux/blk-cgroup.h | 13 -
>  2 files changed, 20 insertions(+), 14 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 6482be5..60a4486 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -78,6 +78,7 @@ static void blkg_free(struct blkcg_gq *blkg)
>  
>   blkg_rwstat_exit(>stat_ios);
>   blkg_rwstat_exit(>stat_bytes);
> + percpu_ref_exit(>refcnt);
>   kfree(blkg);
>  }
>  
> @@ -89,7 +90,7 @@ static void blkg_free(struct blkcg_gq *blkg)
>   * Having a reference to blkg under an rcu allows accesses to only values
>   * local to groups like group stats and group rate limits.
>   */
> -void __blkg_release_rcu(struct rcu_head *rcu_head)
> +static void blkg_release_rcu(struct rcu_head *rcu_head)
>  {
>   struct blkcg_gq *blkg = container_of(rcu_head, struct blkcg_gq, 
> rcu_head);
>  
> @@ -102,7 +103,13 @@ void __blkg_release_rcu(struct rcu_head *rcu_head)
>  
>   blkg_free(blkg);
>  }
> -EXPORT_SYMBOL_GPL(__blkg_release_rcu);
> +
> +static void blkg_release(struct percpu_ref *refcnt)
> +{
> + struct blkcg_gq *blkg = container_of(refcnt, struct blkcg_gq, refcnt);
> +
> + call_rcu(>rcu_head, blkg_release_rcu);
> +}
>  
>  /**
>   * blkg_alloc - allocate a blkg
> @@ -123,6 +130,11 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, 
> struct request_queue *q,
>   if (!blkg)
>   return NULL;
>  
> + if (percpu_ref_init(>refcnt, blkg_release, 0, gfp_mask)) {
> + kfree(blkg);
> + return NULL;
> + }
> +
>   if (blkg_rwstat_init(>stat_bytes, gfp_mask) ||
>   blkg_rwstat_init(>stat_ios, gfp_mask))
>   goto err_free;
> @@ -130,7 +142,6 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, 
> struct request_queue *q,
>   blkg->q = q;
>   INIT_LIST_HEAD(>q_node);
>   blkg->blkcg = blkcg;
> - atomic_set(>refcnt, 1);
>  
>   /* root blkg uses @q->root_rl, init rl only for !root blkgs */
>   if (blkcg != _root) {
> @@ -266,7 +277,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
>   return blkg;
>  
>   /* @blkg failed fully initialized, use the usual release path */
> - blkg_put(blkg);
> + percpu_ref_kill(>refcnt);
>   return ERR_PTR(ret);
>  
>  err_put_congested:
> @@ -373,7 +384,7 @@ static void blkg_destroy(struct blkcg_gq *blkg)
>* Put the reference taken at the time of creation so that when all
>* queues are gone, group can be destroyed.
>*/
> - blkg_put(blkg);
> + percpu_ref_kill(>refcnt);
>  }
>  
>  /**
> diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
> index 8bbc371..c0d4736 100644
> --- a/include/linux/blk-cgroup.h
> +++ b/include/linux/blk-cgroup.h
> @@ -19,7 +19,7 @@
>  #include 
>  #include 
>  #include 
> -#include 
> +#include 
>  
>  /* percpu_counter batch for blkg_[rw]stats, per-cpu drift doesn't matter */
>  #define BLKG_STAT_CPU_BATCH  (INT_MAX / 2)
> @@ -123,7 +123,7 @@ struct blkcg_gq {
>   struct request_list rl;
>  
>   /* reference count */
> - atomic_trefcnt;
> + struct percpu_ref   refcnt;
>  
>   /* is this blkg online? protected by both blkcg and q locks */
>   boolonline;
> @@ -355,21 +355,16 @@ static inline int blkg_path(struct blkcg_gq *blkg, char 
> *buf, int buflen)
>   */
>  static inline void blkg_get(struct blkcg_gq *blkg)
>  {
> - WARN_ON_ONCE(atomic_read(>refcnt) <= 0);
> - atomic_inc(>refcnt);
> + percpu_ref_get(>refcnt);
>  }
>  
> -void __blkg_release_rcu(struct rcu_head *rcu);
> -
>  /**
>   * blkg_put - put a blkg reference
>   * @blkg: blkg to put
>   */
>  static inline void blkg_put(struct blkcg_gq *blkg)
>  {
> - WARN_ON_ONCE(atomic_read(>refcnt) <= 0);
> - if (atomic_dec_and_test(>refcnt))
> - call_rcu(>rcu_head, __blkg_release_rcu);
> + percpu_ref_put(>refcnt);
>  }
>  
>  /**
> -- 
> 2.9.5
> 


Re: [PATCH 1/7] blkcg: relocate __blkg_release_rcu()

2017-11-14 Thread Shaohua Li
On Sun, Nov 12, 2017 at 02:26:07PM -0800, Tejun Heo wrote:
> Move __blkg_release_rcu() above blkg_alloc().  This is a pure code
> reorganization to prepare for the switch to percpu_ref.
> 
> Signed-off-by: Tejun Heo <t...@kernel.org>

Reviewed-by: Shaohua Li <s...@kernel.org>
> ---
>  block/blk-cgroup.c | 46 +++---
>  1 file changed, 23 insertions(+), 23 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index d3f56ba..6482be5 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -81,6 +81,29 @@ static void blkg_free(struct blkcg_gq *blkg)
>   kfree(blkg);
>  }
>  
> +/*
> + * A group is RCU protected, but having an rcu lock does not mean that one
> + * can access all the fields of blkg and assume these are valid.  For
> + * example, don't try to follow throtl_data and request queue links.
> + *
> + * Having a reference to blkg under an rcu allows accesses to only values
> + * local to groups like group stats and group rate limits.
> + */
> +void __blkg_release_rcu(struct rcu_head *rcu_head)
> +{
> + struct blkcg_gq *blkg = container_of(rcu_head, struct blkcg_gq, 
> rcu_head);
> +
> + /* release the blkcg and parent blkg refs this blkg has been holding */
> + css_put(>blkcg->css);
> + if (blkg->parent)
> + blkg_put(blkg->parent);
> +
> + wb_congested_put(blkg->wb_congested);
> +
> + blkg_free(blkg);
> +}
> +EXPORT_SYMBOL_GPL(__blkg_release_rcu);
> +
>  /**
>   * blkg_alloc - allocate a blkg
>   * @blkcg: block cgroup the new blkg is associated with
> @@ -378,29 +401,6 @@ static void blkg_destroy_all(struct request_queue *q)
>  }
>  
>  /*
> - * A group is RCU protected, but having an rcu lock does not mean that one
> - * can access all the fields of blkg and assume these are valid.  For
> - * example, don't try to follow throtl_data and request queue links.
> - *
> - * Having a reference to blkg under an rcu allows accesses to only values
> - * local to groups like group stats and group rate limits.
> - */
> -void __blkg_release_rcu(struct rcu_head *rcu_head)
> -{
> - struct blkcg_gq *blkg = container_of(rcu_head, struct blkcg_gq, 
> rcu_head);
> -
> - /* release the blkcg and parent blkg refs this blkg has been holding */
> - css_put(>blkcg->css);
> - if (blkg->parent)
> - blkg_put(blkg->parent);
> -
> - wb_congested_put(blkg->wb_congested);
> -
> - blkg_free(blkg);
> -}
> -EXPORT_SYMBOL_GPL(__blkg_release_rcu);
> -
> -/*
>   * The next function used by blk_queue_for_each_rl().  It's a bit tricky
>   * because the root blkg uses @q->root_rl instead of its own rl.
>   */
> -- 
> 2.9.5
> 


Re: [PATCH 1/7] blkcg: relocate __blkg_release_rcu()

2017-11-14 Thread Shaohua Li
On Sun, Nov 12, 2017 at 02:26:07PM -0800, Tejun Heo wrote:
> Move __blkg_release_rcu() above blkg_alloc().  This is a pure code
> reorganization to prepare for the switch to percpu_ref.
> 
> Signed-off-by: Tejun Heo 

Reviewed-by: Shaohua Li 
> ---
>  block/blk-cgroup.c | 46 +++---
>  1 file changed, 23 insertions(+), 23 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index d3f56ba..6482be5 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -81,6 +81,29 @@ static void blkg_free(struct blkcg_gq *blkg)
>   kfree(blkg);
>  }
>  
> +/*
> + * A group is RCU protected, but having an rcu lock does not mean that one
> + * can access all the fields of blkg and assume these are valid.  For
> + * example, don't try to follow throtl_data and request queue links.
> + *
> + * Having a reference to blkg under an rcu allows accesses to only values
> + * local to groups like group stats and group rate limits.
> + */
> +void __blkg_release_rcu(struct rcu_head *rcu_head)
> +{
> + struct blkcg_gq *blkg = container_of(rcu_head, struct blkcg_gq, 
> rcu_head);
> +
> + /* release the blkcg and parent blkg refs this blkg has been holding */
> + css_put(>blkcg->css);
> + if (blkg->parent)
> + blkg_put(blkg->parent);
> +
> + wb_congested_put(blkg->wb_congested);
> +
> + blkg_free(blkg);
> +}
> +EXPORT_SYMBOL_GPL(__blkg_release_rcu);
> +
>  /**
>   * blkg_alloc - allocate a blkg
>   * @blkcg: block cgroup the new blkg is associated with
> @@ -378,29 +401,6 @@ static void blkg_destroy_all(struct request_queue *q)
>  }
>  
>  /*
> - * A group is RCU protected, but having an rcu lock does not mean that one
> - * can access all the fields of blkg and assume these are valid.  For
> - * example, don't try to follow throtl_data and request queue links.
> - *
> - * Having a reference to blkg under an rcu allows accesses to only values
> - * local to groups like group stats and group rate limits.
> - */
> -void __blkg_release_rcu(struct rcu_head *rcu_head)
> -{
> - struct blkcg_gq *blkg = container_of(rcu_head, struct blkcg_gq, 
> rcu_head);
> -
> - /* release the blkcg and parent blkg refs this blkg has been holding */
> - css_put(>blkcg->css);
> - if (blkg->parent)
> - blkg_put(blkg->parent);
> -
> - wb_congested_put(blkg->wb_congested);
> -
> - blkg_free(blkg);
> -}
> -EXPORT_SYMBOL_GPL(__blkg_release_rcu);
> -
> -/*
>   * The next function used by blk_queue_for_each_rl().  It's a bit tricky
>   * because the root blkg uses @q->root_rl instead of its own rl.
>   */
> -- 
> 2.9.5
> 


Re: [PATCH 2/2] blk-throtl: add relative percentage support to latency=

2017-11-14 Thread Shaohua Li
On Thu, Nov 09, 2017 at 02:20:00PM -0800, Tejun Heo wrote:
> This patch updates latency= handling so that the latency target can
> also be specified as a percentage.  This allows, in addition to the
> default absolute latency target, to specify the latency target as a
> percentage of the baseline (say, 120% of the expected latency).
> 
> A given blkg can only have either absolute or percentage latency
> target.  The propgation is updated so that we always consider both
> targets and follow whatever is the least protecting on the path to the
> root.
> 
> The percentage latency target is specified and presented with the '%'
> suffix.
> 
>  $ echo 8:16 rbps=$((100<<20)) riops=100 wbps=$((100<<20)) wiops=100 \
>   idle=$((1000*1000)) latency=120% > io.low
>  $ cat io.low
>  8:16 rbps=104857600 wbps=104857600 riops=100 wiops=100 idle=100 
> latency=120%
> 
> Signed-off-by: Tejun Heo <t...@kernel.org>
> Cc: Shaohua Li <s...@kernel.org>

Reviewed-by: Shaohua Li <s...@kernel.org>
> ---
>  block/blk-throttle.c |   66 
> +++
>  1 file changed, 51 insertions(+), 15 deletions(-)
> 
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -27,6 +27,7 @@ static int throtl_quantum = 32;
>  #define MIN_THROTL_BPS (320 * 1024)
>  #define MIN_THROTL_IOPS (10)
>  #define DFL_LATENCY_TARGET (-1L)
> +#define DFL_LATENCY_TARGET_PCT (-1L)
>  #define DFL_IDLE_THRESHOLD (0)
>  #define DFL_HD_BASELINE_LATENCY (4000L) /* 4ms */
>  #define LATENCY_FILTERED_SSD (0)
> @@ -164,8 +165,11 @@ struct throtl_grp {
>  
>   unsigned long last_check_time;
>  
> + /* Either both target and target_pct are DFL or neither is */
>   unsigned long latency_target; /* us */
>   unsigned long latency_target_conf; /* us */
> + unsigned long latency_target_pct; /* % */
> + unsigned long latency_target_pct_conf; /* % */
>   /* When did we start a new slice */
>   unsigned long slice_start[2];
>   unsigned long slice_end[2];
> @@ -511,6 +515,8 @@ static struct blkg_policy_data *throtl_p
>  
>   tg->latency_target = DFL_LATENCY_TARGET;
>   tg->latency_target_conf = DFL_LATENCY_TARGET;
> + tg->latency_target_pct = DFL_LATENCY_TARGET_PCT;
> + tg->latency_target_pct_conf = DFL_LATENCY_TARGET_PCT;
>   tg->idletime_threshold = DFL_IDLE_THRESHOLD;
>   tg->idletime_threshold_conf = DFL_IDLE_THRESHOLD;
>  
> @@ -1417,6 +1423,8 @@ static void tg_conf_updated(struct throt
>   parent_tg->idletime_threshold);
>   this_tg->latency_target = max(this_tg->latency_target,
>   parent_tg->latency_target);
> + this_tg->latency_target_pct = max(this_tg->latency_target_pct,
> + parent_tg->latency_target_pct);
>   }
>  
>   /*
> @@ -1528,7 +1536,7 @@ static u64 tg_prfill_limit(struct seq_fi
>   u64 bps_dft;
>   unsigned int iops_dft;
>   char idle_time[26] = "";
> - char latency_time[26] = "";
> + char latency_time[27] = ""; /* +1 for the optional '%' */
>  
>   if (!dname)
>   return 0;
> @@ -1569,8 +1577,11 @@ static u64 tg_prfill_limit(struct seq_fi
>   snprintf(idle_time, sizeof(idle_time), " idle=%lu",
>   tg->idletime_threshold_conf);
>  
> - if (tg->latency_target_conf == ULONG_MAX)
> + if (tg->latency_target_conf == DFL_LATENCY_TARGET)
>   strcpy(latency_time, " latency=max");
> + else if (tg->latency_target_pct_conf)
> + snprintf(latency_time, sizeof(latency_time),
> + " latency=%lu%%", tg->latency_target_pct_conf);
>   else
>   snprintf(latency_time, sizeof(latency_time),
>   " latency=%lu", tg->latency_target_conf);
> @@ -1597,7 +1608,7 @@ static ssize_t tg_set_limit(struct kernf
>   struct throtl_grp *tg;
>   u64 v[4];
>   unsigned long idle_time;
> - unsigned long latency_time;
> + unsigned long latency_time, latency_pct;
>   int ret;
>   int index = of_cft(of)->private;
>  
> @@ -1614,8 +1625,10 @@ static ssize_t tg_set_limit(struct kernf
>  
>   idle_time = tg->idletime_threshold_conf;
>   latency_time = tg->latency_target_conf;
> + latency_pct = tg->latency_target_pct_conf;
>   while (true) {
>   char tok[27];   /* wiops=18446744073709551616 */
> +  

Re: [PATCH 2/2] blk-throtl: add relative percentage support to latency=

2017-11-14 Thread Shaohua Li
On Thu, Nov 09, 2017 at 02:20:00PM -0800, Tejun Heo wrote:
> This patch updates latency= handling so that the latency target can
> also be specified as a percentage.  This allows, in addition to the
> default absolute latency target, to specify the latency target as a
> percentage of the baseline (say, 120% of the expected latency).
> 
> A given blkg can only have either absolute or percentage latency
> target.  The propgation is updated so that we always consider both
> targets and follow whatever is the least protecting on the path to the
> root.
> 
> The percentage latency target is specified and presented with the '%'
> suffix.
> 
>  $ echo 8:16 rbps=$((100<<20)) riops=100 wbps=$((100<<20)) wiops=100 \
>   idle=$((1000*1000)) latency=120% > io.low
>  $ cat io.low
>  8:16 rbps=104857600 wbps=104857600 riops=100 wiops=100 idle=1000000 
> latency=120%
> 
> Signed-off-by: Tejun Heo 
> Cc: Shaohua Li 

Reviewed-by: Shaohua Li 
> ---
>  block/blk-throttle.c |   66 
> +++
>  1 file changed, 51 insertions(+), 15 deletions(-)
> 
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -27,6 +27,7 @@ static int throtl_quantum = 32;
>  #define MIN_THROTL_BPS (320 * 1024)
>  #define MIN_THROTL_IOPS (10)
>  #define DFL_LATENCY_TARGET (-1L)
> +#define DFL_LATENCY_TARGET_PCT (-1L)
>  #define DFL_IDLE_THRESHOLD (0)
>  #define DFL_HD_BASELINE_LATENCY (4000L) /* 4ms */
>  #define LATENCY_FILTERED_SSD (0)
> @@ -164,8 +165,11 @@ struct throtl_grp {
>  
>   unsigned long last_check_time;
>  
> + /* Either both target and target_pct are DFL or neither is */
>   unsigned long latency_target; /* us */
>   unsigned long latency_target_conf; /* us */
> + unsigned long latency_target_pct; /* % */
> + unsigned long latency_target_pct_conf; /* % */
>   /* When did we start a new slice */
>   unsigned long slice_start[2];
>   unsigned long slice_end[2];
> @@ -511,6 +515,8 @@ static struct blkg_policy_data *throtl_p
>  
>   tg->latency_target = DFL_LATENCY_TARGET;
>   tg->latency_target_conf = DFL_LATENCY_TARGET;
> + tg->latency_target_pct = DFL_LATENCY_TARGET_PCT;
> + tg->latency_target_pct_conf = DFL_LATENCY_TARGET_PCT;
>   tg->idletime_threshold = DFL_IDLE_THRESHOLD;
>   tg->idletime_threshold_conf = DFL_IDLE_THRESHOLD;
>  
> @@ -1417,6 +1423,8 @@ static void tg_conf_updated(struct throt
>   parent_tg->idletime_threshold);
>   this_tg->latency_target = max(this_tg->latency_target,
>   parent_tg->latency_target);
> + this_tg->latency_target_pct = max(this_tg->latency_target_pct,
> + parent_tg->latency_target_pct);
>   }
>  
>   /*
> @@ -1528,7 +1536,7 @@ static u64 tg_prfill_limit(struct seq_fi
>   u64 bps_dft;
>   unsigned int iops_dft;
>   char idle_time[26] = "";
> - char latency_time[26] = "";
> + char latency_time[27] = ""; /* +1 for the optional '%' */
>  
>   if (!dname)
>   return 0;
> @@ -1569,8 +1577,11 @@ static u64 tg_prfill_limit(struct seq_fi
>   snprintf(idle_time, sizeof(idle_time), " idle=%lu",
>   tg->idletime_threshold_conf);
>  
> - if (tg->latency_target_conf == ULONG_MAX)
> + if (tg->latency_target_conf == DFL_LATENCY_TARGET)
>   strcpy(latency_time, " latency=max");
> + else if (tg->latency_target_pct_conf)
> + snprintf(latency_time, sizeof(latency_time),
> + " latency=%lu%%", tg->latency_target_pct_conf);
>   else
>   snprintf(latency_time, sizeof(latency_time),
>   " latency=%lu", tg->latency_target_conf);
> @@ -1597,7 +1608,7 @@ static ssize_t tg_set_limit(struct kernf
>   struct throtl_grp *tg;
>   u64 v[4];
>   unsigned long idle_time;
> - unsigned long latency_time;
> + unsigned long latency_time, latency_pct;
>   int ret;
>   int index = of_cft(of)->private;
>  
> @@ -1614,8 +1625,10 @@ static ssize_t tg_set_limit(struct kernf
>  
>   idle_time = tg->idletime_threshold_conf;
>   latency_time = tg->latency_target_conf;
> + latency_pct = tg->latency_target_pct_conf;
>   while (true) {
>   char tok[27];   /* wiops=18446744073709551616 */
> + char is_pct = 0;
>   char *p;
> 

[GIT PULL] MD update for 4.15-rc1

2017-11-14 Thread Shaohua Li
Hi,

Please pull MD patches for 4.15. This update mostly includes bug fixes:
- md-cluster now supports raid10 from Guoqing
- raid5 PPL fixes from Artur
- badblock regression fix from Bo
- suspend hang related fixes from Neil
- raid5 reshape fixes from Neil
- raid1 freeze deadlock fix from Nate
- memleak fixes from Zdenek
- bitmap related fixes from Me and Tao
- other fixes and cleanup

Thanks,
Shaohua

The following changes since commit 9e66317d3c92ddaab330c125dfe9d06eee268aff:

  Linux 4.14-rc3 (2017-10-01 14:54:54 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git for-next

for you to fetch changes up to 0868b99c214a3d55486c700de7c3f770b7243e7c:

  md: free unused memory after bitmap resize (2017-11-10 11:45:13 -0800)


Artur Paszkiewicz (3):
  raid5-ppl: don't resync after rebuild
  raid5-ppl: check recovery_offset when performing ppl recovery
  md: don't check MD_SB_CHANGE_CLEAN in md_allow_write

Colin Ian King (3):
  md-cluster: make function cluster_check_sync_size static
  md: raid10: remove a couple of redundant variables and initializations
  md: remove redundant variable q

Guoqing Jiang (7):
  md: always set THREAD_WAKEUP and wake up wqueue if thread existed
  md-cluster: fix wrong condition check in raid1_write_request
  md-cluster/raid10: set "do_balance = 0" if area is resyncing
  md-cluster: Suspend writes in RAID10 if within range
  md-cluster: Use a small window for raid10 resync
  raid1: remove obsolete code in raid1_write_request
  md-cluster: update document for raid10

Hou Tao (1):
  md/bitmap: clear BITMAP_WRITE_ERROR bit before writing it to sb

Liu Bo (1):
  badblocks: fix wrong return value in badblocks_set if badblocks are 
disabled

Matthias Kaehlcke (1):
  md: raid10: remove VLAIS

Mike Snitzer (1):
  md: rename some drivers/md/ files to have an "md-" prefix

Mikulas Patocka (1):
  md: use TASK_IDLE instead of blocking signals

Nate Dailey (1):
  raid1: prevent freeze_array/wait_all_barriers deadlock

NeilBrown (10):
  md: fix deadlock error in recent patch.
  raid5: Set R5_Expanded on parity devices as well as data.
  md: forbid a RAID5 from having both a bitmap and a journal.
  md: always hold reconfig_mutex when calling mddev_suspend()
  md: don't call bitmap_create() while array is quiesced.
  md: move suspend_hi/lo handling into core md code
  md: use mddev_suspend/resume instead of ->quiesce()
  md: allow metadata update while suspending.
  md: remove special meaning of ->quiesce(.., 2)
  md: be cautious about using ->curr_resync_completed for ->recovery_offset

Shaohua Li (2):
  md/bitmap: revert a patch
  md: use lockdep_assert_held

Zdenek Kabelac (2):
  md: release allocated bitset sync_set
  md: free unused memory after bitmap resize

 Documentation/md/md-cluster.txt|   3 +-
 MAINTAINERS|   7 +-
 block/badblocks.c  |   2 +-
 drivers/md/Kconfig |   5 +-
 drivers/md/Makefile|   5 +-
 drivers/md/dm-raid.c   |  12 +-
 drivers/md/{bitmap.c => md-bitmap.c}   |  27 -
 drivers/md/{bitmap.h => md-bitmap.h}   |   0
 drivers/md/md-cluster.c|  12 +-
 drivers/md/{faulty.c => md-faulty.c}   |   0
 drivers/md/{linear.c => md-linear.c}   |   2 +-
 drivers/md/{linear.h => md-linear.h}   |   0
 drivers/md/{multipath.c => md-multipath.c} |   4 +-
 drivers/md/{multipath.h => md-multipath.h} |   0
 drivers/md/md.c| 147 -
 drivers/md/md.h|  20 ++--
 drivers/md/raid0.c |   2 +-
 drivers/md/raid1.c |  78 -
 drivers/md/raid10.c| 169 +
 drivers/md/raid10.h|   6 +
 drivers/md/raid5-cache.c   |  44 +---
 drivers/md/raid5-log.h |   2 +-
 drivers/md/raid5-ppl.c |   6 +-
 drivers/md/raid5.c |  79 +++---
 24 files changed, 409 insertions(+), 223 deletions(-)
 rename drivers/md/{bitmap.c => md-bitmap.c} (99%)
 rename drivers/md/{bitmap.h => md-bitmap.h} (100%)
 rename drivers/md/{faulty.c => md-faulty.c} (100%)
 rename drivers/md/{linear.c => md-linear.c} (99%)
 rename drivers/md/{linear.h => md-linear.h} (100%)
 rename drivers/md/{multipath.c => md-multipath.c} (99%)
 rename drivers/md/{multipath.h => md-multipath.h} (100%)


[GIT PULL] MD update for 4.15-rc1

2017-11-14 Thread Shaohua Li
Hi,

Please pull MD patches for 4.15. This update mostly includes bug fixes:
- md-cluster now supports raid10 from Guoqing
- raid5 PPL fixes from Artur
- badblock regression fix from Bo
- suspend hang related fixes from Neil
- raid5 reshape fixes from Neil
- raid1 freeze deadlock fix from Nate
- memleak fixes from Zdenek
- bitmap related fixes from Me and Tao
- other fixes and cleanup

Thanks,
Shaohua

The following changes since commit 9e66317d3c92ddaab330c125dfe9d06eee268aff:

  Linux 4.14-rc3 (2017-10-01 14:54:54 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git for-next

for you to fetch changes up to 0868b99c214a3d55486c700de7c3f770b7243e7c:

  md: free unused memory after bitmap resize (2017-11-10 11:45:13 -0800)


Artur Paszkiewicz (3):
  raid5-ppl: don't resync after rebuild
  raid5-ppl: check recovery_offset when performing ppl recovery
  md: don't check MD_SB_CHANGE_CLEAN in md_allow_write

Colin Ian King (3):
  md-cluster: make function cluster_check_sync_size static
  md: raid10: remove a couple of redundant variables and initializations
  md: remove redundant variable q

Guoqing Jiang (7):
  md: always set THREAD_WAKEUP and wake up wqueue if thread existed
  md-cluster: fix wrong condition check in raid1_write_request
  md-cluster/raid10: set "do_balance = 0" if area is resyncing
  md-cluster: Suspend writes in RAID10 if within range
  md-cluster: Use a small window for raid10 resync
  raid1: remove obsolete code in raid1_write_request
  md-cluster: update document for raid10

Hou Tao (1):
  md/bitmap: clear BITMAP_WRITE_ERROR bit before writing it to sb

Liu Bo (1):
  badblocks: fix wrong return value in badblocks_set if badblocks are 
disabled

Matthias Kaehlcke (1):
  md: raid10: remove VLAIS

Mike Snitzer (1):
  md: rename some drivers/md/ files to have an "md-" prefix

Mikulas Patocka (1):
  md: use TASK_IDLE instead of blocking signals

Nate Dailey (1):
  raid1: prevent freeze_array/wait_all_barriers deadlock

NeilBrown (10):
  md: fix deadlock error in recent patch.
  raid5: Set R5_Expanded on parity devices as well as data.
  md: forbid a RAID5 from having both a bitmap and a journal.
  md: always hold reconfig_mutex when calling mddev_suspend()
  md: don't call bitmap_create() while array is quiesced.
  md: move suspend_hi/lo handling into core md code
  md: use mddev_suspend/resume instead of ->quiesce()
  md: allow metadata update while suspending.
  md: remove special meaning of ->quiesce(.., 2)
  md: be cautious about using ->curr_resync_completed for ->recovery_offset

Shaohua Li (2):
  md/bitmap: revert a patch
  md: use lockdep_assert_held

Zdenek Kabelac (2):
  md: release allocated bitset sync_set
  md: free unused memory after bitmap resize

 Documentation/md/md-cluster.txt|   3 +-
 MAINTAINERS|   7 +-
 block/badblocks.c  |   2 +-
 drivers/md/Kconfig |   5 +-
 drivers/md/Makefile|   5 +-
 drivers/md/dm-raid.c   |  12 +-
 drivers/md/{bitmap.c => md-bitmap.c}   |  27 -
 drivers/md/{bitmap.h => md-bitmap.h}   |   0
 drivers/md/md-cluster.c|  12 +-
 drivers/md/{faulty.c => md-faulty.c}   |   0
 drivers/md/{linear.c => md-linear.c}   |   2 +-
 drivers/md/{linear.h => md-linear.h}   |   0
 drivers/md/{multipath.c => md-multipath.c} |   4 +-
 drivers/md/{multipath.h => md-multipath.h} |   0
 drivers/md/md.c| 147 -
 drivers/md/md.h|  20 ++--
 drivers/md/raid0.c |   2 +-
 drivers/md/raid1.c |  78 -
 drivers/md/raid10.c| 169 +
 drivers/md/raid10.h|   6 +
 drivers/md/raid5-cache.c   |  44 +---
 drivers/md/raid5-log.h |   2 +-
 drivers/md/raid5-ppl.c |   6 +-
 drivers/md/raid5.c |  79 +++---
 24 files changed, 409 insertions(+), 223 deletions(-)
 rename drivers/md/{bitmap.c => md-bitmap.c} (99%)
 rename drivers/md/{bitmap.h => md-bitmap.h} (100%)
 rename drivers/md/{faulty.c => md-faulty.c} (100%)
 rename drivers/md/{linear.c => md-linear.c} (99%)
 rename drivers/md/{linear.h => md-linear.h} (100%)
 rename drivers/md/{multipath.c => md-multipath.c} (99%)
 rename drivers/md/{multipath.h => md-multipath.h} (100%)


Re: [PATCH 1/2] blk-throtl: make latency= absolute

2017-11-13 Thread Shaohua Li
On Mon, Nov 13, 2017 at 06:18:49AM -0800, Tejun Heo wrote:
> Hello, Shaohua.  Just a bit of addition.
> 
> On Mon, Nov 13, 2017 at 03:27:10AM -0800, Tejun Heo wrote:
> > What I'm trying to say is that the latency is defined as "from bio
> > issue to completion", not "in-flight time on device".  Whether the
> > on-device latency is 50us or 500us, the host side queueing latency can
> > be in orders of magnitude higher.
> > 
> > For things like starvation protection for managerial workloads which
> > work fine on rotating disks, the only thing we need to protect against
> > is excessive host side queue overflowing leading to starvation of such
> > workloads.  IOW, we're talking about latency target in tens or lower
> > hundreds of millisecs.  Whether the on-device time is 50 or 500us
> > doesn't matter that much.
> 
> So, the absolute latency target can express the requirements of the
> workload in question - it's saying "if the IO latency stays within
> this boundary, regardless of the underlying device, this workload is
> gonna be happy enough".  There are workloads which are this way -
> e.g. it has some IOs to do and some deadline requirements (like
> heartbeat period).  For those workloads, it doesn't matter what the
> underlying device is.  It can be a rotating disk, or a slow or
> lightening-fast SSD.  As long as the absolute target latency is met,
> the workload will be happy.

I think this is what we don't agree with. The user doesn't really care about
the IO latency. What user care about is 'read' syscall latency or 'fsync'
syscall latency. The syscall could do several 4k IO or 1M IO or mixed. To meet
the syscall latency target, we must control the latency for each IO. If we use
absolute latency, it can only control some IOs. In this case, it's very likely
the syscall latency requirement isn't met. So we do need to know what the
underlying device is.

That said, absolute latency is useful for HD. But on the other hand, HD
baseline is always 4ms for any size IO. So absolute latency = 4ms + slack,
unless you want to specify a smaller than 4ms absolute latency.

Thanks,
Shaohua


Re: [PATCH 1/2] blk-throtl: make latency= absolute

2017-11-13 Thread Shaohua Li
On Mon, Nov 13, 2017 at 06:18:49AM -0800, Tejun Heo wrote:
> Hello, Shaohua.  Just a bit of addition.
> 
> On Mon, Nov 13, 2017 at 03:27:10AM -0800, Tejun Heo wrote:
> > What I'm trying to say is that the latency is defined as "from bio
> > issue to completion", not "in-flight time on device".  Whether the
> > on-device latency is 50us or 500us, the host side queueing latency can
> > be in orders of magnitude higher.
> > 
> > For things like starvation protection for managerial workloads which
> > work fine on rotating disks, the only thing we need to protect against
> > is excessive host side queue overflowing leading to starvation of such
> > workloads.  IOW, we're talking about latency target in tens or lower
> > hundreds of millisecs.  Whether the on-device time is 50 or 500us
> > doesn't matter that much.
> 
> So, the absolute latency target can express the requirements of the
> workload in question - it's saying "if the IO latency stays within
> this boundary, regardless of the underlying device, this workload is
> gonna be happy enough".  There are workloads which are this way -
> e.g. it has some IOs to do and some deadline requirements (like
> heartbeat period).  For those workloads, it doesn't matter what the
> underlying device is.  It can be a rotating disk, or a slow or
> lightening-fast SSD.  As long as the absolute target latency is met,
> the workload will be happy.

I think this is what we don't agree with. The user doesn't really care about
the IO latency. What user care about is 'read' syscall latency or 'fsync'
syscall latency. The syscall could do several 4k IO or 1M IO or mixed. To meet
the syscall latency target, we must control the latency for each IO. If we use
absolute latency, it can only control some IOs. In this case, it's very likely
the syscall latency requirement isn't met. So we do need to know what the
underlying device is.

That said, absolute latency is useful for HD. But on the other hand, HD
baseline is always 4ms for any size IO. So absolute latency = 4ms + slack,
unless you want to specify a smaller than 4ms absolute latency.

Thanks,
Shaohua


Re: [PATCH 7/7] blk-throtl: don't throttle the same IO multiple times

2017-11-13 Thread Shaohua Li
On Mon, Nov 13, 2017 at 07:57:45AM -0800, Tejun Heo wrote:
> Hello,
> 
> On Mon, Nov 13, 2017 at 03:13:48AM -0800, Tejun Heo wrote:
> > You're right.  If we wanna take this approach, we need to keep the
> > throttled flag while cloning.  The clearing part is still correct tho.
> > Without that, I get 1/4 bw limit enforced.  Hmm... I'm not quite sure
> > where that 1/4 is coming from tho.  Will investigate more.
> 
> Okay, this is because when we spiit, the split bio is the first part
> which gets issued and then the orignal bio is wound forward and
> requeued.  So, for the splits, the original bio is the one which gets
> trimmed in the front and requeued, so not clearing BIO_THROTTLED is
> enough.  I think we should still copy BIO_THROTTLED on clones so that
> we don't get suprises w/ other bio drivers.

Oh, I think there is a better solution. Not adding a new bdev is possible. We
always set the BIO_THROTTLED flag after block-throttle and copy the flag in
clone. In bio_set_dev, we clear the flag. This should work I think.

Thanks,
Shaohua


Re: [PATCH 7/7] blk-throtl: don't throttle the same IO multiple times

2017-11-13 Thread Shaohua Li
On Mon, Nov 13, 2017 at 07:57:45AM -0800, Tejun Heo wrote:
> Hello,
> 
> On Mon, Nov 13, 2017 at 03:13:48AM -0800, Tejun Heo wrote:
> > You're right.  If we wanna take this approach, we need to keep the
> > throttled flag while cloning.  The clearing part is still correct tho.
> > Without that, I get 1/4 bw limit enforced.  Hmm... I'm not quite sure
> > where that 1/4 is coming from tho.  Will investigate more.
> 
> Okay, this is because when we spiit, the split bio is the first part
> which gets issued and then the orignal bio is wound forward and
> requeued.  So, for the splits, the original bio is the one which gets
> trimmed in the front and requeued, so not clearing BIO_THROTTLED is
> enough.  I think we should still copy BIO_THROTTLED on clones so that
> we don't get suprises w/ other bio drivers.

Oh, I think there is a better solution. Not adding a new bdev is possible. We
always set the BIO_THROTTLED flag after block-throttle and copy the flag in
clone. In bio_set_dev, we clear the flag. This should work I think.

Thanks,
Shaohua


Re: [PATCH 7/7] blk-throtl: don't throttle the same IO multiple times

2017-11-13 Thread Shaohua Li
On Mon, Nov 13, 2017 at 07:57:45AM -0800, Tejun Heo wrote:
> Hello,
> 
> On Mon, Nov 13, 2017 at 03:13:48AM -0800, Tejun Heo wrote:
> > You're right.  If we wanna take this approach, we need to keep the
> > throttled flag while cloning.  The clearing part is still correct tho.
> > Without that, I get 1/4 bw limit enforced.  Hmm... I'm not quite sure
> > where that 1/4 is coming from tho.  Will investigate more.
> 
> Okay, this is because when we spiit, the split bio is the first part
> which gets issued and then the orignal bio is wound forward and
> requeued.  So, for the splits, the original bio is the one which gets
> trimmed in the front and requeued, so not clearing BIO_THROTTLED is
> enough.  I think we should still copy BIO_THROTTLED on clones so that
> we don't get suprises w/ other bio drivers.

I'm not sure how you are going to make this correct. The mechanism is very
fragile. So for example, 'q->make_request_fn(q, bio)' could just queue the bio
somewhere and handle in other context (both dm and md do this). The bio will be
called again with generic_make_request. In this case, the second time shouldn't
throttle the bio. The bio could be called again with generic_make_request but
with bdev changed. In this case, the second time should throttle the bio
(against the new bdev). There are a lot of different usages of bio. I'd rather
not depend on generic_make_request dispatches new bio immediately. That's why I
add a bdev in my patch.

Thanks,
Shaohua


Re: [PATCH 7/7] blk-throtl: don't throttle the same IO multiple times

2017-11-13 Thread Shaohua Li
On Mon, Nov 13, 2017 at 07:57:45AM -0800, Tejun Heo wrote:
> Hello,
> 
> On Mon, Nov 13, 2017 at 03:13:48AM -0800, Tejun Heo wrote:
> > You're right.  If we wanna take this approach, we need to keep the
> > throttled flag while cloning.  The clearing part is still correct tho.
> > Without that, I get 1/4 bw limit enforced.  Hmm... I'm not quite sure
> > where that 1/4 is coming from tho.  Will investigate more.
> 
> Okay, this is because when we spiit, the split bio is the first part
> which gets issued and then the orignal bio is wound forward and
> requeued.  So, for the splits, the original bio is the one which gets
> trimmed in the front and requeued, so not clearing BIO_THROTTLED is
> enough.  I think we should still copy BIO_THROTTLED on clones so that
> we don't get suprises w/ other bio drivers.

I'm not sure how you are going to make this correct. The mechanism is very
fragile. So for example, 'q->make_request_fn(q, bio)' could just queue the bio
somewhere and handle in other context (both dm and md do this). The bio will be
called again with generic_make_request. In this case, the second time shouldn't
throttle the bio. The bio could be called again with generic_make_request but
with bdev changed. In this case, the second time should throttle the bio
(against the new bdev). There are a lot of different usages of bio. I'd rather
not depend on generic_make_request dispatches new bio immediately. That's why I
add a bdev in my patch.

Thanks,
Shaohua


Re: [PATCH 1/2] blk-throtl: make latency= absolute

2017-11-12 Thread Shaohua Li
On Fri, Nov 10, 2017 at 07:43:14AM -0800, Tejun Heo wrote:
> Hello, Shaohua.
> 
> On Thu, Nov 09, 2017 at 08:27:13PM -0800, Shaohua Li wrote:
> > I think the absolute latency would only work for HD. For a SSD, a 4k latency
> > probably is 60us and 1M latency is 500us. The disk must be very contended to
> > make 4k latency reach 500us. Not sensitive doesn't mean no protection. If 
> > the
> > use case sets rough latency, say 1ms, there will be no protection for 4k IO 
> > at
> > all. The baseline latency is pretty reliable for SSD actually. So I'd rather
> 
> I don't understand how that would mean no protection.  The latency
> naturally includes the queueing time on the host side and, even for a
> fast SSD device, it isn't too difficult to saturate the device to the
> point where the host-side waiting time becomes pretty long.  All
> that's necessary is IOs being issued faster than completed and we can
> almost always do that.

Didn't get this. What did you mean 'queueing time on the host side'? You mean
the application think time delay?

My point is absolute latency doen't protect as we expected. Let me have an
example. Say 4k latency is 60us, BW is 100MB/s. When 4k BW is 50MB/s, the
latency is 200us. 1M latency is 500us. If you set the absolute latency to
600us, you can't protect the 4k BW to above 50MB/s. To do the protection, you
really want to set the absolute latency below 500us, which doesn't work for the
1M IO.
 
> > keeping the baseline latency for SSD but using absolute latency for HD, 
> > which
> > can be done easily by setting DFL_HD_BASELINE_LATENCY to 0.
> 
> I don't think that'd be a good interface choice.  It's too misleading.
> If we actually need to specify baseline + margin, it'd probably be
> better to add another notation - say, "+N" - than overloading the
> meaning of "N".

We don't overload the meaning of "N". Untill your next patch, the "N" actually
means "+N".

Ponder a little bit, I think 4ms base latency for HD actually is reasonable. We
have LATENCY_FILTERED_HD to filter out small latency bios, which come from
sequential IO. So remaining IO is random IO. 4k base latency for HD random IO
should be ok. Probably something else is wrong. I think we need understand
what's wrong for HD throttling first before we make any change.

Thanks,
Shaohua


Re: [PATCH 1/2] blk-throtl: make latency= absolute

2017-11-12 Thread Shaohua Li
On Fri, Nov 10, 2017 at 07:43:14AM -0800, Tejun Heo wrote:
> Hello, Shaohua.
> 
> On Thu, Nov 09, 2017 at 08:27:13PM -0800, Shaohua Li wrote:
> > I think the absolute latency would only work for HD. For a SSD, a 4k latency
> > probably is 60us and 1M latency is 500us. The disk must be very contended to
> > make 4k latency reach 500us. Not sensitive doesn't mean no protection. If 
> > the
> > use case sets rough latency, say 1ms, there will be no protection for 4k IO 
> > at
> > all. The baseline latency is pretty reliable for SSD actually. So I'd rather
> 
> I don't understand how that would mean no protection.  The latency
> naturally includes the queueing time on the host side and, even for a
> fast SSD device, it isn't too difficult to saturate the device to the
> point where the host-side waiting time becomes pretty long.  All
> that's necessary is IOs being issued faster than completed and we can
> almost always do that.

Didn't get this. What did you mean 'queueing time on the host side'? You mean
the application think time delay?

My point is absolute latency doen't protect as we expected. Let me have an
example. Say 4k latency is 60us, BW is 100MB/s. When 4k BW is 50MB/s, the
latency is 200us. 1M latency is 500us. If you set the absolute latency to
600us, you can't protect the 4k BW to above 50MB/s. To do the protection, you
really want to set the absolute latency below 500us, which doesn't work for the
1M IO.
 
> > keeping the baseline latency for SSD but using absolute latency for HD, 
> > which
> > can be done easily by setting DFL_HD_BASELINE_LATENCY to 0.
> 
> I don't think that'd be a good interface choice.  It's too misleading.
> If we actually need to specify baseline + margin, it'd probably be
> better to add another notation - say, "+N" - than overloading the
> meaning of "N".

We don't overload the meaning of "N". Untill your next patch, the "N" actually
means "+N".

Ponder a little bit, I think 4ms base latency for HD actually is reasonable. We
have LATENCY_FILTERED_HD to filter out small latency bios, which come from
sequential IO. So remaining IO is random IO. 4k base latency for HD random IO
should be ok. Probably something else is wrong. I think we need understand
what's wrong for HD throttling first before we make any change.

Thanks,
Shaohua


Re: [PATCH 7/7] blk-throtl: don't throttle the same IO multiple times

2017-11-12 Thread Shaohua Li
On Sun, Nov 12, 2017 at 02:26:13PM -0800, Tejun Heo wrote:
> BIO_THROTTLED is used to mark already throttled bios so that a bio
> doesn't get throttled multiple times.  The flag gets set when the bio
> starts getting dispatched from blk-throtl and cleared when it leaves
> blk-throtl.
> 
> Unfortunately, this doesn't work when the request_queue decides to
> split or requeue the bio and ends up throttling the same IO multiple
> times.  This condition gets easily triggered and often leads to
> multiple times lower bandwidth limit being enforced than configured.
> 
> Fix it by always setting BIO_THROTTLED for bios recursing to the same
> request_queue and clearing only when a bio leaves the current level.
> 
> Signed-off-by: Tejun Heo 
> ---
>  block/blk-core.c   | 10 +++---
>  block/blk-throttle.c   |  8 
>  include/linux/blk-cgroup.h | 20 
>  3 files changed, 27 insertions(+), 11 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index ad23b96..f0e3157 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2216,11 +2216,15 @@ blk_qc_t generic_make_request(struct bio *bio)
>*/
>   bio_list_init();
>   bio_list_init();
> - while ((bio = bio_list_pop(_list_on_stack[0])) != 
> NULL)
> - if (q == bio->bi_disk->queue)
> + while ((bio = bio_list_pop(_list_on_stack[0])) != 
> NULL) {
> + if (q == bio->bi_disk->queue) {
> + blkcg_bio_repeat_q_level(bio);
>   bio_list_add(, bio);
> - else
> + } else {
> + blkcg_bio_leave_q_level(bio);
>   bio_list_add(, bio);
> + }
> + }

Hi Tejun,

Thanks for looking into this while I was absence. I don't understand how this
works. Assume a bio will be splitted into 2 small bios. In
generic_make_request, we charge the whole bio. 'q->make_request_fn' will
dispatch the first small bio, and call generic_make_request for the second
small bio. Then generic_make_request charge the second small bio and we add the
second small bio to current->bio_list[0] (please check the order). In above
code the patch changed, we pop the second small bio and set BIO_THROTTLED for
it. But this is already too late, because generic_make_request already charged
the second small bio.

Did you look at my original patch
(https://marc.info/?l=linux-block=150791825327628=2), anything wrong?

Thanks,
Shaohua

>   /* now assemble so we handle the lowest level first */
>   bio_list_merge(_list_on_stack[0], );
>   bio_list_merge(_list_on_stack[0], );
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 1e6916b..76579b2 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -2223,14 +2223,6 @@ bool blk_throtl_bio(struct request_queue *q, struct 
> blkcg_gq *blkg,
>  out_unlock:
>   spin_unlock_irq(q->queue_lock);
>  out:
> - /*
> -  * As multiple blk-throtls may stack in the same issue path, we
> -  * don't want bios to leave with the flag set.  Clear the flag if
> -  * being issued.
> -  */
> - if (!throttled)
> - bio_clear_flag(bio, BIO_THROTTLED);
> -
>  #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
>   if (throttled || !td->track_bio_latency)
>   bio->bi_issue_stat.stat |= SKIP_LATENCY;
> diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
> index f2f9691..bed0416 100644
> --- a/include/linux/blk-cgroup.h
> +++ b/include/linux/blk-cgroup.h
> @@ -675,9 +675,29 @@ static inline void blkg_rwstat_add_aux(struct 
> blkg_rwstat *to,
>  #ifdef CONFIG_BLK_DEV_THROTTLING
>  extern bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
>  struct bio *bio);
> +
> +static inline void blkcg_bio_repeat_q_level(struct bio *bio)
> +{
> + /*
> +  * @bio is queued while processing a previous bio which was already
> +  * throttled.  Don't throttle it again.
> +  */
> + bio_set_flag(bio, BIO_THROTTLED);
> +}
> +
> +static inline void blkcg_bio_leave_q_level(struct bio *bio)
> +{
> + /*
> +  * @bio may get throttled at multiple q levels, clear THROTTLED
> +  * when leaving the current one.
> +  */
> + bio_clear_flag(bio, BIO_THROTTLED);
> +}
>  #else
>  static inline bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq 
> *blkg,
> struct bio *bio) { return false; }
> +static inline void blkcg_bio_repeat_q_level(struct bio *bio) { }
> +static inline void biocg_bio_leave_q_level(struct bio *bio) { }
>  #endif
>  
>  static inline struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
> 

Re: [PATCH 7/7] blk-throtl: don't throttle the same IO multiple times

2017-11-12 Thread Shaohua Li
On Sun, Nov 12, 2017 at 02:26:13PM -0800, Tejun Heo wrote:
> BIO_THROTTLED is used to mark already throttled bios so that a bio
> doesn't get throttled multiple times.  The flag gets set when the bio
> starts getting dispatched from blk-throtl and cleared when it leaves
> blk-throtl.
> 
> Unfortunately, this doesn't work when the request_queue decides to
> split or requeue the bio and ends up throttling the same IO multiple
> times.  This condition gets easily triggered and often leads to
> multiple times lower bandwidth limit being enforced than configured.
> 
> Fix it by always setting BIO_THROTTLED for bios recursing to the same
> request_queue and clearing only when a bio leaves the current level.
> 
> Signed-off-by: Tejun Heo 
> ---
>  block/blk-core.c   | 10 +++---
>  block/blk-throttle.c   |  8 
>  include/linux/blk-cgroup.h | 20 
>  3 files changed, 27 insertions(+), 11 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index ad23b96..f0e3157 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2216,11 +2216,15 @@ blk_qc_t generic_make_request(struct bio *bio)
>*/
>   bio_list_init();
>   bio_list_init();
> - while ((bio = bio_list_pop(_list_on_stack[0])) != 
> NULL)
> - if (q == bio->bi_disk->queue)
> + while ((bio = bio_list_pop(_list_on_stack[0])) != 
> NULL) {
> + if (q == bio->bi_disk->queue) {
> + blkcg_bio_repeat_q_level(bio);
>   bio_list_add(, bio);
> - else
> + } else {
> + blkcg_bio_leave_q_level(bio);
>   bio_list_add(, bio);
> + }
> + }

Hi Tejun,

Thanks for looking into this while I was absence. I don't understand how this
works. Assume a bio will be splitted into 2 small bios. In
generic_make_request, we charge the whole bio. 'q->make_request_fn' will
dispatch the first small bio, and call generic_make_request for the second
small bio. Then generic_make_request charge the second small bio and we add the
second small bio to current->bio_list[0] (please check the order). In above
code the patch changed, we pop the second small bio and set BIO_THROTTLED for
it. But this is already too late, because generic_make_request already charged
the second small bio.

Did you look at my original patch
(https://marc.info/?l=linux-block=150791825327628=2), anything wrong?

Thanks,
Shaohua

>   /* now assemble so we handle the lowest level first */
>   bio_list_merge(_list_on_stack[0], );
>   bio_list_merge(_list_on_stack[0], );
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 1e6916b..76579b2 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -2223,14 +2223,6 @@ bool blk_throtl_bio(struct request_queue *q, struct 
> blkcg_gq *blkg,
>  out_unlock:
>   spin_unlock_irq(q->queue_lock);
>  out:
> - /*
> -  * As multiple blk-throtls may stack in the same issue path, we
> -  * don't want bios to leave with the flag set.  Clear the flag if
> -  * being issued.
> -  */
> - if (!throttled)
> - bio_clear_flag(bio, BIO_THROTTLED);
> -
>  #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
>   if (throttled || !td->track_bio_latency)
>   bio->bi_issue_stat.stat |= SKIP_LATENCY;
> diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
> index f2f9691..bed0416 100644
> --- a/include/linux/blk-cgroup.h
> +++ b/include/linux/blk-cgroup.h
> @@ -675,9 +675,29 @@ static inline void blkg_rwstat_add_aux(struct 
> blkg_rwstat *to,
>  #ifdef CONFIG_BLK_DEV_THROTTLING
>  extern bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
>  struct bio *bio);
> +
> +static inline void blkcg_bio_repeat_q_level(struct bio *bio)
> +{
> + /*
> +  * @bio is queued while processing a previous bio which was already
> +  * throttled.  Don't throttle it again.
> +  */
> + bio_set_flag(bio, BIO_THROTTLED);
> +}
> +
> +static inline void blkcg_bio_leave_q_level(struct bio *bio)
> +{
> + /*
> +  * @bio may get throttled at multiple q levels, clear THROTTLED
> +  * when leaving the current one.
> +  */
> + bio_clear_flag(bio, BIO_THROTTLED);
> +}
>  #else
>  static inline bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq 
> *blkg,
> struct bio *bio) { return false; }
> +static inline void blkcg_bio_repeat_q_level(struct bio *bio) { }
> +static inline void biocg_bio_leave_q_level(struct bio *bio) { }
>  #endif
>  
>  static inline struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
> -- 
> 2.9.5
> 


Re: [PATCH 1/2] blk-throtl: make latency= absolute

2017-11-09 Thread Shaohua Li
On Thu, Nov 09, 2017 at 03:42:58PM -0800, Tejun Heo wrote:
> Hello, Shaohua.
> 
> On Thu, Nov 09, 2017 at 03:12:12PM -0800, Shaohua Li wrote:
> > The percentage latency makes sense, but the absolute latency doesn't to me. 
> > A
> > 4k IO latency could be much smaller than 1M IO latency. If we don't add
> > baseline latency, we can't specify a latency target which works for both 4k 
> > and
> > 1M IO.
> 
> It isn't adaptive for sure.  I think it's still useful for the
> following reasons.
> 
> 1. The absolute latency target is by nature both workload and device
>dependent.  For a lot of use cases, coming up with a decent number
>should be possible.
> 
> 2. There are many use cases which aren't sensitive to the level where
>they care much about the different between small and large
>requests.  e.g. protecting a managerial job so that it doesn't
>completely stall doesn't require tuning things to that level.  A
>value which is comfortably higher than usually expected latencies
>would often be enough (say 100ms).
> 
> 3. It's also useful for verification / testing.

I think the absolute latency would only work for HD. For a SSD, a 4k latency
probably is 60us and 1M latency is 500us. The disk must be very contended to
make 4k latency reach 500us. Not sensitive doesn't mean no protection. If the
use case sets rough latency, say 1ms, there will be no protection for 4k IO at
all. The baseline latency is pretty reliable for SSD actually. So I'd rather
keeping the baseline latency for SSD but using absolute latency for HD, which
can be done easily by setting DFL_HD_BASELINE_LATENCY to 0.

Thanks,
Shaohua


Re: [PATCH 1/2] blk-throtl: make latency= absolute

2017-11-09 Thread Shaohua Li
On Thu, Nov 09, 2017 at 03:42:58PM -0800, Tejun Heo wrote:
> Hello, Shaohua.
> 
> On Thu, Nov 09, 2017 at 03:12:12PM -0800, Shaohua Li wrote:
> > The percentage latency makes sense, but the absolute latency doesn't to me. 
> > A
> > 4k IO latency could be much smaller than 1M IO latency. If we don't add
> > baseline latency, we can't specify a latency target which works for both 4k 
> > and
> > 1M IO.
> 
> It isn't adaptive for sure.  I think it's still useful for the
> following reasons.
> 
> 1. The absolute latency target is by nature both workload and device
>dependent.  For a lot of use cases, coming up with a decent number
>should be possible.
> 
> 2. There are many use cases which aren't sensitive to the level where
>they care much about the different between small and large
>requests.  e.g. protecting a managerial job so that it doesn't
>completely stall doesn't require tuning things to that level.  A
>value which is comfortably higher than usually expected latencies
>would often be enough (say 100ms).
> 
> 3. It's also useful for verification / testing.

I think the absolute latency would only work for HD. For a SSD, a 4k latency
probably is 60us and 1M latency is 500us. The disk must be very contended to
make 4k latency reach 500us. Not sensitive doesn't mean no protection. If the
use case sets rough latency, say 1ms, there will be no protection for 4k IO at
all. The baseline latency is pretty reliable for SSD actually. So I'd rather
keeping the baseline latency for SSD but using absolute latency for HD, which
can be done easily by setting DFL_HD_BASELINE_LATENCY to 0.

Thanks,
Shaohua


Re: [PATCH 1/2] blk-throtl: make latency= absolute

2017-11-09 Thread Shaohua Li
On Thu, Nov 09, 2017 at 02:19:24PM -0800, Tejun Heo wrote:
> Currently, the latency= parameter specifies the extra latency on top
> of the estimated baseline latency.  This doesn't seem ideal for the
> following two reasons.
> 
> 1. Sometimes we want to use absolute target latencies, especially as
>the baseline latency estimation isn't always reliable.
> 
> 2. If we want to set a latency target relative to the estimated
>baseline latency, it makes a lot more sense to express the target
>as a percentage of the baseline (say, 120% of the expected latency)
>rather than the baseline latency + an absolute offset.
> 
> This patch makes the existing latency= parameter to be interpreted as
> an absolute latency target instead of an offset to the baseline.  The
> next patch will add support for relative latency target expressed in
> terms of percentage.
> 
> While this is a userland visible change, io.low support is still
> evoling and highly experimental.  This isn't expected to break any
> actual usages.

The percentage latency makes sense, but the absolute latency doesn't to me. A
4k IO latency could be much smaller than 1M IO latency. If we don't add
baseline latency, we can't specify a latency target which works for both 4k and
1M IO.

Thanks,
Shaohua
> Signed-off-by: Tejun Heo <t...@kernel.org>
> Cc: Shaohua Li <s...@kernel.org>
> ---
>  block/blk-throttle.c |3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -2299,8 +2299,7 @@ void blk_throtl_bio_endio(struct bio *bi
>  
>   bucket = request_bucket_index(
>   blk_stat_size(>bi_issue_stat));
> - threshold = tg->td->avg_buckets[bucket].latency +
> - tg->latency_target;
> + threshold = tg->latency_target;
>   if (lat > threshold)
>   tg->bad_bio_cnt++;
>   /*


Re: [PATCH 1/2] blk-throtl: make latency= absolute

2017-11-09 Thread Shaohua Li
On Thu, Nov 09, 2017 at 02:19:24PM -0800, Tejun Heo wrote:
> Currently, the latency= parameter specifies the extra latency on top
> of the estimated baseline latency.  This doesn't seem ideal for the
> following two reasons.
> 
> 1. Sometimes we want to use absolute target latencies, especially as
>the baseline latency estimation isn't always reliable.
> 
> 2. If we want to set a latency target relative to the estimated
>baseline latency, it makes a lot more sense to express the target
>as a percentage of the baseline (say, 120% of the expected latency)
>rather than the baseline latency + an absolute offset.
> 
> This patch makes the existing latency= parameter to be interpreted as
> an absolute latency target instead of an offset to the baseline.  The
> next patch will add support for relative latency target expressed in
> terms of percentage.
> 
> While this is a userland visible change, io.low support is still
> evoling and highly experimental.  This isn't expected to break any
> actual usages.

The percentage latency makes sense, but the absolute latency doesn't to me. A
4k IO latency could be much smaller than 1M IO latency. If we don't add
baseline latency, we can't specify a latency target which works for both 4k and
1M IO.

Thanks,
Shaohua
> Signed-off-by: Tejun Heo 
> Cc: Shaohua Li 
> ---
>  block/blk-throttle.c |3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -2299,8 +2299,7 @@ void blk_throtl_bio_endio(struct bio *bi
>  
>   bucket = request_bucket_index(
>   blk_stat_size(>bi_issue_stat));
> - threshold = tg->td->avg_buckets[bucket].latency +
> - tg->latency_target;
> + threshold = tg->latency_target;
>   if (lat > threshold)
>   tg->bad_bio_cnt++;
>   /*


[PATCH V2] kthread: zero the kthread data structure

2017-11-07 Thread Shaohua Li
kthread() could bail out early before we initialize blkcg_css (if the
kthread is killed very early. Please see xchg() statement in kthread()),
which confuses free_kthread_struct. Instead of moving the blkcg_css
initialization early, we simply zero the whole 'self' data structure,
which doesn't sound much overhead.

Reported-by: syzbot <syzkal...@googlegroups.com>
Fixes: 05e3db95ebfc ("kthread: add a mechanism to store cgroup info")
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Ingo Molnar <mi...@kernel.org>
Cc: Tejun Heo <t...@kernel.org>
Cc: Dmitry Vyukov <dvyu...@google.com>
Signed-off-by: Shaohua Li <s...@fb.com>
---
 kernel/kthread.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index f87cd8b4..8dbe245 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -204,7 +204,7 @@ static int kthread(void *_create)
struct kthread *self;
int ret;
 
-   self = kmalloc(sizeof(*self), GFP_KERNEL);
+   self = kzalloc(sizeof(*self), GFP_KERNEL);
set_kthread_struct(self);
 
/* If user was SIGKILLed, I release the structure. */
@@ -220,13 +220,9 @@ static int kthread(void *_create)
do_exit(-ENOMEM);
}
 
-   self->flags = 0;
self->data = data;
init_completion(>exited);
init_completion(>parked);
-#ifdef CONFIG_BLK_CGROUP
-   self->blkcg_css = NULL;
-#endif
current->vfork_done = >exited;
 
/* OK, tell user we're spawned, wait for stop or wakeup */
-- 
2.9.5



[PATCH V2] kthread: zero the kthread data structure

2017-11-07 Thread Shaohua Li
kthread() could bail out early before we initialize blkcg_css (if the
kthread is killed very early. Please see xchg() statement in kthread()),
which confuses free_kthread_struct. Instead of moving the blkcg_css
initialization early, we simply zero the whole 'self' data structure,
which doesn't sound much overhead.

Reported-by: syzbot 
Fixes: 05e3db95ebfc ("kthread: add a mechanism to store cgroup info")
Cc: Andrew Morton 
Cc: Ingo Molnar 
Cc: Tejun Heo 
Cc: Dmitry Vyukov 
Signed-off-by: Shaohua Li 
---
 kernel/kthread.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index f87cd8b4..8dbe245 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -204,7 +204,7 @@ static int kthread(void *_create)
struct kthread *self;
int ret;
 
-   self = kmalloc(sizeof(*self), GFP_KERNEL);
+   self = kzalloc(sizeof(*self), GFP_KERNEL);
set_kthread_struct(self);
 
/* If user was SIGKILLed, I release the structure. */
@@ -220,13 +220,9 @@ static int kthread(void *_create)
do_exit(-ENOMEM);
}
 
-   self->flags = 0;
self->data = data;
init_completion(>exited);
init_completion(>parked);
-#ifdef CONFIG_BLK_CGROUP
-   self->blkcg_css = NULL;
-#endif
current->vfork_done = >exited;
 
/* OK, tell user we're spawned, wait for stop or wakeup */
-- 
2.9.5



[PATCH] kthread: move the cgroup info initialization early

2017-11-07 Thread Shaohua Li
kthread() could bail out early before we initialize blkcg_css (if the
kthread is killed very soon), which confuses free_kthread_struct. Move
the blkcg_css initialization early.

Reported-by: syzbot <syzkal...@googlegroups.com>
Fix: 05e3db9(kthread: add a mechanism to store cgroup info)
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Ingo Molnar <mi...@kernel.org>
Cc: Tejun Heo <t...@kernel.org>
Signed-off-by: Shaohua Li <s...@fb.com>
---
 kernel/kthread.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index f87cd8b4..cf5c113 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -205,6 +205,10 @@ static int kthread(void *_create)
int ret;
 
self = kmalloc(sizeof(*self), GFP_KERNEL);
+#ifdef CONFIG_BLK_CGROUP
+   if (self)
+   self->blkcg_css = NULL;
+#endif
set_kthread_struct(self);
 
/* If user was SIGKILLed, I release the structure. */
@@ -224,9 +228,6 @@ static int kthread(void *_create)
self->data = data;
init_completion(>exited);
init_completion(>parked);
-#ifdef CONFIG_BLK_CGROUP
-   self->blkcg_css = NULL;
-#endif
current->vfork_done = >exited;
 
/* OK, tell user we're spawned, wait for stop or wakeup */
-- 
2.9.5



[PATCH] kthread: move the cgroup info initialization early

2017-11-07 Thread Shaohua Li
kthread() could bail out early before we initialize blkcg_css (if the
kthread is killed very soon), which confuses free_kthread_struct. Move
the blkcg_css initialization early.

Reported-by: syzbot 
Fix: 05e3db9(kthread: add a mechanism to store cgroup info)
Cc: Andrew Morton 
Cc: Ingo Molnar 
Cc: Tejun Heo 
Signed-off-by: Shaohua Li 
---
 kernel/kthread.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index f87cd8b4..cf5c113 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -205,6 +205,10 @@ static int kthread(void *_create)
int ret;
 
self = kmalloc(sizeof(*self), GFP_KERNEL);
+#ifdef CONFIG_BLK_CGROUP
+   if (self)
+   self->blkcg_css = NULL;
+#endif
set_kthread_struct(self);
 
/* If user was SIGKILLed, I release the structure. */
@@ -224,9 +228,6 @@ static int kthread(void *_create)
self->data = data;
init_completion(>exited);
init_completion(>parked);
-#ifdef CONFIG_BLK_CGROUP
-   self->blkcg_css = NULL;
-#endif
current->vfork_done = >exited;
 
/* OK, tell user we're spawned, wait for stop or wakeup */
-- 
2.9.5



Re: [PATCH] md: Convert timers to use timer_setup()

2017-10-16 Thread Shaohua Li
On Mon, Oct 16, 2017 at 05:01:48PM -0700, Kees Cook wrote:
> In preparation for unconditionally passing the struct timer_list pointer to
> all timer callbacks, switch to using the new timer_setup() and from_timer()
> to pass the timer pointer explicitly.

If you send the md.c part along, I'll apply. Or if you want this merged in
other tree, you can add my 'reviewed-by: Shaohua Li <s...@fb.com>' for md.c
part.

> Cc: Kent Overstreet <kent.overstr...@gmail.com>
> Cc: Shaohua Li <s...@kernel.org>
> Cc: Alasdair Kergon <a...@redhat.com>
> Cc: Mike Snitzer <snit...@redhat.com>
> Cc: dm-de...@redhat.com
> Cc: linux-bca...@vger.kernel.org
> Cc: linux-r...@vger.kernel.org
> Signed-off-by: Kees Cook <keesc...@chromium.org>
> ---
>  drivers/md/bcache/stats.c | 8 +++-
>  drivers/md/dm-delay.c | 6 +++---
>  drivers/md/dm-integrity.c | 6 +++---
>  drivers/md/dm-raid1.c | 8 +++-
>  drivers/md/md.c   | 9 -
>  5 files changed, 16 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/md/bcache/stats.c b/drivers/md/bcache/stats.c
> index 0ca072c20d0d..93a89c528760 100644
> --- a/drivers/md/bcache/stats.c
> +++ b/drivers/md/bcache/stats.c
> @@ -146,9 +146,9 @@ static void scale_stats(struct cache_stats *stats, 
> unsigned long rescale_at)
>   }
>  }
>  
> -static void scale_accounting(unsigned long data)
> +static void scale_accounting(struct timer_list *t)
>  {
> - struct cache_accounting *acc = (struct cache_accounting *) data;
> + struct cache_accounting *acc = from_timer(acc, t, timer);
>  
>  #define move_stat(name) do { \
>   unsigned t = atomic_xchg(>collector.name, 0);  \
> @@ -233,9 +233,7 @@ void bch_cache_accounting_init(struct cache_accounting 
> *acc,
>   kobject_init(>day.kobj,_stats_ktype);
>  
>   closure_init(>cl, parent);
> - init_timer(>timer);
> + timer_setup(>timer, scale_accounting, 0);
>   acc->timer.expires  = jiffies + accounting_delay;
> - acc->timer.data = (unsigned long) acc;
> - acc->timer.function = scale_accounting;
>   add_timer(>timer);
>  }
> diff --git a/drivers/md/dm-delay.c b/drivers/md/dm-delay.c
> index 2209a9700acd..288386bfbfb5 100644
> --- a/drivers/md/dm-delay.c
> +++ b/drivers/md/dm-delay.c
> @@ -44,9 +44,9 @@ struct dm_delay_info {
>  
>  static DEFINE_MUTEX(delayed_bios_lock);
>  
> -static void handle_delayed_timer(unsigned long data)
> +static void handle_delayed_timer(struct timer_list *t)
>  {
> - struct delay_c *dc = (struct delay_c *)data;
> + struct delay_c *dc = from_timer(dc, t, delay_timer);
>  
>   queue_work(dc->kdelayd_wq, >flush_expired_bios);
>  }
> @@ -195,7 +195,7 @@ static int delay_ctr(struct dm_target *ti, unsigned int 
> argc, char **argv)
>   goto bad_queue;
>   }
>  
> - setup_timer(>delay_timer, handle_delayed_timer, (unsigned long)dc);
> + timer_setup(>delay_timer, handle_delayed_timer, 0);
>  
>   INIT_WORK(>flush_expired_bios, flush_expired_bios);
>   INIT_LIST_HEAD(>delayed_bios);
> diff --git a/drivers/md/dm-integrity.c b/drivers/md/dm-integrity.c
> index 096fe9b66c50..98f0b645b839 100644
> --- a/drivers/md/dm-integrity.c
> +++ b/drivers/md/dm-integrity.c
> @@ -1093,9 +1093,9 @@ static void sleep_on_endio_wait(struct dm_integrity_c 
> *ic)
>   __remove_wait_queue(>endio_wait, );
>  }
>  
> -static void autocommit_fn(unsigned long data)
> +static void autocommit_fn(struct timer_list *t)
>  {
> - struct dm_integrity_c *ic = (struct dm_integrity_c *)data;
> + struct dm_integrity_c *ic = from_timer(ic, t, autocommit_timer);
>  
>   if (likely(!dm_integrity_failed(ic)))
>   queue_work(ic->commit_wq, >commit_work);
> @@ -2941,7 +2941,7 @@ static int dm_integrity_ctr(struct dm_target *ti, 
> unsigned argc, char **argv)
>  
>   ic->autocommit_jiffies = msecs_to_jiffies(sync_msec);
>   ic->autocommit_msec = sync_msec;
> - setup_timer(>autocommit_timer, autocommit_fn, (unsigned long)ic);
> + timer_setup(>autocommit_timer, autocommit_fn, 0);
>  
>   ic->io = dm_io_client_create();
>   if (IS_ERR(ic->io)) {
> diff --git a/drivers/md/dm-raid1.c b/drivers/md/dm-raid1.c
> index c0b82136b2d1..580c49cc8079 100644
> --- a/drivers/md/dm-raid1.c
> +++ b/drivers/md/dm-raid1.c
> @@ -94,9 +94,9 @@ static void wakeup_mirrord(void *context)
>   queue_work(ms->kmirrord_wq, >kmirrord_work);
>  }
>  
> -static void delayed_wake_fn(unsigned long data)
> +static void

Re: [PATCH] md: Convert timers to use timer_setup()

2017-10-16 Thread Shaohua Li
On Mon, Oct 16, 2017 at 05:01:48PM -0700, Kees Cook wrote:
> In preparation for unconditionally passing the struct timer_list pointer to
> all timer callbacks, switch to using the new timer_setup() and from_timer()
> to pass the timer pointer explicitly.

If you send the md.c part along, I'll apply. Or if you want this merged in
other tree, you can add my 'reviewed-by: Shaohua Li ' for md.c
part.

> Cc: Kent Overstreet 
> Cc: Shaohua Li 
> Cc: Alasdair Kergon 
> Cc: Mike Snitzer 
> Cc: dm-de...@redhat.com
> Cc: linux-bca...@vger.kernel.org
> Cc: linux-r...@vger.kernel.org
> Signed-off-by: Kees Cook 
> ---
>  drivers/md/bcache/stats.c | 8 +++-
>  drivers/md/dm-delay.c | 6 +++---
>  drivers/md/dm-integrity.c | 6 +++---
>  drivers/md/dm-raid1.c | 8 +++-
>  drivers/md/md.c   | 9 -
>  5 files changed, 16 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/md/bcache/stats.c b/drivers/md/bcache/stats.c
> index 0ca072c20d0d..93a89c528760 100644
> --- a/drivers/md/bcache/stats.c
> +++ b/drivers/md/bcache/stats.c
> @@ -146,9 +146,9 @@ static void scale_stats(struct cache_stats *stats, 
> unsigned long rescale_at)
>   }
>  }
>  
> -static void scale_accounting(unsigned long data)
> +static void scale_accounting(struct timer_list *t)
>  {
> - struct cache_accounting *acc = (struct cache_accounting *) data;
> + struct cache_accounting *acc = from_timer(acc, t, timer);
>  
>  #define move_stat(name) do { \
>   unsigned t = atomic_xchg(>collector.name, 0);  \
> @@ -233,9 +233,7 @@ void bch_cache_accounting_init(struct cache_accounting 
> *acc,
>   kobject_init(>day.kobj,_stats_ktype);
>  
>   closure_init(>cl, parent);
> - init_timer(>timer);
> + timer_setup(>timer, scale_accounting, 0);
>   acc->timer.expires  = jiffies + accounting_delay;
> - acc->timer.data = (unsigned long) acc;
> - acc->timer.function = scale_accounting;
>   add_timer(>timer);
>  }
> diff --git a/drivers/md/dm-delay.c b/drivers/md/dm-delay.c
> index 2209a9700acd..288386bfbfb5 100644
> --- a/drivers/md/dm-delay.c
> +++ b/drivers/md/dm-delay.c
> @@ -44,9 +44,9 @@ struct dm_delay_info {
>  
>  static DEFINE_MUTEX(delayed_bios_lock);
>  
> -static void handle_delayed_timer(unsigned long data)
> +static void handle_delayed_timer(struct timer_list *t)
>  {
> - struct delay_c *dc = (struct delay_c *)data;
> + struct delay_c *dc = from_timer(dc, t, delay_timer);
>  
>   queue_work(dc->kdelayd_wq, >flush_expired_bios);
>  }
> @@ -195,7 +195,7 @@ static int delay_ctr(struct dm_target *ti, unsigned int 
> argc, char **argv)
>   goto bad_queue;
>   }
>  
> - setup_timer(>delay_timer, handle_delayed_timer, (unsigned long)dc);
> + timer_setup(>delay_timer, handle_delayed_timer, 0);
>  
>   INIT_WORK(>flush_expired_bios, flush_expired_bios);
>   INIT_LIST_HEAD(>delayed_bios);
> diff --git a/drivers/md/dm-integrity.c b/drivers/md/dm-integrity.c
> index 096fe9b66c50..98f0b645b839 100644
> --- a/drivers/md/dm-integrity.c
> +++ b/drivers/md/dm-integrity.c
> @@ -1093,9 +1093,9 @@ static void sleep_on_endio_wait(struct dm_integrity_c 
> *ic)
>   __remove_wait_queue(>endio_wait, );
>  }
>  
> -static void autocommit_fn(unsigned long data)
> +static void autocommit_fn(struct timer_list *t)
>  {
> - struct dm_integrity_c *ic = (struct dm_integrity_c *)data;
> + struct dm_integrity_c *ic = from_timer(ic, t, autocommit_timer);
>  
>   if (likely(!dm_integrity_failed(ic)))
>   queue_work(ic->commit_wq, >commit_work);
> @@ -2941,7 +2941,7 @@ static int dm_integrity_ctr(struct dm_target *ti, 
> unsigned argc, char **argv)
>  
>   ic->autocommit_jiffies = msecs_to_jiffies(sync_msec);
>   ic->autocommit_msec = sync_msec;
> - setup_timer(>autocommit_timer, autocommit_fn, (unsigned long)ic);
> + timer_setup(>autocommit_timer, autocommit_fn, 0);
>  
>   ic->io = dm_io_client_create();
>   if (IS_ERR(ic->io)) {
> diff --git a/drivers/md/dm-raid1.c b/drivers/md/dm-raid1.c
> index c0b82136b2d1..580c49cc8079 100644
> --- a/drivers/md/dm-raid1.c
> +++ b/drivers/md/dm-raid1.c
> @@ -94,9 +94,9 @@ static void wakeup_mirrord(void *context)
>   queue_work(ms->kmirrord_wq, >kmirrord_work);
>  }
>  
> -static void delayed_wake_fn(unsigned long data)
> +static void delayed_wake_fn(struct timer_list *t)
>  {
> - struct mirror_set *ms = (struct mirror_set *) data;
> + struct mirror_set *ms = from_timer(ms, t, timer

Re: [PATCH] md: raid10: remove VLAIS

2017-10-05 Thread Shaohua Li
On Fri, Oct 06, 2017 at 01:22:12PM +1100, Neil Brown wrote:
> On Thu, Oct 05 2017, Matthias Kaehlcke wrote:
> 
> > Hi Neil,
> >
> > El Fri, Oct 06, 2017 at 10:58:59AM +1100 NeilBrown ha dit:
> >
> >> On Thu, Oct 05 2017, Matthias Kaehlcke wrote:
> >> 
> >> > The raid10 driver can't be built with clang since it uses a variable
> >> > length array in a structure (VLAIS):
> >> >
> >> > drivers/md/raid10.c:4583:17: error: fields must have a constant size:
> >> >   'variable length array in structure' extension will never be supported
> >> >
> >> > Allocate the r10bio struct with kmalloc instead of using the VLAIS
> >> > construct.
> >> >
> >> > Signed-off-by: Matthias Kaehlcke 
> >> > ---
> >> >  drivers/md/raid10.c | 13 -
> >> >  1 file changed, 8 insertions(+), 5 deletions(-)
> >> >
> >> > diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> >> > index 374df5796649..9616163eaf8c 100644
> >> > --- a/drivers/md/raid10.c
> >> > +++ b/drivers/md/raid10.c
> >> > @@ -4578,15 +4578,16 @@ static int handle_reshape_read_error(struct 
> >> > mddev *mddev,
> >> >  /* Use sync reads to get the blocks from somewhere else */
> >> >  int sectors = r10_bio->sectors;
> >> >  struct r10conf *conf = mddev->private;
> >> > -struct {
> >> > -struct r10bio r10_bio;
> >> > -struct r10dev devs[conf->copies];
> >> > -} on_stack;
> >> > -struct r10bio *r10b = _stack.r10_bio;
> >> > +struct r10bio *r10b;
> >> >  int slot = 0;
> >> >  int idx = 0;
> >> >  struct page **pages;
> >> >  
> >> > +r10b = kmalloc(sizeof(*r10b) +
> >> > +   sizeof(struct r10dev) * conf->copies, GFP_KERNEL);
> >> 
> >> GFP_KERNEL isn't a good idea here.
> >> This could wait for writeback, and if writeback tries to write to the
> >> region of the array which is being reshaped, it might deadlock.
> >> 
> >> GFP_NOIO is safer.
> >
> > Good point, thanks!
> >
> >> given that conf->copies is almost always 2 it might be nicer to
> >> have
> >> 
> >>struct {
> >>struct r10bio r10_bio;
> >>struct r10dev devs[2];
> >>} on_stack;
> >> 
> >> struct r10bio *r10b;
> >> 
> >>if (conf->copies <= ARRAY_SIZE(on_stack.devs))
> >>r10b = _stack.r10_bio;
> >> else
> >>r10b = kmalloc(sizeof(*r10b) +
> >>   sizeof(struct r10dev) * conf->copies, GFP_NOIO);
> >
> > It would add also add an extra condition to determine if r10b needs to
> > be freed or not.
> 
> True.
> 
> >
> > Given that array reshaping is a rare operation and an error during
> > this operation is an exceptional condition I think the simpler code
> > with always dynamic allocation is preferable. That said I'm fine with
> > reworking the patch according to your suggestion if you or Shaohua
> > prefer it.
> 
> I don't feel strongly about it.  As long as the GFP_KERNEL->GFP_NOIO
> change happens I'm OK with this patch.

Let's use GFP_NOIO then, should not be big deal. I updated the patch.

Thanks,
Shaohua


Re: [PATCH] md: raid10: remove VLAIS

2017-10-05 Thread Shaohua Li
On Fri, Oct 06, 2017 at 01:22:12PM +1100, Neil Brown wrote:
> On Thu, Oct 05 2017, Matthias Kaehlcke wrote:
> 
> > Hi Neil,
> >
> > El Fri, Oct 06, 2017 at 10:58:59AM +1100 NeilBrown ha dit:
> >
> >> On Thu, Oct 05 2017, Matthias Kaehlcke wrote:
> >> 
> >> > The raid10 driver can't be built with clang since it uses a variable
> >> > length array in a structure (VLAIS):
> >> >
> >> > drivers/md/raid10.c:4583:17: error: fields must have a constant size:
> >> >   'variable length array in structure' extension will never be supported
> >> >
> >> > Allocate the r10bio struct with kmalloc instead of using the VLAIS
> >> > construct.
> >> >
> >> > Signed-off-by: Matthias Kaehlcke 
> >> > ---
> >> >  drivers/md/raid10.c | 13 -
> >> >  1 file changed, 8 insertions(+), 5 deletions(-)
> >> >
> >> > diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> >> > index 374df5796649..9616163eaf8c 100644
> >> > --- a/drivers/md/raid10.c
> >> > +++ b/drivers/md/raid10.c
> >> > @@ -4578,15 +4578,16 @@ static int handle_reshape_read_error(struct 
> >> > mddev *mddev,
> >> >  /* Use sync reads to get the blocks from somewhere else */
> >> >  int sectors = r10_bio->sectors;
> >> >  struct r10conf *conf = mddev->private;
> >> > -struct {
> >> > -struct r10bio r10_bio;
> >> > -struct r10dev devs[conf->copies];
> >> > -} on_stack;
> >> > -struct r10bio *r10b = _stack.r10_bio;
> >> > +struct r10bio *r10b;
> >> >  int slot = 0;
> >> >  int idx = 0;
> >> >  struct page **pages;
> >> >  
> >> > +r10b = kmalloc(sizeof(*r10b) +
> >> > +   sizeof(struct r10dev) * conf->copies, GFP_KERNEL);
> >> 
> >> GFP_KERNEL isn't a good idea here.
> >> This could wait for writeback, and if writeback tries to write to the
> >> region of the array which is being reshaped, it might deadlock.
> >> 
> >> GFP_NOIO is safer.
> >
> > Good point, thanks!
> >
> >> given that conf->copies is almost always 2 it might be nicer to
> >> have
> >> 
> >>struct {
> >>struct r10bio r10_bio;
> >>struct r10dev devs[2];
> >>} on_stack;
> >> 
> >> struct r10bio *r10b;
> >> 
> >>if (conf->copies <= ARRAY_SIZE(on_stack.devs))
> >>r10b = _stack.r10_bio;
> >> else
> >>r10b = kmalloc(sizeof(*r10b) +
> >>   sizeof(struct r10dev) * conf->copies, GFP_NOIO);
> >
> > It would add also add an extra condition to determine if r10b needs to
> > be freed or not.
> 
> True.
> 
> >
> > Given that array reshaping is a rare operation and an error during
> > this operation is an exceptional condition I think the simpler code
> > with always dynamic allocation is preferable. That said I'm fine with
> > reworking the patch according to your suggestion if you or Shaohua
> > prefer it.
> 
> I don't feel strongly about it.  As long as the GFP_KERNEL->GFP_NOIO
> change happens I'm OK with this patch.

Let's use GFP_NOIO then, should not be big deal. I updated the patch.

Thanks,
Shaohua


Re: [PATCH] md: raid10: remove VLAIS

2017-10-05 Thread Shaohua Li
On Thu, Oct 05, 2017 at 11:28:47AM -0700, Matthias Kaehlcke wrote:
> The raid10 driver can't be built with clang since it uses a variable
> length array in a structure (VLAIS):
> 
> drivers/md/raid10.c:4583:17: error: fields must have a constant size:
>   'variable length array in structure' extension will never be supported
> 
> Allocate the r10bio struct with kmalloc instead of using the VLAIS
> construct.
> 
> Signed-off-by: Matthias Kaehlcke 
> ---
>  drivers/md/raid10.c | 13 -
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 374df5796649..9616163eaf8c 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -4578,15 +4578,16 @@ static int handle_reshape_read_error(struct mddev 
> *mddev,
>   /* Use sync reads to get the blocks from somewhere else */
>   int sectors = r10_bio->sectors;
>   struct r10conf *conf = mddev->private;
> - struct {
> - struct r10bio r10_bio;
> - struct r10dev devs[conf->copies];
> - } on_stack;
> - struct r10bio *r10b = _stack.r10_bio;
> + struct r10bio *r10b;
>   int slot = 0;
>   int idx = 0;
>   struct page **pages;
>  
> + r10b = kmalloc(sizeof(*r10b) +
> +sizeof(struct r10dev) * conf->copies, GFP_KERNEL);
> + if (!r10b)
> + return -ENOMEM;

I'll apply this patch with 'set_bit(MD_RECOVERY_INTR, >recovery);' added 
here, thanks!

> +
>   /* reshape IOs share pages from .devs[0].bio */
>   pages = get_resync_pages(r10_bio->devs[0].bio)->pages;
>  
> @@ -4635,11 +4636,13 @@ static int handle_reshape_read_error(struct mddev 
> *mddev,
>   /* couldn't read this block, must give up */
>   set_bit(MD_RECOVERY_INTR,
>   >recovery);
> + kfree(r10b);
>   return -EIO;
>   }
>   sectors -= s;
>   idx++;
>   }
> + kfree(r10b);
>   return 0;
>  }
>  
> -- 
> 2.14.2.920.gcf0c67979c-goog
> 


Re: [PATCH] md: raid10: remove VLAIS

2017-10-05 Thread Shaohua Li
On Thu, Oct 05, 2017 at 11:28:47AM -0700, Matthias Kaehlcke wrote:
> The raid10 driver can't be built with clang since it uses a variable
> length array in a structure (VLAIS):
> 
> drivers/md/raid10.c:4583:17: error: fields must have a constant size:
>   'variable length array in structure' extension will never be supported
> 
> Allocate the r10bio struct with kmalloc instead of using the VLAIS
> construct.
> 
> Signed-off-by: Matthias Kaehlcke 
> ---
>  drivers/md/raid10.c | 13 -
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 374df5796649..9616163eaf8c 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -4578,15 +4578,16 @@ static int handle_reshape_read_error(struct mddev 
> *mddev,
>   /* Use sync reads to get the blocks from somewhere else */
>   int sectors = r10_bio->sectors;
>   struct r10conf *conf = mddev->private;
> - struct {
> - struct r10bio r10_bio;
> - struct r10dev devs[conf->copies];
> - } on_stack;
> - struct r10bio *r10b = _stack.r10_bio;
> + struct r10bio *r10b;
>   int slot = 0;
>   int idx = 0;
>   struct page **pages;
>  
> + r10b = kmalloc(sizeof(*r10b) +
> +sizeof(struct r10dev) * conf->copies, GFP_KERNEL);
> + if (!r10b)
> + return -ENOMEM;

I'll apply this patch with 'set_bit(MD_RECOVERY_INTR, >recovery);' added 
here, thanks!

> +
>   /* reshape IOs share pages from .devs[0].bio */
>   pages = get_resync_pages(r10_bio->devs[0].bio)->pages;
>  
> @@ -4635,11 +4636,13 @@ static int handle_reshape_read_error(struct mddev 
> *mddev,
>   /* couldn't read this block, must give up */
>   set_bit(MD_RECOVERY_INTR,
>   >recovery);
> + kfree(r10b);
>   return -EIO;
>   }
>   sectors -= s;
>   idx++;
>   }
> + kfree(r10b);
>   return 0;
>  }
>  
> -- 
> 2.14.2.920.gcf0c67979c-goog
> 


[GIT PULL] MD update for 4.14-rc3

2017-09-29 Thread Shaohua Li
Hi,
A few fixes for MD. Mainly fix a problem introduced in 4.13, which we retry bio
for some code paths but not all in some situration.

Thanks,
Shaohua

The following changes since commit 4a704d6db0ee4a349d5d523813a718e429b4914d:

  Merge tag 'kbuild-fixes-v4.14' of 
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild 
(2017-09-21 06:01:03 -1000)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git tags/md/4.14-rc3

for you to fetch changes up to 7d5d7b5058fbd638914e42504677141a69f43011:

  md/raid5: cap worker count (2017-09-27 20:08:44 -0700)


Shaohua Li (4):
  md: separate request handling
  md: fix a race condition for flush request handling
  dm-raid: fix a race condition in request handling
  md/raid5: cap worker count

 drivers/md/dm-raid.c |  2 +-
 drivers/md/md.c  | 72 +++-
 drivers/md/md.h  |  1 +
 drivers/md/raid5.c   |  7 +++--
 4 files changed, 50 insertions(+), 32 deletions(-)


[GIT PULL] MD update for 4.14-rc3

2017-09-29 Thread Shaohua Li
Hi,
A few fixes for MD. Mainly fix a problem introduced in 4.13, which we retry bio
for some code paths but not all in some situration.

Thanks,
Shaohua

The following changes since commit 4a704d6db0ee4a349d5d523813a718e429b4914d:

  Merge tag 'kbuild-fixes-v4.14' of 
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild 
(2017-09-21 06:01:03 -1000)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git tags/md/4.14-rc3

for you to fetch changes up to 7d5d7b5058fbd638914e42504677141a69f43011:

  md/raid5: cap worker count (2017-09-27 20:08:44 -0700)


Shaohua Li (4):
  md: separate request handling
  md: fix a race condition for flush request handling
  dm-raid: fix a race condition in request handling
  md/raid5: cap worker count

 drivers/md/dm-raid.c |  2 +-
 drivers/md/md.c  | 72 +++-
 drivers/md/md.h  |  1 +
 drivers/md/raid5.c   |  7 +++--
 4 files changed, 50 insertions(+), 32 deletions(-)


Re: [PATCH V3 0/4] block: make loop block device cgroup aware

2017-09-25 Thread Shaohua Li
On Thu, Sep 14, 2017 at 02:02:03PM -0700, Shaohua Li wrote:
> From: Shaohua Li <s...@fb.com>
> 
> Hi,
> 
> The IO dispatched to under layer disk by loop block device isn't cloned from
> original bio, so the IO loses cgroup information of original bio. These IO
> escapes from cgroup control. The patches try to address this issue. The idea 
> is
> quite generic, but we currently only make it work for blkcg.

Ping! how do we proceed with this patch set?

Thanks,
Shaohua

> 
> Thanks,
> Shaohua
> 
> V2->V3:
> - Make the API more robust pointed out by Tejun
> 
> V1->V2:
> - Address a couple of issues pointed out by Tejun
> 
> 
> Shaohua Li (4):
>   kthread: add a mechanism to store cgroup info
>   blkcg: delete unused APIs
>   block: make blkcg aware of kthread stored original cgroup info
>   block/loop: make loop cgroup aware
> 
>  block/bio.c| 31 --
>  drivers/block/loop.c   | 13 +
>  drivers/block/loop.h   |  1 +
>  include/linux/bio.h|  2 --
>  include/linux/blk-cgroup.h | 25 +-
>  include/linux/kthread.h| 11 
>  kernel/kthread.c   | 66 
> --
>  7 files changed, 96 insertions(+), 53 deletions(-)
> 
> -- 
> 2.9.5
> 


Re: [PATCH V3 0/4] block: make loop block device cgroup aware

2017-09-25 Thread Shaohua Li
On Thu, Sep 14, 2017 at 02:02:03PM -0700, Shaohua Li wrote:
> From: Shaohua Li 
> 
> Hi,
> 
> The IO dispatched to under layer disk by loop block device isn't cloned from
> original bio, so the IO loses cgroup information of original bio. These IO
> escapes from cgroup control. The patches try to address this issue. The idea 
> is
> quite generic, but we currently only make it work for blkcg.

Ping! how do we proceed with this patch set?

Thanks,
Shaohua

> 
> Thanks,
> Shaohua
> 
> V2->V3:
> - Make the API more robust pointed out by Tejun
> 
> V1->V2:
> - Address a couple of issues pointed out by Tejun
> 
> 
> Shaohua Li (4):
>   kthread: add a mechanism to store cgroup info
>   blkcg: delete unused APIs
>   block: make blkcg aware of kthread stored original cgroup info
>   block/loop: make loop cgroup aware
> 
>  block/bio.c| 31 --
>  drivers/block/loop.c   | 13 +
>  drivers/block/loop.h   |  1 +
>  include/linux/bio.h|  2 --
>  include/linux/blk-cgroup.h | 25 +-
>  include/linux/kthread.h| 11 
>  kernel/kthread.c   | 66 
> --
>  7 files changed, 96 insertions(+), 53 deletions(-)
> 
> -- 
> 2.9.5
> 


Re: MADV_FREE is broken

2017-09-20 Thread Shaohua Li
On Wed, Sep 20, 2017 at 11:01:47AM +0200, Artem Savkov wrote:
> Hi All,
> 
> We recently started noticing madvise09[1] test from ltp failing strangely. The
> test does the following: maps 32 pages, sets MADV_FREE for the range it got,
> dirties 2 of the pages, creates memory pressure and check that nondirty pages
> are free. The test hanged while accessing the last 4 pages(29-32) of madvised
> range at line 121 [2]. Any other process (gdb/cat) accessing those pages
> would also hang as would rebooting the machine. It doesn't trigger any debug
> warnings or kasan.
> 
> The issue bisected to "802a3a92ad7a mm: reclaim MADV_FREE pages" (so 4.12 and
> up are affected).
> 
> I did some poking around and found out that the "bad" pages had SwapBacked 
> flag
> set in shrink_page_list() which confused it a lot. It looks like
> mark_page_lazyfree() only calls lru_lazyfree_fn() when the pagevec is full
> (that is in batches of 14) and never drains the rest (so last four in 
> madvise09
> case).
> 
> The patch below greatly reduces the failure rate, but doesn't fix it
> completely, it still shows up with the same symptoms (hanging trying to access
> last 4 pages) after a bunch of retries.
> 
> [1] 
> https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/madvise/madvise09.c
> [2] 
> https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/madvise/madvise09.c#L121
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 21261ff0466f..a0b868e8b7d2 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -453,6 +453,7 @@ static void madvise_free_page_range(struct mmu_gather 
> *tlb,
>  
>   tlb_start_vma(tlb, vma);
>   walk_page_range(addr, end, _walk);
> + lru_add_drain();
>   tlb_end_vma(tlb, vma);
>  }

Looks there is a race between clear pte dirty bit and clear SwapBacked bit.
draining the vect helps a little, but not sufficient. If SwapBacked is set, we
could add the page to swapcache, but since we the page isn't dirty, we don't
write the page out. This could cause data corruption. There is another place we
wrongly clear SwapBacked bit. Below is a test patch which seems to fix the
issue, please give a try.


diff --git a/mm/swap.c b/mm/swap.c
index 62d96b8..5c58257 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -575,7 +575,7 @@ static void lru_lazyfree_fn(struct page *page, struct 
lruvec *lruvec,
void *arg)
 {
if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
-   !PageUnevictable(page)) {
+   !PageSwapCache(page) && !PageUnevictable(page)) {
bool active = PageActive(page);
 
del_page_from_lru_list(page, lruvec,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 13d711d..be1c98e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -980,6 +980,7 @@ static unsigned long shrink_page_list(struct list_head 
*page_list,
int may_enter_fs;
enum page_references references = PAGEREF_RECLAIM_CLEAN;
bool dirty, writeback;
+   bool new_added_swapcache = false;
 
cond_resched();
 
@@ -1165,6 +1166,7 @@ static unsigned long shrink_page_list(struct list_head 
*page_list,
 
/* Adding to swap updated mapping */
mapping = page_mapping(page);
+   new_added_swapcache = true;
}
} else if (unlikely(PageTransHuge(page))) {
/* Split file THP */
@@ -1185,6 +1187,10 @@ static unsigned long shrink_page_list(struct list_head 
*page_list,
nr_unmap_fail++;
goto activate_locked;
}
+   /* race with MADV_FREE */
+   if (PageAnon(page) && !PageDirty(page) &&
+   PageSwapBacked(page) && new_added_swapcache)
+   set_page_dirty(page);
}
 
if (PageDirty(page)) {


Re: MADV_FREE is broken

2017-09-20 Thread Shaohua Li
On Wed, Sep 20, 2017 at 11:01:47AM +0200, Artem Savkov wrote:
> Hi All,
> 
> We recently started noticing madvise09[1] test from ltp failing strangely. The
> test does the following: maps 32 pages, sets MADV_FREE for the range it got,
> dirties 2 of the pages, creates memory pressure and check that nondirty pages
> are free. The test hanged while accessing the last 4 pages(29-32) of madvised
> range at line 121 [2]. Any other process (gdb/cat) accessing those pages
> would also hang as would rebooting the machine. It doesn't trigger any debug
> warnings or kasan.
> 
> The issue bisected to "802a3a92ad7a mm: reclaim MADV_FREE pages" (so 4.12 and
> up are affected).
> 
> I did some poking around and found out that the "bad" pages had SwapBacked 
> flag
> set in shrink_page_list() which confused it a lot. It looks like
> mark_page_lazyfree() only calls lru_lazyfree_fn() when the pagevec is full
> (that is in batches of 14) and never drains the rest (so last four in 
> madvise09
> case).
> 
> The patch below greatly reduces the failure rate, but doesn't fix it
> completely, it still shows up with the same symptoms (hanging trying to access
> last 4 pages) after a bunch of retries.
> 
> [1] 
> https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/madvise/madvise09.c
> [2] 
> https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/madvise/madvise09.c#L121
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 21261ff0466f..a0b868e8b7d2 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -453,6 +453,7 @@ static void madvise_free_page_range(struct mmu_gather 
> *tlb,
>  
>   tlb_start_vma(tlb, vma);
>   walk_page_range(addr, end, _walk);
> + lru_add_drain();
>   tlb_end_vma(tlb, vma);
>  }

Looks there is a race between clear pte dirty bit and clear SwapBacked bit.
draining the vect helps a little, but not sufficient. If SwapBacked is set, we
could add the page to swapcache, but since we the page isn't dirty, we don't
write the page out. This could cause data corruption. There is another place we
wrongly clear SwapBacked bit. Below is a test patch which seems to fix the
issue, please give a try.


diff --git a/mm/swap.c b/mm/swap.c
index 62d96b8..5c58257 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -575,7 +575,7 @@ static void lru_lazyfree_fn(struct page *page, struct 
lruvec *lruvec,
void *arg)
 {
if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
-   !PageUnevictable(page)) {
+   !PageSwapCache(page) && !PageUnevictable(page)) {
bool active = PageActive(page);
 
del_page_from_lru_list(page, lruvec,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 13d711d..be1c98e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -980,6 +980,7 @@ static unsigned long shrink_page_list(struct list_head 
*page_list,
int may_enter_fs;
enum page_references references = PAGEREF_RECLAIM_CLEAN;
bool dirty, writeback;
+   bool new_added_swapcache = false;
 
cond_resched();
 
@@ -1165,6 +1166,7 @@ static unsigned long shrink_page_list(struct list_head 
*page_list,
 
/* Adding to swap updated mapping */
mapping = page_mapping(page);
+   new_added_swapcache = true;
}
} else if (unlikely(PageTransHuge(page))) {
/* Split file THP */
@@ -1185,6 +1187,10 @@ static unsigned long shrink_page_list(struct list_head 
*page_list,
nr_unmap_fail++;
goto activate_locked;
}
+   /* race with MADV_FREE */
+   if (PageAnon(page) && !PageDirty(page) &&
+   PageSwapBacked(page) && new_added_swapcache)
+   set_page_dirty(page);
}
 
if (PageDirty(page)) {


[GIT PULL] MD update for 4.14-rc2

2017-09-19 Thread Shaohua Li
Hi,
2 small patches fix long live raid5 stripe batch bugs, one from Dennis and the
other from me. Please pull!

Thanks,
Shaohua

The following changes since commit e8a27f836f165c26f867ece7f31eb5c811692319:

  md/bitmap: disable bitmap_resize for file-backed bitmaps. (2017-08-31 
22:57:03 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git for-next

for you to fetch changes up to 184a09eb9a2fe425e49c9538f1604b05ed33cfef:

  md/raid5: preserve STRIPE_ON_UNPLUG_LIST in break_stripe_batch_list 
(2017-09-05 22:51:13 -0700)


Dennis Yang (1):
  md/raid5: preserve STRIPE_ON_UNPLUG_LIST in break_stripe_batch_list

Shaohua Li (1):
  md/raid5: fix a race condition in stripe batch

 drivers/md/raid5.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)


[GIT PULL] MD update for 4.14-rc2

2017-09-19 Thread Shaohua Li
Hi,
2 small patches fix long live raid5 stripe batch bugs, one from Dennis and the
other from me. Please pull!

Thanks,
Shaohua

The following changes since commit e8a27f836f165c26f867ece7f31eb5c811692319:

  md/bitmap: disable bitmap_resize for file-backed bitmaps. (2017-08-31 
22:57:03 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git for-next

for you to fetch changes up to 184a09eb9a2fe425e49c9538f1604b05ed33cfef:

  md/raid5: preserve STRIPE_ON_UNPLUG_LIST in break_stripe_batch_list 
(2017-09-05 22:51:13 -0700)


Dennis Yang (1):
  md/raid5: preserve STRIPE_ON_UNPLUG_LIST in break_stripe_batch_list

Shaohua Li (1):
  md/raid5: fix a race condition in stripe batch

 drivers/md/raid5.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)


[PATCH V3 1/4] kthread: add a mechanism to store cgroup info

2017-09-14 Thread Shaohua Li
From: Shaohua Li <s...@fb.com>

kthread usually runs jobs on behalf of other threads. The jobs should be
charged to cgroup of original threads. But the jobs run in a kthread,
where we lose the cgroup context of original threads. The patch adds a
machanism to record cgroup info of original threads in kthread context.
Later we can retrieve the cgroup info and attach the cgroup info to jobs.

Since this mechanism is only required by kthread, we store the cgroup
info in kthread data instead of generic task_struct.

Signed-off-by: Shaohua Li <s...@fb.com>
---
 include/linux/kthread.h | 11 +
 kernel/kthread.c| 66 +++--
 2 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 82e197e..bd4369c 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -3,6 +3,7 @@
 /* Simple interface for creating and stopping kernel threads without mess. */
 #include 
 #include 
+#include 
 
 __printf(4, 5)
 struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
@@ -198,4 +199,14 @@ bool kthread_cancel_delayed_work_sync(struct 
kthread_delayed_work *work);
 
 void kthread_destroy_worker(struct kthread_worker *worker);
 
+#ifdef CONFIG_CGROUPS
+void kthread_associate_blkcg(struct cgroup_subsys_state *css);
+struct cgroup_subsys_state *kthread_blkcg(void);
+#else
+static inline void kthread_associate_blkcg(struct cgroup_subsys_state *css) { }
+static inline struct cgroup_subsys_state *kthread_blkcg(void)
+{
+   return NULL;
+}
+#endif
 #endif /* _LINUX_KTHREAD_H */
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 26db528..a8b4e83 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -20,7 +20,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 static DEFINE_SPINLOCK(kthread_create_lock);
@@ -47,6 +46,9 @@ struct kthread {
void *data;
struct completion parked;
struct completion exited;
+#ifdef CONFIG_CGROUPS
+   struct cgroup_subsys_state *blkcg_css;
+#endif
 };
 
 enum KTHREAD_BITS {
@@ -74,11 +76,17 @@ static inline struct kthread *to_kthread(struct task_struct 
*k)
 
 void free_kthread_struct(struct task_struct *k)
 {
+   struct kthread *kthread;
+
/*
 * Can be NULL if this kthread was created by kernel_thread()
 * or if kmalloc() in kthread() failed.
 */
-   kfree(to_kthread(k));
+   kthread = to_kthread(k);
+#ifdef CONFIG_CGROUPS
+   WARN_ON_ONCE(kthread && kthread->blkcg_css);
+#endif
+   kfree(kthread);
 }
 
 /**
@@ -216,6 +224,9 @@ static int kthread(void *_create)
self->data = data;
init_completion(>exited);
init_completion(>parked);
+#ifdef CONFIG_CGROUPS
+   self->blkcg_css = NULL;
+#endif
current->vfork_done = >exited;
 
/* OK, tell user we're spawned, wait for stop or wakeup */
@@ -1153,3 +1164,54 @@ void kthread_destroy_worker(struct kthread_worker 
*worker)
kfree(worker);
 }
 EXPORT_SYMBOL(kthread_destroy_worker);
+
+#ifdef CONFIG_CGROUPS
+/**
+ * kthread_associate_blkcg - associate blkcg to current kthread
+ * @css: the cgroup info
+ *
+ * Current thread must be a kthread. The thread is running jobs on behalf of
+ * other threads. In some cases, we expect the jobs attach cgroup info of
+ * original threads instead of that of current thread. This function stores
+ * original thread's cgroup info in current kthread context for later
+ * retrieval.
+ */
+void kthread_associate_blkcg(struct cgroup_subsys_state *css)
+{
+   struct kthread *kthread;
+
+   if (!(current->flags & PF_KTHREAD))
+   return;
+   kthread = to_kthread(current);
+   if (!kthread)
+   return;
+
+   if (kthread->blkcg_css) {
+   css_put(kthread->blkcg_css);
+   kthread->blkcg_css = NULL;
+   }
+   if (css) {
+   css_get(css);
+   kthread->blkcg_css = css;
+   }
+}
+EXPORT_SYMBOL(kthread_associate_blkcg);
+
+/**
+ * kthread_blkcg - get associated blkcg css of current kthread
+ *
+ * Current thread must be a kthread.
+ */
+struct cgroup_subsys_state *kthread_blkcg(void)
+{
+   struct kthread *kthread;
+
+   if (current->flags & PF_KTHREAD) {
+   kthread = to_kthread(current);
+   if (kthread)
+   return kthread->blkcg_css;
+   }
+   return NULL;
+}
+EXPORT_SYMBOL(kthread_blkcg);
+#endif
-- 
2.9.5



[PATCH V3 1/4] kthread: add a mechanism to store cgroup info

2017-09-14 Thread Shaohua Li
From: Shaohua Li 

kthread usually runs jobs on behalf of other threads. The jobs should be
charged to cgroup of original threads. But the jobs run in a kthread,
where we lose the cgroup context of original threads. The patch adds a
machanism to record cgroup info of original threads in kthread context.
Later we can retrieve the cgroup info and attach the cgroup info to jobs.

Since this mechanism is only required by kthread, we store the cgroup
info in kthread data instead of generic task_struct.

Signed-off-by: Shaohua Li 
---
 include/linux/kthread.h | 11 +
 kernel/kthread.c| 66 +++--
 2 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 82e197e..bd4369c 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -3,6 +3,7 @@
 /* Simple interface for creating and stopping kernel threads without mess. */
 #include 
 #include 
+#include 
 
 __printf(4, 5)
 struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
@@ -198,4 +199,14 @@ bool kthread_cancel_delayed_work_sync(struct 
kthread_delayed_work *work);
 
 void kthread_destroy_worker(struct kthread_worker *worker);
 
+#ifdef CONFIG_CGROUPS
+void kthread_associate_blkcg(struct cgroup_subsys_state *css);
+struct cgroup_subsys_state *kthread_blkcg(void);
+#else
+static inline void kthread_associate_blkcg(struct cgroup_subsys_state *css) { }
+static inline struct cgroup_subsys_state *kthread_blkcg(void)
+{
+   return NULL;
+}
+#endif
 #endif /* _LINUX_KTHREAD_H */
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 26db528..a8b4e83 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -20,7 +20,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 static DEFINE_SPINLOCK(kthread_create_lock);
@@ -47,6 +46,9 @@ struct kthread {
void *data;
struct completion parked;
struct completion exited;
+#ifdef CONFIG_CGROUPS
+   struct cgroup_subsys_state *blkcg_css;
+#endif
 };
 
 enum KTHREAD_BITS {
@@ -74,11 +76,17 @@ static inline struct kthread *to_kthread(struct task_struct 
*k)
 
 void free_kthread_struct(struct task_struct *k)
 {
+   struct kthread *kthread;
+
/*
 * Can be NULL if this kthread was created by kernel_thread()
 * or if kmalloc() in kthread() failed.
 */
-   kfree(to_kthread(k));
+   kthread = to_kthread(k);
+#ifdef CONFIG_CGROUPS
+   WARN_ON_ONCE(kthread && kthread->blkcg_css);
+#endif
+   kfree(kthread);
 }
 
 /**
@@ -216,6 +224,9 @@ static int kthread(void *_create)
self->data = data;
init_completion(>exited);
init_completion(>parked);
+#ifdef CONFIG_CGROUPS
+   self->blkcg_css = NULL;
+#endif
current->vfork_done = >exited;
 
/* OK, tell user we're spawned, wait for stop or wakeup */
@@ -1153,3 +1164,54 @@ void kthread_destroy_worker(struct kthread_worker 
*worker)
kfree(worker);
 }
 EXPORT_SYMBOL(kthread_destroy_worker);
+
+#ifdef CONFIG_CGROUPS
+/**
+ * kthread_associate_blkcg - associate blkcg to current kthread
+ * @css: the cgroup info
+ *
+ * Current thread must be a kthread. The thread is running jobs on behalf of
+ * other threads. In some cases, we expect the jobs attach cgroup info of
+ * original threads instead of that of current thread. This function stores
+ * original thread's cgroup info in current kthread context for later
+ * retrieval.
+ */
+void kthread_associate_blkcg(struct cgroup_subsys_state *css)
+{
+   struct kthread *kthread;
+
+   if (!(current->flags & PF_KTHREAD))
+   return;
+   kthread = to_kthread(current);
+   if (!kthread)
+   return;
+
+   if (kthread->blkcg_css) {
+   css_put(kthread->blkcg_css);
+   kthread->blkcg_css = NULL;
+   }
+   if (css) {
+   css_get(css);
+   kthread->blkcg_css = css;
+   }
+}
+EXPORT_SYMBOL(kthread_associate_blkcg);
+
+/**
+ * kthread_blkcg - get associated blkcg css of current kthread
+ *
+ * Current thread must be a kthread.
+ */
+struct cgroup_subsys_state *kthread_blkcg(void)
+{
+   struct kthread *kthread;
+
+   if (current->flags & PF_KTHREAD) {
+   kthread = to_kthread(current);
+   if (kthread)
+   return kthread->blkcg_css;
+   }
+   return NULL;
+}
+EXPORT_SYMBOL(kthread_blkcg);
+#endif
-- 
2.9.5



[PATCH V3 0/4] block: make loop block device cgroup aware

2017-09-14 Thread Shaohua Li
From: Shaohua Li <s...@fb.com>

Hi,

The IO dispatched to under layer disk by loop block device isn't cloned from
original bio, so the IO loses cgroup information of original bio. These IO
escapes from cgroup control. The patches try to address this issue. The idea is
quite generic, but we currently only make it work for blkcg.

Thanks,
Shaohua

V2->V3:
- Make the API more robust pointed out by Tejun

V1->V2:
- Address a couple of issues pointed out by Tejun


Shaohua Li (4):
  kthread: add a mechanism to store cgroup info
  blkcg: delete unused APIs
  block: make blkcg aware of kthread stored original cgroup info
  block/loop: make loop cgroup aware

 block/bio.c| 31 --
 drivers/block/loop.c   | 13 +
 drivers/block/loop.h   |  1 +
 include/linux/bio.h|  2 --
 include/linux/blk-cgroup.h | 25 +-
 include/linux/kthread.h| 11 
 kernel/kthread.c   | 66 --
 7 files changed, 96 insertions(+), 53 deletions(-)

-- 
2.9.5



[PATCH V3 0/4] block: make loop block device cgroup aware

2017-09-14 Thread Shaohua Li
From: Shaohua Li 

Hi,

The IO dispatched to under layer disk by loop block device isn't cloned from
original bio, so the IO loses cgroup information of original bio. These IO
escapes from cgroup control. The patches try to address this issue. The idea is
quite generic, but we currently only make it work for blkcg.

Thanks,
Shaohua

V2->V3:
- Make the API more robust pointed out by Tejun

V1->V2:
- Address a couple of issues pointed out by Tejun


Shaohua Li (4):
  kthread: add a mechanism to store cgroup info
  blkcg: delete unused APIs
  block: make blkcg aware of kthread stored original cgroup info
  block/loop: make loop cgroup aware

 block/bio.c| 31 --
 drivers/block/loop.c   | 13 +
 drivers/block/loop.h   |  1 +
 include/linux/bio.h|  2 --
 include/linux/blk-cgroup.h | 25 +-
 include/linux/kthread.h| 11 
 kernel/kthread.c   | 66 --
 7 files changed, 96 insertions(+), 53 deletions(-)

-- 
2.9.5



[PATCH V3 2/4] blkcg: delete unused APIs

2017-09-14 Thread Shaohua Li
From: Shaohua Li <s...@fb.com>

Nobody uses the APIs right now.

Acked-by: Tejun Heo <t...@kernel.org>
Signed-off-by: Shaohua Li <s...@fb.com>
---
 block/bio.c| 31 ---
 include/linux/bio.h|  2 --
 include/linux/blk-cgroup.h | 12 
 3 files changed, 45 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 6745759..9271fa3 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -2033,37 +2033,6 @@ int bio_associate_blkcg(struct bio *bio, struct 
cgroup_subsys_state *blkcg_css)
 EXPORT_SYMBOL_GPL(bio_associate_blkcg);
 
 /**
- * bio_associate_current - associate a bio with %current
- * @bio: target bio
- *
- * Associate @bio with %current if it hasn't been associated yet.  Block
- * layer will treat @bio as if it were issued by %current no matter which
- * task actually issues it.
- *
- * This function takes an extra reference of @task's io_context and blkcg
- * which will be put when @bio is released.  The caller must own @bio,
- * ensure %current->io_context exists, and is responsible for synchronizing
- * calls to this function.
- */
-int bio_associate_current(struct bio *bio)
-{
-   struct io_context *ioc;
-
-   if (bio->bi_css)
-   return -EBUSY;
-
-   ioc = current->io_context;
-   if (!ioc)
-   return -ENOENT;
-
-   get_io_context_active(ioc);
-   bio->bi_ioc = ioc;
-   bio->bi_css = task_get_css(current, io_cgrp_id);
-   return 0;
-}
-EXPORT_SYMBOL_GPL(bio_associate_current);
-
-/**
  * bio_disassociate_task - undo bio_associate_current()
  * @bio: target bio
  */
diff --git a/include/linux/bio.h b/include/linux/bio.h
index a8fe793..d795cdd 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -514,13 +514,11 @@ do {  \
 
 #ifdef CONFIG_BLK_CGROUP
 int bio_associate_blkcg(struct bio *bio, struct cgroup_subsys_state 
*blkcg_css);
-int bio_associate_current(struct bio *bio);
 void bio_disassociate_task(struct bio *bio);
 void bio_clone_blkcg_association(struct bio *dst, struct bio *src);
 #else  /* CONFIG_BLK_CGROUP */
 static inline int bio_associate_blkcg(struct bio *bio,
struct cgroup_subsys_state *blkcg_css) { return 0; }
-static inline int bio_associate_current(struct bio *bio) { return -ENOENT; }
 static inline void bio_disassociate_task(struct bio *bio) { }
 static inline void bio_clone_blkcg_association(struct bio *dst,
struct bio *src) { }
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 9d92153..0cfa8d2 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -235,12 +235,6 @@ static inline struct blkcg *bio_blkcg(struct bio *bio)
return task_blkcg(current);
 }
 
-static inline struct cgroup_subsys_state *
-task_get_blkcg_css(struct task_struct *task)
-{
-   return task_get_css(task, io_cgrp_id);
-}
-
 /**
  * blkcg_parent - get the parent of a blkcg
  * @blkcg: blkcg of interest
@@ -735,12 +729,6 @@ struct blkcg_policy {
 
 #define blkcg_root_css ((struct cgroup_subsys_state *)ERR_PTR(-EINVAL))
 
-static inline struct cgroup_subsys_state *
-task_get_blkcg_css(struct task_struct *task)
-{
-   return NULL;
-}
-
 #ifdef CONFIG_BLOCK
 
 static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, void *key) { 
return NULL; }
-- 
2.9.5



[PATCH V3 2/4] blkcg: delete unused APIs

2017-09-14 Thread Shaohua Li
From: Shaohua Li 

Nobody uses the APIs right now.

Acked-by: Tejun Heo 
Signed-off-by: Shaohua Li 
---
 block/bio.c| 31 ---
 include/linux/bio.h|  2 --
 include/linux/blk-cgroup.h | 12 
 3 files changed, 45 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 6745759..9271fa3 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -2033,37 +2033,6 @@ int bio_associate_blkcg(struct bio *bio, struct 
cgroup_subsys_state *blkcg_css)
 EXPORT_SYMBOL_GPL(bio_associate_blkcg);
 
 /**
- * bio_associate_current - associate a bio with %current
- * @bio: target bio
- *
- * Associate @bio with %current if it hasn't been associated yet.  Block
- * layer will treat @bio as if it were issued by %current no matter which
- * task actually issues it.
- *
- * This function takes an extra reference of @task's io_context and blkcg
- * which will be put when @bio is released.  The caller must own @bio,
- * ensure %current->io_context exists, and is responsible for synchronizing
- * calls to this function.
- */
-int bio_associate_current(struct bio *bio)
-{
-   struct io_context *ioc;
-
-   if (bio->bi_css)
-   return -EBUSY;
-
-   ioc = current->io_context;
-   if (!ioc)
-   return -ENOENT;
-
-   get_io_context_active(ioc);
-   bio->bi_ioc = ioc;
-   bio->bi_css = task_get_css(current, io_cgrp_id);
-   return 0;
-}
-EXPORT_SYMBOL_GPL(bio_associate_current);
-
-/**
  * bio_disassociate_task - undo bio_associate_current()
  * @bio: target bio
  */
diff --git a/include/linux/bio.h b/include/linux/bio.h
index a8fe793..d795cdd 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -514,13 +514,11 @@ do {  \
 
 #ifdef CONFIG_BLK_CGROUP
 int bio_associate_blkcg(struct bio *bio, struct cgroup_subsys_state 
*blkcg_css);
-int bio_associate_current(struct bio *bio);
 void bio_disassociate_task(struct bio *bio);
 void bio_clone_blkcg_association(struct bio *dst, struct bio *src);
 #else  /* CONFIG_BLK_CGROUP */
 static inline int bio_associate_blkcg(struct bio *bio,
struct cgroup_subsys_state *blkcg_css) { return 0; }
-static inline int bio_associate_current(struct bio *bio) { return -ENOENT; }
 static inline void bio_disassociate_task(struct bio *bio) { }
 static inline void bio_clone_blkcg_association(struct bio *dst,
struct bio *src) { }
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 9d92153..0cfa8d2 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -235,12 +235,6 @@ static inline struct blkcg *bio_blkcg(struct bio *bio)
return task_blkcg(current);
 }
 
-static inline struct cgroup_subsys_state *
-task_get_blkcg_css(struct task_struct *task)
-{
-   return task_get_css(task, io_cgrp_id);
-}
-
 /**
  * blkcg_parent - get the parent of a blkcg
  * @blkcg: blkcg of interest
@@ -735,12 +729,6 @@ struct blkcg_policy {
 
 #define blkcg_root_css ((struct cgroup_subsys_state *)ERR_PTR(-EINVAL))
 
-static inline struct cgroup_subsys_state *
-task_get_blkcg_css(struct task_struct *task)
-{
-   return NULL;
-}
-
 #ifdef CONFIG_BLOCK
 
 static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, void *key) { 
return NULL; }
-- 
2.9.5



[PATCH V3 4/4] block/loop: make loop cgroup aware

2017-09-14 Thread Shaohua Li
From: Shaohua Li <s...@fb.com>

loop block device handles IO in a separate thread. The actual IO
dispatched isn't cloned from the IO loop device received, so the
dispatched IO loses the cgroup context.

I'm ignoring buffer IO case now, which is quite complicated.  Making the
loop thread aware cgroup context doesn't really help. The loop device
only writes to a single file. In current writeback cgroup
implementation, the file can only belong to one cgroup.

For direct IO case, we could workaround the issue in theory. For
example, say we assign cgroup1 5M/s BW for loop device and cgroup2
10M/s. We can create a special cgroup for loop thread and assign at
least 15M/s for the underlayer disk. In this way, we correctly throttle
the two cgroups. But this is tricky to setup.

This patch tries to address the issue. We record bio's css in loop
command. When loop thread is handling the command, we then use the API
provided in patch 1 to set the css for current task. The bio layer will
use the css for new IO (from patch 3).

Acked-by: Tejun Heo <t...@kernel.org>
Signed-off-by: Shaohua Li <s...@fb.com>
---
 drivers/block/loop.c | 13 +
 drivers/block/loop.h |  1 +
 2 files changed, 14 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 8934e25..37c4da7 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -479,6 +479,8 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long 
ret, long ret2)
 {
struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);
 
+   if (cmd->css)
+   css_put(cmd->css);
cmd->ret = ret > 0 ? 0 : ret;
lo_rw_aio_do_completion(cmd);
 }
@@ -538,6 +540,8 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
cmd->iocb.ki_filp = file;
cmd->iocb.ki_complete = lo_rw_aio_complete;
cmd->iocb.ki_flags = IOCB_DIRECT;
+   if (cmd->css)
+   kthread_associate_blkcg(cmd->css);
 
if (rw == WRITE)
ret = call_write_iter(file, >iocb, );
@@ -545,6 +549,7 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
ret = call_read_iter(file, >iocb, );
 
lo_rw_aio_do_completion(cmd);
+   kthread_associate_blkcg(NULL);
 
if (ret != -EIOCBQUEUED)
cmd->iocb.ki_complete(>iocb, ret, 0);
@@ -1689,6 +1694,14 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx 
*hctx,
break;
}
 
+   /* always use the first bio's css */
+#ifdef CONFIG_CGROUPS
+   if (cmd->use_aio && cmd->rq->bio && cmd->rq->bio->bi_css) {
+   cmd->css = cmd->rq->bio->bi_css;
+   css_get(cmd->css);
+   } else
+#endif
+   cmd->css = NULL;
kthread_queue_work(>worker, >work);
 
return BLK_STS_OK;
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index f68c1d5..d93b669 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -74,6 +74,7 @@ struct loop_cmd {
long ret;
struct kiocb iocb;
struct bio_vec *bvec;
+   struct cgroup_subsys_state *css;
 };
 
 /* Support for loadable transfer modules */
-- 
2.9.5



[PATCH V3 4/4] block/loop: make loop cgroup aware

2017-09-14 Thread Shaohua Li
From: Shaohua Li 

loop block device handles IO in a separate thread. The actual IO
dispatched isn't cloned from the IO loop device received, so the
dispatched IO loses the cgroup context.

I'm ignoring buffer IO case now, which is quite complicated.  Making the
loop thread aware cgroup context doesn't really help. The loop device
only writes to a single file. In current writeback cgroup
implementation, the file can only belong to one cgroup.

For direct IO case, we could workaround the issue in theory. For
example, say we assign cgroup1 5M/s BW for loop device and cgroup2
10M/s. We can create a special cgroup for loop thread and assign at
least 15M/s for the underlayer disk. In this way, we correctly throttle
the two cgroups. But this is tricky to setup.

This patch tries to address the issue. We record bio's css in loop
command. When loop thread is handling the command, we then use the API
provided in patch 1 to set the css for current task. The bio layer will
use the css for new IO (from patch 3).

Acked-by: Tejun Heo 
Signed-off-by: Shaohua Li 
---
 drivers/block/loop.c | 13 +
 drivers/block/loop.h |  1 +
 2 files changed, 14 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 8934e25..37c4da7 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -479,6 +479,8 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long 
ret, long ret2)
 {
struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);
 
+   if (cmd->css)
+   css_put(cmd->css);
cmd->ret = ret > 0 ? 0 : ret;
lo_rw_aio_do_completion(cmd);
 }
@@ -538,6 +540,8 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
cmd->iocb.ki_filp = file;
cmd->iocb.ki_complete = lo_rw_aio_complete;
cmd->iocb.ki_flags = IOCB_DIRECT;
+   if (cmd->css)
+   kthread_associate_blkcg(cmd->css);
 
if (rw == WRITE)
ret = call_write_iter(file, >iocb, );
@@ -545,6 +549,7 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
ret = call_read_iter(file, >iocb, );
 
lo_rw_aio_do_completion(cmd);
+   kthread_associate_blkcg(NULL);
 
if (ret != -EIOCBQUEUED)
cmd->iocb.ki_complete(>iocb, ret, 0);
@@ -1689,6 +1694,14 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx 
*hctx,
break;
}
 
+   /* always use the first bio's css */
+#ifdef CONFIG_CGROUPS
+   if (cmd->use_aio && cmd->rq->bio && cmd->rq->bio->bi_css) {
+   cmd->css = cmd->rq->bio->bi_css;
+   css_get(cmd->css);
+   } else
+#endif
+   cmd->css = NULL;
kthread_queue_work(>worker, >work);
 
return BLK_STS_OK;
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index f68c1d5..d93b669 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -74,6 +74,7 @@ struct loop_cmd {
long ret;
struct kiocb iocb;
struct bio_vec *bvec;
+   struct cgroup_subsys_state *css;
 };
 
 /* Support for loadable transfer modules */
-- 
2.9.5



  1   2   3   4   5   6   7   8   9   10   >