Disk size update on first open

2019-10-18 Thread Jan Kara
Hello,

I have been debugging weird failures when using encrypted DVDs (bko#194965
for interested) but in the end it all boils down to the fact that
__blkdev_get() updates i_size of bdev inode only of the first open. This
seems as a sensible thing to do but there is some weird behavior resulting
out of this for devices with removable media:

1) If someone has the device (such as /dev/sr0) open while inserting the
media, bdev size will not get updated. This results in the media being
accessible but the device size is wrong resulting in weird and hard to
debug failures.

2) This is especially annoying when pktcdvd is in the game as pktcdvd
device holds corresponding sr device permanently open.

Upon some inspection this seems to be an issue with how check_disk_change()
(called from sr_block_open()) interacts with __blkdev_get(). If partition
scan is enabled, check_disk_change() will call flush_disk() which sets
bdev->bd_invalidated. And __blkdev_get() seeing bd_invalidated will call
rescan_partitions() which ends up updating bdev size through
check_disk_size_change(). But without partitioning none of this happens and
the disk size remains stale.

Now it seems strange that partitioned and unpartitioned devices behave
differently. So I'd be inclined to just unify the behavior and use
bd_invalidated for unpartitioned devices as well. Does anyone see a problem
with that?

Honza
-- 
Jan Kara 
SUSE Labs, CR


[PATCH 0/2] bdev: Refresh bdev size for disks without partitioning

2019-10-21 Thread Jan Kara
Hello,

I've been debugging for quite a long time strange issues with encrypted DVDs
not being readable on Linux (bko#194965). In the end I've tracked down the
problem to the fact that block device size is not updated when the media is
inserted in case the DVD device is already open. This is IMO a bug in block
device code as the size gets properly update in case the device has partition
scanning enabled.  The following series fixes the problem by refreshing disk
size on each open even for devices with partition scanning disabled.

Honza


[PATCH 1/2] bdev: Factor out bdev revalidation into a common helper

2019-10-21 Thread Jan Kara
Factor out code handling revalidation of bdev on disk change into a
common helper.

Signed-off-by: Jan Kara 
---
 fs/block_dev.c | 26 ++
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 9c073dbdc1b0..88c6d35ec71d 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1512,6 +1512,14 @@ EXPORT_SYMBOL(bd_set_size);
 
 static void __blkdev_put(struct block_device *bdev, fmode_t mode, int 
for_part);
 
+static void bdev_disk_changed(struct block_device *bdev, bool invalidate)
+{
+   if (invalidate)
+   invalidate_partitions(bdev->bd_disk, bdev);
+   else
+   rescan_partitions(bdev->bd_disk, bdev);
+}
+
 /*
  * bd_mutex locking:
  *
@@ -1594,12 +1602,9 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
 * The latter is necessary to prevent ghost
 * partitions on a removed medium.
 */
-   if (bdev->bd_invalidated) {
-   if (!ret)
-   rescan_partitions(disk, bdev);
-   else if (ret == -ENOMEDIUM)
-   invalidate_partitions(disk, bdev);
-   }
+   if (bdev->bd_invalidated &&
+   (!ret || ret == -ENOMEDIUM))
+   bdev_disk_changed(bdev, ret == -ENOMEDIUM);
 
if (ret)
goto out_clear;
@@ -1632,12 +1637,9 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
if (bdev->bd_disk->fops->open)
ret = bdev->bd_disk->fops->open(bdev, mode);
/* the same as first opener case, read comment there */
-   if (bdev->bd_invalidated) {
-   if (!ret)
-   rescan_partitions(bdev->bd_disk, bdev);
-   else if (ret == -ENOMEDIUM)
-   invalidate_partitions(bdev->bd_disk, 
bdev);
-   }
+   if (bdev->bd_invalidated &&
+   (!ret || ret == -ENOMEDIUM))
+   bdev_disk_changed(bdev, ret == -ENOMEDIUM);
if (ret)
goto out_unlock_bdev;
}
-- 
2.16.4



[PATCH 2/2] bdev: Refresh bdev size for disks without partitioning

2019-10-21 Thread Jan Kara
Currently, block device size in not updated on second and further open
for block devices where partition scan is disabled. This is particularly
annoying for example for DVD drives as that means block device size does
not get updated once the media is inserted into a drive if the device is
already open when inserting the media. This is actually always the case
for example when pktcdvd is in use.

Fix the problem by revalidating block device size on every open even for
devices with partition scan disabled.

Signed-off-by: Jan Kara 
---
 fs/block_dev.c | 19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 88c6d35ec71d..d612468ee66b 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1403,11 +1403,7 @@ static void flush_disk(struct block_device *bdev, bool 
kill_dirty)
   "resized disk %s\n",
   bdev->bd_disk ? bdev->bd_disk->disk_name : "");
}
-
-   if (!bdev->bd_disk)
-   return;
-   if (disk_part_scan_enabled(bdev->bd_disk))
-   bdev->bd_invalidated = 1;
+   bdev->bd_invalidated = 1;
 }
 
 /**
@@ -1514,10 +1510,15 @@ static void __blkdev_put(struct block_device *bdev, 
fmode_t mode, int for_part);
 
 static void bdev_disk_changed(struct block_device *bdev, bool invalidate)
 {
-   if (invalidate)
-   invalidate_partitions(bdev->bd_disk, bdev);
-   else
-   rescan_partitions(bdev->bd_disk, bdev);
+   if (disk_part_scan_enabled(bdev->bd_disk)) {
+   if (invalidate)
+   invalidate_partitions(bdev->bd_disk, bdev);
+   else
+   rescan_partitions(bdev->bd_disk, bdev);
+   } else {
+   check_disk_size_change(bdev->bd_disk, bdev, !invalidate);
+   bdev->bd_invalidated = 0;
+   }
 }
 
 /*
-- 
2.16.4



Re: [PATCH 2/2] bdev: Refresh bdev size for disks without partitioning

2019-10-21 Thread Jan Kara
On Mon 21-10-19 10:49:54, Guoqing Jiang wrote:
> 
> 
> On 10/21/19 10:38 AM, Jan Kara wrote:
> > Currently, block device size in not updated on second and further open
> > for block devices where partition scan is disabled. This is particularly
> > annoying for example for DVD drives as that means block device size does
> > not get updated once the media is inserted into a drive if the device is
> > already open when inserting the media. This is actually always the case
> > for example when pktcdvd is in use.
> > 
> > Fix the problem by revalidating block device size on every open even for
> > devices with partition scan disabled.
> > 
> > Signed-off-by: Jan Kara 
> > ---
> >   fs/block_dev.c | 19 ++-
> >   1 file changed, 10 insertions(+), 9 deletions(-)
> > 
> > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > index 88c6d35ec71d..d612468ee66b 100644
> > --- a/fs/block_dev.c
> > +++ b/fs/block_dev.c
> > @@ -1403,11 +1403,7 @@ static void flush_disk(struct block_device *bdev, 
> > bool kill_dirty)
> >"resized disk %s\n",
> >bdev->bd_disk ? bdev->bd_disk->disk_name : "");
> > }
> > -
> > -   if (!bdev->bd_disk)
> > -   return;
> > -   if (disk_part_scan_enabled(bdev->bd_disk))
> > -   bdev->bd_invalidated = 1;
> > +   bdev->bd_invalidated = 1;
> >   }
> >   /**
> > @@ -1514,10 +1510,15 @@ static void __blkdev_put(struct block_device *bdev, 
> > fmode_t mode, int for_part);
> >   static void bdev_disk_changed(struct block_device *bdev, bool invalidate)
> >   {
> > -   if (invalidate)
> > -   invalidate_partitions(bdev->bd_disk, bdev);
> > -   else
> > -   rescan_partitions(bdev->bd_disk, bdev);
> > +   if (disk_part_scan_enabled(bdev->bd_disk)) {
> > +   if (invalidate)
> > +   invalidate_partitions(bdev->bd_disk, bdev);
> > +   else
> > +   rescan_partitions(bdev->bd_disk, bdev);
> 
> Maybe use the new common helper to replace above.

What do you mean exactly? Because there's only this place that has the code
pattern here...

Honza

> > +   } else {
> > +   check_disk_size_change(bdev->bd_disk, bdev, !invalidate);
> > +   bdev->bd_invalidated = 0;
> > +   }
> >   }
> >   /*
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 2/2] bdev: Refresh bdev size for disks without partitioning

2019-10-21 Thread Jan Kara
On Mon 21-10-19 11:27:36, Guoqing Jiang wrote:
> 
> 
> On 10/21/19 11:12 AM, Jan Kara wrote:
> > On Mon 21-10-19 10:49:54, Guoqing Jiang wrote:
> > > 
> > > On 10/21/19 10:38 AM, Jan Kara wrote:
> > > > Currently, block device size in not updated on second and further open
> > > > for block devices where partition scan is disabled. This is particularly
> > > > annoying for example for DVD drives as that means block device size does
> > > > not get updated once the media is inserted into a drive if the device is
> > > > already open when inserting the media. This is actually always the case
> > > > for example when pktcdvd is in use.
> > > > 
> > > > Fix the problem by revalidating block device size on every open even for
> > > > devices with partition scan disabled.
> > > > 
> > > > Signed-off-by: Jan Kara 
> > > > ---
> > > >fs/block_dev.c | 19 ++-
> > > >1 file changed, 10 insertions(+), 9 deletions(-)
> > > > 
> > > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > > index 88c6d35ec71d..d612468ee66b 100644
> > > > --- a/fs/block_dev.c
> > > > +++ b/fs/block_dev.c
> > > > @@ -1403,11 +1403,7 @@ static void flush_disk(struct block_device 
> > > > *bdev, bool kill_dirty)
> > > >"resized disk %s\n",
> > > >bdev->bd_disk ? bdev->bd_disk->disk_name : "");
> > > > }
> > > > -
> > > > -   if (!bdev->bd_disk)
> > > > -   return;
> > > > -   if (disk_part_scan_enabled(bdev->bd_disk))
> > > > -   bdev->bd_invalidated = 1;
> > > > +   bdev->bd_invalidated = 1;
> > > >}
> > > >/**
> > > > @@ -1514,10 +1510,15 @@ static void __blkdev_put(struct block_device 
> > > > *bdev, fmode_t mode, int for_part);
> > > >static void bdev_disk_changed(struct block_device *bdev, bool 
> > > > invalidate)
> > > >{
> > > > -   if (invalidate)
> > > > -   invalidate_partitions(bdev->bd_disk, bdev);
> > > > -   else
> > > > -   rescan_partitions(bdev->bd_disk, bdev);
> > > > +   if (disk_part_scan_enabled(bdev->bd_disk)) {
> > > > +   if (invalidate)
> > > > +   invalidate_partitions(bdev->bd_disk, bdev);
> > > > +   else
> > > > +   rescan_partitions(bdev->bd_disk, bdev);
> > > Maybe use the new common helper to replace above.
> > What do you mean exactly? Because there's only this place that has the code
> > pattern here...
> 
> The above looks same as the new function.
> 
> +static void bdev_disk_changed(struct block_device *bdev, bool invalidate)
> +{
> + if (invalidate)
> + invalidate_partitions(bdev->bd_disk, bdev);
> + else
> + rescan_partitions(bdev->bd_disk, bdev);
> +}
> 
> So maybe just call it in the 'if' branch, am I misunderstand something?

I think you misparsed the diff ;) The code that you mention as duplicated
just gets reindented by the patch.

> static void bdev_disk_changed(struct block_device *bdev, bool invalidate)
> {
>   if (disk_part_scan_enabled(bdev->bd_disk))  
>   bdev_disk_changed(bdev, invalidate)
>   else {
>   check_disk_size_change(bdev->bd_disk, bdev, !invalidate);
>   bdev->bd_invalidated = 0;
>   }
> }

This has infinite recursion in it - you call bdev_disk_changed() from
bdev_disk_changed().

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 1/2] bdev: Factor out bdev revalidation into a common helper

2019-10-22 Thread Jan Kara
On Tue 22-10-19 07:58:08, Damien Le Moal wrote:
> On 2019/10/21 17:38, Jan Kara wrote:
> > Factor out code handling revalidation of bdev on disk change into a
> > common helper.
> > 
> > Signed-off-by: Jan Kara 
> > ---
> >  fs/block_dev.c | 26 ++
> >  1 file changed, 14 insertions(+), 12 deletions(-)
> > 
> > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > index 9c073dbdc1b0..88c6d35ec71d 100644
> > --- a/fs/block_dev.c
> > +++ b/fs/block_dev.c
> > @@ -1512,6 +1512,14 @@ EXPORT_SYMBOL(bd_set_size);
> >  
> >  static void __blkdev_put(struct block_device *bdev, fmode_t mode, int 
> > for_part);
> >  
> > +static void bdev_disk_changed(struct block_device *bdev, bool invalidate)
> > +{
> > +   if (invalidate)
> > +   invalidate_partitions(bdev->bd_disk, bdev);
> > +   else
> > +   rescan_partitions(bdev->bd_disk, bdev);
> > +}
> > +
> >  /*
> >   * bd_mutex locking:
> >   *
> > @@ -1594,12 +1602,9 @@ static int __blkdev_get(struct block_device *bdev, 
> > fmode_t mode, int for_part)
> >  * The latter is necessary to prevent ghost
> >  * partitions on a removed medium.
> >  */
> > -   if (bdev->bd_invalidated) {
> > -   if (!ret)
> > -   rescan_partitions(disk, bdev);
> > -   else if (ret == -ENOMEDIUM)
> > -   invalidate_partitions(disk, bdev);
> > -   }
> > +   if (bdev->bd_invalidated &&
> > +   (!ret || ret == -ENOMEDIUM))
> > +   bdev_disk_changed(bdev, ret == -ENOMEDIUM);
> 
> This is a little confusing since from its name, bdev_disk_changed() seem
> to imply that a new disk is present but this is called only if bdev is
> invalidated... What about calling this simply bdev_revalidate_disk(),
> since rescan_partitions() will call the disk revalidate method.

Honestly, the whole disk revalidation code is confusing to me :) I had to
draw a graph of which function calls which to understand what's going on in
that code and I think it could really use some refactoring. But that's
besides current point :)

Your "only if bdev is invalidated" is true but not actually a full story.
->bd_invalidated effectively gets set only through check_disk_change(). All
other places that end up calling flush_disk() clear bd_invalidated shortly
afterwards. So the function I've created is a direct counterpart to
check_disk_change() that just needs to happen after the device is
successfully open. I wanted to express that in the name - hence
bdev_disk_changed(). So yes, bdev_disk_changed() should be called exactly
when the new disk is present. It is bd_invalidated that is actually
misnamed.

Now I'm not really tied to bdev_disk_changed(). bdev_revalidate_disk()
seems a bit confusing to me though because the disk has actually been
already revalidated in check_disk_change() and the function won't
revalidate the disk for devices with partition scan disabled.

> Also, it seems to me that this function could be used without the
> complex "if ()" condition by slightly modifying it:
> 
> static void bdev_revalidate_disk(struct block_device *bdev,
>bool invalidate)
> {
>   if (bdev->bd_invalidated && invalidate)
>   invalidate_partitions(bdev->bd_disk, bdev);
>   else
>   rescan_partitions(bdev->bd_disk, bdev);
> }
> 
> Otherwise, this all looks fine to me.

Well, but you don't want to call rescan_partitions() if bd_invalidated is
not set. But yes, we could move bd_invalidated check into
bdev_disk_changed(), but then it would seem less clear why this function is
getting called. This ties somewhat to the discussion above. Hum, I guess if
we call the function just bdev_revalidate(), the name won't be confusing
and it would then make sense to move the bd_invalidated condition in there.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCHv1, RFC 00/33] ext4: support of huge pages

2016-07-27 Thread Jan Kara
On Tue 26-07-16 22:12:12, Kirill A. Shutemov wrote:
> On Tue, Jul 26, 2016 at 01:29:38PM -0400, Theodore Ts'o wrote:
> > On Tue, Jul 26, 2016 at 03:35:02AM +0300, Kirill A. Shutemov wrote:
> > > Here's the first version of my patchset which intended to bring huge pages
> > > to ext4. It's not yet ready for applying or serious use, but good enough
> > > to show the approach.
> > 
> > Thanks.  The major issues I noticed when doing a quick scan of the
> > patches you've already mentioned here.  I'll try to take a closer look
> > in the next week or so when I have time.
> 
> Thanks.
> 
> > One random question --- in the huge=always approach, how much
> > additional work would be needed to support file systems with a 64k
> > block size on a system with 4k pages?
> 
> I think it's totally different story.
> 
> Here I have block size smaller than page size and it's not new to the
> filesystem -- similar to 1k block size with 4k page size. So I was able to
> re-use most of infrastructure to handle the situation.
> 
> Block size bigger than page size is backward task. I don't think I know
> enough to understand how hard it would be. I guess not easy. :)

I think Ted wanted to ask: When you always have huge pages in page cache,
block size of 64k is smaller than the page size of the page cache so there
are chances it could work. Or is there anything which still exposes the
fact that actual pages are 4k even in huge=always case?

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 14/41] filemap: allocate huge page in page_cache_read(), if allowed

2016-10-11 Thread Jan Kara
On Thu 15-09-16 14:54:56, Kirill A. Shutemov wrote:
> This patch adds basic functionality to put huge page into page cache.
> 
> At the moment we only put huge pages into radix-tree if the range covered
> by the huge page is empty.
> 
> We ignore shadow entires for now, just remove them from the tree before
> inserting huge page.
> 
> Later we can add logic to accumulate information from shadow entires to
> return to caller (average eviction time?).
> 
> Signed-off-by: Kirill A. Shutemov 
> ---
>  include/linux/fs.h  |   5 ++
>  include/linux/pagemap.h |  21 ++-
>  mm/filemap.c| 148 
> +++-
>  3 files changed, 157 insertions(+), 17 deletions(-)
> 
...
> @@ -663,16 +663,55 @@ static int __add_to_page_cache_locked(struct page *page,
>   page->index = offset;
>  
>   spin_lock_irq(&mapping->tree_lock);
> - error = page_cache_tree_insert(mapping, page, shadowp);
> + if (PageTransHuge(page)) {
> + struct radix_tree_iter iter;
> + void **slot;
> + void *p;
> +
> + error = 0;
> +
> + /* Wipe shadow entires */
> + radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, 
> offset) {
> + if (iter.index >= offset + HPAGE_PMD_NR)
> + break;
> +
> + p = radix_tree_deref_slot_protected(slot,
> + &mapping->tree_lock);
> + if (!p)
> + continue;
> +
> + if (!radix_tree_exception(p)) {
> + error = -EEXIST;
> + break;
> + }
> +
> + mapping->nrexceptional--;
> + rcu_assign_pointer(*slot, NULL);

I think you also need something like workingset_node_shadows_dec(node)
here. It would be even better if you used something like
clear_exceptional_entry() to have the logic in one place (you obviously
need to factor out only part of clear_exceptional_entry() first).

> + }
> +
> + if (!error)
> + error = __radix_tree_insert(&mapping->page_tree, offset,
> + compound_order(page), page);
> +
> + if (!error) {
> + count_vm_event(THP_FILE_ALLOC);
> + mapping->nrpages += HPAGE_PMD_NR;
> + *shadowp = NULL;
> + __inc_node_page_state(page, NR_FILE_THPS);
> + }
> + } else {
> + error = page_cache_tree_insert(mapping, page, shadowp);
> + }

And I'd prefer to have this logic moved to page_cache_tree_insert() because
logically it IMHO belongs there - it is a simply another case of handling
of radix tree used for page cache.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 11/41] thp: try to free page's buffers before attempt split

2016-10-11 Thread Jan Kara
On Thu 15-09-16 14:54:53, Kirill A. Shutemov wrote:
> We want page to be isolated from the rest of the system before spliting
> it. We rely on page count to be 2 for file pages to make sure nobody
> uses the page: one pin to caller, one to radix-tree.
> 
> Filesystems with backing storage can have page count increased if it has
> buffers.
> 
> Let's try to free them, before attempt split. And remove one guarding
> VM_BUG_ON_PAGE().
> 
> Signed-off-by: Kirill A. Shutemov 
...
> @@ -2041,6 +2041,23 @@ int split_huge_page_to_list(struct page *page, struct 
> list_head *list)
>   goto out;
>   }
>  
> + /* Try to free buffers before attempt split */
> + if (!PageSwapBacked(head) && PagePrivate(page)) {
> + /*
> +  * We cannot trigger writeback from here due possible
> +  * recursion if triggered from vmscan, only wait.
> +  *
> +  * Caller can trigger writeback it on its own, if safe.
> +  */
> + wait_on_page_writeback(head);
> +
> + if (page_has_buffers(head) &&
> + !try_to_free_buffers(head)) {
> + ret = -EBUSY;
> + goto out;
> + }

Shouldn't you rather use try_to_release_page() here? Because filesystems
have their ->releasepage() callbacks for freeing data associated with a
page. It is not guaranteed page private data are buffers although it is
true for ext4...

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 12/41] thp: handle write-protection faults for file THP

2016-10-11 Thread Jan Kara
On Thu 15-09-16 14:54:54, Kirill A. Shutemov wrote:
> For filesystems that wants to be write-notified (has mkwrite), we will
> encount write-protection faults for huge PMDs in shared mappings.
> 
> The easiest way to handle them is to clear the PMD and let it refault as
> wriable.
> 
> Signed-off-by: Kirill A. Shutemov 
> ---
>  mm/memory.c | 11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 83be99d9d8a1..aad8d5c6311f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3451,8 +3451,17 @@ static int wp_huge_pmd(struct fault_env *fe, pmd_t 
> orig_pmd)
>   return fe->vma->vm_ops->pmd_fault(fe->vma, fe->address, fe->pmd,
>   fe->flags);
>  
> + if (fe->vma->vm_flags & VM_SHARED) {
> + /* Clear PMD */
> + zap_page_range_single(fe->vma, fe->address,
> + HPAGE_PMD_SIZE, NULL);
> + VM_BUG_ON(!pmd_none(*fe->pmd));
> +
> + /* Refault to establish writable PMD */
> + return 0;
> + }
> +

Since we want to write-protect the page table entry on each page writeback
and write-enable then on the next write, this is relatively expensive.
Would it be that complicated to handle this fully in ->pmd_fault handler
like we do for DAX?

Maybe it doesn't have to be done now but longer term I guess it might make
sense.

Otherwise the patch looks good so feel free to add:

Reviewed-by: Jan Kara 

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 13/41] truncate: make sure invalidate_mapping_pages() can discard huge pages

2016-10-11 Thread Jan Kara
On Thu 15-09-16 14:54:55, Kirill A. Shutemov wrote:
> invalidate_inode_page() has expectation about page_count() of the page
> -- if it's not 2 (one to caller, one to radix-tree), it will not be
> dropped. That condition almost never met for THPs -- tail pages are
> pinned to the pagevec.
> 
> Let's drop them, before calling invalidate_inode_page().
> 
> Signed-off-by: Kirill A. Shutemov 
> ---
>  mm/truncate.c | 11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/mm/truncate.c b/mm/truncate.c
> index a01cce450a26..ce904e4b1708 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -504,10 +504,21 @@ unsigned long invalidate_mapping_pages(struct 
> address_space *mapping,
>   /* 'end' is in the middle of THP */
>   if (index ==  round_down(end, HPAGE_PMD_NR))
>   continue;
> + /*
> +  * invalidate_inode_page() expects
> +  * page_count(page) == 2 to drop page from page
> +  * cache -- drop tail pages references.
> +  */
> + get_page(page);
> + pagevec_release(&pvec);

I'm not quite sure why this is needed. When you have multiorder entry in
the radix tree for your huge page, then you should not get more entries in
the pagevec for your huge page. What do I miss?

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 13/41] truncate: make sure invalidate_mapping_pages() can discard huge pages

2016-10-11 Thread Jan Kara
On Wed 12-10-16 00:53:49, Kirill A. Shutemov wrote:
> On Tue, Oct 11, 2016 at 05:58:15PM +0200, Jan Kara wrote:
> > On Thu 15-09-16 14:54:55, Kirill A. Shutemov wrote:
> > > invalidate_inode_page() has expectation about page_count() of the page
> > > -- if it's not 2 (one to caller, one to radix-tree), it will not be
> > > dropped. That condition almost never met for THPs -- tail pages are
> > > pinned to the pagevec.
> > > 
> > > Let's drop them, before calling invalidate_inode_page().
> > > 
> > > Signed-off-by: Kirill A. Shutemov 
> > > ---
> > >  mm/truncate.c | 11 +++
> > >  1 file changed, 11 insertions(+)
> > > 
> > > diff --git a/mm/truncate.c b/mm/truncate.c
> > > index a01cce450a26..ce904e4b1708 100644
> > > --- a/mm/truncate.c
> > > +++ b/mm/truncate.c
> > > @@ -504,10 +504,21 @@ unsigned long invalidate_mapping_pages(struct 
> > > address_space *mapping,
> > >   /* 'end' is in the middle of THP */
> > >   if (index ==  round_down(end, HPAGE_PMD_NR))
> > >   continue;
> > > + /*
> > > +  * invalidate_inode_page() expects
> > > +  * page_count(page) == 2 to drop page from page
> > > +  * cache -- drop tail pages references.
> > > +  */
> > > + get_page(page);
> > > + pagevec_release(&pvec);
> > 
> > I'm not quite sure why this is needed. When you have multiorder entry in
> > the radix tree for your huge page, then you should not get more entries in
> > the pagevec for your huge page. What do I miss?
> 
> For compatibility reason find_get_entries() (which is called by
> pagevec_lookup_entries()) collects all subpages of huge page in the range
> (head/tails). See patch [07/41]
> 
> So huge page, which is fully in the range it will be pinned up to
> PAGEVEC_SIZE times.

Yeah, I see. But then won't it be cleaner to provide iteration method that
would add to pagevec each radix tree entry (regardless of its order) only
once and then use it in places where we care? Instead of strange dances
like you do here?

Ultimately we could convert all the places to use these new iteration
methods but I don't see that as immediately necessary and maybe there are
places where getting all the subpages in the pagevec actually makes life
simpler for us (please point me if you know about such place).

On a somewhat unrelated note: I've noticed that you don't invalidate
a huge page when only part of it should be invalidated. That actually
breaks some assumptions filesystems make. In particular direct IO code
assumes that if you do

filemap_write_and_wait_range(inode, start, end);
invalidate_inode_pages2_range(inode, start, end);

all the page cache covering start-end *will* be invalidated. Your skipping
of partial pages breaks this assumption and thus can bring consistency
issues (e.g. write done using direct IO won't be seen by following buffered
read).

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 17/41] filemap: handle huge pages in filemap_fdatawait_range()

2016-10-13 Thread Jan Kara
On Thu 15-09-16 14:54:59, Kirill A. Shutemov wrote:
> We writeback whole huge page a time.

This is one of the things I don't understand. Firstly I didn't see where
changes of writeback like this would happen (maybe they come later).
Secondly I'm not sure why e.g. writeback should behave atomically wrt huge
pages. Is this because radix-tree multiorder entry tracks dirtiness for us
at that granularity? BTW, can you also explain why do we need multiorder
entries? What do they solve for us?

I'm sorry for these basic questions but I'd just like to understand how is
this supposed to work...

Honza


> 
> Signed-off-by: Kirill A. Shutemov 
> ---
>  mm/filemap.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 05b42d3e5ed8..53da93156e60 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -372,9 +372,14 @@ static int __filemap_fdatawait_range(struct 
> address_space *mapping,
>   if (page->index > end)
>   continue;
>  
> + page = compound_head(page);
>   wait_on_page_writeback(page);
>   if (TestClearPageError(page))
>   ret = -EIO;
> + if (PageTransHuge(page)) {
> + index = page->index + HPAGE_PMD_NR;
> + i += index - pvec.pages[i]->index - 1;
> + }
>       }
>   pagevec_release(&pvec);
>   cond_resched();
> -- 
> 2.9.3
> 
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 15/41] filemap: handle huge pages in do_generic_file_read()

2016-10-13 Thread Jan Kara
On Thu 15-09-16 14:54:57, Kirill A. Shutemov wrote:
> Most of work happans on head page. Only when we need to do copy data to
> userspace we find relevant subpage.
> 
> We are still limited by PAGE_SIZE per iteration. Lifting this limitation
> would require some more work.

Hum, I'm kind of lost. Can you point me to some design document / email
that would explain some high level ideas how are huge pages in page cache
supposed to work? When are we supposed to operate on the head page and when
on subpage? What is protected by the page lock of the head page? Do page
locks of subpages play any role? If understand right, e.g.
pagecache_get_page() will return subpages but is it generally safe to
operate on subpages individually or do we have to be aware that they are
part of a huge page?

If I understand the motivation right, it is mostly about being able to mmap
PMD-sized chunks to userspace. So my naive idea would be that we could just
implement it by allocating PMD sized chunks of pages when adding pages to
page cache, we don't even have to read them all unless we come from PMD
fault path. Reclaim may need to be aware not to split pages unnecessarily
but that's about it. So I'd like to understand what's wrong with this
naive idea and why do filesystems need to be aware that someone wants to
map in PMD sized chunks...

Honza
> 
> Signed-off-by: Kirill A. Shutemov 
> ---
>  mm/filemap.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 50afe17230e7..b77bcf6843ee 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1860,6 +1860,7 @@ find_page:
>   if (unlikely(page == NULL))
>   goto no_cached_page;
>   }
> + page = compound_head(page);
>   if (PageReadahead(page)) {
>   page_cache_async_readahead(mapping,
>   ra, filp, page,
> @@ -1936,7 +1937,8 @@ page_ok:
>* now we can copy it to user space...
>*/
>  
> - ret = copy_page_to_iter(page, offset, nr, iter);
> + ret = copy_page_to_iter(page + index - page->index, offset,
> + nr, iter);
>   offset += ret;
>   index += offset >> PAGE_SHIFT;
>   offset &= ~PAGE_MASK;
> @@ -2356,6 +2358,7 @@ page_not_uptodate:
>* because there really aren't any performance issues here
>* and we need to check for errors.
>*/
> + page = compound_head(page);
>   ClearPageError(page);
>   error = mapping->a_ops->readpage(file, page);
>   if (!error) {
> -- 
> 2.9.3
> 
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 17/41] filemap: handle huge pages in filemap_fdatawait_range()

2016-10-13 Thread Jan Kara
On Thu 13-10-16 15:08:44, Kirill A. Shutemov wrote:
> On Thu, Oct 13, 2016 at 11:44:41AM +0200, Jan Kara wrote:
> > On Thu 15-09-16 14:54:59, Kirill A. Shutemov wrote:
> > > We writeback whole huge page a time.
> > 
> > This is one of the things I don't understand. Firstly I didn't see where
> > changes of writeback like this would happen (maybe they come later).
> > Secondly I'm not sure why e.g. writeback should behave atomically wrt huge
> > pages. Is this because radix-tree multiorder entry tracks dirtiness for us
> > at that granularity?
> 
> We track dirty/writeback on per-compound pages: meaning we have one
> dirty/writeback flag for whole compound page, not on every individual
> 4k subpage. The same story for radix-tree tags.
> 
> > BTW, can you also explain why do we need multiorder entries? What do
> > they solve for us?
> 
> It helps us having coherent view on tags in radix-tree: no matter which
> index we refer from the range huge page covers we will get the same
> answer on which tags set.

OK, understand that. But why do we need a coherent view? For which purposes
exactly do we care that it is not just a bunch of 4k pages that happen to
be physically contiguous and thus can be mapped in one PMD?

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] brd: Switch rd_size to unsigned long

2016-10-25 Thread Jan Kara
Currently rd_size was int which lead to overflow and bogus device size
once the requested ramdisk size was 1 TB or more. Although these days
ramdisks with 1 TB size are mostly a mistake, the days when they are
useful are not far.

Reported-by: Bart Van Assche 
Signed-off-by: Jan Kara 
---
 drivers/block/brd.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 8b22e4a04918..6af087068ab3 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -443,8 +443,8 @@ static int rd_nr = CONFIG_BLK_DEV_RAM_COUNT;
 module_param(rd_nr, int, S_IRUGO);
 MODULE_PARM_DESC(rd_nr, "Maximum number of brd devices");
 
-static int rd_size = CONFIG_BLK_DEV_RAM_SIZE;
-module_param(rd_size, int, S_IRUGO);
+static unsigned long rd_size = CONFIG_BLK_DEV_RAM_SIZE;
+module_param(rd_size, ulong, S_IRUGO);
 MODULE_PARM_DESC(rd_size, "Size of each RAM disk in kbytes.");
 
 static int max_part = 1;
-- 
2.6.6

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] brd: Make rd_size argument static

2016-10-25 Thread Jan Kara
rd_size does not appear to be used outside of brd. Make it static.

Signed-off-by: Jan Kara 
---
 drivers/block/brd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 0c76d4016eeb..8b22e4a04918 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -443,7 +443,7 @@ static int rd_nr = CONFIG_BLK_DEV_RAM_COUNT;
 module_param(rd_nr, int, S_IRUGO);
 MODULE_PARM_DESC(rd_nr, "Maximum number of brd devices");
 
-int rd_size = CONFIG_BLK_DEV_RAM_SIZE;
+static int rd_size = CONFIG_BLK_DEV_RAM_SIZE;
 module_param(rd_size, int, S_IRUGO);
 MODULE_PARM_DESC(rd_size, "Size of each RAM disk in kbytes.");
 
-- 
2.6.6

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler

2016-10-26 Thread Jan Kara
On Wed 26-10-16 03:19:03, Christoph Hellwig wrote:
> Just as last time:
> 
> big NAK for introducing giant new infrastructure like a new I/O scheduler
> for the legacy request structure.
> 
> Please direct your engergy towards blk-mq instead.

Christoph, we will probably talk about this next week but IMO rotating
disks and SATA based SSDs are going to stay with us for another 15 years,
likely more. For them blk-mq is no win, relatively complex IO scheduling
like CFQ or BFQ does is a big win for them in some cases. So I think IO
scheduling (and thus place for something like BFQ) is going to stay with us
for quite a long time still. So are we going to add hooks in blk-mq to
support full-blown IO scheduling at least for single queue devices? Or how
else do we want to support that HW?

    Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler

2016-10-27 Thread Jan Kara
On Wed 26-10-16 10:12:38, Jens Axboe wrote:
> On 10/26/2016 10:04 AM, Paolo Valente wrote:
> >
> >>Il giorno 26 ott 2016, alle ore 17:32, Jens Axboe  ha 
> >>scritto:
> >>
> >>On 10/26/2016 09:29 AM, Christoph Hellwig wrote:
> >>>On Wed, Oct 26, 2016 at 05:13:07PM +0200, Arnd Bergmann wrote:
> >>>>The question to ask first is whether to actually have pluggable
> >>>>schedulers on blk-mq at all, or just have one that is meant to
> >>>>do the right thing in every case (and possibly can be bypassed
> >>>>completely).
> >>>
> >>>That would be my preference.  Have a BFQ-variant for blk-mq as an
> >>>option (default to off unless opted in by the driver or user), and
> >>>not other scheduler for blk-mq.  Don't bother with bfq for non
> >>>blk-mq.  It's not like there is any advantage in the legacy-request
> >>>device even for slow devices, except for the option of having I/O
> >>>scheduling.
> >>
> >>It's the only right way forward. blk-mq might not offer any substantial
> >>advantages to rotating storage, but with scheduling, it won't offer a
> >>downside either. And it'll take us towards the real goal, which is to
> >>have just one IO path.
> >
> >ok
> >
> >>Adding a new scheduler for the legacy IO path
> >>makes no sense.
> >
> >I would fully agree if effective and stable I/O scheduling would be
> >available in blk-mq in one or two months.  But I guess that it will
> >take at least one year optimistically, given the current status of the
> >needed infrastructure, and given the great difficulties of doing
> >effective scheduling at the high parallelism and extreme target speeds
> >of blk-mq.  Of course, this holds true unless little clever scheduling
> >is performed.
> >
> >So, what's the point in forcing a lot of users wait another year or
> >more, for a solution that has yet to be even defined, while they could
> >enjoy a much better system, and then switch an even better system when
> >scheduling is ready in blk-mq too?
> 
> That same argument could have been made 2 years ago. Saying no to a new
> scheduler for the legacy framework goes back roughly that long. We could
> have had BFQ for mq NOW, if we didn't keep coming back to this very
> point.
> 
> I'm hesistant to add a new scheduler because it's very easy to add, very
> difficult to get rid of. If we do add BFQ as a legacy scheduler now,
> it'll take us years and years to get rid of it again. We should be
> moving towards LESS moving parts in the legacy path, not more.
> 
> We can keep having this discussion every few years, but I think we'd
> both prefer to make some actual progress here. It's perfectly fine to
> add an interface for a single queue interface for an IO scheduler for
> blk-mq, since we don't care too much about scalability there. And that
> won't take years, that should be a few weeks. Retrofitting BFQ on top of
> that should not be hard either. That can co-exist with a real multiqueue
> scheduler as well, something that's geared towards some fairness for
> faster devices.

OK, so some solution like having a variant of blk_sq_make_request() that
will consume requests, do IO scheduling decisions on them, and feed them
into the HW queue is it sees fit would be acceptable? That will provide the
IO scheduler a global view that it needs for complex scheduling decisions
so it should indeed be relatively easy to port BFQ to work like that.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler

2016-10-28 Thread Jan Kara
On Thu 27-10-16 10:26:18, Jens Axboe wrote:
> On 10/27/2016 03:26 AM, Jan Kara wrote:
> >On Wed 26-10-16 10:12:38, Jens Axboe wrote:
> >>On 10/26/2016 10:04 AM, Paolo Valente wrote:
> >>>
> >>>>Il giorno 26 ott 2016, alle ore 17:32, Jens Axboe  ha 
> >>>>scritto:
> >>>>
> >>>>On 10/26/2016 09:29 AM, Christoph Hellwig wrote:
> >>>>>On Wed, Oct 26, 2016 at 05:13:07PM +0200, Arnd Bergmann wrote:
> >>>>>>The question to ask first is whether to actually have pluggable
> >>>>>>schedulers on blk-mq at all, or just have one that is meant to
> >>>>>>do the right thing in every case (and possibly can be bypassed
> >>>>>>completely).
> >>>>>
> >>>>>That would be my preference.  Have a BFQ-variant for blk-mq as an
> >>>>>option (default to off unless opted in by the driver or user), and
> >>>>>not other scheduler for blk-mq.  Don't bother with bfq for non
> >>>>>blk-mq.  It's not like there is any advantage in the legacy-request
> >>>>>device even for slow devices, except for the option of having I/O
> >>>>>scheduling.
> >>>>
> >>>>It's the only right way forward. blk-mq might not offer any substantial
> >>>>advantages to rotating storage, but with scheduling, it won't offer a
> >>>>downside either. And it'll take us towards the real goal, which is to
> >>>>have just one IO path.
> >>>
> >>>ok
> >>>
> >>>>Adding a new scheduler for the legacy IO path
> >>>>makes no sense.
> >>>
> >>>I would fully agree if effective and stable I/O scheduling would be
> >>>available in blk-mq in one or two months.  But I guess that it will
> >>>take at least one year optimistically, given the current status of the
> >>>needed infrastructure, and given the great difficulties of doing
> >>>effective scheduling at the high parallelism and extreme target speeds
> >>>of blk-mq.  Of course, this holds true unless little clever scheduling
> >>>is performed.
> >>>
> >>>So, what's the point in forcing a lot of users wait another year or
> >>>more, for a solution that has yet to be even defined, while they could
> >>>enjoy a much better system, and then switch an even better system when
> >>>scheduling is ready in blk-mq too?
> >>
> >>That same argument could have been made 2 years ago. Saying no to a new
> >>scheduler for the legacy framework goes back roughly that long. We could
> >>have had BFQ for mq NOW, if we didn't keep coming back to this very
> >>point.
> >>
> >>I'm hesistant to add a new scheduler because it's very easy to add, very
> >>difficult to get rid of. If we do add BFQ as a legacy scheduler now,
> >>it'll take us years and years to get rid of it again. We should be
> >>moving towards LESS moving parts in the legacy path, not more.
> >>
> >>We can keep having this discussion every few years, but I think we'd
> >>both prefer to make some actual progress here. It's perfectly fine to
> >>add an interface for a single queue interface for an IO scheduler for
> >>blk-mq, since we don't care too much about scalability there. And that
> >>won't take years, that should be a few weeks. Retrofitting BFQ on top of
> >>that should not be hard either. That can co-exist with a real multiqueue
> >>scheduler as well, something that's geared towards some fairness for
> >>faster devices.
> >
> >OK, so some solution like having a variant of blk_sq_make_request() that
> >will consume requests, do IO scheduling decisions on them, and feed them
> >into the HW queue is it sees fit would be acceptable? That will provide the
> >IO scheduler a global view that it needs for complex scheduling decisions
> >so it should indeed be relatively easy to port BFQ to work like that.
> 
> I'd probably start off Omar's base [1] that switches the software queues
> to store bios instead of requests, since that lifts the of the 1:1
> mapping between what we can queue up and what we can dispatch. Without
> that, the IO scheduler won't have too much to work with. And with that
> in place, it'll be a "bio in, request out" type of setup, which is
> similar to what we have in the legac

Re: [PATCHv3 17/41] filemap: handle huge pages in filemap_fdatawait_range()

2016-10-31 Thread Jan Kara
On Mon 24-10-16 14:36:25, Kirill A. Shutemov wrote:
> On Thu, Oct 13, 2016 at 03:18:02PM +0200, Jan Kara wrote:
> > On Thu 13-10-16 15:08:44, Kirill A. Shutemov wrote:
> > > On Thu, Oct 13, 2016 at 11:44:41AM +0200, Jan Kara wrote:
> > > > On Thu 15-09-16 14:54:59, Kirill A. Shutemov wrote:
> > > > > We writeback whole huge page a time.
> > > > 
> > > > This is one of the things I don't understand. Firstly I didn't see where
> > > > changes of writeback like this would happen (maybe they come later).
> > > > Secondly I'm not sure why e.g. writeback should behave atomically wrt 
> > > > huge
> > > > pages. Is this because radix-tree multiorder entry tracks dirtiness for 
> > > > us
> > > > at that granularity?
> > > 
> > > We track dirty/writeback on per-compound pages: meaning we have one
> > > dirty/writeback flag for whole compound page, not on every individual
> > > 4k subpage. The same story for radix-tree tags.
> > > 
> > > > BTW, can you also explain why do we need multiorder entries? What do
> > > > they solve for us?
> > > 
> > > It helps us having coherent view on tags in radix-tree: no matter which
> > > index we refer from the range huge page covers we will get the same
> > > answer on which tags set.
> > 
> > OK, understand that. But why do we need a coherent view? For which purposes
> > exactly do we care that it is not just a bunch of 4k pages that happen to
> > be physically contiguous and thus can be mapped in one PMD?
> 
> My understanding is that things like PageDirty() should be handled on the
> same granularity as PAGECACHE_TAG_DIRTY, otherwise things can go horribly
> wrong...

Yeah, I agree with that. My question was rather aiming in the direction:
Why don't we keep PageDirty and PAGECACHE_TAG_DIRTY on a page granularity?
Why do we push all this to happen only in the head page?

In your coverletter of the latest version (BTW thanks for expanding
explanations there) you write:
  - head page (the first subpage) on LRU represents whole huge page;
  - head page's flags represent state of whole huge page (with few
exceptions);
  - mm can't migrate subpages of the compound page individually;

So the fact that flags of a head page represent flags of each individual
page is the decision that I'm questioning, at least for PageDirty and
PageWriteback flags. I'm asking because frankly, I don't like the series
much. IMHO too many places need to know about huge pages and things will
get broken frequently. And from filesystem POV I don't really see why a
filesystem should care about huge pages *at all*. Sure functions allocating
pages into page cache need to care, sure functions mapping pages into page
tables need to care. But nobody else should need to be aware we are playing
some huge page games... At least that is my idea how things ought to work
;)

Your solution seems to go more towards the direction where we have two
different sizes of pages in the system and everyone has to cope with it.
But I'd also note that you go only half way there - e.g. page lookup
functions still work with subpages, some places still use PAGE_SIZE &
page->index, ... - so the result is a strange mix.

So what are the reasons for having pages forming a huge page bound so
tightly?


Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 15/41] filemap: handle huge pages in do_generic_file_read()

2016-11-01 Thread Jan Kara
On Mon 31-10-16 21:10:35, Kirill A. Shutemov wrote:
> [ My mail system got broken and original reply didn't get to through. Resent. 
> ]

OK, this answers some of my questions from previous email so disregard that
one.

> On Thu, Oct 13, 2016 at 11:33:13AM +0200, Jan Kara wrote:
> > On Thu 15-09-16 14:54:57, Kirill A. Shutemov wrote:
> > > Most of work happans on head page. Only when we need to do copy data to
> > > userspace we find relevant subpage.
> > > 
> > > We are still limited by PAGE_SIZE per iteration. Lifting this limitation
> > > would require some more work.
> >
> > Hum, I'm kind of lost.
> 
> The limitation here comes from how copy_page_to_iter() and
> copy_page_from_iter() work wrt. highmem: it can only handle one small
> page a time.
> 
> On write side, we also have problem with assuming small page: write length
> and offset within page calculated before we know if small or huge page is
> allocated. It's not easy to fix. Looks like it would require change in
> ->write_begin() interface to accept len > PAGE_SIZE.
>
> > Can you point me to some design document / email that would explain some
> > high level ideas how are huge pages in page cache supposed to work?
> 
> I'll elaborate more in cover letter to next revision.
> 
> > When are we supposed to operate on the head page and when on subpage?
> 
> It's case-by-case. See above explanation why we're limited to PAGE_SIZE
> here.
> 
> > What is protected by the page lock of the head page?
> 
> Whole huge page. As with anon pages.
> 
> > Do page locks of subpages play any role?
> 
> lock_page() on any subpage would lock whole huge page.
> 
> > If understand right, e.g.  pagecache_get_page() will return subpages but
> > is it generally safe to operate on subpages individually or do we have
> > to be aware that they are part of a huge page?
> 
> I tried to make it as transparent as possible: page flag operations will
> be redirected to head page, if necessary. Things like page_mapping() and
> page_to_pgoff() know about huge pages.
> 
> Direct access to struct page fields must be avoided for tail pages as most
> of them doesn't have meaning you would expect for small pages.

OK, good to know.

> > If I understand the motivation right, it is mostly about being able to mmap
> > PMD-sized chunks to userspace. So my naive idea would be that we could just
> > implement it by allocating PMD sized chunks of pages when adding pages to
> > page cache, we don't even have to read them all unless we come from PMD
> > fault path.
> 
> Well, no. We have one PG_{uptodate,dirty,writeback,mappedtodisk,etc}
> per-hugepage, one common list of buffer heads...
> 
> PG_dirty and PG_uptodate behaviour inhered from anon-THP (where handling
> it otherwise doesn't make sense) and handling it differently for file-THP
> is nightmare from maintenance POV.

But the complexity of two different page sizes for page cache and *each*
filesystem that wants to support it does not make the maintenance easy
either. So I'm not convinced that using the same rules for anon-THP and
file-THP is a clear win. And if we have these two options neither of which
has negligible maintenance cost, I'd also like to see more justification
for why it is a good idea to have file-THP for normal filesystems. Do you
have any performance numbers that show it is a win under some realistic
workload?

I'd also note that having PMD-sized pages has some obvious disadvantages as
well:

1) I'm not sure buffer head handling code will quite scale to 512 or even
2048 buffer_heads on a linked list referenced from a page. It may work but
I suspect the performance will suck. 

2) PMD-sized pages result in increased space & memory usage.

3) In ext4 we have to estimate how much metadata we may need to modify when
allocating blocks underlying a page in the worst case (you don't seem to
update this estimate in your patch set). With 2048 blocks underlying a page,
each possibly in a different block group, it is a lot of metadata forcing
us to reserve a large transaction (not sure if you'll be able to even
reserve such large transaction with the default journal size), which again
makes things slower.

4) As you have noted some places like write_begin() still depend on 4k
pages which creates a strange mix of places that use subpages and that use
head pages.

All this would be a non-issue (well, except 2 I guess) if we just didn't
expose filesystems to the fact that something like file-THP exists.

> > Reclaim may need to be aware not to split pages unnecessarily
> > but that's about it. So I'd like to understand what's wrong with this
> > naive 

Re: [PATCHv3 15/41] filemap: handle huge pages in do_generic_file_read()

2016-11-03 Thread Jan Kara
On Wed 02-11-16 07:36:12, Christoph Hellwig wrote:
> On Tue, Nov 01, 2016 at 05:39:40PM +0100, Jan Kara wrote:
> > I'd also note that having PMD-sized pages has some obvious disadvantages as
> > well:
> > 
> > 1) I'm not sure buffer head handling code will quite scale to 512 or even
> > 2048 buffer_heads on a linked list referenced from a page. It may work but
> > I suspect the performance will suck. 
> 
> buffer_head handling always sucks.  For the iomap based bufferd write
> path I plan to support a buffer_head-less mode for the block size ==
> PAGE_SIZE case in 4.11 latest, but if I get enough other things of my
> plate in time even for 4.10.  I think that's the right way to go for
> THP, especially if we require the fs to allocate the whole huge page
> as a single extent, similar to the DAX PMD mapping case.

Yeah, if we require whole THP to be backed by a single extent, things get
simpler. But still there's the issue that ext4 cannot easily use iomap code
for buffered writes because of the data exposure issue we already talked
about - well, ext4 could actually work (it supports unwritten extents) but
old compatibility modes won't work and I'd strongly prefer not to have two
independent write paths in ext4... But I'll put more thought into this, I
have some idea how we could hack around the problem even for on-disk formats
that don't support unwritten extents. The trick we could use is that we'd
just mark the range of file as unwritten in memory in extent cache we have,
that should protect us against exposing uninitialized pages in racing
faults.

> > 2) PMD-sized pages result in increased space & memory usage.
> 
> How so?

Well, memory usage is clear I guess - if the files are smaller than THP
size, or if you don't use all the 4k pages that are forming THP you are
wasting memory. Sure it can be somewhat controlled by the heuristics
deciding when to use THP in pagecache and when to fall back to 4k pages.

Regarding space usage - it is mostly the case for sparse mmaped IO where
you always have to allocate (and write out) all the blocks underlying a THP
that gets written to, even though you may only need 4K from that area...

> > 3) In ext4 we have to estimate how much metadata we may need to modify when
> > allocating blocks underlying a page in the worst case (you don't seem to
> > update this estimate in your patch set). With 2048 blocks underlying a page,
> > each possibly in a different block group, it is a lot of metadata forcing
> > us to reserve a large transaction (not sure if you'll be able to even
> > reserve such large transaction with the default journal size), which again
> > makes things slower.
> 
> As said above I think we should only use huge page mappings if there is
> a single underlying extent, same as in DAX to keep the complexity down.
> 
> > 4) As you have noted some places like write_begin() still depend on 4k
> > pages which creates a strange mix of places that use subpages and that use
> > head pages.
> 
> Just use the iomap bufferd I/O code and all these issues will go away.

Yep, the above two things would make things somewhat less ugly I agree.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 15/41] filemap: handle huge pages in do_generic_file_read()

2016-11-03 Thread Jan Kara
On Wed 02-11-16 11:32:04, Kirill A. Shutemov wrote:
> On Tue, Nov 01, 2016 at 05:39:40PM +0100, Jan Kara wrote:
> > On Mon 31-10-16 21:10:35, Kirill A. Shutemov wrote:
> > > > If I understand the motivation right, it is mostly about being able to 
> > > > mmap
> > > > PMD-sized chunks to userspace. So my naive idea would be that we could 
> > > > just
> > > > implement it by allocating PMD sized chunks of pages when adding pages 
> > > > to
> > > > page cache, we don't even have to read them all unless we come from PMD
> > > > fault path.
> > > 
> > > Well, no. We have one PG_{uptodate,dirty,writeback,mappedtodisk,etc}
> > > per-hugepage, one common list of buffer heads...
> > > 
> > > PG_dirty and PG_uptodate behaviour inhered from anon-THP (where handling
> > > it otherwise doesn't make sense) and handling it differently for file-THP
> > > is nightmare from maintenance POV.
> > 
> > But the complexity of two different page sizes for page cache and *each*
> > filesystem that wants to support it does not make the maintenance easy
> > either.
> 
> I think with time we can make small pages just a subcase of huge pages.
> And some generalization can be made once more than one filesystem with
> backing storage will adopt huge pages.

My objection is that IMHO currently the code is too ugly to go in. Too many
places need to know about THP and I'm not even sure you have patched all
the places or whether some corner cases remained unfixed and how should I
find that out.

> > So I'm not convinced that using the same rules for anon-THP and
> > file-THP is a clear win.
> 
> We already have file-THP with the same rules: tmpfs. Backing storage is
> what changes the picture.

Right, the ugliness comes from access to backing storage having to deal
with huge pages.

> > I'd also note that having PMD-sized pages has some obvious disadvantages as
> > well:
> >
> > 1) I'm not sure buffer head handling code will quite scale to 512 or even
> > 2048 buffer_heads on a linked list referenced from a page. It may work but
> > I suspect the performance will suck.
> 
> Yes, buffer_head list doesn't scale. That's the main reason (along with 4)
> why syscall-based IO sucks. We spend a lot of time looking for desired
> block.
> 
> We need to switch to some other data structure for storing buffer_heads.
> Is there a reason why we have list there in first place?
> Why not just array?
> 
> I will look into it, but this sounds like a separate infrastructure change
> project.

As Christoph said iomap code should help you with that and make things
simpler. If things go as we imagine, we should be able to pretty much avoid
buffer heads. But it will take some time to get there.

> > 2) PMD-sized pages result in increased space & memory usage.
> 
> Space? Do you mean disk space? Not really: we still don't write beyond
> i_size or into holes.
> 
> Behaviour wrt to holes may change with mmap()-IO as we have less
> granularity, but the same can be seen just between different
> architectures: 4k vs. 64k base page size.

Yes, I meant different granularity of mmap based IO. And I agree it isn't a
new problem but the scale of the problem is much larger with 2MB pages than
with say 64K pages. And actually the overhead of higher IO granularity of
64K pages has been one of the reasons we have switched SLES PPC kernels
from 64K pages to 4K pages (we've got complaints from customers). 

> > 3) In ext4 we have to estimate how much metadata we may need to modify when
> > allocating blocks underlying a page in the worst case (you don't seem to
> > update this estimate in your patch set). With 2048 blocks underlying a page,
> > each possibly in a different block group, it is a lot of metadata forcing
> > us to reserve a large transaction (not sure if you'll be able to even
> > reserve such large transaction with the default journal size), which again
> > makes things slower.
> 
> I didn't saw this on profiles. And xfstests looks fine. I probably need to
> run them with 1k blocks once again.

You wouldn't see this in profiles - it is a correctness thing. And it won't
be triggered unless the file is heavily fragmented which likely does not
happen with any test in xfstests. If it happens you'll notice though - the
filesystem will just report error and shut itself down.

> The numbers below generated with fio. The working set is relatively small,
> so it fits into page cache and writing set doesn't hit dirty_ratio.
> 
> I think the mmap performance should be enough to justify initial inclus

Re: [PATCH 3/8] writeback: mark background writeback as such

2016-11-05 Thread Jan Kara
On Tue 01-11-16 15:08:46, Jens Axboe wrote:
> If we're doing background type writes, then use the appropriate
> background write flags for that.
> 
> Signed-off-by: Jens Axboe 

Looks good. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  include/linux/writeback.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 50c96ee8108f..c78f9f0920b5 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -107,6 +107,8 @@ static inline int wbc_to_write_flags(struct 
> writeback_control *wbc)
>  {
>   if (wbc->sync_mode == WB_SYNC_ALL)
>   return REQ_SYNC;
> + else if (wbc->for_kupdate || wbc->for_background)
> + return REQ_BACKGROUND;
>  
>   return 0;
>  }
> -- 
> 2.7.4
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/8] block: add WRITE_BACKGROUND

2016-11-05 Thread Jan Kara
On Tue 01-11-16 15:08:44, Jens Axboe wrote:
> This adds a new request flag, REQ_BACKGROUND, that callers can use to
> tell the block layer that this is background (non-urgent) IO.
> 
> Signed-off-by: Jens Axboe 

Looks good. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  include/linux/blk_types.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index bb921028e7c5..562ac46cb790 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -177,6 +177,7 @@ enum req_flag_bits {
>   __REQ_FUA,  /* forced unit access */
>   __REQ_PREFLUSH, /* request for cache flush */
>   __REQ_RAHEAD,   /* read ahead, can fail anytime */
> + __REQ_BACKGROUND,   /* background IO */
>   __REQ_NR_BITS,  /* stops here */
>  };
>  
> @@ -192,6 +193,7 @@ enum req_flag_bits {
>  #define REQ_FUA  (1ULL << __REQ_FUA)
>  #define REQ_PREFLUSH (1ULL << __REQ_PREFLUSH)
>  #define REQ_RAHEAD   (1ULL << __REQ_RAHEAD)
> +#define REQ_BACKGROUND   (1ULL << __REQ_BACKGROUND)
>  
>  #define REQ_FAILFAST_MASK \
>   (REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
> -- 
> 2.7.4
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/8] block: add code to track actual device queue depth

2016-11-05 Thread Jan Kara
On Tue 01-11-16 15:08:48, Jens Axboe wrote:
> For blk-mq, ->nr_requests does track queue depth, at least at init
> time. But for the older queue paths, it's simply a soft setting.
> On top of that, it's generally larger than the hardware setting
> on purpose, to allow backup of requests for merging.
> 
> Fill a hole in struct request with a 'queue_depth' member, that
> drivers can call to more closely inform the block layer of the
> real queue depth.
> 
> Signed-off-by: Jens Axboe 

The patch looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza
> ---
>  block/blk-settings.c   | 12 
>  drivers/scsi/scsi.c|  3 +++
>  include/linux/blkdev.h | 11 +++
>  3 files changed, 26 insertions(+)
> 
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 55369a65dea2..9cf053759363 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -837,6 +837,18 @@ void blk_queue_flush_queueable(struct request_queue *q, 
> bool queueable)
>  EXPORT_SYMBOL_GPL(blk_queue_flush_queueable);
>  
>  /**
> + * blk_set_queue_depth - tell the block layer about the device queue depth
> + * @q:   the request queue for the device
> + * @depth:   queue depth
> + *
> + */
> +void blk_set_queue_depth(struct request_queue *q, unsigned int depth)
> +{
> + q->queue_depth = depth;
> +}
> +EXPORT_SYMBOL(blk_set_queue_depth);
> +
> +/**
>   * blk_queue_write_cache - configure queue's write cache
>   * @q:   the request queue for the device
>   * @wc:  write back cache on or off
> diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
> index 1deb6adc411f..75455d4dab68 100644
> --- a/drivers/scsi/scsi.c
> +++ b/drivers/scsi/scsi.c
> @@ -621,6 +621,9 @@ int scsi_change_queue_depth(struct scsi_device *sdev, int 
> depth)
>   wmb();
>   }
>  
> + if (sdev->request_queue)
> + blk_set_queue_depth(sdev->request_queue, depth);
> +
>   return sdev->queue_depth;
>  }
>  EXPORT_SYMBOL(scsi_change_queue_depth);
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 8396da2bb698..0c677fb35ce4 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -405,6 +405,8 @@ struct request_queue {
>   struct blk_mq_ctx __percpu  *queue_ctx;
>   unsigned intnr_queues;
>  
> + unsigned intqueue_depth;
> +
>   /* hw dispatch queues */
>   struct blk_mq_hw_ctx**queue_hw_ctx;
>   unsigned intnr_hw_queues;
> @@ -777,6 +779,14 @@ static inline bool blk_write_same_mergeable(struct bio 
> *a, struct bio *b)
>   return false;
>  }
>  
> +static inline unsigned int blk_queue_depth(struct request_queue *q)
> +{
> + if (q->queue_depth)
> + return q->queue_depth;
> +
> + return q->nr_requests;
> +}
> +
>  /*
>   * q->prep_rq_fn return values
>   */
> @@ -1093,6 +1103,7 @@ extern void blk_limits_io_min(struct queue_limits 
> *limits, unsigned int min);
>  extern void blk_queue_io_min(struct request_queue *q, unsigned int min);
>  extern void blk_limits_io_opt(struct queue_limits *limits, unsigned int opt);
>  extern void blk_queue_io_opt(struct request_queue *q, unsigned int opt);
> +extern void blk_set_queue_depth(struct request_queue *q, unsigned int depth);
>  extern void blk_set_default_limits(struct queue_limits *lim);
>  extern void blk_set_stacking_limits(struct queue_limits *lim);
>  extern int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
> -- 
> 2.7.4
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/8] writeback: track if we're sleeping on progress in balance_dirty_pages()

2016-11-08 Thread Jan Kara
On Tue 01-11-16 15:08:47, Jens Axboe wrote:
> Note in the bdi_writeback structure whenever a task ends up sleeping
> waiting for progress. We can use that information in the lower layers
> to increase the priority of writes.
> 
> Signed-off-by: Jens Axboe 

The patch looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  include/linux/backing-dev-defs.h | 2 ++
>  mm/backing-dev.c | 1 +
>  mm/page-writeback.c  | 1 +
>  3 files changed, 4 insertions(+)
> 
> diff --git a/include/linux/backing-dev-defs.h 
> b/include/linux/backing-dev-defs.h
> index c357f27d5483..dc5f76d7f648 100644
> --- a/include/linux/backing-dev-defs.h
> +++ b/include/linux/backing-dev-defs.h
> @@ -116,6 +116,8 @@ struct bdi_writeback {
>   struct list_head work_list;
>   struct delayed_work dwork;  /* work item used for writeback */
>  
> + unsigned long dirty_sleep;  /* last wait */
> +
>   struct list_head bdi_node;  /* anchored at bdi->wb_list */
>  
>  #ifdef CONFIG_CGROUP_WRITEBACK
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 8fde443f36d7..3bfed5ab2475 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -310,6 +310,7 @@ static int wb_init(struct bdi_writeback *wb, struct 
> backing_dev_info *bdi,
>   spin_lock_init(&wb->work_lock);
>   INIT_LIST_HEAD(&wb->work_list);
>   INIT_DELAYED_WORK(&wb->dwork, wb_workfn);
> + wb->dirty_sleep = jiffies;
>  
>   wb->congested = wb_congested_get_create(bdi, blkcg_id, gfp);
>   if (!wb->congested)
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 439cc63ad903..52e2f8e3b472 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1778,6 +1778,7 @@ static void balance_dirty_pages(struct address_space 
> *mapping,
> pause,
> start_time);
>   __set_current_state(TASK_KILLABLE);
> + wb->dirty_sleep = now;
>   io_schedule_timeout(pause);
>  
>   current->dirty_paused_when = now + pause;
> -- 
> 2.7.4
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/8] block: add scalable completion tracking of requests

2016-11-08 Thread Jan Kara
t; +{
> + struct blk_rq_stat stat[2];
> + ssize_t ret;
> +
> + blk_queue_stat_get(q, stat);
> +
> + ret = print_stat(page, &stat[0], "read :");
> + ret += print_stat(page + ret, &stat[1], "write:");
> + return ret;
> +}
> +
>  static struct queue_sysfs_entry queue_requests_entry = {
>   .attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
>   .show = queue_requests_show,
> @@ -553,6 +573,11 @@ static struct queue_sysfs_entry queue_dax_entry = {
>   .show = queue_dax_show,
>  };
>  
> +static struct queue_sysfs_entry queue_stats_entry = {
> + .attr = {.name = "stats", .mode = S_IRUGO },
> + .show = queue_stats_show,
> +};
> +
>  static struct attribute *default_attrs[] = {
>   &queue_requests_entry.attr,
>   &queue_ra_entry.attr,
> @@ -582,6 +607,7 @@ static struct attribute *default_attrs[] = {
>   &queue_poll_entry.attr,
>   &queue_wc_entry.attr,
>   &queue_dax_entry.attr,
> + &queue_stats_entry.attr,
>   NULL,
>  };
>  
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 562ac46cb790..4d0044d09984 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -250,4 +250,20 @@ static inline unsigned int blk_qc_t_to_tag(blk_qc_t 
> cookie)
>   return cookie & ((1u << BLK_QC_T_SHIFT) - 1);
>  }
>  
> +struct blk_issue_stat {
> + u64 time;
> +};
> +
> +#define BLK_RQ_STAT_BATCH64
> +
> +struct blk_rq_stat {
> + s64 mean;
> + u64 min;
> + u64 max;
> + s32 nr_samples;
> + s32 nr_batch;
> + u64 batch;
> + s64 time;
> +};
> +
>  #endif /* __LINUX_BLK_TYPES_H */
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 0c677fb35ce4..6bd5eb56894e 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -197,6 +197,7 @@ struct request {
>   struct gendisk *rq_disk;
>   struct hd_struct *part;
>   unsigned long start_time;
> + struct blk_issue_stat issue_stat;
>  #ifdef CONFIG_BLK_CGROUP
>   struct request_list *rl;/* rl this rq is alloced from */
>   unsigned long long start_time_ns;
> @@ -492,6 +493,9 @@ struct request_queue {
>  
>   unsigned intnr_sorted;
>   unsigned intin_flight[2];
> +
> + struct blk_rq_stat  rq_stats[2];
> +
>   /*
>* Number of active block driver functions for which blk_drain_queue()
>* must wait. Must be incremented around functions that unlock the
> -- 
> 2.7.4
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/8] blk-wbt: add general throttling mechanism

2016-11-08 Thread Jan Kara
On Tue 01-11-16 15:08:50, Jens Axboe wrote:
> We can hook this up to the block layer, to help throttle buffered
> writes.
> 
> wbt registers a few trace points that can be used to track what is
> happening in the system:
> 
> wbt_lat: 259:0: latency 2446318
> wbt_stat: 259:0: rmean=2446318, rmin=2446318, rmax=2446318, rsamples=1,
>wmean=518866, wmin=15522, wmax=5330353, wsamples=57
> wbt_step: 259:0: step down: step=1, window=72727272, background=8, normal=16, 
> max=32
> 
> This shows a sync issue event (wbt_lat) that exceeded it's time. wbt_stat
> dumps the current read/write stats for that window, and wbt_step shows a
> step down event where we now scale back writes. Each trace includes the
> device, 259:0 in this case.

Just one serious question and one nit below:

> +void __wbt_done(struct rq_wb *rwb, enum wbt_flags wb_acct)
> +{
> + struct rq_wait *rqw;
> + int inflight, limit;
> +
> + if (!(wb_acct & WBT_TRACKED))
> + return;
> +
> + rqw = get_rq_wait(rwb, wb_acct & WBT_KSWAPD);
> + inflight = atomic_dec_return(&rqw->inflight);
> +
> + /*
> +  * wbt got disabled with IO in flight. Wake up any potential
> +  * waiters, we don't have to do more than that.
> +  */
> + if (unlikely(!rwb_enabled(rwb))) {
> + rwb_wake_all(rwb);
> + return;
> + }
> +
> + /*
> +  * If the device does write back caching, drop further down
> +  * before we wake people up.
> +  */
> + if (rwb->wc && !wb_recent_wait(rwb))
> + limit = 0;
> + else
> + limit = rwb->wb_normal;

So for devices with write cache, you will completely drain the device
before waking anybody waiting to issue new requests. Isn't it too strict?
In particular may_queue() will allow new writers to issue new writes once
we drop below the limit so it can happen that some processes will be
effectively starved waiting in may_queue?

> +static void wb_timer_fn(unsigned long data)
> +{
> + struct rq_wb *rwb = (struct rq_wb *) data;
> + unsigned int inflight = wbt_inflight(rwb);
> + int status;
> +
> + status = latency_exceeded(rwb);
> +
> + trace_wbt_timer(rwb->bdi, status, rwb->scale_step, inflight);
> +
> + /*
> +  * If we exceeded the latency target, step down. If we did not,
> +  * step one level up. If we don't know enough to say either exceeded
> +  * or ok, then don't do anything.
> +  */
> + switch (status) {
> + case LAT_EXCEEDED:
> + scale_down(rwb, true);
> + break;
> + case LAT_OK:
> + scale_up(rwb);
> + break;
> + case LAT_UNKNOWN_WRITES:
> + scale_up(rwb);
> + break;
> + case LAT_UNKNOWN:
> + if (++rwb->unknown_cnt < RWB_UNKNOWN_BUMP)
> + break;
> + /*
> +  * We get here for two reasons:
> +  *
> +  * 1) We previously scaled reduced depth, and we currently
> +  *don't have a valid read/write sample. For that case,
> +  *slowly return to center state (step == 0).
> +  * 2) We started a the center step, but don't have a valid
> +      *read/write sample, but we do have writes going on.
> +  *Allow step to go negative, to increase write perf.
> +  */

I think part 2) of the comment now belongs to LAT_UNKNOWN_WRITES label.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] block: hook up writeback throttling

2016-11-08 Thread Jan Kara
On Tue 01-11-16 15:08:51, Jens Axboe wrote:
> Enable throttling of buffered writeback to make it a lot
> more smooth, and has way less impact on other system activity.
> Background writeback should be, by definition, background
> activity. The fact that we flush huge bundles of it at the time
> means that it potentially has heavy impacts on foreground workloads,
> which isn't ideal. We can't easily limit the sizes of writes that
> we do, since that would impact file system layout in the presence
> of delayed allocation. So just throttle back buffered writeback,
> unless someone is waiting for it.
> 
> The algorithm for when to throttle takes its inspiration in the
> CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
> the minimum latencies of requests over a window of time. In that
> window of time, if the minimum latency of any request exceeds a
> given target, then a scale count is incremented and the queue depth
> is shrunk. The next monitoring window is shrunk accordingly. Unlike
> CoDel, if we hit a window that exhibits good behavior, then we
> simply increment the scale count and re-calculate the limits for that
> scale value. This prevents us from oscillating between a
> close-to-ideal value and max all the time, instead remaining in the
> windows where we get good behavior.
> 
> Unlike CoDel, blk-wb allows the scale count to to negative. This
> happens if we primarily have writes going on. Unlike positive
> scale counts, this doesn't change the size of the monitoring window.
> When the heavy writers finish, blk-bw quickly snaps back to it's
> stable state of a zero scale count.
> 
> The patch registers two sysfs entries. The first one, 'wb_window_usec',
> defines the window of monitoring. The second one, 'wb_lat_usec',
> sets the latency target for the window. It defaults to 2 msec for
> non-rotational storage, and 75 msec for rotational storage. Setting
> this value to '0' disables blk-wb. Generally, a user would not have
> to touch these settings.
> 
> We don't enable WBT on devices that are managed with CFQ, and have
> a non-root block cgroup attached. If we have a proportional share setup
> on this particular disk, then the wbt throttling will interfere with
> that. We don't have a strong need for wbt for that case, since we will
> rely on CFQ doing that for us.

Just one nit: Don't you miss wbt_exit() call for legacy block layer? I
don't see where that happens.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/8] blk-wbt: add general throttling mechanism

2016-11-09 Thread Jan Kara
On Tue 08-11-16 08:41:09, Jens Axboe wrote:
> On Tue, Nov 08 2016, Jan Kara wrote:
> > On Tue 01-11-16 15:08:50, Jens Axboe wrote:
> > > We can hook this up to the block layer, to help throttle buffered
> > > writes.
> > > 
> > > wbt registers a few trace points that can be used to track what is
> > > happening in the system:
> > > 
> > > wbt_lat: 259:0: latency 2446318
> > > wbt_stat: 259:0: rmean=2446318, rmin=2446318, rmax=2446318, rsamples=1,
> > >wmean=518866, wmin=15522, wmax=5330353, wsamples=57
> > > wbt_step: 259:0: step down: step=1, window=72727272, background=8, 
> > > normal=16, max=32
> > > 
> > > This shows a sync issue event (wbt_lat) that exceeded it's time. wbt_stat
> > > dumps the current read/write stats for that window, and wbt_step shows a
> > > step down event where we now scale back writes. Each trace includes the
> > > device, 259:0 in this case.
> > 
> > Just one serious question and one nit below:
> > 
> > > +void __wbt_done(struct rq_wb *rwb, enum wbt_flags wb_acct)
> > > +{
> > > + struct rq_wait *rqw;
> > > + int inflight, limit;
> > > +
> > > + if (!(wb_acct & WBT_TRACKED))
> > > + return;
> > > +
> > > + rqw = get_rq_wait(rwb, wb_acct & WBT_KSWAPD);
> > > + inflight = atomic_dec_return(&rqw->inflight);
> > > +
> > > + /*
> > > +  * wbt got disabled with IO in flight. Wake up any potential
> > > +  * waiters, we don't have to do more than that.
> > > +  */
> > > + if (unlikely(!rwb_enabled(rwb))) {
> > > + rwb_wake_all(rwb);
> > > + return;
> > > + }
> > > +
> > > + /*
> > > +  * If the device does write back caching, drop further down
> > > +  * before we wake people up.
> > > +  */
> > > + if (rwb->wc && !wb_recent_wait(rwb))
> > > + limit = 0;
> > > + else
> > > + limit = rwb->wb_normal;
> > 
> > So for devices with write cache, you will completely drain the device
> > before waking anybody waiting to issue new requests. Isn't it too strict?
> > In particular may_queue() will allow new writers to issue new writes once
> > we drop below the limit so it can happen that some processes will be
> > effectively starved waiting in may_queue?
> 
> It is strict, and perhaps too strict. In testing, it's the only method
> that's proven to keep the writeback caching devices in check. It will
> round robin the writers, if we have more, which isn't necessarily a bad
> thing. Each will get to do a burst of depth writes, then wait for a new
> one.

Well, I'm more concerned about a situation where one writer does a bursty
write and blocks sleeping in may_queue(). Another writer produces a steady
flow of write requests so that never causes the write queue to completely
drain but that writer also never blocks in may_queue() when it starts
queueing after write queue has somewhat drained because it never submits
many requests in parallel. In such case the first writer would get starved
AFAIU.

Also I'm not sure why such logic for devices with writeback cache is
needed. Sure the disk is fast to accept writes but if that causes long read
latencies, we should scale down the writeback limits so that we eventually
end up submitting only one write request anyway - effectively the same
thing as limit=0 - won't we?

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/8] block: add scalable completion tracking of requests

2016-11-09 Thread Jan Kara
On Tue 08-11-16 08:25:52, Jens Axboe wrote:
> On 11/08/2016 06:30 AM, Jan Kara wrote:
> >On Tue 01-11-16 15:08:49, Jens Axboe wrote:
> >>For legacy block, we simply track them in the request queue. For
> >>blk-mq, we track them on a per-sw queue basis, which we can then
> >>sum up through the hardware queues and finally to a per device
> >>state.
> >>
> >>The stats are tracked in, roughly, 0.1s interval windows.
> >>
> >>Add sysfs files to display the stats.
> >>
> >>Signed-off-by: Jens Axboe 
> >
> >This patch looks mostly good to me but I have one concern: You track
> >statistics in a fixed 134ms window, stats get cleared at the beginning of
> >each window. Now this can interact with the writeback window and latency
> >settings which are dynamic and settable from userspace - so if the
> >writeback code observation window gets set larger than the stats window,
> >things become strange since you'll likely miss quite some observations
> >about read latencies. So I think you need to make sure stats window is
> >always larger than writeback window. Or actually, why do you have something
> >like stats window and don't leave clearing of statistics completely to the
> >writeback tracking code?
> 
> That's a good point, and there actually used to be a comment to that
> effect in the code. I think the best solution here would be to make the
> stats code mask available somewhere, and allow a consumer of the stats
> to request a larger window.
> 
> Similarly, we could make the stat window be driven by the consumer, as
> you suggest.
> 
> Currently there are two pending submissions that depend on the stats
> code. One is this writeback series, and the other one is the hybrid
> polling code. The latter does not really care about the window size as
> such, since it has no monitoring window of its own, and it wants the
> auto-clearing as well.
> 
> I don't mind working on additions for this, but I'd prefer if we could
> layer them on top of the existing series instead of respinning it.
> There's considerable test time on the existing patchset. Would that work
> for you? Especially collapsing the stats and wbt windows would require
> some re-architecting.

OK, that works for me. Actually, when thinking about this, I have one more
suggestion: Do we really want to expose the wbt window as a sysfs tunable?
I guess it is good for initial experiments but longer term having the wbt
window length be a function of target read latency might be better.
Generally you want the window length to be considerably larger than the
target latency but OTOH not too large so that the algorithm can react
reasonably quickly so that suggests it could really be autotuned (and we
scale the window anyway to adapt it to current situation).

> >Also as a side note - nobody currently uses the mean value of the
> >statistics. It may be faster to track just sum and count so that mean can
> >be computed on request which will be presumably much more rare than current
> >situation where we recompute the mean on each batch update. Actually, that
> >way you could get rid of the batching as well I assume.
> 
> That could be opt-in as well. The poll code uses it. And fwiw, it is
> exposed through sysfs as well.

Yeah, my point was that just doing the division in response to sysfs read
or actual request to read the average is likely going to be less expensive
than having to do it on each batch completion (actually, you seem to have
that batching code only so that you don't have to do the division too
often). Whether my suggestion is right depends on how often polling code
actually needs to read the average...

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/8] blk-wbt: add general throttling mechanism

2016-11-10 Thread Jan Kara
On Wed 09-11-16 12:52:59, Jens Axboe wrote:
> On 11/09/2016 09:07 AM, Jens Axboe wrote:
> >On 11/09/2016 01:40 AM, Jan Kara wrote:
> >>>>So for devices with write cache, you will completely drain the device
> >>>>before waking anybody waiting to issue new requests. Isn't it too
> >>>>strict?
> >>>>In particular may_queue() will allow new writers to issue new writes
> >>>>once
> >>>>we drop below the limit so it can happen that some processes will be
> >>>>effectively starved waiting in may_queue?
> >>>
> >>>It is strict, and perhaps too strict. In testing, it's the only method
> >>>that's proven to keep the writeback caching devices in check. It will
> >>>round robin the writers, if we have more, which isn't necessarily a bad
> >>>thing. Each will get to do a burst of depth writes, then wait for a new
> >>>one.
> >>
> >>Well, I'm more concerned about a situation where one writer does a
> >>bursty write and blocks sleeping in may_queue(). Another writer
> >>produces a steady flow of write requests so that never causes the
> >>write queue to completely drain but that writer also never blocks in
> >>may_queue() when it starts queueing after write queue has somewhat
> >>drained because it never submits many requests in parallel. In such
> >>case the first writer would get starved AFAIU.
> >
> >I see what you are saying. I can modify the logic to ensure that if we
> >do have a waiter, we queue up others behind it. That should get rid of
> >that concern.
> 
> I added that - if we currently have a waiter, we'll add ourselves to the
> back of the waitqueue and wait.

OK, sounds good to me. If the write queue draining will show to be an
issue, it will be at least clearly visible with this logic.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/8] block: add scalable completion tracking of requests

2016-11-10 Thread Jan Kara
On Wed 09-11-16 12:52:25, Jens Axboe wrote:
> On 11/09/2016 09:09 AM, Jens Axboe wrote:
> >On 11/09/2016 02:01 AM, Jan Kara wrote:
> >>On Tue 08-11-16 08:25:52, Jens Axboe wrote:
> >>>On 11/08/2016 06:30 AM, Jan Kara wrote:
> >>>>On Tue 01-11-16 15:08:49, Jens Axboe wrote:
> >>>>>For legacy block, we simply track them in the request queue. For
> >>>>>blk-mq, we track them on a per-sw queue basis, which we can then
> >>>>>sum up through the hardware queues and finally to a per device
> >>>>>state.
> >>>>>
> >>>>>The stats are tracked in, roughly, 0.1s interval windows.
> >>>>>
> >>>>>Add sysfs files to display the stats.
> >>>>>
> >>>>>Signed-off-by: Jens Axboe 
> >>>>
> >>>>This patch looks mostly good to me but I have one concern: You track
> >>>>statistics in a fixed 134ms window, stats get cleared at the
> >>>>beginning of
> >>>>each window. Now this can interact with the writeback window and
> >>>>latency
> >>>>settings which are dynamic and settable from userspace - so if the
> >>>>writeback code observation window gets set larger than the stats
> >>>>window,
> >>>>things become strange since you'll likely miss quite some observations
> >>>>about read latencies. So I think you need to make sure stats window is
> >>>>always larger than writeback window. Or actually, why do you have
> >>>>something
> >>>>like stats window and don't leave clearing of statistics completely
> >>>>to the
> >>>>writeback tracking code?
> >>>
> >>>That's a good point, and there actually used to be a comment to that
> >>>effect in the code. I think the best solution here would be to make the
> >>>stats code mask available somewhere, and allow a consumer of the stats
> >>>to request a larger window.
> >>>
> >>>Similarly, we could make the stat window be driven by the consumer, as
> >>>you suggest.
> >>>
> >>>Currently there are two pending submissions that depend on the stats
> >>>code. One is this writeback series, and the other one is the hybrid
> >>>polling code. The latter does not really care about the window size as
> >>>such, since it has no monitoring window of its own, and it wants the
> >>>auto-clearing as well.
> >>>
> >>>I don't mind working on additions for this, but I'd prefer if we could
> >>>layer them on top of the existing series instead of respinning it.
> >>>There's considerable test time on the existing patchset. Would that work
> >>>for you? Especially collapsing the stats and wbt windows would require
> >>>some re-architecting.
> >>
> >>OK, that works for me. Actually, when thinking about this, I have one
> >>more
> >>suggestion: Do we really want to expose the wbt window as a sysfs
> >>tunable?
> >>I guess it is good for initial experiments but longer term having the wbt
> >>window length be a function of target read latency might be better.
> >>Generally you want the window length to be considerably larger than the
> >>target latency but OTOH not too large so that the algorithm can react
> >>reasonably quickly so that suggests it could really be autotuned (and we
> >>scale the window anyway to adapt it to current situation).
> >
> >That's not a bad idea, I have thought about that as well before. We
> >don't need the window tunable, and you are right, it can be a function
> >of the desired latency.
> >
> >I'll hardwire the 100msec latency window for now and get rid of the
> >exposed tunable. It's harder to remove sysfs files once they have made
> >it into the kernel...
> 
> Killed the sysfs variable, so for now it'll be a 100msec window by
> default.

OK, I guess good enough to get this merged.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] blk-wbt: store queue instead of bdi

2016-11-14 Thread Jan Kara
On Fri 11-11-16 08:21:56, Jens Axboe wrote:
> The bdi was a leftover from when the code was block layer agnostic.
> Now that we just support a block layer user, store the queue directly.
> 
> Signed-off-by: Jens Axboe 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza
> ---
>  block/blk-wbt.c | 20 
>  block/blk-wbt.h |  4 +---
>  2 files changed, 13 insertions(+), 11 deletions(-)
> 
> diff --git a/block/blk-wbt.c b/block/blk-wbt.c
> index 889c17ff8503..4ab9cebc8003 100644
> --- a/block/blk-wbt.c
> +++ b/block/blk-wbt.c
> @@ -96,7 +96,7 @@ static void wb_timestamp(struct rq_wb *rwb, unsigned long 
> *var)
>   */
>  static bool wb_recent_wait(struct rq_wb *rwb)
>  {
> - struct bdi_writeback *wb = &rwb->bdi->wb;
> + struct bdi_writeback *wb = &rwb->queue->backing_dev_info.wb;
>  
>   return time_before(jiffies, wb->dirty_sleep + HZ);
>  }
> @@ -279,6 +279,7 @@ enum {
>  
>  static int __latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat)
>  {
> + struct backing_dev_info *bdi = &rwb->queue->backing_dev_info;
>   u64 thislat;
>  
>   /*
> @@ -293,7 +294,7 @@ static int __latency_exceeded(struct rq_wb *rwb, struct 
> blk_rq_stat *stat)
>   thislat = rwb_sync_issue_lat(rwb);
>   if (thislat > rwb->cur_win_nsec ||
>   (thislat > rwb->min_lat_nsec && !stat[0].nr_samples)) {
> - trace_wbt_lat(rwb->bdi, thislat);
> + trace_wbt_lat(bdi, thislat);
>   return LAT_EXCEEDED;
>   }
>  
> @@ -317,13 +318,13 @@ static int __latency_exceeded(struct rq_wb *rwb, struct 
> blk_rq_stat *stat)
>* If the 'min' latency exceeds our target, step down.
>*/
>   if (stat[0].min > rwb->min_lat_nsec) {
> - trace_wbt_lat(rwb->bdi, stat[0].min);
> - trace_wbt_stat(rwb->bdi, stat);
> + trace_wbt_lat(bdi, stat[0].min);
> + trace_wbt_stat(bdi, stat);
>   return LAT_EXCEEDED;
>   }
>  
>   if (rwb->scale_step)
> - trace_wbt_stat(rwb->bdi, stat);
> + trace_wbt_stat(bdi, stat);
>  
>   return LAT_OK;
>  }
> @@ -338,7 +339,9 @@ static int latency_exceeded(struct rq_wb *rwb)
>  
>  static void rwb_trace_step(struct rq_wb *rwb, const char *msg)
>  {
> - trace_wbt_step(rwb->bdi, msg, rwb->scale_step, rwb->cur_win_nsec,
> + struct backing_dev_info *bdi = &rwb->queue->backing_dev_info;
> +
> + trace_wbt_step(bdi, msg, rwb->scale_step, rwb->cur_win_nsec,
>   rwb->wb_background, rwb->wb_normal, rwb->wb_max);
>  }
>  
> @@ -420,7 +423,8 @@ static void wb_timer_fn(unsigned long data)
>  
>   status = latency_exceeded(rwb);
>  
> - trace_wbt_timer(rwb->bdi, status, rwb->scale_step, inflight);
> + trace_wbt_timer(&rwb->queue->backing_dev_info, status, rwb->scale_step,
> + inflight);
>  
>   /*
>* If we exceeded the latency target, step down. If we did not,
> @@ -700,7 +704,7 @@ int wbt_init(struct request_queue *q, struct wb_stat_ops 
> *ops)
>   rwb->wc = 1;
>   rwb->queue_depth = RWB_DEF_DEPTH;
>   rwb->last_comp = rwb->last_issue = jiffies;
> - rwb->bdi = &q->backing_dev_info;
> + rwb->queue = q;
>   rwb->win_nsec = RWB_WINDOW_NSEC;
>   rwb->stat_ops = ops;
>   rwb->ops_data = q;
> diff --git a/block/blk-wbt.h b/block/blk-wbt.h
> index e57700337c26..09c61a3f8295 100644
> --- a/block/blk-wbt.h
> +++ b/block/blk-wbt.h
> @@ -87,7 +87,7 @@ struct rq_wb {
>   unsigned long last_issue;   /* last non-throttled issue */
>   unsigned long last_comp;/* last non-throttled comp */
>   unsigned long min_lat_nsec;
> - struct backing_dev_info *bdi;
> + struct request_queue *queue;
>   struct rq_wait rq_wait[WBT_NUM_RWQ];
>  
>   struct wb_stat_ops *stat_ops;
> @@ -104,8 +104,6 @@ static inline unsigned int wbt_inflight(struct rq_wb *rwb)
>   return ret;
>  }
>  
> -struct backing_dev_info;
> -
>  #ifdef CONFIG_BLK_WBT
>  
>  void __wbt_done(struct rq_wb *, enum wbt_flags);
> -- 
> 2.7.4
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] blk-wbt: remove stat ops

2016-11-14 Thread Jan Kara
On Fri 11-11-16 08:21:57, Jens Axboe wrote:
> Again a leftover from when the throttling code was generic. Now that we
> just have the block user, get rid of the stat ops and indirections.
> 
> Signed-off-by: Jens Axboe 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  block/blk-sysfs.c | 23 +--
>  block/blk-wbt.c   | 15 +--
>  block/blk-wbt.h   | 13 ++---
>  3 files changed, 8 insertions(+), 43 deletions(-)
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 9262d2d60a09..415e764807d0 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -770,27 +770,6 @@ struct kobj_type blk_queue_ktype = {
>   .release= blk_release_queue,
>  };
>  
> -static void blk_wb_stat_get(void *data, struct blk_rq_stat *stat)
> -{
> - blk_queue_stat_get(data, stat);
> -}
> -
> -static void blk_wb_stat_clear(void *data)
> -{
> - blk_stat_clear(data);
> -}
> -
> -static bool blk_wb_stat_is_current(struct blk_rq_stat *stat)
> -{
> - return blk_stat_is_current(stat);
> -}
> -
> -static struct wb_stat_ops wb_stat_ops = {
> - .get= blk_wb_stat_get,
> - .is_current = blk_wb_stat_is_current,
> - .clear  = blk_wb_stat_clear,
> -};
> -
>  static void blk_wb_init(struct request_queue *q)
>  {
>  #ifndef CONFIG_BLK_WBT_MQ
> @@ -805,7 +784,7 @@ static void blk_wb_init(struct request_queue *q)
>   /*
>* If this fails, we don't get throttling
>*/
> - wbt_init(q, &wb_stat_ops);
> + wbt_init(q);
>  }
>  
>  int blk_register_queue(struct gendisk *disk)
> diff --git a/block/blk-wbt.c b/block/blk-wbt.c
> index 4ab9cebc8003..f6ec7e587fa6 100644
> --- a/block/blk-wbt.c
> +++ b/block/blk-wbt.c
> @@ -308,7 +308,7 @@ static int __latency_exceeded(struct rq_wb *rwb, struct 
> blk_rq_stat *stat)
>* waited or still has writes in flights, consider us doing
>* just writes as well.
>*/
> - if ((stat[1].nr_samples && rwb->stat_ops->is_current(stat)) ||
> + if ((stat[1].nr_samples && blk_stat_is_current(stat)) ||
>   wb_recent_wait(rwb) || wbt_inflight(rwb))
>   return LAT_UNKNOWN_WRITES;
>   return LAT_UNKNOWN;
> @@ -333,7 +333,7 @@ static int latency_exceeded(struct rq_wb *rwb)
>  {
>   struct blk_rq_stat stat[2];
>  
> - rwb->stat_ops->get(rwb->ops_data, stat);
> + blk_queue_stat_get(rwb->queue, stat);
>   return __latency_exceeded(rwb, stat);
>  }
>  
> @@ -355,7 +355,7 @@ static void scale_up(struct rq_wb *rwb)
>  
>   rwb->scale_step--;
>   rwb->unknown_cnt = 0;
> - rwb->stat_ops->clear(rwb->ops_data);
> + blk_stat_clear(rwb->queue);
>  
>   rwb->scaled_max = calc_wb_limits(rwb);
>  
> @@ -385,7 +385,7 @@ static void scale_down(struct rq_wb *rwb, bool 
> hard_throttle)
>  
>   rwb->scaled_max = false;
>   rwb->unknown_cnt = 0;
> - rwb->stat_ops->clear(rwb->ops_data);
> + blk_stat_clear(rwb->queue);
>   calc_wb_limits(rwb);
>   rwb_trace_step(rwb, "step down");
>  }
> @@ -675,7 +675,7 @@ void wbt_disable(struct rq_wb *rwb)
>  }
>  EXPORT_SYMBOL_GPL(wbt_disable);
>  
> -int wbt_init(struct request_queue *q, struct wb_stat_ops *ops)
> +int wbt_init(struct request_queue *q)
>  {
>   struct rq_wb *rwb;
>   int i;
> @@ -688,9 +688,6 @@ int wbt_init(struct request_queue *q, struct wb_stat_ops 
> *ops)
>   BUILD_BUG_ON(RWB_WINDOW_NSEC > BLK_STAT_NSEC);
>   BUILD_BUG_ON(WBT_NR_BITS > BLK_STAT_RES_BITS);
>  
> - if (!ops->get || !ops->is_current || !ops->clear)
> - return -EINVAL;
> -
>   rwb = kzalloc(sizeof(*rwb), GFP_KERNEL);
>   if (!rwb)
>   return -ENOMEM;
> @@ -706,8 +703,6 @@ int wbt_init(struct request_queue *q, struct wb_stat_ops 
> *ops)
>   rwb->last_comp = rwb->last_issue = jiffies;
>   rwb->queue = q;
>   rwb->win_nsec = RWB_WINDOW_NSEC;
> - rwb->stat_ops = ops;
> - rwb->ops_data = q;
>   wbt_update_limits(rwb);
>  
>   /*
> diff --git a/block/blk-wbt.h b/block/blk-wbt.h
> index 09c61a3f8295..44dc2173dc1f 100644
> --- a/block/blk-wbt.h
> +++ b/block/blk-wbt.h
> @@ -46,12 +46,6 @@ static inline bool wbt_is_read(struct blk_issue_stat *stat)
>   return (stat->time >> BLK_STAT_SHIFT) & WBT_READ;
>  }
>  
> -struct wb_stat_ops {
> -

[PATCH] block: protect iterate_bdevs() against concurrent close

2016-12-01 Thread Jan Kara
From: Rabin Vincent 

If a block device is closed while iterate_bdevs() is handling it, the
following NULL pointer dereference occurs because bdev->b_disk is NULL
in bdev_get_queue(), which is called from blk_get_backing_dev_info() (in
turn called by the mapping_cap_writeback_dirty() call in
__filemap_fdatawrite_range()):

 BUG: unable to handle kernel NULL pointer dereference at 0508
 IP: [] blk_get_backing_dev_info+0x10/0x20
 PGD 9e62067 PUD 9ee8067 PMD 0
 Oops:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
 Modules linked in:
 CPU: 1 PID: 2422 Comm: sync Not tainted 4.5.0-rc7+ #400
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
 task: 880009f4d700 ti: 880009f5c000 task.ti: 880009f5c000
 RIP: 0010:[]  [] 
blk_get_backing_dev_info+0x10/0x20
 RSP: 0018:880009f5fe68  EFLAGS: 00010246
 RAX:  RBX: 88000ec17a38 RCX: 81a4e940
 RDX: 7fff RSI:  RDI: 88000ec176c0
 RBP: 880009f5fe68 R08:  R09: 
 R10: 0001 R11:  R12: 88000ec17860
 R13: 811b25c0 R14: 88000ec178e0 R15: 88000ec17a38
 FS:  7faee505d700() GS:88000fb0() knlGS:
 CS:  0010 DS:  ES:  CR0: 8005003b
 CR2: 0508 CR3: 09e8a000 CR4: 06e0
 Stack:
  880009f5feb8 8112e7f5  7fff
    7fff 0001
  88000ec178e0 88000ec17860 880009f5fec8 8112e81f
 Call Trace:
  [] __filemap_fdatawrite_range+0x85/0x90
  [] filemap_fdatawrite+0x1f/0x30
  [] fdatawrite_one_bdev+0x16/0x20
  [] iterate_bdevs+0xf2/0x130
  [] sys_sync+0x63/0x90
  [] entry_SYSCALL_64_fastpath+0x12/0x76
 Code: 0f 1f 44 00 00 48 8b 87 f0 00 00 00 55 48 89 e5 <48> 8b 80 08 05 00 00 5d
 RIP  [] blk_get_backing_dev_info+0x10/0x20
  RSP 
 CR2: 0508
 ---[ end trace 2487336ceb3de62d ]---

The crash is easily reproducible by running the following command, if an
msleep(100) is inserted before the call to func() in iterate_devs():

 while :; do head -c1 /dev/nullb0; done > /dev/null & while :; do sync; done

Fix it by holding the bd_mutex across the func() call and only calling
func() if the bdev is opened.

Cc: sta...@vger.kernel.org
Fixes: 5c0d6b60a0ba46d45020547eacf7199171920935
Reported-and-tested-by: Wei Fang 
Signed-off-by: Rabin Vincent 
Signed-off-by: Jan Kara 
---
 fs/block_dev.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 05b553368bb4..899fa8ccc347 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1950,6 +1950,7 @@ void iterate_bdevs(void (*func)(struct block_device *, 
void *), void *arg)
spin_lock(&blockdev_superblock->s_inode_list_lock);
list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
struct address_space *mapping = inode->i_mapping;
+   struct block_device *bdev;
 
spin_lock(&inode->i_lock);
if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
@@ -1970,8 +1971,12 @@ void iterate_bdevs(void (*func)(struct block_device *, 
void *), void *arg)
 */
iput(old_inode);
old_inode = inode;
+   bdev = I_BDEV(inode);
 
-   func(I_BDEV(inode), arg);
+   mutex_lock(&bdev->bd_mutex);
+   if (bdev->bd_openers)
+   func(bdev, arg);
+   mutex_unlock(&bdev->bd_mutex);
 
spin_lock(&blockdev_superblock->s_inode_list_lock);
}
-- 
2.6.6

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] LSF/MM 2017: Call for Proposals

2016-12-08 Thread Jan Kara
On Thu 08-12-16 13:26:19, Michal Hocko wrote:
> On Wed 07-12-16 06:57:06, James Bottomley wrote:
> [...]
> > Just on this point, since there seems to be a lot of confusion: lsf-pc
> > is the list for contacting the programme committee, so you cannot
> > subscribe to it.
> > 
> > There is no -discuss equivalent, like kernel summit has, because we
> > expect you to cc the relevant existing mailing list and have the
> > discussion there instead rather than expecting people to subscribe to a
> > new list.
> 
> There used to be l...@lists.linux-foundation.org. Is it not active
> anymore?

No sure about the current state but usually it gets populated by the
selected attendees each year.

        Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications

2016-12-14 Thread Jan Kara
On Tue 13-12-16 16:24:33, Jerome Glisse wrote:
> On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
> > On Tue, Dec 13, 2016 at 03:31:13PM -0500, Jerome Glisse wrote:
> > > On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote:
> > > > On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote:
> > > > > I would like to discuss un-addressable device memory in the context of
> > > > > filesystem and block device. Specificaly how to handle write-back, 
> > > > > read,
> > > > > ... when a filesystem page is migrated to device memory that CPU can 
> > > > > not
> > > > > access.
> > > > 
> > > > You mean pmem that is DAX-capable that suddenly, without warning,
> > > > becomes non-DAX capable?
> > > > 
> > > > If you are not talking about pmem and DAX, then exactly what does
> > > > "when a filesystem page is migrated to device memory that CPU can
> > > > not access" mean? What "filesystem page" are we talking about that
> > > > can get migrated from main RAM to something the CPU can't access?
> > > 
> > > I am talking about GPU, FPGA, ... any PCIE device that have fast on
> > > board memory that can not be expose transparently to the CPU. I am
> > > reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm
> > > https://lwn.net/Articles/706856/
> > 
> > So ZONE_DEVICE memory that is a DMA target but not CPU addressable?
> 
> Well not only target, it can be source too. But the device can read
> and write any system memory and dma to/from that memory to its on
> board memory.
> 
> > 
> > > So in my case i am only considering non DAX/PMEM filesystem ie any
> > > "regular" filesystem back by a "regular" block device. I want to be
> > > able to migrate mmaped area of such filesystem to device memory while
> > > the device is actively using that memory.
> > 
> > "migrate mmapped area of such filesystem" means what, exactly?
> 
> fd = open("/path/to/some/file")
> ptr = mmap(fd, ...);
> gpu_compute_something(ptr);
> 
> > 
> > Are you talking about file data contents that have been copied into
> > the page cache and mmapped into a user process address space?
> > IOWs, migrating ZONE_NORMAL page cache page content and state
> > to a new ZONE_DEVICE page, and then migrating back again somehow?
> 
> Take any existing application that mmap a file and allow to migrate
> chunk of that mmaped file to device memory without the application
> even knowing about it. So nothing special in respect to that mmaped
> file. It is a regular file on your filesystem.

OK, so I share most of Dave's concerns about this. But let's talk about
what we can do and what you need and we may find something usable. First
let me understand what is doable / what are the costs on your side.

So we have a page cache page that you'd like to migrate to the device.
Fine. You are willing to sacrifice direct IO - even better. We can fall
back to buffered IO in that case (well, except for XFS which does not do it
but that's a minor detail). One thing I'm not sure about: When a page is
migrated to the device, is its contents available and is just possibly stale
or will something bad happen if we try to access (or even modify) page data?

And by migration you really mean page migration? Be aware that migration of
pagecache pages may be a problem for some pages of some filesystems on its
own - e. g. page migration may fail because there is a filesystem transaction
outstanding modifying that page. For userspace these will be really hard
to understand sporadic errors because it's really filesystem internal
thing. So far page migration was widely used only for free space
defragmentation and for that purpose if page is not migratable for a minute
who cares.

So won't it be easier to leave the pagecache page where it is and *copy* it
to the device? Can the device notify us *before* it is going to modify a
page, not just after it has modified it? Possibly if we just give it the
page read-only and it will have to ask CPU to get write permission? If yes,
then I belive this could work and even fs support should be doable.

Honza

-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications

2016-12-15 Thread Jan Kara
On Wed 14-12-16 12:15:14, Jerome Glisse wrote:


> > So won't it be easier to leave the pagecache page where it is and *copy* it
> > to the device? Can the device notify us *before* it is going to modify a
> > page, not just after it has modified it? Possibly if we just give it the
> > page read-only and it will have to ask CPU to get write permission? If yes,
> > then I belive this could work and even fs support should be doable.
> 
> Well yes and no. Device obey the same rule as CPU so if a file back page is
> map read only in the process it must first do a write fault which will call
> in the fs (page_mkwrite() of vm_ops). But once a page has write permission
> there is no way to be notify by hardware on every write. First the hardware
> do not have the capability. Second we are talking thousand (10 000 is upper
> range in today device) of concurrent thread, each can possibly write to page
> under consideration.

Sure, I meant whether the device is able to do equivalent of ->page_mkwrite
notification which apparently it is. OK.

> We really want the device page to behave just like regular page. Most fs code
> path never map file content, it only happens during read/write and i believe
> this can be handled either by migrating back or by using bounce page. I want
> to provide the choice between the two solutions as one will be better for some
> workload and the other for different workload.

I agree with keeping page used by the device behaving as similar as
possible as any other page. I'm just exploring different possibilities how
to make that happen. E.g. the scheme I was aiming at is:

When you want page A to be used by the device, you set up page A' in the
device but make sure any access to it will fault.

When the device wants to access A', it notifies the CPU, that writeprotects
all mappings of A, copy A to A' and map A' read-only for the device.

When the device wants to write to A', it notifies CPU, that will clear all
mappings of A and mark A as not-uptodate & dirty. When the CPU will then
want to access the data in A again - we need to catch ->readpage,
->readpages, ->writepage, ->writepages - it will writeprotect A' in
the device, copy data to A, mark A as uptodate & dirty, and off we go.

When we want to write to the page on CPU - we get either wp fault if it was
via mmap, or we have to catch that in places using kmap() - we just remove
access to A' from the device.

This scheme makes the device mapping functionality transparent to the
filesystem (you actually don't need to hook directly into ->readpage etc.
handlers, you can just have wrappers around them for this functionality)
and fairly straightforward... It is so transparent that even direct IO works
with this since the page cache invalidation pass we do before actually doing
the direct IO will make sure to pull all the pages from the device and write
them to disk if needed. What do you think?

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications

2016-12-16 Thread Jan Kara
On Thu 15-12-16 14:14:53, Jerome Glisse wrote:
> On Thu, Dec 15, 2016 at 05:19:39PM +0100, Jan Kara wrote:
> > On Wed 14-12-16 12:15:14, Jerome Glisse wrote:
> >  > page handling>
> > 
> > > > So won't it be easier to leave the pagecache page where it is and 
> > > > *copy* it
> > > > to the device? Can the device notify us *before* it is going to modify a
> > > > page, not just after it has modified it? Possibly if we just give it the
> > > > page read-only and it will have to ask CPU to get write permission? If 
> > > > yes,
> > > > then I belive this could work and even fs support should be doable.
> > > 
> > > Well yes and no. Device obey the same rule as CPU so if a file back page 
> > > is
> > > map read only in the process it must first do a write fault which will 
> > > call
> > > in the fs (page_mkwrite() of vm_ops). But once a page has write permission
> > > there is no way to be notify by hardware on every write. First the 
> > > hardware
> > > do not have the capability. Second we are talking thousand (10 000 is 
> > > upper
> > > range in today device) of concurrent thread, each can possibly write to 
> > > page
> > > under consideration.
> > 
> > Sure, I meant whether the device is able to do equivalent of ->page_mkwrite
> > notification which apparently it is. OK.
> > 
> > > We really want the device page to behave just like regular page. Most fs 
> > > code
> > > path never map file content, it only happens during read/write and i 
> > > believe
> > > this can be handled either by migrating back or by using bounce page. I 
> > > want
> > > to provide the choice between the two solutions as one will be better for 
> > > some
> > > workload and the other for different workload.
> > 
> > I agree with keeping page used by the device behaving as similar as
> > possible as any other page. I'm just exploring different possibilities how
> > to make that happen. E.g. the scheme I was aiming at is:
> > 
> > When you want page A to be used by the device, you set up page A' in the
> > device but make sure any access to it will fault.
> > 
> > When the device wants to access A', it notifies the CPU, that writeprotects
> > all mappings of A, copy A to A' and map A' read-only for the device.
> > 
> > When the device wants to write to A', it notifies CPU, that will clear all
> > mappings of A and mark A as not-uptodate & dirty. When the CPU will then
> > want to access the data in A again - we need to catch ->readpage,
> > ->readpages, ->writepage, ->writepages - it will writeprotect A' in
> > the device, copy data to A, mark A as uptodate & dirty, and off we go.
> > 
> > When we want to write to the page on CPU - we get either wp fault if it was
> > via mmap, or we have to catch that in places using kmap() - we just remove
> > access to A' from the device.
> > 
> > This scheme makes the device mapping functionality transparent to the
> > filesystem (you actually don't need to hook directly into ->readpage etc.
> > handlers, you can just have wrappers around them for this functionality)
> > and fairly straightforward... It is so transparent that even direct IO works
> > with this since the page cache invalidation pass we do before actually doing
> > the direct IO will make sure to pull all the pages from the device and write
> > them to disk if needed. What do you think?
> 
> This is do-able but i think it will require the same amount of changes than
> what i had in mind (excluding the block bounce code) with one drawback. Doing
> it that way we can not free page A.

I guess I'd have to see code implementing your approach to be able to judge
what ends up being less code - the devil is in the details here I believe.
Actually, when thinking about it with a fresh mind, I don't think we'd have
to catch kmap() at all with my approach - all writes could be cached either
in grab_cache_page_write_begin() or in page_mkwrite(). What I like about my
solution is that it is completely fs agnostic and the places that need
handling of device pages have very relaxed locking constraints - grabbing
locks necessary to update mappings / communicate with the device should be
no brainer in those contexts.

> On some workload this probably does not hurt much but on workload where you
> read a big dataset from disk and then use it only on the GPU for long period
> of time (minutes/hours) you will waste GB of system memo

Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications

2016-12-19 Thread Jan Kara
On Fri 16-12-16 08:40:38, Aneesh Kumar K.V wrote:
> Jan Kara  writes:
> 
> > On Wed 14-12-16 12:15:14, Jerome Glisse wrote:
> >  > page handling>
> >
> >> > So won't it be easier to leave the pagecache page where it is and *copy* 
> >> > it
> >> > to the device? Can the device notify us *before* it is going to modify a
> >> > page, not just after it has modified it? Possibly if we just give it the
> >> > page read-only and it will have to ask CPU to get write permission? If 
> >> > yes,
> >> > then I belive this could work and even fs support should be doable.
> >> 
> >> Well yes and no. Device obey the same rule as CPU so if a file back page is
> >> map read only in the process it must first do a write fault which will call
> >> in the fs (page_mkwrite() of vm_ops). But once a page has write permission
> >> there is no way to be notify by hardware on every write. First the hardware
> >> do not have the capability. Second we are talking thousand (10 000 is upper
> >> range in today device) of concurrent thread, each can possibly write to 
> >> page
> >> under consideration.
> >
> > Sure, I meant whether the device is able to do equivalent of ->page_mkwrite
> > notification which apparently it is. OK.
> >
> >> We really want the device page to behave just like regular page. Most fs 
> >> code
> >> path never map file content, it only happens during read/write and i 
> >> believe
> >> this can be handled either by migrating back or by using bounce page. I 
> >> want
> >> to provide the choice between the two solutions as one will be better for 
> >> some
> >> workload and the other for different workload.
> >
> > I agree with keeping page used by the device behaving as similar as
> > possible as any other page. I'm just exploring different possibilities how
> > to make that happen. E.g. the scheme I was aiming at is:
> >
> > When you want page A to be used by the device, you set up page A' in the
> > device but make sure any access to it will fault.
> >
> > When the device wants to access A', it notifies the CPU, that writeprotects
> > all mappings of A, copy A to A' and map A' read-only for the device.
> 
> 
> A and A' will have different pfns here and hence different struct page.

Yes. In fact I don't think there's need to have struct page for A' in my
scheme. At least for the purposes of page cache tracking... Maybe there's
good reason to have it from a device driver POV.

> So what will be there in the address_space->page_tree ? If we place
> A' in the page cache, then we are essentially bringing lot of locking
> complexity Dave talked about in previous mails.

No, I meant page A will stay in the page_tree. There's no need for
migration in my scheme.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] clean_bdev_aliases: Prevent cleaning blocks that are not in block range

2017-01-02 Thread Jan Kara
On Sun 25-12-16 19:01:03, Chandan Rajendra wrote:
> The first block to be cleaned may start at a non-zero page offset. In
> such a scenario clean_bdev_aliases() will end up cleaning blocks that
> do not fall in the range of blocks to be cleaned. This commit fixes the
> issue by skipping blocks that do not fall in valid block range.
> 
> Signed-off-by: Chandan Rajendra 

Ah, very good catch! How did you spot this?

The patch looks good. You can add:

Reviewed-by: Jan Kara 

Jens, please merge this fix quickly as we may end up discarding changes to
innocent metadata blocks due to this... Thanks!

Honza
> ---
>  fs/buffer.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 1df2bd5..28484b3 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1660,7 +1660,7 @@ void clean_bdev_aliases(struct block_device *bdev, 
> sector_t block, sector_t len)
>   head = page_buffers(page);
>   bh = head;
>   do {
> - if (!buffer_mapped(bh))
> + if (!buffer_mapped(bh) || (bh->b_blocknr < 
> block))
>   goto next;
>   if (bh->b_blocknr >= block + len)
>   break;
> -- 
> 2.5.5
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] clean_bdev_aliases: Prevent cleaning blocks that are not in block range

2017-01-02 Thread Jan Kara
On Tue 03-01-17 00:21:45, Eryu Guan wrote:
> On Mon, Jan 02, 2017 at 04:58:03PM +0100, Jan Kara wrote:
> > On Sun 25-12-16 19:01:03, Chandan Rajendra wrote:
> > > The first block to be cleaned may start at a non-zero page offset. In
> > > such a scenario clean_bdev_aliases() will end up cleaning blocks that
> > > do not fall in the range of blocks to be cleaned. This commit fixes the
> > > issue by skipping blocks that do not fall in valid block range.
> > > 
> > > Signed-off-by: Chandan Rajendra 
> > 
> > Ah, very good catch! How did you spot this?
> 
> I failed to notice this patch, and I came up with a same patch today
> myself, and I'm still testing it.
> 
> I found this by xfstests, many tests (tens of tests) failed fsck after
> test when testing extN if blocksize < pagesize. E.g. generic/013 could
> reproduce the fs corruption quite reliablely for me.
> 
> Reviewed-by: Eryu Guan 

Thanks! Yeah, I had run xfstests with those patches but not with blocksize
< pagesize :-| Shit happens... need to be more careful next time.

        Honza

> > The patch looks good. You can add:
> > 
> > Reviewed-by: Jan Kara 
> > 
> > Jens, please merge this fix quickly as we may end up discarding changes to
> > innocent metadata blocks due to this... Thanks!
> > 
> > Honza
> > > ---
> > >  fs/buffer.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/buffer.c b/fs/buffer.c
> > > index 1df2bd5..28484b3 100644
> > > --- a/fs/buffer.c
> > > +++ b/fs/buffer.c
> > > @@ -1660,7 +1660,7 @@ void clean_bdev_aliases(struct block_device *bdev, 
> > > sector_t block, sector_t len)
> > >   head = page_buffers(page);
> > >   bh = head;
> > >   do {
> > > - if (!buffer_mapped(bh))
> > > + if (!buffer_mapped(bh) || (bh->b_blocknr < 
> > > block))
> > >   goto next;
> > >   if (bh->b_blocknr >= block + len)
> > >   break;
> > > -- 
> > > 2.5.5
> > > 
> > -- 
> > Jan Kara 
> > SUSE Labs, CR
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-block" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2017-01-03 Thread Jan Kara
On Mon 02-01-17 16:11:36, Johannes Weiner wrote:
> On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> > On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner  
> > > > > wrote:
> > > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > > workload again and about a minute in this fired:
> > > > > >
> > > > > > [628867.607417] [ cut here ]
> > > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
> > > > > > shadow_lru_isolate+0x171/0x220
> > > > > 
> > > > > Well, part of the changes during the merge window were the shadow
> > > > > entry tracking changes that came in through Andrew's tree. Adding
> > > > > Johannes Weiner to the participants.
> > > > > 
> > > > > > Now, this workload does not touch the page cache at all - it's
> > > > > > entirely an XFS metadata workload, so it should not really be
> > > > > > affecting the working set code.
> > > > > 
> > > > > Well, I suspect that anything that creates memory pressure will end up
> > > > > triggering the working set code, so ..
> > > > > 
> > > > > That said, obviously memory corruption could be involved and result in
> > > > > random issues too, but I wouldn't really expect that in this code.
> > > > > 
> > > > > It would probably be really useful to get more data points - is the
> > > > > problem reliably in this area, or is it going to be random and all
> > > > > over the place.
> > > > 
> > > > Data point: kswapd got WARNING on mm/workingset.c:457 in 
> > > > shadow_lru_isolate,
> > > > soon followed by NULL pointer deref in list_lru_isolate, one time when
> > > > I tried out Sunday's git tree.  Not seen since, I haven't had time to
> > > > investigate, just set it aside as something to worry about if it happens
> > > > again.  But it looks like shadow_lru_isolate() has issues beyond Dave's
> > > > case (I've no XFS and no iscsi), suspect unrelated to his other 
> > > > problems.
> > > 
> > > This seems consistent with what Dave observed: we encounter regular
> > > pages in radix tree nodes on the shadow LRU that should only contain
> > > nodes full of exceptional shadow entries. It could be an issue in the
> > > new slot replacement code and the node tracking callback.
> > 
> > Both encounters seem to indicate use-after-free. Dave's node didn't
> > warn about an unexpected node->count / node->exceptional state, but
> > had entries that were inconsistent with that. Hugh got the counter
> > warning but crashed on a list_head that's not NULLed in a live node.
> > 
> > workingset_update_node() should be called on page cache radix tree
> > leaf nodes that go empty. I must be missing an update_node callback
> > where a leaf node gets freed somewhere.
> 
> Sorry for dropping silent on this. I'm traveling over the holidays
> with sporadic access to my emails and no access to real equipment.
> 
> The times I managed to sneak away to look at the code didn't turn up
> anything useful yet.
> 
> Andrea encountered the warning as well and I gave him a debugging
> patch (attached below), but he hasn't been able to reproduce this
> condition. I've personally never seen the warning trigger, even though
> the patches have been running on my main development machine for quite
> a while now. Albeit against an older base; I've updated to Linus's
> master branch now in case it's an interaction with other new code.
> 
> If anybody manages to reproduce this, that would be helpful. Any extra
> eyes on this would be much appreciated too until I'm back at my desk.

I was looking into this but I didn't find a way how we could possibly leave
radix tree node on LRU. So your debug patch looks like a good way forward.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM ATTEND AND AGENDA TOPIC] request to attend the summit

2017-01-03 Thread Jan Kara
On Tue 03-01-17 18:23:17, Paolo Valente wrote:
> 
> > Il giorno 03 gen 2017, alle ore 13:00, Bart Van Assche 
> >  ha scritto:
> > 
> > On Tue, 2017-01-03 at 10:39 +0100, Paolo Valente wrote:
> >> In this respect, I hope that the committee does not meet too soon.
> > 
> > Hello Paolo,
> > 
> 
> Hi
> 
> > I don't know when the program committee will meet. However, what I
> > remember from previous editions of the LSF/MM is that changes of the
> > agenda happened not only during the weeks before the LSF/MM but even
> > during the LSF/MM.
> > 
> 
> That would be perfectly fine.  The only piece of information that I
> would need in advance is whether my request to attend is accepted, so
> that I can organize my travel and really participate.

We certainly notify selected people (and also rejected ones) enough in
advance.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] block: fix blk_get_backing_dev_info() crash, use bdev->bd_queue

2017-01-06 Thread Jan Kara
On Thu 05-01-17 17:17:55, Dan Williams wrote:
> The ->bd_queue member of struct block_device was added in commit
> 87192a2a49c4 ("vfs: cache request_queue in struct block_device") in
> v3.3. However, blk_get_backing_dev_info() has been using
> bdev_get_queue() and grabbing the request_queue through the gendisk
> since before the git era.
> 
> At final __blkdev_put() time ->bd_disk is cleared while ->bd_queue is
> not. The queue remains valid until the final put of the parent disk.
> 
> The following crash signature results from blk_get_backing_dev_info()
> trying to lookup the queue through ->bd_disk after the final put of the
> block device. Simply switch bdev_get_queue() to use ->bd_queue directly
> which is guaranteed to still be valid at invalidate_partition() time.
> 
>  BUG: unable to handle kernel NULL pointer dereference at 0568
>  IP: blk_get_backing_dev_info+0x10/0x20
>  [..]
>  Call Trace:
>   __inode_attach_wb+0x3a7/0x5d0
>   __filemap_fdatawrite_range+0xf8/0x100
>   filemap_write_and_wait+0x40/0x90
>   fsync_bdev+0x54/0x60
>   ? bdget_disk+0x30/0x40
>   invalidate_partition+0x24/0x50
>   del_gendisk+0xfa/0x230

So we have a similar reports of the same problem. E.g.:

http://www.spinics.net/lists/linux-fsdevel/msg105153.html

However I kind of miss how your patch would fix all those cases. The
principial problem is that inode_to_bdi() called on block device inode
wants to get the backing_dev_info however on last close of a block device
we do put_disk() and thus the request queue containing backing_dev_info
does not have to be around at that time. In your case you are lucky enough
to have the containing disk still around but that's not the case for all
inode_to_bdi() users (see e.g. the report I referenced) and your patch
would change relatively straightforward NULL pointer dereference to rather
subtle use-after-free issue so I disagree with going down this path.

So what I think needs to be done is that we make backing_dev_info
independently allocated structure with different lifetime rules to gendisk
or request_queue - definitely I want it to live as long as block device
inode exists. However it needs more thought what the exact lifetime rules
will be.

Honza
> 
> Cc: 
> Cc: Jan Kara 
> Cc: Jens Axboe 
> Cc: Jeff Moyer 
> Cc: Christoph Hellwig 
> Cc: Wei Fang 
> Cc: Rabin Vincent 
> Cc: Andi Kleen 
> Fixes: 87192a2a49c4 ("vfs: cache request_queue in struct block_device")
> Signed-off-by: Dan Williams 
> ---
>  block/blk-core.c   |4 ++--
>  include/linux/blkdev.h |2 +-
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 61ba08c58b64..04737548e1e1 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -109,8 +109,8 @@ void blk_queue_congestion_threshold(struct request_queue 
> *q)
>   * @bdev:device
>   *
>   * Locates the passed device's request queue and returns the address of its
> - * backing_dev_info.  This function can only be called if @bdev is opened
> - * and the return value is never NULL.
> + * backing_dev_info.  This function can only be called while there is an
> + * active reference against the parent gendisk.
>   */
>  struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev)
>  {
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 83695641bd5e..27779709f821 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -962,7 +962,7 @@ bool blk_mq_poll(struct request_queue *q, blk_qc_t 
> cookie);
>  
>  static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
>  {
> - return bdev->bd_disk->queue;/* this is never NULL */
> + return bdev->bd_queue;  /* this is never NULL */
>  }
>  
>  /*
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] block: fix blk_get_backing_dev_info() crash, use bdev->bd_queue

2017-01-08 Thread Jan Kara
On Fri 06-01-17 09:45:45, Dan Williams wrote:
> On Fri, Jan 6, 2017 at 2:23 AM, Jan Kara  wrote:
> > On Thu 05-01-17 17:17:55, Dan Williams wrote:
> >> The ->bd_queue member of struct block_device was added in commit
> >> 87192a2a49c4 ("vfs: cache request_queue in struct block_device") in
> >> v3.3. However, blk_get_backing_dev_info() has been using
> >> bdev_get_queue() and grabbing the request_queue through the gendisk
> >> since before the git era.
> >>
> >> At final __blkdev_put() time ->bd_disk is cleared while ->bd_queue is
> >> not. The queue remains valid until the final put of the parent disk.
> >>
> >> The following crash signature results from blk_get_backing_dev_info()
> >> trying to lookup the queue through ->bd_disk after the final put of the
> >> block device. Simply switch bdev_get_queue() to use ->bd_queue directly
> >> which is guaranteed to still be valid at invalidate_partition() time.
> >>
> >>  BUG: unable to handle kernel NULL pointer dereference at 0568
> >>  IP: blk_get_backing_dev_info+0x10/0x20
> >>  [..]
> >>  Call Trace:
> >>   __inode_attach_wb+0x3a7/0x5d0
> >>   __filemap_fdatawrite_range+0xf8/0x100
> >>   filemap_write_and_wait+0x40/0x90
> >>   fsync_bdev+0x54/0x60
> >>   ? bdget_disk+0x30/0x40
> >>   invalidate_partition+0x24/0x50
> >>   del_gendisk+0xfa/0x230
> >
> > So we have a similar reports of the same problem. E.g.:
> >
> > http://www.spinics.net/lists/linux-fsdevel/msg105153.html
> >
> > However I kind of miss how your patch would fix all those cases. The
> > principial problem is that inode_to_bdi() called on block device inode
> > wants to get the backing_dev_info however on last close of a block device
> > we do put_disk() and thus the request queue containing backing_dev_info
> > does not have to be around at that time. In your case you are lucky enough
> > to have the containing disk still around but that's not the case for all
> > inode_to_bdi() users (see e.g. the report I referenced) and your patch
> > would change relatively straightforward NULL pointer dereference to rather
> > subtle use-after-free issue
> 
> True. If there are other cases that don't hold their own queue
> reference this patch makes things worse.
> 
> > so I disagree with going down this path.
> 
> I still think this patch is the right thing to do, but it needs to
> come after the wider guarantee that having an active bdev reference
> guarantees that the queue and backing_dev_info are still allocated.
> 
> > So what I think needs to be done is that we make backing_dev_info
> > independently allocated structure with different lifetime rules to gendisk
> > or request_queue - definitely I want it to live as long as block device
> > inode exists. However it needs more thought what the exact lifetime rules
> > will be.
> 
> Hmm, why does it need to be separately allocated?
> 
> Something like this, passes the libnvdimm unit tests: (non-whitespace
> damaged version attached)

So the problem with this approach is that request queue will be pinned while
bdev inode exists. And how long that is is impossible to predict or influence
from userspace so e.g. you cannot remove device driver from memory and even
unplugging USB after it has been unmounted would suddently go via a path of
"device removed while it is used" which can have unexpected consequences. I
guess Jens or Christoph will know more about the details...

I have prototyped patches which split backing_dev_info from request_queue
and it was not even that difficult in the end. I'll give those patches some
testing and post them for comments...

Honza

> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 7c1a24fb09d8..6742c49b38e4 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -378,7 +378,8 @@ EXPORT_SYMBOL(blk_run_queue);
> 
>  void blk_put_queue(struct request_queue *q)
>  {
> -   kobject_put(&q->kobj);
> +   if (q)
> +   kobject_put(&q->kobj);
>  }
>  EXPORT_SYMBOL(blk_put_queue);
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 05b553368bb4..7503110e37ff 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -589,12 +589,22 @@ static struct inode *bdev_alloc_inode(struct
> super_block *sb)
> return &ei->vfs_inode;
>  }
> 
> +static void bdev_release(struct work_struct *w)
> +{
> +   struct block_device *bdev = container_of(w, typeof(*bdev), 
> bd_rele

Re: [PATCH] do_direct_IO: Use inode->i_blkbits to compute block count to be cleaned

2017-01-09 Thread Jan Kara
On Sun 08-01-17 20:17:10, Chandan Rajendra wrote:
> The code currently uses sdio->blkbits to compute the number of blocks to
> be cleaned. However sdio->blkbits is derived from the logical block size
> of the underlying block device (Refer to the definition of
> do_blockdev_direct_IO()). Due to this, generic/299 test would rarely
> fail when executed on an ext4 filesystem with 64k as the block size and
> when using a virtio based disk (having 512 byte as the logical block
> size) inside a kvm guest.
> 
> This commit fixes the bug by using inode->i_blkbits to compute the
> number of blocks to be cleaned.

Ah, good catch. You can add:

Reviewed-by: Jan Kara 

Honza

> Signed-off-by: Chandan Rajendra 
> ---
>  fs/direct-io.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index aeae8c0..b20adf9 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -905,6 +905,7 @@ static inline void dio_zero_block(struct dio *dio, struct 
> dio_submit *sdio,
>  static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
>   struct buffer_head *map_bh)
>  {
> + const unsigned i_blkbits = dio->inode->i_blkbits;
>   const unsigned blkbits = sdio->blkbits;
>   int ret = 0;
>  
> @@ -949,7 +950,7 @@ static int do_direct_IO(struct dio *dio, struct 
> dio_submit *sdio,
>   clean_bdev_aliases(
>   map_bh->b_bdev,
>   map_bh->b_blocknr,
> - map_bh->b_size >> blkbits);
> + map_bh->b_size >> i_blkbits);
>   }
>  
>   if (!sdio->blkfactor)
> -- 
> 2.5.5
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2017-01-09 Thread Jan Kara
On Sat 07-01-17 21:02:00, Johannes Weiner wrote:
> On Tue, Jan 03, 2017 at 01:28:25PM +0100, Jan Kara wrote:
> > On Mon 02-01-17 16:11:36, Johannes Weiner wrote:
> > > On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> > > > On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > > > > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > > > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner 
> > > > > > >  wrote:
> > > > > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > > > > workload again and about a minute in this fired:
> > > > > > > >
> > > > > > > > [628867.607417] [ cut here ]
> > > > > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at 
> > > > > > > > mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
> > > > > > > 
> > > > > > > Well, part of the changes during the merge window were the shadow
> > > > > > > entry tracking changes that came in through Andrew's tree. Adding
> > > > > > > Johannes Weiner to the participants.
> 
> Okay, the below patch should address this problem. Dave Jones managed
> to reproduce it with the added WARN_ONs, and they made it obvious. He
> cannot trigger it anymore with this fix applied. Thanks Dave!

FWIW the patch looks good to me. I'd just note that the need to pass the
callback to deletion function and the fact that we do it only in cases
where we think it is needed appears errorprone. With the warning you've
added it should at least catch the cases where we got it wrong but more
robust would be if the radix tree root contained a pointer to the callback
function so that we would not rely on passing the callback to every place
which can possibly free a node. Also conceptually this would make more
sense to me since the fact that we may need to do some cleanup on node
deletion is a property of the particular radix tree and how we use it.
OTOH that would mean growing radix tree root by one pointer which would be
unpopular I guess.

Honza

-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] [LSF/MM ATTEND] FS Management Interfaces

2017-01-10 Thread Jan Kara
Hi,

On Tue 10-01-17 09:44:59, Steven Whitehouse wrote:
> I had originally thought about calling the proposal "kernel/userland
> interface", however that seemed a bit vague and management interfaces seems
> like a better title since it is I hope a bit clearer of the kind of thing
> that I'm thinking about in this case.
> 
> There are a number of possible sub-topics and I hope that I might find a few
> more before LSF too. One is that of space management (we have statfs, but
> currently no notifications for thresholds crossed etc., so everything is
> polled. That is ok sometimes, but statfs can be expensive in the case of
> distributed filesystems, if 100% accurate. We could just have ENOSPC
> notifications for 100% full, or something more generic), another is state
> transitions (is the fs running normally, or has it gone read
> only/withdrawn/etc due to I/O errors?) and a further topic would be working
> towards a common interface for fs statistics (at the moment each fs defines
> their own interface). One potential implementation, at least for the first
> two sub-topics, would be to use something along the lines of the quota
> netlink interface, but since few ideas survive first contact with the
> community at large, I'm throwing this out for further discussion and
> feedback on whether this approach is considered the right way to go.
> 
> Assuming the topic is accepted, my intention would be to gather together
> some additional sub-topics relating to fs management to go along with those
> I mentioned above, and I'd be very interested to hear of any other issues
> that could be usefully added to the list for discussion.

So this topic came up last year and probably the year before as well (heh,
I can even find some patches from 2011 [1]). I think the latest attempt at
what you suggest was here [2]. So clearly there's some interest in these
interfaces but not enough to actually drive anything to completion. So for
this topic to be useful, I think you need to go at least through the
patches in [2] and comments to them and have a concrete proposal that can
be discussed and some commitment (not necessarily from yourself) that
someone is going to devote time to implement it. Because generally nobody
seems to be opposed to the abstract idea but once it gets to the
implementation details, it is non-trivial to get some wider agreement
(statx anybody? ;)).

        Honza

[1] https://lkml.org/lkml/2011/8/18/170
[2] https://lkml.org/lkml/2015/6/16/456
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] fs: support IOCB_NOWAIT in generic_file_buffered_read

2017-09-04 Thread Jan Kara
On Tue 29-08-17 16:13:19, Christoph Hellwig wrote:
> From: Milosz Tanski 
> 
> Allow generic_file_buffered_read to bail out early instead of waiting for
> the page lock or reading a page if IOCB_NOWAIT is specified.
> 
> Signed-off-by: Milosz Tanski 
> Reviewed-by: Christoph Hellwig 
> Reviewed-by: Jeff Moyer 
> Acked-by: Sage Weil 

This looks good to me now. You can add:

Reviewed-by: Jan Kara 

Honza


> ---
>  mm/filemap.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 4bcfa74ad802..eed394fd331c 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1937,6 +1937,8 @@ static ssize_t generic_file_buffered_read(struct kiocb 
> *iocb,
>  
>   page = find_get_page(mapping, index);
>   if (!page) {
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + goto would_block;
>   page_cache_sync_readahead(mapping,
>   ra, filp,
>   index, last_index - index);
> @@ -1950,6 +1952,11 @@ static ssize_t generic_file_buffered_read(struct kiocb 
> *iocb,
>   index, last_index - index);
>   }
>   if (!PageUptodate(page)) {
> + if (iocb->ki_flags & IOCB_NOWAIT) {
> + put_page(page);
> + goto would_block;
> + }
> +
>   /*
>* See comment in do_read_cache_page on why
>* wait_on_page_locked is used to avoid unnecessarily
> @@ -2131,6 +2138,8 @@ static ssize_t generic_file_buffered_read(struct kiocb 
> *iocb,
>   goto readpage;
>   }
>  
> +would_block:
> + error = -EAGAIN;
>  out:
>   ra->prev_pos = prev_index;
>   ra->prev_pos <<= PAGE_SHIFT;
> -- 
> 2.11.0
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v2] block/laptop_mode: Convert timers to use timer_setup()

2017-10-09 Thread Jan Kara
_info *bdi)
>  {
>   int ret;
> @@ -835,6 +867,8 @@ static int bdi_init(struct backing_dev_info *bdi)
>   INIT_LIST_HEAD(&bdi->bdi_list);
>   INIT_LIST_HEAD(&bdi->wb_list);
>   init_waitqueue_head(&bdi->wb_waitq);
> + setup_timer(&bdi->laptop_mode_wb_timer,
> + laptop_mode_timer_fn, (unsigned long)bdi);
>  
>   ret = cgwb_bdi_init(bdi);
>  
> @@ -916,6 +950,8 @@ EXPORT_SYMBOL(bdi_register_owner);
>   */
>  static void bdi_remove_from_list(struct backing_dev_info *bdi)
>  {
> + del_timer_sync(&bdi->laptop_mode_wb_timer);
> +
>   spin_lock_bh(&bdi_lock);
>   list_del_rcu(&bdi->bdi_list);
>   spin_unlock_bh(&bdi_lock);
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 8d1fc593bce8..f8fe90dc529d 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1976,42 +1976,6 @@ int dirty_writeback_centisecs_handler(struct ctl_table 
> *table, int write,
>   return 0;
>  }
>  
> -#ifdef CONFIG_BLOCK
> -void laptop_mode_timer_fn(unsigned long data)
> -{
> - struct request_queue *q = (struct request_queue *)data;
> -
> - wakeup_flusher_threads_bdi(q->backing_dev_info, WB_REASON_LAPTOP_TIMER);
> -}
> -
> -/*
> - * We've spun up the disk and we're in laptop mode: schedule writeback
> - * of all dirty data a few seconds from now.  If the flush is already 
> scheduled
> - * then push it back - the user is still using the disk.
> - */
> -void laptop_io_completion(struct backing_dev_info *info)
> -{
> - mod_timer(&info->laptop_mode_wb_timer, jiffies + laptop_mode);
> -}
> -
> -/*
> - * We're in laptop mode and we've just synced. The sync's writes will have
> - * caused another writeback to be scheduled by laptop_io_completion.
> - * Nothing needs to be written back anymore, so we unschedule the writeback.
> - */
> -void laptop_sync_completion(void)
> -{
> - struct backing_dev_info *bdi;
> -
> - rcu_read_lock();
> -
> - list_for_each_entry_rcu(bdi, &bdi_list, bdi_list)
> - del_timer(&bdi->laptop_mode_wb_timer);
> -
> - rcu_read_unlock();
> -}
> -#endif
> -
>  /*
>   * If ratelimit_pages is too high then we can get into dirty-data overload
>   * if a large number of processes all perform writes at the same time.
> -- 
> 2.14.1
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 2/4] bdi: convert bdi_debug_register to int

2017-10-30 Thread Jan Kara
On Fri 27-10-17 01:35:57, weiping zhang wrote:
> Convert bdi_debug_register to int and then do error handle for it.
> 
> Signed-off-by: weiping zhang 

This patch looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  mm/backing-dev.c | 17 +++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 5072be19d9b2..e9d6a1ede12b 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -116,11 +116,23 @@ static const struct file_operations 
> bdi_debug_stats_fops = {
>   .release= single_release,
>  };
>  
> -static void bdi_debug_register(struct backing_dev_info *bdi, const char 
> *name)
> +static int bdi_debug_register(struct backing_dev_info *bdi, const char *name)
>  {
> + if (!bdi_debug_root)
> + return -ENOMEM;
> +
>   bdi->debug_dir = debugfs_create_dir(name, bdi_debug_root);
> + if (!bdi->debug_dir)
> + return -ENOMEM;
> +
>   bdi->debug_stats = debugfs_create_file("stats", 0444, bdi->debug_dir,
>  bdi, &bdi_debug_stats_fops);
> + if (!bdi->debug_stats) {
> + debugfs_remove(bdi->debug_dir);
> + return -ENOMEM;
> + }
> +
> + return 0;
>  }
>  
>  static void bdi_debug_unregister(struct backing_dev_info *bdi)
> @@ -133,9 +145,10 @@ static inline int bdi_debug_init(void)
>  {
>   return 0;
>  }
> -static inline void bdi_debug_register(struct backing_dev_info *bdi,
> +static inline int bdi_debug_register(struct backing_dev_info *bdi,
>     const char *name)
>  {
> + return 0;
>  }
>  static inline void bdi_debug_unregister(struct backing_dev_info *bdi)
>  {
> -- 
> 2.14.2
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 1/4] bdi: add check for bdi_debug_root

2017-10-30 Thread Jan Kara
On Fri 27-10-17 01:35:36, weiping zhang wrote:
> this patch add a check for bdi_debug_root and do error handle for it.
> we should make sure it was created success, otherwise when add new
> block device's bdi folder(eg, 8:0) will be create a debugfs root directory.
> 
> Signed-off-by: weiping zhang 
> ---
>  mm/backing-dev.c | 17 ++---
>  1 file changed, 14 insertions(+), 3 deletions(-)

These functions get called only on system boot - ENOMEM in those cases is
generally considered fatal and oopsing is acceptable result. So I don't
think this patch is needed.

Honza

> 
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 74b52dfd5852..5072be19d9b2 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -36,9 +36,12 @@ struct workqueue_struct *bdi_wq;
>  
>  static struct dentry *bdi_debug_root;
>  
> -static void bdi_debug_init(void)
> +static int bdi_debug_init(void)
>  {
>   bdi_debug_root = debugfs_create_dir("bdi", NULL);
> + if (!bdi_debug_root)
> + return -ENOMEM;
> + return 0;
>  }
>  
>  static int bdi_debug_stats_show(struct seq_file *m, void *v)
> @@ -126,8 +129,9 @@ static void bdi_debug_unregister(struct backing_dev_info 
> *bdi)
>   debugfs_remove(bdi->debug_dir);
>  }
>  #else
> -static inline void bdi_debug_init(void)
> +static inline int bdi_debug_init(void)
>  {
> + return 0;
>  }
>  static inline void bdi_debug_register(struct backing_dev_info *bdi,
> const char *name)
> @@ -229,12 +233,19 @@ ATTRIBUTE_GROUPS(bdi_dev);
>  
>  static __init int bdi_class_init(void)
>  {
> + int ret;
> +
>   bdi_class = class_create(THIS_MODULE, "bdi");
>   if (IS_ERR(bdi_class))
>   return PTR_ERR(bdi_class);
>  
>   bdi_class->dev_groups = bdi_dev_groups;
> - bdi_debug_init();
> + ret = bdi_debug_init();
> + if (ret) {
> + class_destroy(bdi_class);
> + bdi_class = NULL;
> + return ret;
> + }
>  
>   return 0;
>  }
> -- 
> 2.14.2
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 3/4] bdi: add error handle for bdi_debug_register

2017-10-30 Thread Jan Kara
On Fri 27-10-17 01:36:14, weiping zhang wrote:
> In order to make error handle more cleaner we call bdi_debug_register
> before set state to WB_registered, that we can avoid call bdi_unregister
> in release_bdi().
> 
> Signed-off-by: weiping zhang 
> ---
>  mm/backing-dev.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index e9d6a1ede12b..54396d53f471 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -893,10 +893,13 @@ int bdi_register_va(struct backing_dev_info *bdi, const 
> char *fmt, va_list args)
>   if (IS_ERR(dev))
>   return PTR_ERR(dev);
>  
> + if (bdi_debug_register(bdi, dev_name(dev))) {
> + device_destroy(bdi_class, dev->devt);
> + return -ENOMEM;
> + }
>   cgwb_bdi_register(bdi);
>   bdi->dev = dev;
>  
> - bdi_debug_register(bdi, dev_name(dev));
>   set_bit(WB_registered, &bdi->wb.state);
>  
>   spin_lock_bh(&bdi_lock);
> @@ -916,6 +919,8 @@ int bdi_register(struct backing_dev_info *bdi, const char 
> *fmt, ...)
>   va_start(args, fmt);
>   ret = bdi_register_va(bdi, fmt, args);
>   va_end(args);
> + if (ret)
> + bdi_put(bdi);

Why do you drop bdi reference here in case of error? We didn't do it
previously if bdi_register_va() failed for other reasons...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 4/4] block: add WARN_ON if bdi register fail

2017-10-30 Thread Jan Kara
On Fri 27-10-17 01:36:42, weiping zhang wrote:
> device_add_disk need do more safety error handle, so this patch just
> add WARN_ON.
> 
> Signed-off-by: weiping zhang 
> ---
>  block/genhd.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/block/genhd.c b/block/genhd.c
> index dd305c65ffb0..cb55eea821eb 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -660,7 +660,9 @@ void device_add_disk(struct device *parent, struct 
> gendisk *disk)
>  
>   /* Register BDI before referencing it from bdev */
>   bdi = disk->queue->backing_dev_info;
> - bdi_register_owner(bdi, disk_to_dev(disk));
> + retval = bdi_register_owner(bdi, disk_to_dev(disk));
> + if (retval)
> + WARN_ON(1);

Just a nit: You can do

    WARN_ON(retval);

Otherwise you can add:

Reviewed-by: Jan Kara 

        Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v2 2/3] bdi: add error handle for bdi_debug_register

2017-11-01 Thread Jan Kara
On Tue 31-10-17 18:38:24, weiping zhang wrote:
> In order to make error handle more cleaner we call bdi_debug_register
> before set state to WB_registered, that we can avoid call bdi_unregister
> in release_bdi().
> 
> Signed-off-by: weiping zhang 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  mm/backing-dev.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index b5f940ce0143..84b2dc76f140 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -882,10 +882,13 @@ int bdi_register_va(struct backing_dev_info *bdi, const 
> char *fmt, va_list args)
>   if (IS_ERR(dev))
>   return PTR_ERR(dev);
>  
> + if (bdi_debug_register(bdi, dev_name(dev))) {
> + device_destroy(bdi_class, dev->devt);
> + return -ENOMEM;
> + }
>   cgwb_bdi_register(bdi);
>   bdi->dev = dev;
>  
> - bdi_debug_register(bdi, dev_name(dev));
>   set_bit(WB_registered, &bdi->wb.state);
>  
>   spin_lock_bh(&bdi_lock);
> -- 
> 2.14.2
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown

2017-11-20 Thread Jan Kara
Hi Tao!

On Fri 17-11-17 14:51:18, Hou Tao wrote:
> On 2017/3/13 23:14, Jan Kara wrote:
> > blkdev_open() may race with gendisk shutdown in two different ways.
> > Either del_gendisk() has already unhashed block device inode (and thus
> > bd_acquire() will end up creating new block device inode) however
> > gen_gendisk() will still return the gendisk that is being destroyed.
> > Or bdev returned by bd_acquire() will get unhashed and gendisk destroyed
> > before we get to get_gendisk() and get_gendisk() will return new gendisk
> > that got allocated for device that reused the device number.
> > 
> > In both cases this will result in possible inconsistencies between
> > bdev->bd_disk and bdev->bd_bdi (in the first case after gendisk gets
> > destroyed and device number reused, in the second case immediately).
> 
> > Fix the problem by checking whether the gendisk is still alive and inode
> > hashed when associating bdev inode with it and its bdi. That way we are
> > sure that we will not associate bdev inode with disk that got past
> > blk_unregister_region() in del_gendisk() (and thus device number can get
> > reused). Similarly, we will not associate bdev that was once associated
> > with gendisk that is going away (and thus the corresponding bdev inode
> > will get unhashed in del_gendisk()) with a new gendisk that is just
> > reusing the device number.
> 
> I have run some extended tests based on Omar Sandoval's script [1] these days,
> and found two new problems related with the race between bdev open and gendisk
> shutdown.
> 
> The first problem lies in __blkdev_get(), and will cause use-after-free, or
> worse oops. It happens because there may be two instances of gendisk related
> with one bdev. When the process which owns the newer gendisk completes the
> invocation of __blkdev_get() firstly, the other process which owns the older
> gendisk will put the last reference of the older gendisk and cause 
> use-after-free.
> 
> The following call sequences illustrate the problem:
> 
> unhash the bdev_inode of /dev/sda
> 
> process A (blkdev_open()):
>   bd_acquire() returns new bdev_inode of /dev/sda
>   get_gendisk() returns gendisk (v1 gendisk)
> 
> remove gendisk from bdev_map
> device number is reused

^^ this is through blk_unregister_region() in del_gendisk(), isn't it?

> process B (blkdev_open()):
>   bd_acquire() returns new bdev_inode of /dev/sda
>   get_gendisk() returns a new gendisk (v2 gendisk)
>   increase bdev->bd_openers

So this needs to be a bit more complex - bd_acquire() must happen before
bdev_unhash_inode() in del_gendisk() above happened. And get_gendisk() must
happen after blk_unregister_region() from del_gendisk() happened. But yes,
this seems possible.

> process A (blkdev_open()):
>   find bdev->bd_openers != 0
>   put_disk(disk) // this is the last reference to v1 gendisk
>   disk_unblock_events(disk) // use-after-free occurs
> 
> The problem can be fixed by an extra check showed in the following snippet:
> 
> static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
> {
>   if (!bdev->bd_openers) {
>   } else {
>   if (bdev->bd_disk != disk) {
>   ret = -ENXIO;
>   goto out_unlock_bdev;
>   }
>   }
> }
>
> And we also can simplify the check in your patch by moving
> blk_unregister_region() before bdev_unhash_inode(), so whenever we get a
> new bdev_inode, the gendisk returned by get_gendisk() will either be NULL
> or a new gendisk.

I don't think we can do that. If we call blk_unregister_region() before
bdev_unhash_inode(), the device number can get reused while bdev inode is
still visible. So you can get that bdev inode v1 associated with gendisk
v2. And open following unhashing of bdev inode v1 will get bdev inode v2
associated with gendisk v2. So you have two different bdev inodes
associated with the same gendisk and that will cause data integrity issues.

But what you suggest above is probably a reasonable fix. Will you submit a
proper patch please?

> And the only check we need to do in __blkdev_get() is
> checking the hash status of the bdev_inode after we get a gendisk from
> get_gendisk(). The check can be done like the following snippet:
> 
> static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
> {
>   /* .. */
>   ret = -ENXIO;
>   disk = get_gendisk(bdev->bd_dev, &partno);
>   if (!disk)
>   goto out;
>   owner = disk->fops->owner;
> 
>   if (inode_unhashed(bdev->bd_inode))
>   goto out_unmatched;
> }

Re: [PATCH 1/3] lockdep: Apply crossrelease to PG_locked locks

2017-11-24 Thread Jan Kara
On Fri 24-11-17 09:11:49, Michal Hocko wrote:
> On Fri 24-11-17 12:02:36, Byungchul Park wrote:
> > On Thu, Nov 16, 2017 at 02:07:46PM +0100, Michal Hocko wrote:
> > > On Thu 16-11-17 21:48:05, Byungchul Park wrote:
> > > > On 11/16/2017 9:02 PM, Michal Hocko wrote:
> > > > > for each struct page. So you are doubling the size. Who is going to
> > > > > enable this config option? You are moving this to page_ext in a later
> > > > > patch which is a good step but it doesn't go far enough because this
> > > > > still consumes those resources. Is there any problem to make this
> > > > > kernel command line controllable? Something we do for page_owner for
> > > > > example?
> > > > 
> > > > Sure. I will add it.
> > > > 
> > > > > Also it would be really great if you could give us some measures about
> > > > > the runtime overhead. I do not expect it to be very large but this is
> > > > 
> > > > The major overhead would come from the amount of additional memory
> > > > consumption for 'lockdep_map's.
> > > 
> > > yes
> > > 
> > > > Do you want me to measure the overhead by the additional memory
> > > > consumption?
> > > > 
> > > > Or do you expect another overhead?
> > > 
> > > I would be also interested how much impact this has on performance. I do
> > > not expect it would be too large but having some numbers for cache cold
> > > parallel kbuild or other heavy page lock workloads.
> > 
> > Hello Michal,
> > 
> > I measured 'cache cold parallel kbuild' on my qemu machine. The result
> > varies much so I cannot confirm, but I think there's no meaningful
> > difference between before and after applying crossrelease to page locks.
> > 
> > Actually, I expect little overhead in lock_page() and unlock_page() even
> > after applying crossreleas to page locks, but only expect a bit overhead
> > by additional memory consumption for 'lockdep_map's per page.
> > 
> > I run the following instructions within "QEMU x86_64 4GB memory 4 cpus":
> > 
> >make clean
> >echo 3 > drop_caches
> >time make -j4
> 
> Maybe FS people will help you find a more representative workload. E.g.
> linear cache cold file read should be good as well. Maybe there are some
> tests in fstests (or how they call xfstests these days).

So a relatively good test of page handling costs is to mmap cache hot file
and measure time to fault in all the pages in the mapping. That way IO and
filesystem stays out of the way and you measure only page table lookup,
page handling (taking page ref and locking the page), and instantiation of
the new PTE. Out of this page handling is actually the significant part.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 07/11] xfs: remove not needed freezing calls

2017-11-30 Thread Jan Kara
On Wed 29-11-17 15:23:52, Luis R. Rodriguez wrote:
> This removes superflous freezer calls as they are no longer needed
> as the VFS now performs filesystem freezing/thaw if the filesystem has
> support for it.
> 
> The following Coccinelle rule was used as follows:
> 
> spatch --sp-file fs-freeze-cleanup.cocci --in-place fs/$FS/

I think your rule misses WQ_FREEZABLE flag for workqueues? That would be
also good to get rid of...

Honza

> 
> @ has_freeze_fs @
> identifier super_ops;
> expression freeze_op;
> @@
> 
> struct super_operations super_ops = {
>   .freeze_fs = freeze_op,
> };
> 
> @ remove_set_freezable depends on has_freeze_fs @
> expression time;
> statement S, S2, S3;
> expression task;
> @@
> 
> (
> - set_freezable();
> |
> - if (try_to_freeze())
> - continue;
> |
> - try_to_freeze();
> |
> - freezable_schedule();
> + schedule();
> |
> - freezable_schedule_timeout(time);
> + schedule_timeout(time);
> |
> - if (freezing(task)) { S }
> |
> - if (freezing(task)) { S }
> - else
>   { S2 }
> |
> - freezing(current)
> )
> 
> Generated-by: Coccinelle SmPL
> Signed-off-by: Luis R. Rodriguez 
> ---
>  fs/xfs/xfs_trans_ail.c | 7 ++-
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index cef89f7127d3..1f3dd10a9d00 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -513,7 +513,6 @@ xfsaild(
>   longtout = 0;   /* milliseconds */
>  
>   current->flags |= PF_MEMALLOC;
> - set_freezable();
>  
>   while (1) {
>   if (tout && tout <= 20)
> @@ -551,19 +550,17 @@ xfsaild(
>   if (!xfs_ail_min(ailp) &&
>   ailp->xa_target == ailp->xa_target_prev) {
>   spin_unlock(&ailp->xa_lock);
> - freezable_schedule();
> + schedule();
>   tout = 0;
>   continue;
>   }
>   spin_unlock(&ailp->xa_lock);
>  
>   if (tout)
> - freezable_schedule_timeout(msecs_to_jiffies(tout));
> + schedule_timeout(msecs_to_jiffies(tout));
>  
>   __set_current_state(TASK_RUNNING);
>  
> - try_to_freeze();
> -
>   tout = xfsaild_push(ailp);
>   }
>  
> -- 
> 2.15.0
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 05/11] fs: add iterate_supers_excl() and iterate_supers_reverse_excl()

2017-11-30 Thread Jan Kara
On Wed 29-11-17 15:23:50, Luis R. Rodriguez wrote:
> There are use cases where we wish to traverse the superblock list
> but also capture errors, and in which case we want to avoid having
> our callers issue a lock themselves since we can do the locking for
> the callers. Provide a iterate_supers_excl() which calls a function
> with the write lock held. If an error occurs we capture it and
> propagate it.
> 
> Likewise there are use cases where we wish to traverse the superblock
> list but in reverse order. The new iterate_supers_reverse_excl() helpers
> does this but also also captures any errors encountered.
> 
> Signed-off-by: Luis R. Rodriguez 

The patch looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza


> ---
>  fs/super.c | 91 
> ++
>  include/linux/fs.h |  2 ++
>  2 files changed, 93 insertions(+)
> 
> diff --git a/fs/super.c b/fs/super.c
> index a63513d187e8..885711c1d35b 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -605,6 +605,97 @@ void iterate_supers(void (*f)(struct super_block *, void 
> *), void *arg)
>   spin_unlock(&sb_lock);
>  }
>  
> +/**
> + *   iterate_supers_excl - exclusively call func for all active superblocks
> + *   @f: function to call
> + *   @arg: argument to pass to it
> + *
> + *   Scans the superblock list and calls given function, passing it
> + *   locked superblock and given argument. Returns 0 unless an error
> + *   occurred on calling the function on any superblock.
> + */
> +int iterate_supers_excl(int (*f)(struct super_block *, void *), void *arg)
> +{
> + struct super_block *sb, *p = NULL;
> + int error = 0;
> +
> + spin_lock(&sb_lock);
> + list_for_each_entry(sb, &super_blocks, s_list) {
> + if (hlist_unhashed(&sb->s_instances))
> + continue;
> + sb->s_count++;
> + spin_unlock(&sb_lock);
> +
> + down_write(&sb->s_umount);
> + if (sb->s_root && (sb->s_flags & SB_BORN)) {
> + error = f(sb, arg);
> + if (error) {
> + up_write(&sb->s_umount);
> + spin_lock(&sb_lock);
> + __put_super(sb);
> + break;
> + }
> + }
> + up_write(&sb->s_umount);
> +
> + spin_lock(&sb_lock);
> + if (p)
> + __put_super(p);
> + p = sb;
> + }
> + if (p)
> + __put_super(p);
> + spin_unlock(&sb_lock);
> +
> + return error;
> +}
> +
> +/**
> + *   iterate_supers_reverse_excl - exclusively calls func in reverse order
> + *   @f: function to call
> + *   @arg: argument to pass to it
> + *
> + *   Scans the superblock list and calls given function, passing it
> + *   locked superblock and given argument, in reverse order, and holding
> + *   the s_umount write lock. Returns if an error occurred.
> + */
> +int iterate_supers_reverse_excl(int (*f)(struct super_block *, void *),
> +  void *arg)
> +{
> + struct super_block *sb, *p = NULL;
> + int error = 0;
> +
> + spin_lock(&sb_lock);
> + list_for_each_entry_reverse(sb, &super_blocks, s_list) {
> + if (hlist_unhashed(&sb->s_instances))
> + continue;
> + sb->s_count++;
> + spin_unlock(&sb_lock);
> +
> + down_write(&sb->s_umount);
> + if (sb->s_root && (sb->s_flags & SB_BORN)) {
> + error = f(sb, arg);
> + if (error) {
> + up_write(&sb->s_umount);
> + spin_lock(&sb_lock);
> + __put_super(sb);
> + break;
> + }
> + }
> + up_write(&sb->s_umount);
> +
> + spin_lock(&sb_lock);
> + if (p)
> + __put_super(p);
> + p = sb;
> + }
> + if (p)
> + __put_super(p);
> + spin_unlock(&sb_lock);
> +
> + return error;
> +}
> +
>  /**
>   *   iterate_supers_type - call function for superblocks of given type
>   *   @type: fs type
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 107725b20fad..fe90b6542697 100644
> --- a/include/li

Re: [PATCH 01/11] fs: provide unlocked helper for freeze_super()

2017-11-30 Thread Jan Kara
On Wed 29-11-17 15:23:46, Luis R. Rodriguez wrote:
> freeze_super() holds a write lock, however we wish to also enable
> callers which already hold the write lock. To do this provide a helper
> and make freeze_super() use it. This way, all that freeze_super() does
> now is lock handling and active count management.
> 
> This change has no functional changes.
> 
> Suggested-by: Dave Chinner 
> Signed-off-by: Luis R. Rodriguez 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/super.c | 100 
> +
>  1 file changed, 55 insertions(+), 45 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index d4e33e8f1e6f..a7650ff22f0e 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1387,59 +1387,20 @@ static void sb_freeze_unlock(struct super_block *sb)
>   percpu_up_write(sb->s_writers.rw_sem + level);
>  }
>  
> -/**
> - * freeze_super - lock the filesystem and force it into a consistent state
> - * @sb: the super to lock
> - *
> - * Syncs the super to make sure the filesystem is consistent and calls the 
> fs's
> - * freeze_fs.  Subsequent calls to this without first thawing the fs will 
> return
> - * -EBUSY.
> - *
> - * During this function, sb->s_writers.frozen goes through these values:
> - *
> - * SB_UNFROZEN: File system is normal, all writes progress as usual.
> - *
> - * SB_FREEZE_WRITE: The file system is in the process of being frozen.  New
> - * writes should be blocked, though page faults are still allowed. We wait 
> for
> - * all writes to complete and then proceed to the next stage.
> - *
> - * SB_FREEZE_PAGEFAULT: Freezing continues. Now also page faults are blocked
> - * but internal fs threads can still modify the filesystem (although they
> - * should not dirty new pages or inodes), writeback can run etc. After 
> waiting
> - * for all running page faults we sync the filesystem which will clean all
> - * dirty pages and inodes (no new dirty pages or inodes can be created when
> - * sync is running).
> - *
> - * SB_FREEZE_FS: The file system is frozen. Now all internal sources of fs
> - * modification are blocked (e.g. XFS preallocation truncation on inode
> - * reclaim). This is usually implemented by blocking new transactions for
> - * filesystems that have them and need this additional guard. After all
> - * internal writers are finished we call ->freeze_fs() to finish filesystem
> - * freezing. Then we transition to SB_FREEZE_COMPLETE state. This state is
> - * mostly auxiliary for filesystems to verify they do not modify frozen fs.
> - *
> - * sb->s_writers.frozen is protected by sb->s_umount.
> - */
> -int freeze_super(struct super_block *sb)
> +/* Caller takes lock and handles active count */
> +static int freeze_locked_super(struct super_block *sb)
>  {
>   int ret;
>  
> - atomic_inc(&sb->s_active);
> - down_write(&sb->s_umount);
> - if (sb->s_writers.frozen != SB_UNFROZEN) {
> - deactivate_locked_super(sb);
> + if (sb->s_writers.frozen != SB_UNFROZEN)
>   return -EBUSY;
> - }
>  
> - if (!(sb->s_flags & SB_BORN)) {
> - up_write(&sb->s_umount);
> + if (!(sb->s_flags & SB_BORN))
>   return 0;   /* sic - it's "nothing to do" */
> - }
>  
>   if (sb_rdonly(sb)) {
>   /* Nothing to do really... */
>   sb->s_writers.frozen = SB_FREEZE_COMPLETE;
> - up_write(&sb->s_umount);
>   return 0;
>   }
>  
> @@ -1468,7 +1429,6 @@ int freeze_super(struct super_block *sb)
>   sb->s_writers.frozen = SB_UNFROZEN;
>   sb_freeze_unlock(sb);
>   wake_up(&sb->s_writers.wait_unfrozen);
> - deactivate_locked_super(sb);
>   return ret;
>   }
>   }
> @@ -1478,9 +1438,59 @@ int freeze_super(struct super_block *sb)
>*/
>   sb->s_writers.frozen = SB_FREEZE_COMPLETE;
>   lockdep_sb_freeze_release(sb);
> - up_write(&sb->s_umount);
>   return 0;
>  }
> +
> +/**
> + * freeze_super - lock the filesystem and force it into a consistent state
> + * @sb: the super to lock
> + *
> + * Syncs the super to make sure the filesystem is consistent and calls the 
> fs's
> + * freeze_fs.  Subsequent calls to this without first thawing the fs will 
> return
> + * -EBUSY.
> + *
> + * During this function, sb->s_writers.frozen goes through these valu

Re: [PATCH 02/11] fs: provide unlocked helper thaw_super()

2017-11-30 Thread Jan Kara
On Wed 29-11-17 15:23:47, Luis R. Rodriguez wrote:
> thaw_super() hold a write lock, however we wish to also enable
> callers which already hold the write lock. To do this provide a helper
> and make thaw_super() use it. This way, all that thaw_super() does
> now is lock handling and active count management.
> 
> This change has no functional changes.
> 
> Suggested-by: Dave Chinner 
> Signed-off-by: Luis R. Rodriguez 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/super.c | 39 ++-
>  1 file changed, 26 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index a7650ff22f0e..cecc279beecd 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1493,21 +1493,13 @@ int freeze_super(struct super_block *sb)
>  }
>  EXPORT_SYMBOL(freeze_super);
>  
> -/**
> - * thaw_super -- unlock filesystem
> - * @sb: the super to thaw
> - *
> - * Unlocks the filesystem and marks it writeable again after freeze_super().
> - */
> -int thaw_super(struct super_block *sb)
> +/* Caller takes lock and handles active count */
> +static int thaw_locked_super(struct super_block *sb)
>  {
>   int error;
>  
> - down_write(&sb->s_umount);
> - if (sb->s_writers.frozen != SB_FREEZE_COMPLETE) {
> - up_write(&sb->s_umount);
> + if (sb->s_writers.frozen != SB_FREEZE_COMPLETE)
>   return -EINVAL;
> - }
>  
>   if (sb_rdonly(sb)) {
>   sb->s_writers.frozen = SB_UNFROZEN;
> @@ -1522,7 +1514,6 @@ int thaw_super(struct super_block *sb)
>   printk(KERN_ERR
>   "VFS:Filesystem thaw failed\n");
>   lockdep_sb_freeze_release(sb);
> - up_write(&sb->s_umount);
>   return error;
>   }
>   }
> @@ -1531,7 +1522,29 @@ int thaw_super(struct super_block *sb)
>   sb_freeze_unlock(sb);
>  out:
>   wake_up(&sb->s_writers.wait_unfrozen);
> - deactivate_locked_super(sb);
>   return 0;
>  }
> +
> +/**
> + * thaw_super -- unlock filesystem
> + * @sb: the super to thaw
> + *
> + * Unlocks the filesystem and marks it writeable again after freeze_super().
> + */
> +int thaw_super(struct super_block *sb)
> +{
> + int error;
> +
> + down_write(&sb->s_umount);
> +     error = thaw_locked_super(sb);
> + if (error) {
> + up_write(&sb->s_umount);
> + goto out;
> + }
> +
> + deactivate_locked_super(sb);
> +
> +out:
> + return error;
> +}
>  EXPORT_SYMBOL(thaw_super);
> -- 
> 2.15.0
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 03/11] fs: add frozen sb state helpers

2017-11-30 Thread Jan Kara
On Wed 29-11-17 15:23:48, Luis R. Rodriguez wrote:
> The question of whether or not a superblock is frozen needs to be
> augmented in the future to account for differences between a user
> initiated freeze and a kernel initiated freeze done automatically
> on behalf of the kernel.
> 
> Provide helpers so that these can be used instead so that we don't
> have to expand checks later in these same call sites as we expand
> the definition of a frozen superblock.
> 
> Signed-off-by: Luis R. Rodriguez 

So helpers are fine but...

> +/**
> + * sb_is_frozen_by_user - is superblock frozen by a user call
> + * @sb: the super to check
> + *
> + * Returns true if the super freeze was initiated by userspace, for instance,
> + * an ioctl call.
> + */
> +static inline bool sb_is_frozen_by_user(struct super_block *sb)
> +{
> + return sb->s_writers.frozen == SB_FREEZE_COMPLETE;
> +}

... I dislike the _by_user() suffix as there may be different places that
call freeze_super() (e.g. device mapper does this during some operations).
Clearly we need to distinguish "by system suspend" and "the other" cases.
So please make this clear in the naming.

In fact, what might be a cleaner solution is to introduce a 'freeze_count'
for superblock freezing (we already do have this for block devices). Then
you don't need to differentiate these two cases - but you'd still need to
properly handle cleanup if freezing of all superblocks fails in the middle.
So I'm not 100% this works out nicely in the end. But it's certainly worth
a consideration.

Honza

-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 03/11] fs: add frozen sb state helpers

2017-12-01 Thread Jan Kara
On Thu 30-11-17 20:05:48, Luis R. Rodriguez wrote:
> On Thu, Nov 30, 2017 at 06:13:10PM +0100, Jan Kara wrote:
> > ... I dislike the _by_user() suffix as there may be different places that
> > call freeze_super() (e.g. device mapper does this during some operations).
> > Clearly we need to distinguish "by system suspend" and "the other" cases.
> > So please make this clear in the naming.
> 
> Ah. How about sb_frozen_by_cb() ?

And what does 'cb' stand for? :)

> > In fact, what might be a cleaner solution is to introduce a 'freeze_count'
> > for superblock freezing (we already do have this for block devices). Then
> > you don't need to differentiate these two cases - but you'd still need to
> > properly handle cleanup if freezing of all superblocks fails in the middle.
> > So I'm not 100% this works out nicely in the end. But it's certainly worth
> > a consideration.
> 
> Ah, there are three important reasons for doing it the way I did it which are
> easy to miss, unless you read the commit log message very carefully.
> 
> 0) The ioctl interface causes a failure to be sent back to userspace if
> you issue two consecutive freezes, or two thaws. Ie, once a filesystem is
> frozen, a secondary call will result in an error. Likewise for thaw.

Yep. But also note that there's *another* interface to filesystem freezing
which behaves differently - freeze_bdev() (used internally by dm). That
interface uses the counter and freezing of already frozen device succeeds.
IOW it is a mess. We cannot change the behavior of the ioctl but we could
still provide an in-kernel interface to freeze_super() with the same
semantics as freeze_bdev() which might be easier to use by suspend - maybe
we could call it 'exclusive' (for the current freeze_super() semantics) and
'non-exclusive' (for the freeze_bdev() semantics) since this is very much
like O_EXCL open of block devices...

> 1) The new iterate supers stuff I added bail on the first error and return 
> that
> error. If we kept the ioctl() interface error scheme we'd be erroring out
> if on suspend if userspace had already frozen a filesystem. Clearly that'd
> be silly so we need to distinguish between the automatic kernel freezing
> and the old userspace ioctl initiated interface, so that we can keep the
> old behaviour but allow in-kernel auto freeze on suspend to work properly.

This would work fine with the non-exclusive semantics I believe.

> 2) If we fail to suspend we need to then thaw up all filesystems. The order
> in which we try to freeze is in reverse order on the super_block list. If we
> fail though we iterate in proper order on the super_block list and thaw. If
> you had two filesystems this means that if a failure happened on freezing
> the first filesystem, we'd first thaw the other filesystem -- and because of
> 0) if we don't distinguish between the ioctl interface or auto freezing, we'd
> also fail on thaw'ing given the other superblock wouldn't have been frozen.
> 
> So we need to keep two separate approaches. The count stuff would not suffice
> to distinguish origin of source for freeze call.
> 
> Come to think of it, I think I forgot to avoid thaw if the freeze was ioctl
> initiated..
> 
> thaw_unlocked(bool cb_call)
> {
>   if (sb_frozen_by_cb(sb) && !cb_call)
> return 0; /* skip as the user wanted to keep this fs frozen */
>   ...
> }
> 
> Even though the kernel auto call is new I think we need to keep ioctl 
> initiated
> frozen filesystems frozen to not break old userspace assumptions.
> 
> So, keeping all this in mind, does a count method still suffice?

The count method would need a different error recovery method - i.e. if you
fail freezing filesystems somewhere in the middle of iteration through
superblock list, you'd need to iterate from that point on to the superblock
where you've started. This is somewhat more complicated than your approach
but it seems cleaner to me:

1) Function freezing all superblocks either succeeds and all superblocks
are frozen or fails and no superblocks are (additionally) frozen.

2) It is not that normal users + one special user (who owns the "flag" in
the superblock in form of a special freeze state) setup. We'd simply have
exclusive and non-exclusive users of superblock freezing and there can be
arbitrary numbers of them.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 03/11] fs: add frozen sb state helpers

2017-12-21 Thread Jan Kara
Hello,

I think I owe you a reply here... Sorry that it took so long.

On Fri 01-12-17 22:13:27, Luis R. Rodriguez wrote:
> On Fri, Dec 01, 2017 at 12:47:24PM +0100, Jan Kara wrote:
> > On Thu 30-11-17 20:05:48, Luis R. Rodriguez wrote:
> > > > In fact, what might be a cleaner solution is to introduce a 
> > > > 'freeze_count'
> > > > for superblock freezing (we already do have this for block devices). 
> > > > Then
> > > > you don't need to differentiate these two cases - but you'd still need 
> > > > to
> > > > properly handle cleanup if freezing of all superblocks fails in the 
> > > > middle.
> > > > So I'm not 100% this works out nicely in the end. But it's certainly 
> > > > worth
> > > > a consideration.
> > > 
> > > Ah, there are three important reasons for doing it the way I did it which 
> > > are
> > > easy to miss, unless you read the commit log message very carefully.
> > > 
> > > 0) The ioctl interface causes a failure to be sent back to userspace if
> > > you issue two consecutive freezes, or two thaws. Ie, once a filesystem is
> > > frozen, a secondary call will result in an error. Likewise for thaw.
> > 
> > Yep. But also note that there's *another* interface to filesystem freezing
> > which behaves differently - freeze_bdev() (used internally by dm). That
> > interface uses the counter and freezing of already frozen device succeeds.
> 
> Ah... so freeze_bdev() semantics matches the same semantics I was looking
> for.
> 
> > IOW it is a mess.
> 
> To say the least.
> 
> > We cannot change the behavior of the ioctl but we could
> > still provide an in-kernel interface to freeze_super() with the same
> > semantics as freeze_bdev() which might be easier to use by suspend - maybe
> > we could call it 'exclusive' (for the current freeze_super() semantics) and
> > 'non-exclusive' (for the freeze_bdev() semantics) since this is very much
> > like O_EXCL open of block devices...
> 
> Sure, now typically I see we make exclusive calls with the postfix _excl() so
> I take it you'd be OK in renaming freeze_super() freeze_super_excl() 
> eventually
> then?

In principle yes but let's leave the naming disputes to a later time when
it is clear what API do we actually want to provide.

> I totally missed freeze_bdev() otherwise I think I would have picked up on the
> shared semantics stuff and I would have just made a helper out of what
> freeze_bdev() uses, and then have both in-kernel and freeze_bdev() use it.
> 
> I'll note that its still not perfectly clear if really the semantics behind
> freeze_bdev() match what I described above fully. That still needs to be
> vetted for. For instance, does thaw_bdev() keep a superblock frozen if we
> an ioctl initiated freeze had occurred before? If so then great. Otherwise
> I think we'll need to distinguish the ioctl interface. Worst possible case
> is that bdev semantics and in-kernel semantics differ somehow, then that
> will really create a holy fucking mess.

I believe nobody really thought about mixing those two interfaces to fs
freezing and so the behavior is basically defined by the implementation.
That is:

freeze_bdev() on sb frozen by ioctl_fsfreeze() -> EBUSY
freeze_bdev() on sb frozen by freeze_bdev() -> success
ioctl_fsfreeze() on sb frozen by freeze_bdev() -> EBUSY
ioctl_fsfreeze() on sb frozen by ioctl_fsfreeze() -> EBUSY

thaw_bdev() on sb frozen by ioctl_fsfreeze() -> EINVAL
ioctl_fsthaw() on sb frozen by freeze_bdev() -> success

What I propose is the following API:

freeze_super_excl()
  - freezes superblock, returns EBUSY if the superblock is already frozen
(either by another freeze_super_excl() or by freeze_super())
freeze_super()
  - this function will make sure superblock is frozen when the function
returns with success. It can be nested with other freeze_super() or
freeze_super_excl() calls (this second part is different from how
freeze_bdev() behaves currently but AFAICT this behavior is actually
what all current users of freeze_bdev() really want - just make sure
fs cannot be written to)
thaw_super()
  - counterpart to freeze_super(), would fail with EINVAL if we were to
drop the last "freeze reference" but sb was actually frozen with
freeze_super_excl()
thaw_super_excl()
  - counterpart to freeze_super_excl(). Fails with EINVAL if sb was not
frozen with freeze_super_excl() (this is different to current behavior
but I don't believe anyone relies on this and doing otherwise is asking
for data corruption).

I'd implement it by a freeze coun

Re: blkdev loop UAF

2018-01-11 Thread Jan Kara
On Thu 11-01-18 19:22:39, Hou Tao wrote:
> Hi,
> 
> On 2018/1/11 16:24, Dan Carpenter wrote:
> > Thanks for your report and the patch.  I am sending it to the
> > linux-block devs since it's already public.
> > 
> > regards,
> > dan carpenter
> 
> The User-after-free problem is not specific for loop device, it can also
> be reproduced on scsi device, and there are more race problems caused by
> the race between bdev open and gendisk shutdown [1].
> 
> The cause of the UAF problem is that there are two instances of gendisk which 
> share
> the same bdev. After the process owning the new gendisk increases 
> bdev->bd_openers,
> the other process which owns the older gendisk will find bdev->bd_openers is 
> not zero
> and will put the last reference of the older gendisk and cause 
> User-after-free.
> 
> I had proposed a patch for the problem, but it's still an incomplete fix for 
> the race
> between gendisk shutdown and bdev opening.
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 4a181fc..5ecdb9f 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1510,6 +1510,11 @@ static int __blkdev_get(struct block_device *bdev, 
> fmode_t mode, int for_part)
> if (bdev->bd_bdi == &noop_backing_dev_info)
> bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
> } else {
> +   if (bdev->bd_disk != disk) {
> +   ret = -ENXIO;
> +   goto out_unlock_bdev;
> +   }
> +
>     if (bdev->bd_contains == bdev) {
> ret = 0;
> if (bdev->bd_disk->fops->open)
> 
> 
> As far as I know, Jan Kara is working on these problems. So, Jan, any 
> suggestions ?

Yeah, thanks for the ping. I was working today to get this fixed. I have
about 6 patches (attached) which should fix all the outstanding issues I'm
aware of.  So far they are only boot tested. I will give them more
stress-testing next week and then post them officially but if you have time,
feel free to try them out as well. Thanks!

Honza


-- 
Jan Kara 
SUSE Labs, CR
>From 7acbaaf7925cfd5440b7f7bf9902d84136624007 Mon Sep 17 00:00:00 2001
From: Jan Kara 
Date: Thu, 11 Jan 2018 15:11:03 +0100
Subject: [PATCH 1/6] genhd: Fix leaked module reference for NVME devices

Commit 8ddcd653257c "block: introduce GENHD_FL_HIDDEN" added handling of
hidden devices to get_gendisk() but forgot to drop module reference
which is also acquired by get_disk(). Drop the reference as necessary.

Arguably the function naming here is misleading as put_disk() is *not*
the counterpart of get_disk() but let's fix that in the follow up
commit since that will be more intrusive.

Fixes: 8ddcd653257c18a669fcb75ee42c37054908e0d6
CC: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 block/genhd.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/genhd.c b/block/genhd.c
index 96a66f671720..22f4fc340a3a 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -802,7 +802,10 @@ struct gendisk *get_gendisk(dev_t devt, int *partno)
 	}
 
 	if (disk && unlikely(disk->flags & GENHD_FL_HIDDEN)) {
+		struct module *owner = disk->fops->owner;
+
 		put_disk(disk);
+		module_put(owner);
 		disk = NULL;
 	}
 	return disk;
-- 
2.13.6

>From d660ec31e06f683f53c37b9195b85d363fb255ea Mon Sep 17 00:00:00 2001
From: Jan Kara 
Date: Thu, 11 Jan 2018 15:27:38 +0100
Subject: [PATCH 2/6] genhd: Rename get_disk() to get_disk_and_module()

Rename get_disk() to get_disk_and_module() to make sure what the
function does. It's not a great name but at least it is now clear that
put_disk() is not it's counterpart.

Signed-off-by: Jan Kara 
---
 block/genhd.c   | 10 --
 drivers/block/amiflop.c |  2 +-
 drivers/block/ataflop.c |  2 +-
 drivers/block/brd.c |  2 +-
 drivers/block/floppy.c  |  2 +-
 drivers/block/loop.c|  2 +-
 drivers/block/swim.c|  2 +-
 drivers/block/z2ram.c   |  2 +-
 drivers/ide/ide-probe.c |  2 +-
 include/linux/genhd.h   |  2 +-
 10 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 22f4fc340a3a..058cf91f8d32 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -547,7 +547,7 @@ static int exact_lock(dev_t devt, void *data)
 {
 	struct gendisk *p = data;
 
-	if (!get_disk(p))
+	if (!get_disk_and_module(p))
 		return -1;
 	return 0;
 }
@@ -794,7 +794,7 @@ struct gendisk *get_gendisk(dev_t devt, int *partno)
 
 		spin_lock_bh(&ext_devt_lock);
 		part = idr_find(&ext_devt_idr, blk_mangle_minor(MINOR(devt)));
-		if (part && get_disk(part_to_disk(part))) {
+		if (part && get_disk_and_module(part_to_disk(part))) {
 			*p

Re: Panic with ext4,nbd,qemu-img,block

2018-01-25 Thread Jan Kara
On Sat 20-01-18 18:40:28, Theodore Ts'o wrote:
> On Sat, Jan 20, 2018 at 04:41:00PM +0800, Hongzhi, Song wrote:
> > 
> > 329.11 EXT4-fs (nbd0): mounted filesystem with ordered data mode. Opts:
> > (null)
> > 329.12 block nbd0: Connection timed out
> > 329.13 block nbd0: shutting down sockets
> 
> That's your problem.  I'm guessing qemu-nbd is dying for some reason,
> which is causing the nbd device to fail (hence the "connection timed
> out").

All those filesystem errors are obviously his problem but:

329.45 [ cut here ]
329.46 kernel BUG at /kernel-source//fs/buffer.c:3091!
329.47 invalid opcode:  [#1] PREEMPT SMP
...

is a kernel problem. Hongzhi, can you please track down which line in
submit_bh_wbc() is line 3091 in your sources? I'd be interested in which
assertion actually failed. My best bets would be on
BUG_ON(!buffer_mapped(bh));
but I'm not sure...

    Honza

-- 
Jan Kara 
SUSE Labs, CR


[LSF/MM TOPIC] get_user_pages() and filesystems

2018-01-25 Thread Jan Kara
Hello,

this is about a problem I have identified last month and for which I still
don't have good solution. Some discussion of the problem happened here [1]
where also technical details are posted but culprit of the problem is
relatively simple: Lots of places in kernel (fs code, writeback logic,
stable-pages framework for DIF/DIX) assume that file pages in page cache
can be modified either via write(2), truncate(2), fallocate(2) or similar
code paths explicitely manipulating with file space or via a writeable
mapping into page tables. In particular we assume that if we block all the
above paths by taking proper locks, block page faults, and unmap (/ map
read-only) the page, it cannot be modified. But this assumption is violated
by get_user_pages() users (such as direct IO or RDMA drivers - and we've
got reports from such users of weird things happening).

The problem with GUP users is that they acquire page reference (at that
point page is writeably mapped into page tables) and some time in future
(which can be quite far in case of RDMA) page contents gets modified and
page marked dirty.

The question is how to properly solve this problem. One obvious way is to
indicate page has a GUP reference and block its unmapping / remapping RO
until that is dropped. But this has a technical problem (how to find space
in struct page for such tracking) and a design problem (blocking e.g.
writeback for hours because some RDMA app used a file mapping as a buffer
simply is not acceptable). There are also various modifications to this
solution like refuse to use file pages for RDMA
(get_user_pages_longterm()) and block waiting for users like direct IO, or
require that RDMA users provide a way to revoke access from GUPed pages.

Another obvious solution is to try to remove the assumption from all those
places - i.e., use bounce buffers for DIF/DIX, make sure filesystems are
prepared for dirty pages suddenly appearing in files and handle that as
good as they can. They really need to sensibly handle only a case when
underlying storage is already allocated / reserved, in all other cases I
believe they are fine in just discarding the data. This would be very
tedious but I believe it could be done. But overall long-term maintenance
burden of this solution just doesn't seem worth it to me.

Another possible solution might be that GUP users (at least the long term
ones) won't get references directly to page cache pages but only to some
bounce pages (something like cow on private file mappings) and data would
be just copied to the page cache pages at set_page_dirty_lock() time (we
would probably have to move these users to a completely new API to keep our
sanity). This would have userspace visible impacts (data won't be visible
in the file until GUP user is done with it) but maybe it would be
acceptable. Also how to keep association to the original pagecache page
(and how it should be handled when underlying file just goes away) is
unclear.

So clever ideas are needed and possibly some input from FS / MM / RDMA
folks about what might be acceptable.

Honza

[1] https://www.spinics.net/lists/linux-xfs/msg14468.html

-- 
Jan Kara 
SUSE Labs, CR


Re: blkdev loop UAF

2018-02-05 Thread Jan Kara
On Thu 01-02-18 19:58:45, Eric Biggers wrote:
> On Thu, Jan 11, 2018 at 06:00:08PM +0100, Jan Kara wrote:
> > On Thu 11-01-18 19:22:39, Hou Tao wrote:
> > > Hi,
> > > 
> > > On 2018/1/11 16:24, Dan Carpenter wrote:
> > > > Thanks for your report and the patch.  I am sending it to the
> > > > linux-block devs since it's already public.
> > > > 
> > > > regards,
> > > > dan carpenter
> > > 
> > > The User-after-free problem is not specific for loop device, it can also
> > > be reproduced on scsi device, and there are more race problems caused by
> > > the race between bdev open and gendisk shutdown [1].
> > > 
> > > The cause of the UAF problem is that there are two instances of gendisk 
> > > which share
> > > the same bdev. After the process owning the new gendisk increases 
> > > bdev->bd_openers,
> > > the other process which owns the older gendisk will find bdev->bd_openers 
> > > is not zero
> > > and will put the last reference of the older gendisk and cause 
> > > User-after-free.
> > > 
> > > I had proposed a patch for the problem, but it's still an incomplete fix 
> > > for the race
> > > between gendisk shutdown and bdev opening.
> > > 
> > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > index 4a181fc..5ecdb9f 100644
> > > --- a/fs/block_dev.c
> > > +++ b/fs/block_dev.c
> > > @@ -1510,6 +1510,11 @@ static int __blkdev_get(struct block_device *bdev, 
> > > fmode_t mode, int for_part)
> > > if (bdev->bd_bdi == &noop_backing_dev_info)
> > > bdev->bd_bdi = 
> > > bdi_get(disk->queue->backing_dev_info);
> > > } else {
> > > +   if (bdev->bd_disk != disk) {
> > > +   ret = -ENXIO;
> > > +   goto out_unlock_bdev;
> > > +   }
> > > +
> > > if (bdev->bd_contains == bdev) {
> > > ret = 0;
> > > if (bdev->bd_disk->fops->open)
> > > 
> > > 
> > > As far as I know, Jan Kara is working on these problems. So, Jan, any 
> > > suggestions ?
> > 
> > Yeah, thanks for the ping. I was working today to get this fixed. I have
> > about 6 patches (attached) which should fix all the outstanding issues I'm
> > aware of.  So far they are only boot tested. I will give them more
> > stress-testing next week and then post them officially but if you have time,
> > feel free to try them out as well. Thanks!
> > 
> 
> Jan, are you still planning to fix this?  This was also reported by syzbot 
> here:
> https://groups.google.com/d/msg/syzkaller-bugs/9wXx4YULITQ/0SK8vqxhAgAJ.  Note
> that the C reproducer attached to that report doesn't work for me anymore
> following ae6650163c6 ("loop: fix concurrent lo_open/lo_release").  However,
> syzbot has still hit this on more recent kernels.  Here's a C reproducer that
> still works for me on Linus' tree as of today, though it can take 5-10 
> seconds:

Yes, I do plan to post the patches. I was just struggling with getting the
original failure reproduced (plus I had quite some other work, vacation) so
I couldn't really test whether my patches fix the reported problem. So
thanks for your reproducer, I'll try it out and see whether my patches fix
the problem.

Honza

> #include 
> #include 
> #include 
> #include 
> 
> int main()
> {
>   int fd = open("/dev/loop-control", 0);
> 
>   if (!fork()) {
>   for (;;)
>   ioctl(fd, LOOP_CTL_ADD, 0);
>   }
> 
>   if (!fork()) {
>   for (;;)
>   ioctl(fd, LOOP_CTL_REMOVE, 0);
>   }
> 
>   fork();
>   for (;;) {
>   fd = open("/dev/loop0", O_EXCL);
>   close(fd);
>   }
> }
> 
> This is the KASAN splat I get from the above:
> 
> BUG: KASAN: use-after-free in disk_unblock_events+0x3e/0x40 block/genhd.c:1672
> Read of size 8 at addr 880030743670 by task systemd-udevd/165
> 
> CPU: 1 PID: 165 Comm: systemd-udevd Not tainted 4.15.0-08279-gfe53d1443a146 
> #38
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.11.0-20171110_100015-anatol 04/01/2014
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0xb3/0x145 lib/dump_st

[PATCH 6/6] blockdev: Avoid two active bdev inodes for one device

2018-02-06 Thread Jan Kara
When blkdev_open() races with device removal and creation it can happen
that unhashed bdev inode gets associated with newly created gendisk
like:

CPU0CPU1
blkdev_open()
  bdev = bd_acquire()
del_gendisk()
  bdev_unhash_inode(bdev);
remove device
create new device with the same number
  __blkdev_get()
disk = get_gendisk()
  - gets reference to gendisk of the new device

Now another blkdev_open() will not find original 'bdev' as it got
unhashed, create a new one and associate it with the same 'disk' at
which point problems start as we have two independent page caches for
one device.

Fix the problem by verifying that the bdev inode didn't get unhashed
before we acquired gendisk reference. That way we make sure gendisk can
get associated only with visible bdev inodes.

Signed-off-by: Jan Kara 
---
 fs/block_dev.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index fe41a76769fa..fe09ef9c21f3 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1058,6 +1058,27 @@ static int bd_prepare_to_claim(struct block_device *bdev,
return 0;
 }
 
+static struct gendisk *bdev_get_gendisk(struct block_device *bdev, int *partno)
+{
+   struct gendisk *disk = get_gendisk(bdev->bd_dev, partno);
+
+   if (!disk)
+   return NULL;
+   /*
+* Now that we hold gendisk reference we make sure bdev we looked up is
+* not stale. If it is, it means device got removed and created before
+* we looked up gendisk and we fail open in such case. Associating
+* unhashed bdev with newly created gendisk could lead to two bdevs
+* (and thus two independent caches) being associated with one device
+* which is bad.
+*/
+   if (inode_unhashed(bdev->bd_inode)) {
+   put_disk_and_module(disk);
+   return NULL;
+   }
+   return disk;
+}
+
 /**
  * bd_start_claiming - start claiming a block device
  * @bdev: block device of interest
@@ -1094,7 +1115,7 @@ static struct block_device *bd_start_claiming(struct 
block_device *bdev,
 * @bdev might not have been initialized properly yet, look up
 * and grab the outer block device the hard way.
 */
-   disk = get_gendisk(bdev->bd_dev, &partno);
+   disk = bdev_get_gendisk(bdev, &partno);
if (!disk)
return ERR_PTR(-ENXIO);
 
@@ -1429,7 +1450,7 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
  restart:
 
ret = -ENXIO;
-   disk = get_gendisk(bdev->bd_dev, &partno);
+   disk = bdev_get_gendisk(bdev, &partno);
if (!disk)
goto out;
 
-- 
2.13.6



[PATCH 3/6] genhd: Add helper put_disk_and_module()

2018-02-06 Thread Jan Kara
Add a proper counterpart to get_disk_and_module() -
put_disk_and_module(). Currently it is opencoded in several places.

Signed-off-by: Jan Kara 
---
 block/blk-cgroup.c| 11 ++-
 block/genhd.c | 20 
 fs/block_dev.c| 19 +--
 include/linux/genhd.h |  1 +
 4 files changed, 24 insertions(+), 27 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 4117524ca45b..c2033a232a44 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -812,7 +812,6 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct 
blkcg_policy *pol,
struct gendisk *disk;
struct request_queue *q;
struct blkcg_gq *blkg;
-   struct module *owner;
unsigned int major, minor;
int key_len, part, ret;
char *body;
@@ -904,9 +903,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct 
blkcg_policy *pol,
spin_unlock_irq(q->queue_lock);
rcu_read_unlock();
 fail:
-   owner = disk->fops->owner;
-   put_disk(disk);
-   module_put(owner);
+   put_disk_and_module(disk);
/*
 * If queue was bypassing, we should retry.  Do so after a
 * short msleep().  It isn't strictly necessary but queue
@@ -931,13 +928,9 @@ EXPORT_SYMBOL_GPL(blkg_conf_prep);
 void blkg_conf_finish(struct blkg_conf_ctx *ctx)
__releases(ctx->disk->queue->queue_lock) __releases(rcu)
 {
-   struct module *owner;
-
spin_unlock_irq(ctx->disk->queue->queue_lock);
rcu_read_unlock();
-   owner = ctx->disk->fops->owner;
-   put_disk(ctx->disk);
-   module_put(owner);
+   put_disk_and_module(ctx->disk);
 }
 EXPORT_SYMBOL_GPL(blkg_conf_finish);
 
diff --git a/block/genhd.c b/block/genhd.c
index 058cf91f8d32..64c323549a22 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -802,10 +802,7 @@ struct gendisk *get_gendisk(dev_t devt, int *partno)
}
 
if (disk && unlikely(disk->flags & GENHD_FL_HIDDEN)) {
-   struct module *owner = disk->fops->owner;
-
-   put_disk(disk);
-   module_put(owner);
+   put_disk_and_module(disk);
disk = NULL;
}
return disk;
@@ -1468,6 +1465,21 @@ void put_disk(struct gendisk *disk)
 }
 EXPORT_SYMBOL(put_disk);
 
+/*
+ * This is a counterpart of get_disk_and_module() and thus also of
+ * get_gendisk().
+ */
+void put_disk_and_module(struct gendisk *disk)
+{
+   if (disk) {
+   struct module *owner = disk->fops->owner;
+
+   put_disk(disk);
+   module_put(owner);
+   }
+}
+EXPORT_SYMBOL(put_disk_and_module);
+
 static void set_disk_ro_uevent(struct gendisk *gd, int ro)
 {
char event[] = "DISK_RO=1";
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 4a181fcb5175..1dbbf847911a 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -,8 +,7 @@ static struct block_device *bd_start_claiming(struct 
block_device *bdev,
else
whole = bdgrab(bdev);
 
-   module_put(disk->fops->owner);
-   put_disk(disk);
+   put_disk_and_module(disk);
if (!whole)
return ERR_PTR(-ENOMEM);
 
@@ -1407,7 +1406,6 @@ static void __blkdev_put(struct block_device *bdev, 
fmode_t mode, int for_part);
 static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 {
struct gendisk *disk;
-   struct module *owner;
int ret;
int partno;
int perm = 0;
@@ -1433,7 +1431,6 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
disk = get_gendisk(bdev->bd_dev, &partno);
if (!disk)
goto out;
-   owner = disk->fops->owner;
 
disk_block_events(disk);
mutex_lock_nested(&bdev->bd_mutex, for_part);
@@ -1463,8 +1460,7 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
bdev->bd_queue = NULL;
mutex_unlock(&bdev->bd_mutex);
disk_unblock_events(disk);
-   put_disk(disk);
-   module_put(owner);
+   put_disk_and_module(disk);
goto restart;
}
}
@@ -1525,8 +1521,7 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
goto out_unlock_bdev;
}
/* only one opener holds refs to the module and disk */
-   put_disk(disk);
-   module_put(owner);
+   put_disk_and_module(disk);
}
bdev->bd_openers++;
if (for_part)
@@ -1546,8 +1541,7 @@ static int _

[PATCH 2/6] genhd: Rename get_disk() to get_disk_and_module()

2018-02-06 Thread Jan Kara
Rename get_disk() to get_disk_and_module() to make sure what the
function does. It's not a great name but at least it is now clear that
put_disk() is not it's counterpart.

Signed-off-by: Jan Kara 
---
 block/genhd.c   | 10 --
 drivers/block/amiflop.c |  2 +-
 drivers/block/ataflop.c |  2 +-
 drivers/block/brd.c |  2 +-
 drivers/block/floppy.c  |  2 +-
 drivers/block/loop.c|  2 +-
 drivers/block/swim.c|  2 +-
 drivers/block/z2ram.c   |  2 +-
 drivers/ide/ide-probe.c |  2 +-
 include/linux/genhd.h   |  2 +-
 10 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 22f4fc340a3a..058cf91f8d32 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -547,7 +547,7 @@ static int exact_lock(dev_t devt, void *data)
 {
struct gendisk *p = data;
 
-   if (!get_disk(p))
+   if (!get_disk_and_module(p))
return -1;
return 0;
 }
@@ -794,7 +794,7 @@ struct gendisk *get_gendisk(dev_t devt, int *partno)
 
spin_lock_bh(&ext_devt_lock);
part = idr_find(&ext_devt_idr, blk_mangle_minor(MINOR(devt)));
-   if (part && get_disk(part_to_disk(part))) {
+   if (part && get_disk_and_module(part_to_disk(part))) {
*partno = part->partno;
disk = part_to_disk(part);
}
@@ -1441,7 +1441,7 @@ struct gendisk *__alloc_disk_node(int minors, int node_id)
 }
 EXPORT_SYMBOL(__alloc_disk_node);
 
-struct kobject *get_disk(struct gendisk *disk)
+struct kobject *get_disk_and_module(struct gendisk *disk)
 {
struct module *owner;
struct kobject *kobj;
@@ -1459,15 +1459,13 @@ struct kobject *get_disk(struct gendisk *disk)
return kobj;
 
 }
-
-EXPORT_SYMBOL(get_disk);
+EXPORT_SYMBOL(get_disk_and_module);
 
 void put_disk(struct gendisk *disk)
 {
if (disk)
kobject_put(&disk_to_dev(disk)->kobj);
 }
-
 EXPORT_SYMBOL(put_disk);
 
 static void set_disk_ro_uevent(struct gendisk *gd, int ro)
diff --git a/drivers/block/amiflop.c b/drivers/block/amiflop.c
index e5aa62fcf5a8..3aaf6af3ec23 100644
--- a/drivers/block/amiflop.c
+++ b/drivers/block/amiflop.c
@@ -1758,7 +1758,7 @@ static struct kobject *floppy_find(dev_t dev, int *part, 
void *data)
if (unit[drive].type->code == FD_NODRIVE)
return NULL;
*part = 0;
-   return get_disk(unit[drive].gendisk);
+   return get_disk_and_module(unit[drive].gendisk);
 }
 
 static int __init amiga_floppy_probe(struct platform_device *pdev)
diff --git a/drivers/block/ataflop.c b/drivers/block/ataflop.c
index 8bc3b9fd8dd2..dfb2c2622e5a 100644
--- a/drivers/block/ataflop.c
+++ b/drivers/block/ataflop.c
@@ -1917,7 +1917,7 @@ static struct kobject *floppy_find(dev_t dev, int *part, 
void *data)
if (drive >= FD_MAX_UNITS || type > NUM_DISK_MINORS)
return NULL;
*part = 0;
-   return get_disk(unit[drive].disk);
+   return get_disk_and_module(unit[drive].disk);
 }
 
 static int __init atari_floppy_init (void)
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 8028a3a7e7fd..deea78e485da 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -456,7 +456,7 @@ static struct kobject *brd_probe(dev_t dev, int *part, void 
*data)
 
mutex_lock(&brd_devices_mutex);
brd = brd_init_one(MINOR(dev) / max_part, &new);
-   kobj = brd ? get_disk(brd->brd_disk) : NULL;
+   kobj = brd ? get_disk_and_module(brd->brd_disk) : NULL;
mutex_unlock(&brd_devices_mutex);
 
if (new)
diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
index eae484acfbbc..8ec7235fc93b 100644
--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -4505,7 +4505,7 @@ static struct kobject *floppy_find(dev_t dev, int *part, 
void *data)
if (((*part >> 2) & 0x1f) >= ARRAY_SIZE(floppy_type))
return NULL;
*part = 0;
-   return get_disk(disks[drive]);
+   return get_disk_and_module(disks[drive]);
 }
 
 static int __init do_floppy_init(void)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index d5fe720cf149..87855b5123a6 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1922,7 +1922,7 @@ static struct kobject *loop_probe(dev_t dev, int *part, 
void *data)
if (err < 0)
kobj = NULL;
else
-   kobj = get_disk(lo->lo_disk);
+   kobj = get_disk_and_module(lo->lo_disk);
mutex_unlock(&loop_index_mutex);
 
*part = 0;
diff --git a/drivers/block/swim.c b/drivers/block/swim.c
index 84434d3ea19b..64e066eba72e 100644
--- a/drivers/block/swim.c
+++ b/drivers/block/swim.c
@@ -799,7 +799,7 @@ static struct kobject *floppy_find(dev_t dev, int *part, 
void *data)
return NULL;
 
*part = 0;
-   return get_disk(swd->unit[drive].disk)

[PATCH 0/6] block: Fix races in bdev - gendisk handling

2018-02-06 Thread Jan Kara
Hello,

these patches fix races happening when devices are frequently destroyed and
recreated in association of block device inode with corresponding gendisk.
Generally when such race happen it results in use-after-free issues, block
device page cache inconsistencies, or other problems. I have verified these
patches fix use-after-free issues that could be reproduced by frequent creation
and destruction of loop device. I was not able to reproduce races reported by
Hou Tao [1] related to gendisk-blkdev association (at least I was not able to
hit any issues with stressing using scsi_debug in various ways with or without
patches). My patches should fix those but it would be great if Tao could verify
that. Any comments and review welcome!

Honza

[1] https://www.spinics.net/lists/linux-block/msg20015.html


[PATCH 4/6] genhd: Fix use after free in __blkdev_get()

2018-02-06 Thread Jan Kara
When two blkdev_open() calls race with device removal and recreation,
__blkdev_get() can use looked up gendisk after it is freed:

CPU0CPU1CPU2
del_gendisk(disk);
  
bdev_unhash_inode(inode);
blkdev_open()   blkdev_open()
  bdev = bd_acquire(inode);
- creates and returns new inode
  bdev = bd_acquire(inode);
- returns the same inode
  __blkdev_get(devt)  __blkdev_get(devt)
disk = get_gendisk(devt);
  - got structure of device going away


  disk = get_gendisk(devt);
- got new device structure
  if (!bdev->bd_openers) {
does the first open
  }
if (!bdev->bd_openers)
  - false
} else {
  put_disk_and_module(disk)
- remember this was old device - this was last ref and disk is
  now freed
}
disk_unblock_events(disk); -> oops

Fix the problem by making sure we drop reference to disk in
__blkdev_get() only after we are really done with it.

Reported-by: Hou Tao 
Signed-off-by: Jan Kara 
---
 fs/block_dev.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 1dbbf847911a..fe41a76769fa 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1409,6 +1409,7 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
int ret;
int partno;
int perm = 0;
+   bool first_open = false;
 
if (mode & FMODE_READ)
perm |= MAY_READ;
@@ -1435,6 +1436,7 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
disk_block_events(disk);
mutex_lock_nested(&bdev->bd_mutex, for_part);
if (!bdev->bd_openers) {
+   first_open = true;
bdev->bd_disk = disk;
bdev->bd_queue = disk->queue;
bdev->bd_contains = bdev;
@@ -1520,14 +1522,15 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
if (ret)
goto out_unlock_bdev;
}
-   /* only one opener holds refs to the module and disk */
-   put_disk_and_module(disk);
}
bdev->bd_openers++;
if (for_part)
bdev->bd_part_count++;
mutex_unlock(&bdev->bd_mutex);
disk_unblock_events(disk);
+   /* only one opener holds refs to the module and disk */
+   if (!first_open)
+   put_disk_and_module(disk);
return 0;
 
  out_clear:
-- 
2.13.6



[PATCH 1/6] genhd: Fix leaked module reference for NVME devices

2018-02-06 Thread Jan Kara
Commit 8ddcd653257c "block: introduce GENHD_FL_HIDDEN" added handling of
hidden devices to get_gendisk() but forgot to drop module reference
which is also acquired by get_disk(). Drop the reference as necessary.

Arguably the function naming here is misleading as put_disk() is *not*
the counterpart of get_disk() but let's fix that in the follow up
commit since that will be more intrusive.

Fixes: 8ddcd653257c18a669fcb75ee42c37054908e0d6
CC: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 block/genhd.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/genhd.c b/block/genhd.c
index 96a66f671720..22f4fc340a3a 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -802,7 +802,10 @@ struct gendisk *get_gendisk(dev_t devt, int *partno)
}
 
if (disk && unlikely(disk->flags & GENHD_FL_HIDDEN)) {
+   struct module *owner = disk->fops->owner;
+
put_disk(disk);
+   module_put(owner);
disk = NULL;
}
return disk;
-- 
2.13.6



[PATCH 5/6] genhd: Fix BUG in blkdev_open()

2018-02-06 Thread Jan Kara
When two blkdev_open() calls for a partition race with device removal
and recreation, we can hit BUG_ON(!bd_may_claim(bdev, whole, holder)) in
blkdev_open(). The race can happen as follows:

CPU0CPU1CPU2
del_gendisk()
  
bdev_unhash_inode(part1);

blkdev_open(part1, O_EXCL)  blkdev_open(part1, O_EXCL)
  bdev = bd_acquire() bdev = bd_acquire()
  blkdev_get(bdev)
bd_start_claiming(bdev)
  - finds old inode 'whole'
  bd_prepare_to_claim() -> 0
  
bdev_unhash_inode(whole);


  blkdev_get(bdev);
bd_start_claiming(bdev)
  - finds new inode 'whole'
  bd_prepare_to_claim()
- this also succeeds as we have
  different 'whole' here...
- bad things happen now as we
  have two exclusive openers of
  the same bdev

The problem here is that block device opens can see various intermediate
states while gendisk is shutting down and then being recreated.

We fix the problem by introducing new lookup_sem in gendisk that
synchronizes gendisk deletion with get_gendisk() and furthermore by
making sure that get_gendisk() does not return gendisk that is being (or
has been) deleted. This makes sure that once we ever manage to look up
newly created bdev inode, we are also guaranteed that following
get_gendisk() will either return failure (and we fail open) or it
returns gendisk for the new device and following bdget_disk() will
return new bdev inode (i.e., blkdev_open() follows the path as if it is
completely run after new device is created).

Reported-and-analyzed-by: Hou Tao 
Signed-off-by: Jan Kara 
---
 block/genhd.c | 21 -
 include/linux/genhd.h |  1 +
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/block/genhd.c b/block/genhd.c
index 64c323549a22..c6f68c332bfe 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -703,6 +703,11 @@ void del_gendisk(struct gendisk *disk)
blk_integrity_del(disk);
disk_del_events(disk);
 
+   /*
+* Block lookups of the disk until all bdevs are unhashed and the
+* disk is marked as dead (GENHD_FL_UP cleared).
+*/
+   down_write(&disk->lookup_sem);
/* invalidate stuff */
disk_part_iter_init(&piter, disk,
 DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
@@ -717,6 +722,7 @@ void del_gendisk(struct gendisk *disk)
bdev_unhash_inode(disk_devt(disk));
set_capacity(disk, 0);
disk->flags &= ~GENHD_FL_UP;
+   up_write(&disk->lookup_sem);
 
if (!(disk->flags & GENHD_FL_HIDDEN))
sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");
@@ -801,9 +807,21 @@ struct gendisk *get_gendisk(dev_t devt, int *partno)
spin_unlock_bh(&ext_devt_lock);
}
 
-   if (disk && unlikely(disk->flags & GENHD_FL_HIDDEN)) {
+   if (!disk)
+   return NULL;
+
+   /*
+* Synchronize with del_gendisk() to not return disk that is being
+* destroyed.
+*/
+   down_read(&disk->lookup_sem);
+   if (unlikely((disk->flags & GENHD_FL_HIDDEN) ||
+!(disk->flags & GENHD_FL_UP))) {
+   up_read(&disk->lookup_sem);
put_disk_and_module(disk);
disk = NULL;
+   } else {
+   up_read(&disk->lookup_sem);
}
return disk;
 }
@@ -1403,6 +1421,7 @@ struct gendisk *__alloc_disk_node(int minors, int node_id)
kfree(disk);
return NULL;
}
+   init_rwsem(&disk->lookup_sem);
disk->node_id = node_id;
if (disk_expand_part_tbl(disk, 0)) {
free_part_stats(&disk->part0);
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 07b715cdeb93..7b548253eaef 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -198,6 +198,7 @@ struct gendisk {
void *private_data;
 
int flags;
+   struct rw_semaphore lookup_sem;
struct kobject *slave_dir;
 
struct timer_rand_state *random;
-- 
2.13.6



Re: [LSF/MM TOPIC] get_user_pages() and filesystems

2018-02-06 Thread Jan Kara
Hello,

On Fri 02-02-18 15:04:11, Liu Bo wrote:
> On Thu, Jan 25, 2018 at 12:57:27PM +0100, Jan Kara wrote:
> > Hello,
> > 
> > this is about a problem I have identified last month and for which I still
> > don't have good solution. Some discussion of the problem happened here [1]
> > where also technical details are posted but culprit of the problem is
> > relatively simple: Lots of places in kernel (fs code, writeback logic,
> > stable-pages framework for DIF/DIX) assume that file pages in page cache
> > can be modified either via write(2), truncate(2), fallocate(2) or similar
> > code paths explicitely manipulating with file space or via a writeable
> > mapping into page tables. In particular we assume that if we block all the
> > above paths by taking proper locks, block page faults, and unmap (/ map
> > read-only) the page, it cannot be modified. But this assumption is violated
> > by get_user_pages() users (such as direct IO or RDMA drivers - and we've
> > got reports from such users of weird things happening).
> > 
> > The problem with GUP users is that they acquire page reference (at that
> > point page is writeably mapped into page tables) and some time in future
> > (which can be quite far in case of RDMA) page contents gets modified and
> > page marked dirty.
> 
> I got a question here, when you say 'page contents gets modified', do
> you mean that GUP users modify the page content?

Yes.

> I have another story about GUP users who use direct-IO, qemu sometimes
> doesn't work well with btrfs when checksum enabled and reports
> checksum failures when guest OS doesn't use stable pages, where it is
> not GUP users but the original file/mapping that may be changing the
> page content in flight.

OK, but that is kind of expected, isn't it? The whole purpose of 'stable
pages' is exactly to modifying pages while IO is in flight. So if a device
image is backed by a storage (filesystem in this case) which checksums
data, qemu should present it to the guest as a block device supporting
DIF/DIX and thus requiring stable pages...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v2 2/2] block: Fix a race between the throttling code and request queue initialization

2018-02-07 Thread Jan Kara
On Mon 05-02-18 17:58:12, Bart Van Assche wrote:
> On Sat, 2018-02-03 at 10:51 +0800, Joseph Qi wrote:
> > Hi Bart,
> > 
> > On 18/2/3 00:21, Bart Van Assche wrote:
> > > On Fri, 2018-02-02 at 09:02 +0800, Joseph Qi wrote:
> > > > We triggered this race when using single queue. I'm not sure if it
> > > > exists in multi-queue.
> > > 
> > > Regarding the races between modifying the queue_lock pointer and the code 
> > > that
> > > uses that pointer, I think the following construct in blk_cleanup_queue() 
> > > is
> > > sufficient to avoid races between the queue_lock pointer assignment and 
> > > the code
> > > that executes concurrently with blk_cleanup_queue():
> > > 
> > >   spin_lock_irq(lock);
> > >   if (q->queue_lock != &q->__queue_lock)
> > >   q->queue_lock = &q->__queue_lock;
> > >   spin_unlock_irq(lock);
> > > 
> > 
> > IMO, the race also exists.
> > 
> > blk_cleanup_queue   blkcg_print_blkgs
> >   spin_lock_irq(lock) (1)   spin_lock_irq(blkg->q->queue_lock) (2,5)
> > q->queue_lock = &q->__queue_lock (3)
> >   spin_unlock_irq(lock) (4)
> > spin_unlock_irq(blkg->q->queue_lock) (6)
> > 
> > (1) take driver lock;
> > (2) busy loop for driver lock;
> > (3) override driver lock with internal lock;
> > (4) unlock driver lock; 
> > (5) can take driver lock now;
> > (6) but unlock internal lock.
> > 
> > If we get blkg->q->queue_lock to local first like blk_cleanup_queue,
> > it indeed can fix the different lock use in lock/unlock. But since
> > blk_cleanup_queue has overridden queue lock to internal lock now, I'm
> > afraid we couldn't still use driver lock in blkcg_print_blkgs.
> 
> (+ Jan Kara)
> 
> Hello Joseph,
> 
> That's a good catch. Since modifying all code that accesses the queue_lock
> pointer and that can race with blk_cleanup_queue() would be too cumbersome I
> see only one solution, namely making the request queue cgroup and sysfs
> attributes invisible before the queue_lock pointer is restored. Leaving the
> debugfs attributes visible while blk_cleanup_queue() is in progress should
> be fine if the request queue initialization code is modified such that it
> only modifies the queue_lock pointer for legacy queues. Jan, I think some
> time ago you objected when I proposed to move code from __blk_release_queue()
> into blk_cleanup_queue(). Would you be fine with a slightly different
> approach, namely making block cgroup and sysfs attributes invisible earlier,
> namely from inside blk_cleanup_queue() instead of from inside
> __blk_release_queue()?

Making attributes invisible earlier should be fine. But this whole
switching of queue_lock in blk_cleanup_queue() looks error-prone to me.
Generally anyone having access to request_queue can have old value of
q->queue_lock in its CPU caches and can happily use that value after
blk_cleanup_queue() finishes and the driver specific structure storing the
lock is freed. blkcg_print_blkgs() is one such example but I wouldn't bet a
penny that there are no other paths with a similar problem.

Logically, the lifetime of storage for q->queue_lock should be at least as
long as that of the request_queue itself - i.e., released only after
__blk_release_queue(). Otherwise all users of q->queue_lock need a proper
synchronization against lock switch in blk_cleanup_queue(). Either of these
looks like a lot of work. I guess since this involves only a legacy path,
your approach to move removal sysfs attributes earlier might be a
reasonable band aid.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 5/6] genhd: Fix BUG in blkdev_open()

2018-02-14 Thread Jan Kara
On Tue 13-02-18 10:11:37, Hou Tao wrote:
> Hi Jan,
> 
> On 2018/2/7 0:05, Jan Kara wrote:
> > When two blkdev_open() calls for a partition race with device removal
> > and recreation, we can hit BUG_ON(!bd_may_claim(bdev, whole, holder)) in
> > blkdev_open(). The race can happen as follows:
> > 
> > CPU0CPU1CPU2
> > del_gendisk()
> >   
> > bdev_unhash_inode(part1);
> > 
> > blkdev_open(part1, O_EXCL)  blkdev_open(part1, O_EXCL)
> >   bdev = bd_acquire() bdev = bd_acquire()
> >   blkdev_get(bdev)
> > bd_start_claiming(bdev)
> >   - finds old inode 'whole'
> >   bd_prepare_to_claim() -> 0
> >   
> > bdev_unhash_inode(whole);
> > 
> >  >  number created>
> >   blkdev_get(bdev);
> > bd_start_claiming(bdev)
> >   - finds new inode 'whole'
> >   bd_prepare_to_claim()
> > - this also succeeds as we have
> >   different 'whole' here...
> > - bad things happen now as we
> >   have two exclusive openers of
> >   the same bdev
> > 
> > The problem here is that block device opens can see various intermediate
> > states while gendisk is shutting down and then being recreated.
> > 
> > We fix the problem by introducing new lookup_sem in gendisk that
> > synchronizes gendisk deletion with get_gendisk() and furthermore by
> > making sure that get_gendisk() does not return gendisk that is being (or
> > has been) deleted. This makes sure that once we ever manage to look up
> > newly created bdev inode, we are also guaranteed that following
> > get_gendisk() will either return failure (and we fail open) or it
> > returns gendisk for the new device and following bdget_disk() will
> > return new bdev inode (i.e., blkdev_open() follows the path as if it is
> > completely run after new device is created).
> > 
> > Reported-and-analyzed-by: Hou Tao 
> > Signed-off-by: Jan Kara 
> > ---
> >  block/genhd.c | 21 -
> >  include/linux/genhd.h |  1 +
> >  2 files changed, 21 insertions(+), 1 deletion(-)
> > 
> 
> Before applying the patch set, the BUG_ON in blkdev_open() will reproduce in 
> about
> 10 minutes or less. Now after applying the patch set and running about 8 
> hours or more,
> the bug is no longer reproducible, so
> 
> Tested-by: Hou Tao 
> 
> Based on the test result, it seems that this patch alone can not fix the 
> BUG_ON in
> blkdev_open(). Patch 6 is also needed to fix the BUG_ON problem.

Thanks a lot for the testing! Jens, can you please pick up these fixes?
Thanks!

Honza

> > diff --git a/block/genhd.c b/block/genhd.c
> > index 64c323549a22..c6f68c332bfe 100644
> > --- a/block/genhd.c
> > +++ b/block/genhd.c
> > @@ -703,6 +703,11 @@ void del_gendisk(struct gendisk *disk)
> > blk_integrity_del(disk);
> > disk_del_events(disk);
> >  
> > +   /*
> > +* Block lookups of the disk until all bdevs are unhashed and the
> > +* disk is marked as dead (GENHD_FL_UP cleared).
> > +*/
> > +   down_write(&disk->lookup_sem);
> > /* invalidate stuff */
> > disk_part_iter_init(&piter, disk,
> >  DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
> > @@ -717,6 +722,7 @@ void del_gendisk(struct gendisk *disk)
> > bdev_unhash_inode(disk_devt(disk));
> > set_capacity(disk, 0);
> > disk->flags &= ~GENHD_FL_UP;
> > +   up_write(&disk->lookup_sem);
> >  
> > if (!(disk->flags & GENHD_FL_HIDDEN))
> > sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");
> > @@ -801,9 +807,21 @@ struct gendisk *get_gendisk(dev_t devt, int *partno)
> > spin_unlock_bh(&ext_devt_lock);
> > }
> >  
> > -   if (disk && unlikely(disk->flags & GENHD_FL_HIDDEN)) {
> 

Re: v4.16-rc2: virtio-block + ext4 lockdep splats / sleeping from invalid context

2018-02-26 Thread Jan Kara
48024]  __handle_domain_irq+0x84/0xf0
> [  162.650304]  gic_handle_irq+0x58/0xa8
> [  162.652917]  el1_irq+0xb4/0x130
> [  162.656972]  arch_cpu_idle+0x18/0x28
> [  162.658445]  default_idle_call+0x1c/0x34
> [  162.660723]  do_idle+0x17c/0x1f0
> [  162.664958]  cpu_startup_entry+0x20/0x28
> [  162.670009]  rest_init+0x250/0x260
> [  162.672763]  start_kernel+0x3f0/0x41c
> [  162.677064] WARNING: CPU: 0 PID: 0 at block/blk.h:297 
> generic_make_request_checks+0x670/0x750
> [  162.682622] Kernel panic - not syncing: panic_on_warn set ...
> [  162.682622]
> [  162.695097] CPU: 0 PID: 0 Comm: swapper/0 Tainted: GW
> 4.16.0-rc2 #1
> [  162.700355] Hardware name: linux,dummy-virt (DT)
> [  162.713462] Call trace:
> [  162.714859]  dump_backtrace+0x0/0x1c8
> [  162.722935]  show_stack+0x14/0x20
> [  162.725392]  dump_stack+0xac/0xe4
> [  162.727719]  panic+0x13c/0x2b8
> [  162.732313]  panic+0x0/0x2b8
> [  162.735233]  report_bug+0xb4/0x158
> [  162.738377]  bug_handler+0x30/0x88
> [  162.740810]  brk_handler+0xd8/0x198
> [  162.747677]  do_debug_exception+0x9c/0x190
> [  162.752011]  el1_dbg+0x18/0x78
> [  162.755197]  generic_make_request_checks+0x670/0x750
> [  162.761260]  generic_make_request+0x2c/0x278
> [  162.764366]  submit_bio+0x54/0x208
> [  162.771402]  submit_bio_wait+0x68/0xa0
> [  162.775053]  blkdev_issue_flush+0x8c/0xc8
> [  162.780520]  ext4_sync_file+0x220/0x330
> [  162.784815]  vfs_fsync_range+0x48/0xc0
> [  162.788243]  dio_complete+0x1fc/0x220
> [  162.792071]  dio_bio_end_aio+0xf0/0x130
> [  162.794820]  bio_endio+0xe8/0xf8
> [  162.796203]  blk_update_request+0x80/0x2e8
> [  162.800883]  blk_mq_end_request+0x20/0x70
> [  162.803072]  virtblk_request_done+0x24/0x30
> [  162.806877]  __blk_mq_complete_request+0x100/0x1b0
> [  162.809298]  blk_mq_complete_request+0x60/0x98
> [  162.811408]  virtblk_done+0x70/0xf8
> [  162.812461]  vring_interrupt+0x38/0x50
> [  162.813467]  __handle_irq_event_percpu+0x5c/0x148
> [  162.814657]  handle_irq_event_percpu+0x34/0x88
> [  162.815730]  handle_irq_event+0x48/0x78
> [  162.816856]  handle_fasteoi_irq+0xc0/0x198
> [  162.817946]  generic_handle_irq+0x24/0x38
> [  162.818948]  __handle_domain_irq+0x84/0xf0
> [  162.820056]  gic_handle_irq+0x58/0xa8
> [  162.821013]  el1_irq+0xb4/0x130
> [  162.826572]  arch_cpu_idle+0x18/0x28
> [  162.827304]  default_idle_call+0x1c/0x34
> [  162.828315]  do_idle+0x17c/0x1f0
> [  162.829174]  cpu_startup_entry+0x20/0x28
> [  162.830072]  rest_init+0x250/0x260
> [  162.830872]  start_kernel+0x3f0/0x41c
> [  162.831627] SMP: stopping secondary CPUs
> [  165.929235] SMP: failed to stop secondary CPUs 0,2
> [  165.949001] Kernel Offset: disabled
> [  165.963649] CPU features: 0x1802082
> [  165.969405] Memory Limit: none
> [  165.974480] Rebooting in 86400 seconds..
-- 
Jan Kara 
SUSE Labs, CR
>From 501d97ed88f5020a55a0de4d546df5ad11461cea Mon Sep 17 00:00:00 2001
From: Jan Kara 
Date: Mon, 26 Feb 2018 11:36:52 +0100
Subject: [PATCH] direct-io: Fix sleep in atomic due to sync AIO

Commit e864f39569f4 "fs: add RWF_DSYNC aand RWF_SYNC" added additional
way for direct IO to become synchronous and thus trigger fsync from the
IO completion handler. Then commit 9830f4be159b "fs: Use RWF_* flags for
AIO operations" allowed these flags to be set for AIO as well. However
that commit forgot to update the condition checking whether the IO
completion handling should be defered to a workqueue and thus AIO DIO
with RWF_[D]SYNC set will call fsync() from IRQ context resulting in
sleep in atomic.

Fix the problem by checking directly iocb flags (the same way as it is
done in dio_complete()) instead of checking all conditions that could
lead to IO being synchronous.

CC: Christoph Hellwig 
CC: Goldwyn Rodrigues 
CC: sta...@vger.kernel.org
Reported-by: Mark Rutland 
Fixes: 9830f4be159b29399d107bffb99e0132bc5aedd4
Signed-off-by: Jan Kara 
---
 fs/direct-io.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index a0ca9e48e993..1357ef563893 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1274,8 +1274,7 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	 */
 	if (dio->is_async && iov_iter_rw(iter) == WRITE) {
 		retval = 0;
-		if ((iocb->ki_filp->f_flags & O_DSYNC) ||
-		IS_SYNC(iocb->ki_filp->f_mapping->host))
+		if (iocb->ki_flags & IOCB_DSYNC)
 			retval = dio_set_defer_completion(dio);
 		else if (!dio->inode->i_sb->s_dio_done_wq) {
 			/*
-- 
2.13.6



[PATCH 3/6] genhd: Add helper put_disk_and_module()

2018-02-26 Thread Jan Kara
Add a proper counterpart to get_disk_and_module() -
put_disk_and_module(). Currently it is opencoded in several places.

Signed-off-by: Jan Kara 
---
 block/blk-cgroup.c| 11 ++-
 block/genhd.c | 20 
 fs/block_dev.c| 19 +--
 include/linux/genhd.h |  1 +
 4 files changed, 24 insertions(+), 27 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 4117524ca45b..c2033a232a44 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -812,7 +812,6 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct 
blkcg_policy *pol,
struct gendisk *disk;
struct request_queue *q;
struct blkcg_gq *blkg;
-   struct module *owner;
unsigned int major, minor;
int key_len, part, ret;
char *body;
@@ -904,9 +903,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct 
blkcg_policy *pol,
spin_unlock_irq(q->queue_lock);
rcu_read_unlock();
 fail:
-   owner = disk->fops->owner;
-   put_disk(disk);
-   module_put(owner);
+   put_disk_and_module(disk);
/*
 * If queue was bypassing, we should retry.  Do so after a
 * short msleep().  It isn't strictly necessary but queue
@@ -931,13 +928,9 @@ EXPORT_SYMBOL_GPL(blkg_conf_prep);
 void blkg_conf_finish(struct blkg_conf_ctx *ctx)
__releases(ctx->disk->queue->queue_lock) __releases(rcu)
 {
-   struct module *owner;
-
spin_unlock_irq(ctx->disk->queue->queue_lock);
rcu_read_unlock();
-   owner = ctx->disk->fops->owner;
-   put_disk(ctx->disk);
-   module_put(owner);
+   put_disk_and_module(ctx->disk);
 }
 EXPORT_SYMBOL_GPL(blkg_conf_finish);
 
diff --git a/block/genhd.c b/block/genhd.c
index 21b2843b27d0..4c0590434591 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -817,10 +817,7 @@ struct gendisk *get_gendisk(dev_t devt, int *partno)
}
 
if (disk && unlikely(disk->flags & GENHD_FL_HIDDEN)) {
-   struct module *owner = disk->fops->owner;
-
-   put_disk(disk);
-   module_put(owner);
+   put_disk_and_module(disk);
disk = NULL;
}
return disk;
@@ -1483,6 +1480,21 @@ void put_disk(struct gendisk *disk)
 }
 EXPORT_SYMBOL(put_disk);
 
+/*
+ * This is a counterpart of get_disk_and_module() and thus also of
+ * get_gendisk().
+ */
+void put_disk_and_module(struct gendisk *disk)
+{
+   if (disk) {
+   struct module *owner = disk->fops->owner;
+
+   put_disk(disk);
+   module_put(owner);
+   }
+}
+EXPORT_SYMBOL(put_disk_and_module);
+
 static void set_disk_ro_uevent(struct gendisk *gd, int ro)
 {
char event[] = "DISK_RO=1";
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 4a181fcb5175..1dbbf847911a 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -,8 +,7 @@ static struct block_device *bd_start_claiming(struct 
block_device *bdev,
else
whole = bdgrab(bdev);
 
-   module_put(disk->fops->owner);
-   put_disk(disk);
+   put_disk_and_module(disk);
if (!whole)
return ERR_PTR(-ENOMEM);
 
@@ -1407,7 +1406,6 @@ static void __blkdev_put(struct block_device *bdev, 
fmode_t mode, int for_part);
 static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 {
struct gendisk *disk;
-   struct module *owner;
int ret;
int partno;
int perm = 0;
@@ -1433,7 +1431,6 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
disk = get_gendisk(bdev->bd_dev, &partno);
if (!disk)
goto out;
-   owner = disk->fops->owner;
 
disk_block_events(disk);
mutex_lock_nested(&bdev->bd_mutex, for_part);
@@ -1463,8 +1460,7 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
bdev->bd_queue = NULL;
mutex_unlock(&bdev->bd_mutex);
disk_unblock_events(disk);
-   put_disk(disk);
-   module_put(owner);
+   put_disk_and_module(disk);
goto restart;
}
}
@@ -1525,8 +1521,7 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
goto out_unlock_bdev;
}
/* only one opener holds refs to the module and disk */
-   put_disk(disk);
-   module_put(owner);
+   put_disk_and_module(disk);
}
bdev->bd_openers++;
if (for_part)
@@ -1546,8 +1541,7 @@ static int _

[PATCH 0/6 v2] block: Fix races in bdev - gendisk handling

2018-02-26 Thread Jan Kara
Hello,

these patches fix races happening when devices are frequently destroyed and
recreated in association of block device inode with corresponding gendisk.
Generally when such race happen it results in use-after-free issues, block
device page cache inconsistencies, or other problems. I have verified these
patches fix use-after-free issues that could be reproduced by frequent creation
and destruction of loop device. Hou Tao has verified that races reported by
him in [1] related to gendisk-blkdev association were also fixed. Jens, can
you please merge these patches? Thanks!

Changes since v1:
* Added tested-by tags

Honza

[1] https://www.spinics.net/lists/linux-block/msg20015.html


[PATCH 5/6] genhd: Fix BUG in blkdev_open()

2018-02-26 Thread Jan Kara
When two blkdev_open() calls for a partition race with device removal
and recreation, we can hit BUG_ON(!bd_may_claim(bdev, whole, holder)) in
blkdev_open(). The race can happen as follows:

CPU0CPU1CPU2
del_gendisk()
  
bdev_unhash_inode(part1);

blkdev_open(part1, O_EXCL)  blkdev_open(part1, O_EXCL)
  bdev = bd_acquire() bdev = bd_acquire()
  blkdev_get(bdev)
bd_start_claiming(bdev)
  - finds old inode 'whole'
  bd_prepare_to_claim() -> 0
  
bdev_unhash_inode(whole);


  blkdev_get(bdev);
bd_start_claiming(bdev)
  - finds new inode 'whole'
  bd_prepare_to_claim()
- this also succeeds as we have
  different 'whole' here...
- bad things happen now as we
  have two exclusive openers of
  the same bdev

The problem here is that block device opens can see various intermediate
states while gendisk is shutting down and then being recreated.

We fix the problem by introducing new lookup_sem in gendisk that
synchronizes gendisk deletion with get_gendisk() and furthermore by
making sure that get_gendisk() does not return gendisk that is being (or
has been) deleted. This makes sure that once we ever manage to look up
newly created bdev inode, we are also guaranteed that following
get_gendisk() will either return failure (and we fail open) or it
returns gendisk for the new device and following bdget_disk() will
return new bdev inode (i.e., blkdev_open() follows the path as if it is
completely run after new device is created).

Reported-and-analyzed-by: Hou Tao 
Tested-by: Hou Tao 
Signed-off-by: Jan Kara 
---
 block/genhd.c | 21 -
 include/linux/genhd.h |  1 +
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/block/genhd.c b/block/genhd.c
index 4c0590434591..9656f9e9f99e 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -717,6 +717,11 @@ void del_gendisk(struct gendisk *disk)
blk_integrity_del(disk);
disk_del_events(disk);
 
+   /*
+* Block lookups of the disk until all bdevs are unhashed and the
+* disk is marked as dead (GENHD_FL_UP cleared).
+*/
+   down_write(&disk->lookup_sem);
/* invalidate stuff */
disk_part_iter_init(&piter, disk,
 DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
@@ -731,6 +736,7 @@ void del_gendisk(struct gendisk *disk)
bdev_unhash_inode(disk_devt(disk));
set_capacity(disk, 0);
disk->flags &= ~GENHD_FL_UP;
+   up_write(&disk->lookup_sem);
 
if (!(disk->flags & GENHD_FL_HIDDEN))
sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");
@@ -816,9 +822,21 @@ struct gendisk *get_gendisk(dev_t devt, int *partno)
spin_unlock_bh(&ext_devt_lock);
}
 
-   if (disk && unlikely(disk->flags & GENHD_FL_HIDDEN)) {
+   if (!disk)
+   return NULL;
+
+   /*
+* Synchronize with del_gendisk() to not return disk that is being
+* destroyed.
+*/
+   down_read(&disk->lookup_sem);
+   if (unlikely((disk->flags & GENHD_FL_HIDDEN) ||
+!(disk->flags & GENHD_FL_UP))) {
+   up_read(&disk->lookup_sem);
put_disk_and_module(disk);
disk = NULL;
+   } else {
+   up_read(&disk->lookup_sem);
}
return disk;
 }
@@ -1418,6 +1436,7 @@ struct gendisk *__alloc_disk_node(int minors, int node_id)
kfree(disk);
return NULL;
}
+   init_rwsem(&disk->lookup_sem);
disk->node_id = node_id;
if (disk_expand_part_tbl(disk, 0)) {
free_part_stats(&disk->part0);
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 7f5906fe1b70..c826b0b5232a 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -198,6 +198,7 @@ struct gendisk {
void *private_data;
 
int flags;
+   struct rw_semaphore lookup_sem;
struct kobject *slave_dir;
 
struct timer_rand_state *random;
-- 
2.13.6



[PATCH 2/6] genhd: Rename get_disk() to get_disk_and_module()

2018-02-26 Thread Jan Kara
Rename get_disk() to get_disk_and_module() to make sure what the
function does. It's not a great name but at least it is now clear that
put_disk() is not it's counterpart.

Signed-off-by: Jan Kara 
---
 block/genhd.c   | 10 --
 drivers/block/amiflop.c |  2 +-
 drivers/block/ataflop.c |  2 +-
 drivers/block/brd.c |  2 +-
 drivers/block/floppy.c  |  2 +-
 drivers/block/loop.c|  2 +-
 drivers/block/swim.c|  2 +-
 drivers/block/z2ram.c   |  2 +-
 drivers/ide/ide-probe.c |  2 +-
 include/linux/genhd.h   |  2 +-
 10 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 5098bffe6ba6..21b2843b27d0 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -547,7 +547,7 @@ static int exact_lock(dev_t devt, void *data)
 {
struct gendisk *p = data;
 
-   if (!get_disk(p))
+   if (!get_disk_and_module(p))
return -1;
return 0;
 }
@@ -809,7 +809,7 @@ struct gendisk *get_gendisk(dev_t devt, int *partno)
 
spin_lock_bh(&ext_devt_lock);
part = idr_find(&ext_devt_idr, blk_mangle_minor(MINOR(devt)));
-   if (part && get_disk(part_to_disk(part))) {
+   if (part && get_disk_and_module(part_to_disk(part))) {
*partno = part->partno;
disk = part_to_disk(part);
}
@@ -1456,7 +1456,7 @@ struct gendisk *__alloc_disk_node(int minors, int node_id)
 }
 EXPORT_SYMBOL(__alloc_disk_node);
 
-struct kobject *get_disk(struct gendisk *disk)
+struct kobject *get_disk_and_module(struct gendisk *disk)
 {
struct module *owner;
struct kobject *kobj;
@@ -1474,15 +1474,13 @@ struct kobject *get_disk(struct gendisk *disk)
return kobj;
 
 }
-
-EXPORT_SYMBOL(get_disk);
+EXPORT_SYMBOL(get_disk_and_module);
 
 void put_disk(struct gendisk *disk)
 {
if (disk)
kobject_put(&disk_to_dev(disk)->kobj);
 }
-
 EXPORT_SYMBOL(put_disk);
 
 static void set_disk_ro_uevent(struct gendisk *gd, int ro)
diff --git a/drivers/block/amiflop.c b/drivers/block/amiflop.c
index e5aa62fcf5a8..3aaf6af3ec23 100644
--- a/drivers/block/amiflop.c
+++ b/drivers/block/amiflop.c
@@ -1758,7 +1758,7 @@ static struct kobject *floppy_find(dev_t dev, int *part, 
void *data)
if (unit[drive].type->code == FD_NODRIVE)
return NULL;
*part = 0;
-   return get_disk(unit[drive].gendisk);
+   return get_disk_and_module(unit[drive].gendisk);
 }
 
 static int __init amiga_floppy_probe(struct platform_device *pdev)
diff --git a/drivers/block/ataflop.c b/drivers/block/ataflop.c
index 8bc3b9fd8dd2..dfb2c2622e5a 100644
--- a/drivers/block/ataflop.c
+++ b/drivers/block/ataflop.c
@@ -1917,7 +1917,7 @@ static struct kobject *floppy_find(dev_t dev, int *part, 
void *data)
if (drive >= FD_MAX_UNITS || type > NUM_DISK_MINORS)
return NULL;
*part = 0;
-   return get_disk(unit[drive].disk);
+   return get_disk_and_module(unit[drive].disk);
 }
 
 static int __init atari_floppy_init (void)
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 8028a3a7e7fd..deea78e485da 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -456,7 +456,7 @@ static struct kobject *brd_probe(dev_t dev, int *part, void 
*data)
 
mutex_lock(&brd_devices_mutex);
brd = brd_init_one(MINOR(dev) / max_part, &new);
-   kobj = brd ? get_disk(brd->brd_disk) : NULL;
+   kobj = brd ? get_disk_and_module(brd->brd_disk) : NULL;
mutex_unlock(&brd_devices_mutex);
 
if (new)
diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
index eae484acfbbc..8ec7235fc93b 100644
--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -4505,7 +4505,7 @@ static struct kobject *floppy_find(dev_t dev, int *part, 
void *data)
if (((*part >> 2) & 0x1f) >= ARRAY_SIZE(floppy_type))
return NULL;
*part = 0;
-   return get_disk(disks[drive]);
+   return get_disk_and_module(disks[drive]);
 }
 
 static int __init do_floppy_init(void)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index d5fe720cf149..87855b5123a6 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1922,7 +1922,7 @@ static struct kobject *loop_probe(dev_t dev, int *part, 
void *data)
if (err < 0)
kobj = NULL;
else
-   kobj = get_disk(lo->lo_disk);
+   kobj = get_disk_and_module(lo->lo_disk);
mutex_unlock(&loop_index_mutex);
 
*part = 0;
diff --git a/drivers/block/swim.c b/drivers/block/swim.c
index 84434d3ea19b..64e066eba72e 100644
--- a/drivers/block/swim.c
+++ b/drivers/block/swim.c
@@ -799,7 +799,7 @@ static struct kobject *floppy_find(dev_t dev, int *part, 
void *data)
return NULL;
 
*part = 0;
-   return get_disk(swd->unit[drive].disk)

[PATCH 1/6] genhd: Fix leaked module reference for NVME devices

2018-02-26 Thread Jan Kara
Commit 8ddcd653257c "block: introduce GENHD_FL_HIDDEN" added handling of
hidden devices to get_gendisk() but forgot to drop module reference
which is also acquired by get_disk(). Drop the reference as necessary.

Arguably the function naming here is misleading as put_disk() is *not*
the counterpart of get_disk() but let's fix that in the follow up
commit since that will be more intrusive.

Fixes: 8ddcd653257c18a669fcb75ee42c37054908e0d6
CC: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 block/genhd.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/genhd.c b/block/genhd.c
index 88a53c188cb7..5098bffe6ba6 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -817,7 +817,10 @@ struct gendisk *get_gendisk(dev_t devt, int *partno)
}
 
if (disk && unlikely(disk->flags & GENHD_FL_HIDDEN)) {
+   struct module *owner = disk->fops->owner;
+
put_disk(disk);
+   module_put(owner);
disk = NULL;
}
return disk;
-- 
2.13.6



[PATCH 6/6] blockdev: Avoid two active bdev inodes for one device

2018-02-26 Thread Jan Kara
When blkdev_open() races with device removal and creation it can happen
that unhashed bdev inode gets associated with newly created gendisk
like:

CPU0CPU1
blkdev_open()
  bdev = bd_acquire()
del_gendisk()
  bdev_unhash_inode(bdev);
remove device
create new device with the same number
  __blkdev_get()
disk = get_gendisk()
  - gets reference to gendisk of the new device

Now another blkdev_open() will not find original 'bdev' as it got
unhashed, create a new one and associate it with the same 'disk' at
which point problems start as we have two independent page caches for
one device.

Fix the problem by verifying that the bdev inode didn't get unhashed
before we acquired gendisk reference. That way we make sure gendisk can
get associated only with visible bdev inodes.

Tested-by: Hou Tao 
Signed-off-by: Jan Kara 
---
 fs/block_dev.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index fe41a76769fa..fe09ef9c21f3 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1058,6 +1058,27 @@ static int bd_prepare_to_claim(struct block_device *bdev,
return 0;
 }
 
+static struct gendisk *bdev_get_gendisk(struct block_device *bdev, int *partno)
+{
+   struct gendisk *disk = get_gendisk(bdev->bd_dev, partno);
+
+   if (!disk)
+   return NULL;
+   /*
+* Now that we hold gendisk reference we make sure bdev we looked up is
+* not stale. If it is, it means device got removed and created before
+* we looked up gendisk and we fail open in such case. Associating
+* unhashed bdev with newly created gendisk could lead to two bdevs
+* (and thus two independent caches) being associated with one device
+* which is bad.
+*/
+   if (inode_unhashed(bdev->bd_inode)) {
+   put_disk_and_module(disk);
+   return NULL;
+   }
+   return disk;
+}
+
 /**
  * bd_start_claiming - start claiming a block device
  * @bdev: block device of interest
@@ -1094,7 +1115,7 @@ static struct block_device *bd_start_claiming(struct 
block_device *bdev,
 * @bdev might not have been initialized properly yet, look up
 * and grab the outer block device the hard way.
 */
-   disk = get_gendisk(bdev->bd_dev, &partno);
+   disk = bdev_get_gendisk(bdev, &partno);
if (!disk)
return ERR_PTR(-ENXIO);
 
@@ -1429,7 +1450,7 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
  restart:
 
ret = -ENXIO;
-   disk = get_gendisk(bdev->bd_dev, &partno);
+   disk = bdev_get_gendisk(bdev, &partno);
if (!disk)
goto out;
 
-- 
2.13.6



[PATCH 4/6] genhd: Fix use after free in __blkdev_get()

2018-02-26 Thread Jan Kara
When two blkdev_open() calls race with device removal and recreation,
__blkdev_get() can use looked up gendisk after it is freed:

CPU0CPU1CPU2
del_gendisk(disk);
  
bdev_unhash_inode(inode);
blkdev_open()   blkdev_open()
  bdev = bd_acquire(inode);
- creates and returns new inode
  bdev = bd_acquire(inode);
- returns the same inode
  __blkdev_get(devt)  __blkdev_get(devt)
disk = get_gendisk(devt);
  - got structure of device going away


  disk = get_gendisk(devt);
- got new device structure
  if (!bdev->bd_openers) {
does the first open
  }
if (!bdev->bd_openers)
  - false
} else {
  put_disk_and_module(disk)
- remember this was old device - this was last ref and disk is
  now freed
}
disk_unblock_events(disk); -> oops

Fix the problem by making sure we drop reference to disk in
__blkdev_get() only after we are really done with it.

Reported-by: Hou Tao 
Tested-by: Hou Tao 
Signed-off-by: Jan Kara 
---
 fs/block_dev.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 1dbbf847911a..fe41a76769fa 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1409,6 +1409,7 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
int ret;
int partno;
int perm = 0;
+   bool first_open = false;
 
if (mode & FMODE_READ)
perm |= MAY_READ;
@@ -1435,6 +1436,7 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
disk_block_events(disk);
mutex_lock_nested(&bdev->bd_mutex, for_part);
if (!bdev->bd_openers) {
+   first_open = true;
bdev->bd_disk = disk;
bdev->bd_queue = disk->queue;
bdev->bd_contains = bdev;
@@ -1520,14 +1522,15 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
if (ret)
goto out_unlock_bdev;
}
-   /* only one opener holds refs to the module and disk */
-   put_disk_and_module(disk);
}
bdev->bd_openers++;
if (for_part)
bdev->bd_part_count++;
mutex_unlock(&bdev->bd_mutex);
disk_unblock_events(disk);
+   /* only one opener holds refs to the module and disk */
+   if (!first_open)
+   put_disk_and_module(disk);
return 0;
 
  out_clear:
-- 
2.13.6



Re: v4.16-rc2: virtio-block + ext4 lockdep splats / sleeping from invalid context

2018-02-26 Thread Jan Kara
On Mon 26-02-18 11:38:19, Mark Rutland wrote:
> On Mon, Feb 26, 2018 at 11:52:56AM +0100, Jan Kara wrote:
> > On Fri 23-02-18 15:47:36, Mark Rutland wrote:
> > > Hi all,
> > > 
> > > While fuzzing arm64/v4.16-rc2 with syzkaller, I simultaneously hit a
> > > number of splats in the block layer:
> > > 
> > > * inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-R} usage in
> > >   jbd2_trans_will_send_data_barrier
> > > 
> > > * BUG: sleeping function called from invalid context at mm/mempool.c:320
> > > 
> > > * WARNING: CPU: 0 PID: 0 at block/blk.h:297 
> > > generic_make_request_checks+0x670/0x750
> > > 
> > > ... I've included the full splats at the end of the mail.
> > > 
> > > These all happen in the context of the virtio block IRQ handler, so I
> > > wonder if this calls something that doesn't expect to be called from IRQ
> > > context. Is it valid to call blk_mq_complete_request() or
> > > blk_mq_end_request() from an IRQ handler?
> > 
> > No, it's likely a bug in detection whether IO completion should be deferred
> > to a workqueue or not. Does attached patch fix the problem? I don't see
> > exactly this being triggered by the syzkaller but it's close enough :)
> > 
> > Honza
> 
> That seems to be it!
> 
> With the below patch applied, I can't trigger the bug after ~10 minutes,
> whereas prior to the patch I can trigger it in ~10 seconds. I'll leave
> that running for a while just in case there's another part to the
> problem, but FWIW:
> 
> Tested-by: Mark Rutland 

Thanks for testing! Sent the patch to Jens for inclusion.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 0/6 v2] block: Fix races in bdev - gendisk handling

2018-02-26 Thread Jan Kara
On Mon 26-02-18 09:04:40, Jens Axboe wrote:
> On 2/26/18 5:01 AM, Jan Kara wrote:
> > Hello,
> > 
> > these patches fix races happening when devices are frequently destroyed and
> > recreated in association of block device inode with corresponding gendisk.
> > Generally when such race happen it results in use-after-free issues, block
> > device page cache inconsistencies, or other problems. I have verified these
> > patches fix use-after-free issues that could be reproduced by frequent 
> > creation
> > and destruction of loop device. Hou Tao has verified that races reported by
> > him in [1] related to gendisk-blkdev association were also fixed. Jens, can
> > you please merge these patches? Thanks!
> 
> What are you thinking in terms of target? I'd like to get it into 4.16,
> but I'd need to know if you are comfortable with that.

Yeah, 4.16 should be OK. The patches are quite conservative.

Honza

-- 
Jan Kara 
SUSE Labs, CR


Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems

2017-01-17 Thread Jan Kara
t regular case tool
> but the last resort. If you are going to rely on the fsck tool then
> simply forget about using your hardware. Some file systems haven't the
> fsck tool at all. Some guys really believe that file system has to work
> without support of the fsck tool.  Even if a mature file system has
> reliable fsck tool then the probability of file system recovering is very
> low in the case of serious metadata corruptions. So, it means that you
> are trying to suggest the technique when we will lose the whole file
> system volumes on regular basis without any hope to recover data. Even if
> file system has snapshots then, again, we haven't hope because we can
> suffer from read error and for operation with snapshot.

I hope I have cleared out that this is not about higher error rate of
persistent memory above. As a side note, XFS guys are working on automatic
background scrubbing and online filesystem checking. Not specifically for
persistent memory but simply because with growing size of the filesystem
the likelihood of some problem somewhere is growing. 
 
Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] Future direction of DAX

2017-01-17 Thread Jan Kara
On Fri 13-01-17 17:20:08, Ross Zwisler wrote:
> - The DAX fsync/msync model was built for platforms that need to flush dirty
>   processor cache lines in order to make data durable on NVDIMMs.  There exist
>   platforms, however, that are set up so that the processor caches are
>   effectively part of the ADR safe zone.  This means that dirty data can be
>   assumed to be durable even in the processor cache, obviating the need to
>   manually flush the cache during fsync/msync.  These platforms still need to
>   call fsync/msync to ensure that filesystem metadata updates are properly
>   written to media.  Our first idea on how to properly support these platforms
>   would be for DAX to be made aware that in some cases doesn't need to keep
>   metadata about dirty cache lines.  A similar issue exists for volatile uses
>   of DAX such as with BRD or with PMEM and the memmap command line parameter,
>   and we'd like a solution that covers them all.

Well, we still need the radix tree entries for locking. And you still need
to keep track of which file offsets are writeably mapped (which we
currently implicitely keep via dirty radix tree entries) so that you can
writeprotect them if needed (during filesystem freezing, for reflink, ...).
So I think what is going to gain the most by far is simply to avoid doing
the writeback at all in such situations.

> - If I recall correctly, at one point Dave Chinner suggested that we change
>   DAX so that I/O would use cached stores instead of the non-temporal stores
>   that it currently uses.  We would then track pages that were written to by
>   DAX in the radix tree so that they would be flushed later during
>   fsync/msync.  Does this sound like a win?  Also, assuming that we can find a
>   solution for platforms where the processor cache is part of the ADR safe
>   zone (above topic) this would be a clear improvement, moving us from using
>   non-temporal stores to faster cached stores with no downside.

I guess this needs measurements. But it is worth a try.

> - Jan suggested [2] that we could use the radix tree as a cache to service DAX
>   faults without needing to call into the filesystem.  Are there any issues
>   with this approach, and should we move forward with it as an optimization?

Yup, I'm still for it.

> - Whenever you mount a filesystem with DAX, it spits out a message that says
>   "DAX enabled. Warning: EXPERIMENTAL, use at your own risk".  What criteria
>   needs to be met for DAX to no longer be considered experimental?

So from my POV I'd be OK with removing the warning but still the code is
new so there are clearly bugs lurking ;).

> - When we msync() a huge page, if the range is less than the entire huge page,
>   should we flush the entire huge page and mark it clean in the radix tree, or
>   should we only flush the requested range and leave the radix tree entry
>   dirty?

If you do partial msync(), then you have the problem that msync(0, x),
msync(x, EOF) will not yield a clean file which may surprise somebody. So
I'm slightly skeptical.
 
> - Should we enable 1 GiB huge pages in filesystem DAX?  Does anyone have any
>   specific customer requests for this or performance data suggesting it would
>   be a win?  If so, what work needs to be done to get 1 GiB sized and aligned
>   filesystem block allocations, to get the required enabling in the MM layer,
>   etc?

I'm not convinced it is worth it now. Maybe later...

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems

2017-01-18 Thread Jan Kara
On Tue 17-01-17 15:37:05, Vishal Verma wrote:
> I do mean that in the filesystem, for every IO, the badblocks will be
> checked. Currently, the pmem driver does this, and the hope is that the
> filesystem can do a better job at it. The driver unconditionally checks
> every IO for badblocks on the whole device. Depending on how the
> badblocks are represented in the filesystem, we might be able to quickly
> tell if a file/range has existing badblocks, and error out the IO
> accordingly.
> 
> At mount the the fs would read the existing badblocks on the block
> device, and build its own representation of them. Then during normal
> use, if the underlying badblocks change, the fs would get a notification
> that would allow it to also update its own representation.

So I believe we have to distinguish three cases so that we are on the same
page.

1) PMEM is exposed only via a block interface for legacy filesystems to
use. Here, all the bad blocks handling IMO must happen in NVDIMM driver.
Looking from outside, the IO either returns with EIO or succeeds. As a
result you cannot ever ger rid of bad blocks handling in the NVDIMM driver.

2) PMEM is exposed for DAX aware filesystem. This seems to be what you are
mostly interested in. We could possibly do something more efficient than
what NVDIMM driver does however the complexity would be relatively high and
frankly I'm far from convinced this is really worth it. If there are so
many badblocks this would matter, the HW has IMHO bigger problems than
performance.

3) PMEM filesystem - there things are even more difficult as was already
noted elsewhere in the thread. But for now I'd like to leave those aside
not to complicate things too much.

Now my question: Why do we bother with badblocks at all? In cases 1) and 2)
if the platform can recover from MCE, we can just always access persistent
memory using memcpy_mcsafe(), if that fails, return -EIO. Actually that
seems to already happen so we just need to make sure all places handle
returned errors properly (e.g. fs/dax.c does not seem to) and we are done.
No need for bad blocks list at all, no slow down unless we hit a bad cell
and in that case who cares about performance when the data is gone...

For platforms that cannot recover from MCE - just buy better hardware ;).
Seriously, I have doubts people can seriously use a machine that will
unavoidably randomly reboot (as there is always a risk you hit error that
has not been uncovered by background scrub). But maybe for big cloud providers
the cost savings may offset for the inconvenience, I don't know. But still
for that case a bad blocks handling in NVDIMM code like we do now looks
good enough?

        Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems

2017-01-18 Thread Jan Kara
On Tue 17-01-17 15:14:21, Vishal Verma wrote:
> Your note on the online repair does raise another tangentially related
> topic. Currently, if there are badblocks, writes via the bio submission
> path will clear the error (if the hardware is able to remap the bad
> locations). However, if the filesystem is mounted eith DAX, even
> non-mmap operations - read() and write() will go through the dax paths
> (dax_do_io()). We haven't found a good/agreeable way to perform
> error-clearing in this case. So currently, if a dax mounted filesystem
> has badblocks, the only way to clear those badblocks is to mount it
> without DAX, and overwrite/zero the bad locations. This is a pretty
> terrible user experience, and I'm hoping this can be solved in a better
> way.

Please remind me, what is the problem with DAX code doing necessary work to
clear the error when it gets EIO from memcpy on write?

        Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM ATTEND] Un-addressable device memory and block/fs implications

2017-01-18 Thread Jan Kara
On Fri 16-12-16 08:44:11, Aneesh Kumar K.V wrote:
> Jerome Glisse  writes:
> 
> > I would like to discuss un-addressable device memory in the context of
> > filesystem and block device. Specificaly how to handle write-back, read,
> > ... when a filesystem page is migrated to device memory that CPU can not
> > access.
> >
> > I intend to post a patchset leveraging the same idea as the existing
> > block bounce helper (block/bounce.c) to handle this. I believe this is
> > worth discussing during summit see how people feels about such plan and
> > if they have better ideas.
> >
> >
> > I also like to join discussions on:
> >   - Peer-to-Peer DMAs between PCIe devices
> >   - CDM coherent device memory
> >   - PMEM
> >   - overall mm discussions
> 
> I would like to attend this discussion. I can talk about coherent device
> memory and how having HMM handle that will make it easy to have one
> interface for device driver. For Coherent device case we definitely need
> page cache migration support.

Aneesh, did you intend this as your request to attend? You posted it as a
reply to another email so it is not really clear. Note that each attend
request should be a separate email so that it does not get lost...

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems

2017-01-19 Thread Jan Kara
On Wed 18-01-17 21:56:58, Verma, Vishal L wrote:
> On Wed, 2017-01-18 at 13:32 -0800, Dan Williams wrote:
> > On Wed, Jan 18, 2017 at 1:02 PM, Darrick J. Wong
> >  wrote:
> > > On Wed, Jan 18, 2017 at 03:39:17PM -0500, Jeff Moyer wrote:
> > > > Jan Kara  writes:
> > > > 
> > > > > On Tue 17-01-17 15:14:21, Vishal Verma wrote:
> > > > > > Your note on the online repair does raise another tangentially
> > > > > > related
> > > > > > topic. Currently, if there are badblocks, writes via the bio
> > > > > > submission
> > > > > > path will clear the error (if the hardware is able to remap
> > > > > > the bad
> > > > > > locations). However, if the filesystem is mounted eith DAX,
> > > > > > even
> > > > > > non-mmap operations - read() and write() will go through the
> > > > > > dax paths
> > > > > > (dax_do_io()). We haven't found a good/agreeable way to
> > > > > > perform
> > > > > > error-clearing in this case. So currently, if a dax mounted
> > > > > > filesystem
> > > > > > has badblocks, the only way to clear those badblocks is to
> > > > > > mount it
> > > > > > without DAX, and overwrite/zero the bad locations. This is a
> > > > > > pretty
> > > > > > terrible user experience, and I'm hoping this can be solved in
> > > > > > a better
> > > > > > way.
> > > > > 
> > > > > Please remind me, what is the problem with DAX code doing
> > > > > necessary work to
> > > > > clear the error when it gets EIO from memcpy on write?
> > > > 
> > > > You won't get an MCE for a store;  only loads generate them.
> > > > 
> > > > Won't fallocate FL_ZERO_RANGE clear bad blocks when mounted with
> > > > -o dax?
> > > 
> > > Not necessarily; XFS usually implements this by punching out the
> > > range
> > > and then reallocating it as unwritten blocks.
> > > 
> > 
> > That does clear the error because the unwritten blocks are zeroed and
> > errors cleared when they become allocated again.
> 
> Yes, the problem was that writes won't clear errors. zeroing through
> either hole-punch, truncate, unlinking the file should all work
> (assuming the hole-punch or truncate ranges wholly contain the
> 'badblock' sector).

Let me repeat my question: You have mentioned that if we do IO through DAX,
writes won't clear errors and we should fall back to normal block path to
do write to clear the error. What does prevent us from directly clearing
the error from DAX path?

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems

2017-01-20 Thread Jan Kara
On Thu 19-01-17 11:03:12, Dan Williams wrote:
> On Thu, Jan 19, 2017 at 10:59 AM, Vishal Verma  
> wrote:
> > On 01/19, Jan Kara wrote:
> >> On Wed 18-01-17 21:56:58, Verma, Vishal L wrote:
> >> > On Wed, 2017-01-18 at 13:32 -0800, Dan Williams wrote:
> >> > > On Wed, Jan 18, 2017 at 1:02 PM, Darrick J. Wong
> >> > >  wrote:
> >> > > > On Wed, Jan 18, 2017 at 03:39:17PM -0500, Jeff Moyer wrote:
> >> > > > > Jan Kara  writes:
> >> > > > >
> >> > > > > > On Tue 17-01-17 15:14:21, Vishal Verma wrote:
> >> > > > > > > Your note on the online repair does raise another tangentially
> >> > > > > > > related
> >> > > > > > > topic. Currently, if there are badblocks, writes via the bio
> >> > > > > > > submission
> >> > > > > > > path will clear the error (if the hardware is able to remap
> >> > > > > > > the bad
> >> > > > > > > locations). However, if the filesystem is mounted eith DAX,
> >> > > > > > > even
> >> > > > > > > non-mmap operations - read() and write() will go through the
> >> > > > > > > dax paths
> >> > > > > > > (dax_do_io()). We haven't found a good/agreeable way to
> >> > > > > > > perform
> >> > > > > > > error-clearing in this case. So currently, if a dax mounted
> >> > > > > > > filesystem
> >> > > > > > > has badblocks, the only way to clear those badblocks is to
> >> > > > > > > mount it
> >> > > > > > > without DAX, and overwrite/zero the bad locations. This is a
> >> > > > > > > pretty
> >> > > > > > > terrible user experience, and I'm hoping this can be solved in
> >> > > > > > > a better
> >> > > > > > > way.
> >> > > > > >
> >> > > > > > Please remind me, what is the problem with DAX code doing
> >> > > > > > necessary work to
> >> > > > > > clear the error when it gets EIO from memcpy on write?
> >> > > > >
> >> > > > > You won't get an MCE for a store;  only loads generate them.
> >> > > > >
> >> > > > > Won't fallocate FL_ZERO_RANGE clear bad blocks when mounted with
> >> > > > > -o dax?
> >> > > >
> >> > > > Not necessarily; XFS usually implements this by punching out the
> >> > > > range
> >> > > > and then reallocating it as unwritten blocks.
> >> > > >
> >> > >
> >> > > That does clear the error because the unwritten blocks are zeroed and
> >> > > errors cleared when they become allocated again.
> >> >
> >> > Yes, the problem was that writes won't clear errors. zeroing through
> >> > either hole-punch, truncate, unlinking the file should all work
> >> > (assuming the hole-punch or truncate ranges wholly contain the
> >> > 'badblock' sector).
> >>
> >> Let me repeat my question: You have mentioned that if we do IO through DAX,
> >> writes won't clear errors and we should fall back to normal block path to
> >> do write to clear the error. What does prevent us from directly clearing
> >> the error from DAX path?
> >>
> > With DAX, all IO goes through DAX paths. There are two cases:
> > 1. mmap and loads/stores: Obviously there is no kernel intervention
> > here, and no badblocks handling is possible.
> > 2. read() or write() IO: In the absence of dax, this would go through
> > the bio submission path, through the pmem driver, and that would handle
> > error clearing. With DAX, this goes through dax_iomap_actor, which also
> > doesn't go through the pmem driver (it does a dax mapping, followed by
> > essentially memcpy), and hence cannot handle badblocks.
> 
> Hmm, that may no longer be true after my changes to push dax flushing
> to the driver. I.e. we could have a copy_from_iter() implementation
> that attempts to clear errors... I'll get that series out and we can
> discuss there.

Yeah, that was precisely my point - doing copy_from_iter() that clears
errors should be possible...

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   5   6   >