date:20160915

Re: [bug report] fs/block_dev.c: add bdev_read_page() and bdev_write_page()

2016-09-15 Thread Dan Carpenter

Let's try reporting this again to new email addresses...

Btw, belated thanks for creating a linux-block mailing list Jens.  :)

regards,
dan carpenter

On Thu, Aug 04, 2016 at 05:02:06PM +0300, Dan Carpenter wrote:
> Hello Matthew Wilcox,
> 
> The patch 47a191fd38eb: "fs/block_dev.c: add bdev_read_page() and
> bdev_write_page()" from Jun 4, 2014, leads to the following static
> checker warning:
> 
>   include/linux/genhd.h:382 part_inc_in_flight()
>   warn: buffer overflow 'part->in_flight' 2 <= 72
> 
> include/linux/genhd.h
>380  static inline void part_inc_in_flight(struct hd_struct *part, int rw)
>381  {
>382  atomic_inc(&part->in_flight[rw]);
> ^^
> This should either be READ or WRITE.
> 
>383  if (part->partno)
>384  atomic_inc(&part_to_disk(part)->part0.in_flight[rw]);
>385  }
> 
> 
>444  int bdev_write_page(struct block_device *bdev, sector_t sector,
>445  struct page *page, struct writeback_control 
> *wbc)
>446  {
>447  int result;
>448  int rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE;
>^^
> But here we're setting it to WRITE_SYNC.
> 
>449  const struct block_device_operations *ops = 
> bdev->bd_disk->fops;
>450  
>451  if (!ops->rw_page || bdev_get_integrity(bdev))
>452  return -EOPNOTSUPP;
>453  result = blk_queue_enter(bdev->bd_queue, false);
>454  if (result)
>455  return result;
>456  
>457  set_page_writeback(page);
>458  result = ops->rw_page(bdev, sector + get_start_sect(bdev), 
> page, rw);
>459  if (result)
>460  end_page_writeback(page);
>461  else
>462  unlock_page(page);
>463  blk_queue_exit(bdev->bd_queue);
>464  return result;
>465  }
> 
> regards,
> dan carpenter
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Wouter Verhelst

Hi,

On Fri, Sep 09, 2016 at 10:02:03PM +0200, Wouter Verhelst wrote:
> I see some practical problems with this:
[...]

One more that I didn't think about earlier:

A while back, we spent quite some time defining the semantics of the
various commands in the face of the NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA
write barriers. At the time, we decided that it would be unreasonable
to expect servers to make these write barriers effective across
different connections.

Since my knowledge of kernel internals is limited, I tried finding some
documentation on this, but I guess that either it doesn't exist or I'm
looking in the wrong place; therefore, am I correct in assuming that
blk-mq knows about such semantics, and will handle them correctly (by
either sending a write barrier to all queues, or not making assumptions
about write barriers that were sent over a different queue)? If not,
this may be something that needs to be taken care of.

Thanks,

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
   people in the world who think they really understand all of its rules,
   and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh

Wouter, Josef, (& Eric)

> On 15 Sep 2016, at 11:49, Wouter Verhelst  wrote:
> 
> Hi,
> 
> On Fri, Sep 09, 2016 at 10:02:03PM +0200, Wouter Verhelst wrote:
>> I see some practical problems with this:
> [...]
> 
> One more that I didn't think about earlier:
> 
> A while back, we spent quite some time defining the semantics of the
> various commands in the face of the NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA
> write barriers. At the time, we decided that it would be unreasonable
> to expect servers to make these write barriers effective across
> different connections.

Actually I wonder if there is a wider problem in that implementations
might mediate access to a device by presence of an extant TCP connection,
i.e. only permit one TCP connection to access a given block device at
once. If you think about (for instance) a forking daemon that does
writeback caching, that would be an entirely reasonable thing to do
for data consistency.

I also wonder whether any servers that can do caching per
connection will always share a consistent cache between 
connections. The one I'm worried about in particular here
is qemu-nbd - Eric Blake CC'd.

A more general point is that with multiple queues requests
may be processed in a different order even by those servers that
currently process the requests in strict order, or in something
similar to strict order. The server is permitted by the spec
(save as mandated by NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA) to
process commands out of order anyway, but I suspect this has
to date been little tested.

Lastly I confess to lack of familiarity with the kernel side
code, but how is NBD_CMD_DISCONNECT synchronised across
each of the connections? Presumably you need to send it on
each channel, but cannot assume the NBD connection as a whole
is dead until the last tcp connection has closed?

-- 
Alex Bligh

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Wouter Verhelst

On Thu, Sep 15, 2016 at 12:09:28PM +0100, Alex Bligh wrote:
> Wouter, Josef, (& Eric)
> 
> > On 15 Sep 2016, at 11:49, Wouter Verhelst  wrote:
> > 
> > Hi,
> > 
> > On Fri, Sep 09, 2016 at 10:02:03PM +0200, Wouter Verhelst wrote:
> >> I see some practical problems with this:
> > [...]
> > 
> > One more that I didn't think about earlier:
> > 
> > A while back, we spent quite some time defining the semantics of the
> > various commands in the face of the NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA
> > write barriers. At the time, we decided that it would be unreasonable
> > to expect servers to make these write barriers effective across
> > different connections.
> 
> Actually I wonder if there is a wider problem in that implementations
> might mediate access to a device by presence of an extant TCP connection,
> i.e. only permit one TCP connection to access a given block device at
> once. If you think about (for instance) a forking daemon that does
> writeback caching, that would be an entirely reasonable thing to do
> for data consistency.

Sure. They will have to live with the fact that clients connected to
them will run slower; I don't think that's a problem. In addition,
Josef's client implementation requires the user to explicitly ask for
multiple connections.

There are multiple contexts in which NBD can be used, and in some
performance is more important than in others. I think that is fine.

[...]
> A more general point is that with multiple queues requests
> may be processed in a different order even by those servers that
> currently process the requests in strict order, or in something
> similar to strict order. The server is permitted by the spec
> (save as mandated by NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA) to
> process commands out of order anyway, but I suspect this has
> to date been little tested.

Yes, and that is why I was asking about this. If the write barriers
are expected to be shared across connections, we have a problem. If,
however, they are not, then it doesn't matter that the commands may be
processed out of order.

[...]

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
   people in the world who think they really understand all of its rules,
   and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Christoph Hellwig

On Thu, Sep 15, 2016 at 12:09:28PM +0100, Alex Bligh wrote:
> A more general point is that with multiple queues requests
> may be processed in a different order even by those servers that
> currently process the requests in strict order, or in something
> similar to strict order. The server is permitted by the spec
> (save as mandated by NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA) to
> process commands out of order anyway, but I suspect this has
> to date been little tested.

The Linux kernel does not assume any synchroniztion between block I/O
commands.  So any sort of synchronization a protocol does is complete
overkill for us.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Christoph Hellwig

On Thu, Sep 15, 2016 at 01:29:36PM +0200, Wouter Verhelst wrote:
> Yes, and that is why I was asking about this. If the write barriers
> are expected to be shared across connections, we have a problem. If,
> however, they are not, then it doesn't matter that the commands may be
> processed out of order.

There is no such thing as a write barrier in the Linux kernel.  We'd
much prefer protocols not to introduce any pointless synchronization
if we can avoid it.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh

Christoph,

> On 15 Sep 2016, at 12:38, Christoph Hellwig  wrote:
> 
> On Thu, Sep 15, 2016 at 12:49:35PM +0200, Wouter Verhelst wrote:
>> A while back, we spent quite some time defining the semantics of the
>> various commands in the face of the NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA
>> write barriers. At the time, we decided that it would be unreasonable
>> to expect servers to make these write barriers effective across
>> different connections.
> 
> Do you have a nbd protocol specification?  treating a flush or fua
> as any sort of barrier is incredibly stupid.  Is it really documented
> that way, and if yes, why?

Sure, it's at:

https://github.com/yoe/nbd/blob/master/doc/proto.md#ordering-of-messages-and-writes

and that link takes you to the specific section.

The treatment of FLUSH and FUA is meant to mirror exactly the
linux block layer (or rather how the linux block layer was a few
years ago). I even asked on LKML to verify a few points.

-- 
Alex Bligh




--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh


> On 15 Sep 2016, at 12:40, Christoph Hellwig  wrote:
> 
> On Thu, Sep 15, 2016 at 01:29:36PM +0200, Wouter Verhelst wrote:
>> Yes, and that is why I was asking about this. If the write barriers
>> are expected to be shared across connections, we have a problem. If,
>> however, they are not, then it doesn't matter that the commands may be
>> processed out of order.
> 
> There is no such thing as a write barrier in the Linux kernel.  We'd
> much prefer protocols not to introduce any pointless synchronization
> if we can avoid it.

I suspect the issue is terminological.

Essentially NBD does supports FLUSH/FUA like this:

https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt

IE supports the same FLUSH/FUA primitives as other block drivers (AIUI).

Link to protocol (per last email) here:

https://github.com/yoe/nbd/blob/master/doc/proto.md#ordering-of-messages-and-writes

-- 
Alex Bligh




--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Christoph Hellwig

On Thu, Sep 15, 2016 at 12:43:35PM +0100, Alex Bligh wrote:
> Sure, it's at:
> 
> https://github.com/yoe/nbd/blob/master/doc/proto.md#ordering-of-messages-and-writes
> 
> and that link takes you to the specific section.
> 
> The treatment of FLUSH and FUA is meant to mirror exactly the
> linux block layer (or rather how the linux block layer was a few
> years ago). I even asked on LKML to verify a few points.

Linux never expected ordering on the wire.  Before 2010 we had barriers
in the kernel that provided ordering to the caller, but we never
required it from the protocol / hardware.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Christoph Hellwig

On Thu, Sep 15, 2016 at 12:46:07PM +0100, Alex Bligh wrote:
> Essentially NBD does supports FLUSH/FUA like this:
> 
> https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt
> 
> IE supports the same FLUSH/FUA primitives as other block drivers (AIUI).
> 
> Link to protocol (per last email) here:
> 
> https://github.com/yoe/nbd/blob/master/doc/proto.md#ordering-of-messages-and-writes

Flush as defined by the Linux block layer (and supported that way in
SCSI, ATA, NVMe) only requires to flush all already completed writes
to non-volatile media.  It does not impose any ordering unlike the
nbd spec.

FUA as defined by the Linux block layer (and supported that way in SCSI,
ATA, NVMe) only requires the write operation the FUA bit is set on to be
on non-volatile media before completing the write operation.  It does
not impose any ordering, which seems to match the nbd spec.  Unlike the
NBD spec Linux does not allow FUA to be set on anything by WRITE
commands.  Some other storage protocols allow a FUA bit on READ
commands or other commands that write data to the device, though.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 27/41] truncate: make invalidate_inode_pages2_range() aware about huge pages

2016-09-15 Thread Kirill A. Shutemov

For huge pages we need to unmap whole range covered by the huge page.

Signed-off-by: Kirill A. Shutemov 
---
 mm/truncate.c | 27 +++
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index 9c339e6255f2..6a445278aaaf 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -708,27 +708,34 @@ int invalidate_inode_pages2_range(struct address_space 
*mapping,
continue;
}
wait_on_page_writeback(page);
+   page = compound_head(page);
+
if (page_mapped(page)) {
+   loff_t begin, len;
+
+   begin = page->index << PAGE_SHIFT;
+
if (!did_range_unmap) {
/*
 * Zap the rest of the file in one hit.
 */
+   len = (loff_t)(1 + end - page->index) <<
+   PAGE_SHIFT;
+   if (len < hpage_size(page))
+   len = hpage_size(page);
unmap_mapping_range(mapping,
-  (loff_t)index << PAGE_SHIFT,
-  (loff_t)(1 + end - index)
-<< PAGE_SHIFT,
-0);
+   begin, len, 0);
did_range_unmap = 1;
} else {
/*
 * Just zap this page
 */
-   unmap_mapping_range(mapping,
-  (loff_t)index << PAGE_SHIFT,
-  PAGE_SIZE, 0);
+   len = hpage_size(page);
+   unmap_mapping_range(mapping, begin,
+   len, 0 );
}
}
-   BUG_ON(page_mapped(page));
+   VM_BUG_ON_PAGE(page_mapped(page), page);
ret2 = do_launder_page(mapping, page);
if (ret2 == 0) {
if (!invalidate_complete_page2(mapping, page))
@@ -737,6 +744,10 @@ int invalidate_inode_pages2_range(struct address_space 
*mapping,
if (ret2 < 0)
ret = ret2;
unlock_page(page);
+   if (PageTransHuge(page)) {
+   index = page->index + HPAGE_PMD_NR - 1;
+   break;
+   }
}
pagevec_remove_exceptionals(&pvec);
pagevec_release(&pvec);
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 01/41] tools: Add WARN_ON_ONCE

2016-09-15 Thread Kirill A. Shutemov

From: Matthew Wilcox 

The radix tree uses its own buggy WARN_ON_ONCE.  Replace it with the
definition from asm-generic/bug.h

Signed-off-by: Matthew Wilcox 
Signed-off-by: Kirill A. Shutemov 
---
 tools/include/asm/bug.h  | 11 +++
 tools/testing/radix-tree/Makefile|  2 +-
 tools/testing/radix-tree/linux/bug.h |  2 +-
 3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/tools/include/asm/bug.h b/tools/include/asm/bug.h
index 9e5f4846967f..beda1a884b50 100644
--- a/tools/include/asm/bug.h
+++ b/tools/include/asm/bug.h
@@ -12,6 +12,17 @@
unlikely(__ret_warn_on);\
 })
 
+#define WARN_ON_ONCE(condition) ({ \
+   static int __warned;\
+   int __ret_warn_once = !!(condition);\
+   \
+   if (unlikely(__ret_warn_once && !__warned)) {   \
+   __warned = true;\
+   WARN_ON(1); \
+   }   \
+   unlikely(__ret_warn_once);  \
+})
+
 #define WARN_ONCE(condition, format...)({  \
static int __warned;\
int __ret_warn_once = !!(condition);\
diff --git a/tools/testing/radix-tree/Makefile 
b/tools/testing/radix-tree/Makefile
index 3b530467148e..20d8bb37017a 100644
--- a/tools/testing/radix-tree/Makefile
+++ b/tools/testing/radix-tree/Makefile
@@ -1,5 +1,5 @@
 
-CFLAGS += -I. -g -Wall -D_LGPL_SOURCE
+CFLAGS += -I. -I../../include -g -Wall -D_LGPL_SOURCE
 LDFLAGS += -lpthread -lurcu
 TARGETS = main
 OFILES = main.o radix-tree.o linux.o test.o tag_check.o find_next_bit.o \
diff --git a/tools/testing/radix-tree/linux/bug.h 
b/tools/testing/radix-tree/linux/bug.h
index ccbe444977df..23b8ed52f8c8 100644
--- a/tools/testing/radix-tree/linux/bug.h
+++ b/tools/testing/radix-tree/linux/bug.h
@@ -1 +1 @@
-#define WARN_ON_ONCE(x)assert(x)
+#include "asm/bug.h"
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 08/41] Revert "radix-tree: implement radix_tree_maybe_preload_order()"

2016-09-15 Thread Kirill A. Shutemov

This reverts commit 356e1c23292a4f63cfdf1daf0e0ddada51f32de8.

After conversion of huge tmpfs to multi-order entries, we don't need
this anymore.

Signed-off-by: Kirill A. Shutemov 
---
 include/linux/radix-tree.h |  1 -
 lib/radix-tree.c   | 74 --
 2 files changed, 75 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index c4cea311d901..7a271f662320 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -290,7 +290,6 @@ unsigned int radix_tree_gang_lookup_slot(struct 
radix_tree_root *root,
unsigned long first_index, unsigned int max_items);
 int radix_tree_preload(gfp_t gfp_mask);
 int radix_tree_maybe_preload(gfp_t gfp_mask);
-int radix_tree_maybe_preload_order(gfp_t gfp_mask, int order);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *root,
unsigned long index, unsigned int tag);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 57c15c4d0796..307e517b2479 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -38,9 +38,6 @@
 #include  /* in_interrupt() */
 
 
-/* Number of nodes in fully populated tree of given height */
-static unsigned long height_to_maxnodes[RADIX_TREE_MAX_PATH + 1] __read_mostly;
-
 /*
  * Radix tree node cache.
  */
@@ -433,51 +430,6 @@ int radix_tree_split_preload(unsigned int old_order, 
unsigned int new_order,
 #endif
 
 /*
- * The same as function above, but preload number of nodes required to insert
- * (1 << order) continuous naturally-aligned elements.
- */
-int radix_tree_maybe_preload_order(gfp_t gfp_mask, int order)
-{
-   unsigned long nr_subtrees;
-   int nr_nodes, subtree_height;
-
-   /* Preloading doesn't help anything with this gfp mask, skip it */
-   if (!gfpflags_allow_blocking(gfp_mask)) {
-   preempt_disable();
-   return 0;
-   }
-
-   /*
-* Calculate number and height of fully populated subtrees it takes to
-* store (1 << order) elements.
-*/
-   nr_subtrees = 1 << order;
-   for (subtree_height = 0; nr_subtrees > RADIX_TREE_MAP_SIZE;
-   subtree_height++)
-   nr_subtrees >>= RADIX_TREE_MAP_SHIFT;
-
-   /*
-* The worst case is zero height tree with a single item at index 0 and
-* then inserting items starting at ULONG_MAX - (1 << order).
-*
-* This requires RADIX_TREE_MAX_PATH nodes to build branch from root to
-* 0-index item.
-*/
-   nr_nodes = RADIX_TREE_MAX_PATH;
-
-   /* Plus branch to fully populated subtrees. */
-   nr_nodes += RADIX_TREE_MAX_PATH - subtree_height;
-
-   /* Root node is shared. */
-   nr_nodes--;
-
-   /* Plus nodes required to build subtrees. */
-   nr_nodes += nr_subtrees * height_to_maxnodes[subtree_height];
-
-   return __radix_tree_preload(gfp_mask, nr_nodes);
-}
-
-/*
  * The maximum index which can be stored in a radix tree
  */
 static inline unsigned long shift_maxindex(unsigned int shift)
@@ -1856,31 +1808,6 @@ radix_tree_node_ctor(void *arg)
INIT_LIST_HEAD(&node->private_list);
 }
 
-static __init unsigned long __maxindex(unsigned int height)
-{
-   unsigned int width = height * RADIX_TREE_MAP_SHIFT;
-   int shift = RADIX_TREE_INDEX_BITS - width;
-
-   if (shift < 0)
-   return ~0UL;
-   if (shift >= BITS_PER_LONG)
-   return 0UL;
-   return ~0UL >> shift;
-}
-
-static __init void radix_tree_init_maxnodes(void)
-{
-   unsigned long height_to_maxindex[RADIX_TREE_MAX_PATH + 1];
-   unsigned int i, j;
-
-   for (i = 0; i < ARRAY_SIZE(height_to_maxindex); i++)
-   height_to_maxindex[i] = __maxindex(i);
-   for (i = 0; i < ARRAY_SIZE(height_to_maxnodes); i++) {
-   for (j = i; j > 0; j--)
-   height_to_maxnodes[i] += height_to_maxindex[j - 1] + 1;
-   }
-}
-
 static int radix_tree_callback(struct notifier_block *nfb,
unsigned long action, void *hcpu)
 {
@@ -1907,6 +1834,5 @@ void __init radix_tree_init(void)
sizeof(struct radix_tree_node), 0,
SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
radix_tree_node_ctor);
-   radix_tree_init_maxnodes();
hotcpu_notifier(radix_tree_callback, 0);
 }
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 06/41] radix-tree: Handle multiorder entries being deleted by replace_clear_tags

2016-09-15 Thread Kirill A. Shutemov

From: Matthew Wilcox 

radix_tree_replace_clear_tags() can be called with NULL as the replacement
value; in this case we need to delete sibling entries which point to
the slot.

Signed-off-by: Matthew Wilcox 
Signed-off-by: Kirill A. Shutemov 
---
 lib/radix-tree.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index e58855435c01..57c15c4d0796 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -1805,17 +1805,23 @@ void *radix_tree_delete(struct radix_tree_root *root, 
unsigned long index)
 }
 EXPORT_SYMBOL(radix_tree_delete);
 
+/*
+ * If the caller passes NULL for @entry, it must take care to adjust
+ * node->count.  See page_cache_tree_delete() for an example.
+ */
 struct radix_tree_node *radix_tree_replace_clear_tags(
struct radix_tree_root *root,
unsigned long index, void *entry)
 {
struct radix_tree_node *node;
void **slot;
+   unsigned int offset;
 
__radix_tree_lookup(root, index, &node, &slot);
 
if (node) {
-   unsigned int tag, offset = get_slot_offset(node, slot);
+   unsigned int tag;
+   offset = get_slot_offset(node, slot);
for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
node_tag_clear(root, node, tag, offset);
} else {
@@ -1824,6 +1830,9 @@ struct radix_tree_node *radix_tree_replace_clear_tags(
}
 
radix_tree_replace_slot(slot, entry);
+   if (!entry && node)
+   delete_sibling_entries(node, node_to_entry(slot), offset);
+
return node;
 }
 
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 05/41] radix-tree: Add radix_tree_split_preload()

2016-09-15 Thread Kirill A. Shutemov

From: Matthew Wilcox 

Calculate how many nodes we need to allocate to split an old_order entry
into multiple entries, each of size new_order.  The test suite checks that
we allocated exactly the right number of nodes; neither too many (checked
by rtp->nr == 0), nor too few (checked by comparing nr_allocated before
and after the call to radix_tree_split()).

Signed-off-by: Matthew Wilcox 
Signed-off-by: Kirill A. Shutemov 
---
 include/linux/radix-tree.h|  1 +
 lib/radix-tree.c  | 22 ++
 tools/testing/radix-tree/multiorder.c | 28 ++--
 tools/testing/radix-tree/test.h   |  9 +
 4 files changed, 58 insertions(+), 2 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 459e8a152c8a..c4cea311d901 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -318,6 +318,7 @@ static inline void radix_tree_preload_end(void)
preempt_enable();
 }
 
+int radix_tree_split_preload(unsigned old_order, unsigned new_order, gfp_t);
 int radix_tree_split(struct radix_tree_root *, unsigned long index,
unsigned new_order);
 int radix_tree_join(struct radix_tree_root *, unsigned long index,
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index ad3116cbe61b..e58855435c01 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -410,6 +410,28 @@ int radix_tree_maybe_preload(gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(radix_tree_maybe_preload);
 
+#ifdef CONFIG_RADIX_TREE_MULTIORDER
+/*
+ * Preload with enough objects to ensure that we can split a single entry
+ * of order @old_order into many entries of size @new_order
+ */
+int radix_tree_split_preload(unsigned int old_order, unsigned int new_order,
+   gfp_t gfp_mask)
+{
+   unsigned top = 1 << (old_order % RADIX_TREE_MAP_SHIFT);
+   unsigned layers = (old_order / RADIX_TREE_MAP_SHIFT) -
+   (new_order / RADIX_TREE_MAP_SHIFT);
+   unsigned nr = 0;
+
+   WARN_ON_ONCE(!gfpflags_allow_blocking(gfp_mask));
+   BUG_ON(new_order >= old_order);
+
+   while (layers--)
+   nr = nr * RADIX_TREE_MAP_SIZE + 1;
+   return __radix_tree_preload(gfp_mask, top * nr);
+}
+#endif
+
 /*
  * The same as function above, but preload number of nodes required to insert
  * (1 << order) continuous naturally-aligned elements.
diff --git a/tools/testing/radix-tree/multiorder.c 
b/tools/testing/radix-tree/multiorder.c
index 9d27a4dd7b2a..5eda47dfe818 100644
--- a/tools/testing/radix-tree/multiorder.c
+++ b/tools/testing/radix-tree/multiorder.c
@@ -348,18 +348,42 @@ static void multiorder_join(void)
}
 }
 
+static void check_mem(unsigned old_order, unsigned new_order, unsigned alloc)
+{
+   struct radix_tree_preload *rtp = &radix_tree_preloads;
+   if (rtp->nr != 0)
+   printf("split(%u %u) remaining %u\n", old_order, new_order,
+   rtp->nr);
+   /*
+* Can't check for equality here as some nodes may have been
+* RCU-freed while we ran.  But we should never finish with more
+* nodes allocated since they should have all been preloaded.
+*/
+   if (nr_allocated > alloc)
+   printf("split(%u %u) allocated %u %u\n", old_order, new_order,
+   alloc, nr_allocated);
+}
+
 static void __multiorder_split(int old_order, int new_order)
 {
-   RADIX_TREE(tree, GFP_KERNEL);
+   RADIX_TREE(tree, GFP_ATOMIC);
void **slot;
struct radix_tree_iter iter;
+   unsigned alloc;
 
-   item_insert_order(&tree, 0, old_order);
+   radix_tree_preload(GFP_KERNEL);
+   assert(item_insert_order(&tree, 0, old_order) == 0);
+   radix_tree_callback(NULL, CPU_DEAD, NULL);
radix_tree_tag_set(&tree, 0, 2);
+
+   radix_tree_split_preload(old_order, new_order, GFP_KERNEL);
+   alloc = nr_allocated;
radix_tree_split(&tree, 0, new_order);
+   check_mem(old_order, new_order, alloc);
radix_tree_for_each_slot(slot, &tree, &iter, 0) {
radix_tree_replace_slot(slot, item_create(iter.index));
}
+   radix_tree_preload_end();
 
item_kill_tree(&tree);
 }
diff --git a/tools/testing/radix-tree/test.h b/tools/testing/radix-tree/test.h
index e85131369723..55e6d095047b 100644
--- a/tools/testing/radix-tree/test.h
+++ b/tools/testing/radix-tree/test.h
@@ -2,6 +2,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 struct item {
unsigned long index;
@@ -43,3 +45,10 @@ void radix_tree_dump(struct radix_tree_root *root);
 int root_tag_get(struct radix_tree_root *root, unsigned int tag);
 unsigned long node_maxindex(struct radix_tree_node *);
 unsigned long shift_maxindex(unsigned int shift);
+int radix_tree_callback(struct notifier_block *nfb,
+   unsigned lon

[PATCHv3 16/41] filemap: allocate huge page in pagecache_get_page(), if allowed

2016-09-15 Thread Kirill A. Shutemov

Write path allocate pages using pagecache_get_page(). We should be able
to allocate huge pages there, if it's allowed. As usually, fallback to
small pages, if failed.

Signed-off-by: Kirill A. Shutemov 
---
 mm/filemap.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index b77bcf6843ee..05b42d3e5ed8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1310,13 +1310,16 @@ repeat:
 
 no_page:
if (!page && (fgp_flags & FGP_CREAT)) {
+   pgoff_t hoffset;
int err;
if ((fgp_flags & FGP_WRITE) && 
mapping_cap_account_dirty(mapping))
gfp_mask |= __GFP_WRITE;
if (fgp_flags & FGP_NOFS)
gfp_mask &= ~__GFP_FS;
 
-   page = __page_cache_alloc(gfp_mask);
+   page = page_cache_alloc_huge(mapping, offset, gfp_mask);
+no_huge:   if (!page)
+   page = __page_cache_alloc(gfp_mask);
if (!page)
return NULL;
 
@@ -1327,14 +1330,25 @@ no_page:
if (fgp_flags & FGP_ACCESSED)
__SetPageReferenced(page);
 
-   err = add_to_page_cache_lru(page, mapping, offset,
+   if (PageTransHuge(page))
+   hoffset = round_down(offset, HPAGE_PMD_NR);
+   else
+   hoffset = offset;
+
+   err = add_to_page_cache_lru(page, mapping, hoffset,
gfp_mask & GFP_RECLAIM_MASK);
if (unlikely(err)) {
+   if (PageTransHuge(page)) {
+   put_page(page);
+   page = NULL;
+   goto no_huge;
+   }
put_page(page);
page = NULL;
if (err == -EEXIST)
goto repeat;
}
+   page += offset - hoffset;
}
 
return page;
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 13/41] truncate: make sure invalidate_mapping_pages() can discard huge pages

2016-09-15 Thread Kirill A. Shutemov

invalidate_inode_page() has expectation about page_count() of the page
-- if it's not 2 (one to caller, one to radix-tree), it will not be
dropped. That condition almost never met for THPs -- tail pages are
pinned to the pagevec.

Let's drop them, before calling invalidate_inode_page().

Signed-off-by: Kirill A. Shutemov 
---
 mm/truncate.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/mm/truncate.c b/mm/truncate.c
index a01cce450a26..ce904e4b1708 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -504,10 +504,21 @@ unsigned long invalidate_mapping_pages(struct 
address_space *mapping,
/* 'end' is in the middle of THP */
if (index ==  round_down(end, HPAGE_PMD_NR))
continue;
+   /*
+* invalidate_inode_page() expects
+* page_count(page) == 2 to drop page from page
+* cache -- drop tail pages references.
+*/
+   get_page(page);
+   pagevec_release(&pvec);
}
 
ret = invalidate_inode_page(page);
unlock_page(page);
+
+   if (PageTransHuge(page))
+   put_page(page);
+
/*
 * Invalidation is a hint that the page is no longer
 * of interest and try to speed up its reclaim.
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 22/41] thp: do not threat slab pages as huge in hpage_{nr_pages,size,mask}

2016-09-15 Thread Kirill A. Shutemov

Slab pages can be compound, but we shouldn't threat them as THP for
pupose of hpage_* helpers, otherwise it would lead to confusing results.

For instance, ext4 uses slab pages for journal pages and we shouldn't
confuse them with THPs. The easiest way is to exclude them in hpage_*
helpers.

Signed-off-by: Kirill A. Shutemov 
---
 include/linux/huge_mm.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index de2789b4402c..5c5466ba37df 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -133,21 +133,21 @@ static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
 }
 static inline int hpage_nr_pages(struct page *page)
 {
-   if (unlikely(PageTransHuge(page)))
+   if (unlikely(!PageSlab(page) && PageTransHuge(page)))
return HPAGE_PMD_NR;
return 1;
 }
 
 static inline int hpage_size(struct page *page)
 {
-   if (unlikely(PageTransHuge(page)))
+   if (unlikely(!PageSlab(page) && PageTransHuge(page)))
return HPAGE_PMD_SIZE;
return PAGE_SIZE;
 }
 
 static inline unsigned long hpage_mask(struct page *page)
 {
-   if (unlikely(PageTransHuge(page)))
+   if (unlikely(!PageSlab(page) && PageTransHuge(page)))
return HPAGE_PMD_MASK;
return PAGE_MASK;
 }
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 25/41] fs: make block_page_mkwrite() aware about huge pages

2016-09-15 Thread Kirill A. Shutemov

Adjust check on whether part of the page beyond file size and apply
compound_head() and page_mapping() where appropriate.

Signed-off-by: Kirill A. Shutemov 
---
 fs/buffer.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 7f50e5a63670..e53808e790e2 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2502,7 +2502,7 @@ EXPORT_SYMBOL(block_commit_write);
 int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 get_block_t get_block)
 {
-   struct page *page = vmf->page;
+   struct page *page = compound_head(vmf->page);
struct inode *inode = file_inode(vma->vm_file);
unsigned long end;
loff_t size;
@@ -2510,7 +2510,7 @@ int block_page_mkwrite(struct vm_area_struct *vma, struct 
vm_fault *vmf,
 
lock_page(page);
size = i_size_read(inode);
-   if ((page->mapping != inode->i_mapping) ||
+   if ((page_mapping(page) != inode->i_mapping) ||
(page_offset(page) > size)) {
/* We overload EFAULT to mean page got truncated */
ret = -EFAULT;
@@ -2518,10 +2518,10 @@ int block_page_mkwrite(struct vm_area_struct *vma, 
struct vm_fault *vmf,
}
 
/* page is wholly or partially inside EOF */
-   if (((page->index + 1) << PAGE_SHIFT) > size)
-   end = size & ~PAGE_MASK;
+   if (((page->index + hpage_nr_pages(page)) << PAGE_SHIFT) > size)
+   end = size & ~hpage_mask(page);
else
-   end = PAGE_SIZE;
+   end = hpage_size(page);
 
ret = __block_write_begin(page, 0, end, get_block);
if (!ret)
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh

> On 15 Sep 2016, at 12:46, Christoph Hellwig  wrote:
> 
> On Thu, Sep 15, 2016 at 12:43:35PM +0100, Alex Bligh wrote:
>> Sure, it's at:
>> 
>> https://github.com/yoe/nbd/blob/master/doc/proto.md#ordering-of-messages-and-writes
>> 
>> and that link takes you to the specific section.
>> 
>> The treatment of FLUSH and FUA is meant to mirror exactly the
>> linux block layer (or rather how the linux block layer was a few
>> years ago). I even asked on LKML to verify a few points.
> 
> Linux never expected ordering on the wire.  Before 2010 we had barriers
> in the kernel that provided ordering to the caller, but we never
> required it from the protocol / hardware.

Sure. And I think the doc section reflects exactly Linux's post-2010
expectations (note the link to the kernel documentation). IE
servers are not required to order reads/writes (save that all
writes that the server completes prior to reply to an NBD_CMD_FLUSH
must be persisted prior to the reply to that NBD_CMD_FLUSH
which I believe to be exactly the same behaviour as the kernel
dealing with an empty bio with REQ_FLUSH set).

Perhaps the section should be called "No ordering of messages
and writes"!

My point was that *in practice* disordering is not well tested
as *in practice* many server implementations do in fact process
each command in order, though that's changed recently.

-- 
Alex Bligh

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 12/41] thp: handle write-protection faults for file THP

2016-09-15 Thread Kirill A. Shutemov

For filesystems that wants to be write-notified (has mkwrite), we will
encount write-protection faults for huge PMDs in shared mappings.

The easiest way to handle them is to clear the PMD and let it refault as
wriable.

Signed-off-by: Kirill A. Shutemov 
---
 mm/memory.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 83be99d9d8a1..aad8d5c6311f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3451,8 +3451,17 @@ static int wp_huge_pmd(struct fault_env *fe, pmd_t 
orig_pmd)
return fe->vma->vm_ops->pmd_fault(fe->vma, fe->address, fe->pmd,
fe->flags);
 
+   if (fe->vma->vm_flags & VM_SHARED) {
+   /* Clear PMD */
+   zap_page_range_single(fe->vma, fe->address,
+   HPAGE_PMD_SIZE, NULL);
+   VM_BUG_ON(!pmd_none(*fe->pmd));
+
+   /* Refault to establish writable PMD */
+   return 0;
+   }
+
/* COW handled on pte level: split pmd */
-   VM_BUG_ON_VMA(fe->vma->vm_flags & VM_SHARED, fe->vma);
split_huge_pmd(fe->vma, fe->pmd, fe->address);
 
return VM_FAULT_FALLBACK;
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 34/41] ext4: handle huge pages in ext4_da_write_end()

2016-09-15 Thread Kirill A. Shutemov

Call ext4_da_should_update_i_disksize() for head page with offset
relative to head page.

Signed-off-by: Kirill A. Shutemov 
---
 fs/ext4/inode.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 31560b52eff2..deacd3499ec7 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3006,7 +3006,6 @@ static int ext4_da_write_end(struct file *file,
int ret = 0, ret2;
handle_t *handle = ext4_journal_current_handle();
loff_t new_i_size;
-   unsigned long start, end;
int write_mode = (int)(unsigned long)fsdata;
 
if (write_mode == FALL_BACK_TO_NONDELALLOC)
@@ -3014,8 +3013,6 @@ static int ext4_da_write_end(struct file *file,
  len, copied, page, fsdata);
 
trace_ext4_da_write_end(inode, pos, len, copied);
-   start = pos & (PAGE_SIZE - 1);
-   end = start + copied - 1;
 
/*
 * generic_write_end() will run mark_inode_dirty() if i_size
@@ -3024,8 +3021,10 @@ static int ext4_da_write_end(struct file *file,
 */
new_i_size = pos + copied;
if (copied && new_i_size > EXT4_I(inode)->i_disksize) {
+   struct page *head = compound_head(page);
+   unsigned long end = (pos & ~hpage_mask(head)) + copied - 1;
if (ext4_has_inline_data(inode) ||
-   ext4_da_should_update_i_disksize(page, end)) {
+   ext4_da_should_update_i_disksize(head, end)) {
ext4_update_i_disksize(inode, new_i_size);
/* We need to mark inode dirty even if
 * new_i_size is less that inode->i_size
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 41/41] ext4, vfs: add huge= mount option

2016-09-15 Thread Kirill A. Shutemov

The same four values as in tmpfs case.

Encyption code is not yet ready to handle huge page, so we disable huge
pages support if the inode has EXT4_INODE_ENCRYPT.

Signed-off-by: Kirill A. Shutemov 
---
 fs/ext4/ext4.h  |  5 +
 fs/ext4/inode.c | 26 +-
 fs/ext4/super.c | 26 ++
 3 files changed, 52 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index ea31931386ec..feece2d1f646 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1123,6 +1123,11 @@ struct ext4_inode_info {
 #define EXT4_MOUNT_DIOREAD_NOLOCK  0x40 /* Enable support for dio read 
nolocking */
 #define EXT4_MOUNT_JOURNAL_CHECKSUM0x80 /* Journal checksums */
 #define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT0x100 /* Journal Async 
Commit */
+#define EXT4_MOUNT_HUGE_MODE   0x600 /* Huge support mode: */
+#define EXT4_MOUNT_HUGE_NEVER  0x000
+#define EXT4_MOUNT_HUGE_ALWAYS 0x200
+#define EXT4_MOUNT_HUGE_WITHIN_SIZE0x400
+#define EXT4_MOUNT_HUGE_ADVISE 0x600
 #define EXT4_MOUNT_DELALLOC0x800 /* Delalloc support */
 #define EXT4_MOUNT_DATA_ERR_ABORT  0x1000 /* Abort on file data write 
*/
 #define EXT4_MOUNT_BLOCK_VALIDITY  0x2000 /* Block validity checking */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ff995083856b..fd243e9f7c29 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4369,7 +4369,7 @@ int ext4_get_inode_loc(struct inode *inode, struct 
ext4_iloc *iloc)
 void ext4_set_inode_flags(struct inode *inode)
 {
unsigned int flags = EXT4_I(inode)->i_flags;
-   unsigned int new_fl = 0;
+   unsigned int mask, new_fl = 0;
 
if (flags & EXT4_SYNC_FL)
new_fl |= S_SYNC;
@@ -4381,10 +4381,26 @@ void ext4_set_inode_flags(struct inode *inode)
new_fl |= S_NOATIME;
if (flags & EXT4_DIRSYNC_FL)
new_fl |= S_DIRSYNC;
-   if (test_opt(inode->i_sb, DAX) && S_ISREG(inode->i_mode))
-   new_fl |= S_DAX;
-   inode_set_flags(inode, new_fl,
-   S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|S_DAX);
+   if (S_ISREG(inode->i_mode) && !ext4_encrypted_inode(inode)) {
+   if (test_opt(inode->i_sb, DAX))
+   new_fl |= S_DAX;
+   switch (test_opt(inode->i_sb, HUGE_MODE)) {
+   case EXT4_MOUNT_HUGE_NEVER:
+   break;
+   case EXT4_MOUNT_HUGE_ALWAYS:
+   new_fl |= S_HUGE_ALWAYS;
+   break;
+   case EXT4_MOUNT_HUGE_WITHIN_SIZE:
+   new_fl |= S_HUGE_WITHIN_SIZE;
+   break;
+   case EXT4_MOUNT_HUGE_ADVISE:
+   new_fl |= S_HUGE_ADVISE;
+   break;
+   }
+   }
+   mask = S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME |
+   S_DIRSYNC | S_DAX | S_HUGE_MODE;
+   inode_set_flags(inode, new_fl, mask);
 }
 
 /* Propagate flags from i_flags to EXT4_I(inode)->i_flags */
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3ec8708989ca..392c30925052 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1123,6 +1123,7 @@ static int ext4_set_context(struct inode *inode, const 
void *ctx, size_t len,
ext4_set_inode_flag(inode, EXT4_INODE_ENCRYPT);
ext4_clear_inode_state(inode,
EXT4_STATE_MAY_INLINE_DATA);
+   ext4_set_inode_flags(inode);
}
return res;
}
@@ -1137,6 +1138,7 @@ static int ext4_set_context(struct inode *inode, const 
void *ctx, size_t len,
len, 0);
if (!res) {
ext4_set_inode_flag(inode, EXT4_INODE_ENCRYPT);
+   ext4_set_inode_flags(inode);
res = ext4_mark_inode_dirty(handle, inode);
if (res)
EXT4_ERROR_INODE(inode, "Failed to mark inode dirty");
@@ -1275,6 +1277,7 @@ enum {
Opt_dioread_nolock, Opt_dioread_lock,
Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
Opt_max_dir_size_kb, Opt_nojournal_checksum,
+   Opt_huge_never, Opt_huge_always, Opt_huge_within_size, Opt_huge_advise,
 };
 
 static const match_table_t tokens = {
@@ -1354,6 +1357,10 @@ static const match_table_t tokens = {
{Opt_init_itable, "init_itable"},
{Opt_noinit_itable, "noinit_itable"},
{Opt_max_dir_size_kb, "max_dir_size_kb=%u"},
+   {Opt_huge_never, "huge=never"},
+   {Opt_huge_always, "huge=always"},
+   {Opt_huge_within_size, "huge=within_size"},
+   {Opt_huge_advise, "huge=advise"},
{Opt_test_dummy_encryption, "test_dummy_encryption"},
{Opt_removed, "check=none"},/* mount option from ext2/3 */
{Opt_removed, "nocheck"},   /* mount option from ext2/3 */
@@ -1472,6

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Wouter Verhelst

On Thu, Sep 15, 2016 at 04:52:17AM -0700, Christoph Hellwig wrote:
> On Thu, Sep 15, 2016 at 12:46:07PM +0100, Alex Bligh wrote:
> > Essentially NBD does supports FLUSH/FUA like this:
> > 
> > https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt
> > 
> > IE supports the same FLUSH/FUA primitives as other block drivers (AIUI).
> > 
> > Link to protocol (per last email) here:
> > 
> > https://github.com/yoe/nbd/blob/master/doc/proto.md#ordering-of-messages-and-writes
> 
> Flush as defined by the Linux block layer (and supported that way in
> SCSI, ATA, NVMe) only requires to flush all already completed writes
> to non-volatile media.

That is precisely what FLUSH in nbd does, too.

> It does not impose any ordering unlike the nbd spec.

If you read the spec differently, then that's a bug in the spec. Can you
clarify which part of it caused that confusion? We should fix it, then.

> FUA as defined by the Linux block layer (and supported that way in SCSI,
> ATA, NVMe) only requires the write operation the FUA bit is set on to be
> on non-volatile media before completing the write operation.  It does
> not impose any ordering, which seems to match the nbd spec.  Unlike the
> NBD spec Linux does not allow FUA to be set on anything by WRITE
> commands.  Some other storage protocols allow a FUA bit on READ
> commands or other commands that write data to the device, though.

Yes. There was some discussion on that part, and we decided that setting
the flag doesn't hurt, but the spec also clarifies that using it on READ
does nothing, semantically.

The problem is that there are clients in the wild which do set it on
READ, so it's just a matter of "be liberal in what you accept".

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
   people in the world who think they really understand all of its rules,
   and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 39/41] ext4: make fallocate() operations work with huge pages

2016-09-15 Thread Kirill A. Shutemov

__ext4_block_zero_page_range() adjusted to calculate starting iblock
correctry for huge pages.

ext4_{collapse,insert}_range() requires page cache invalidation. We need
the invalidation to be aligning to huge page border if huge pages are
possible in page cache.

Signed-off-by: Kirill A. Shutemov 
---
 fs/ext4/extents.c | 10 --
 fs/ext4/inode.c   |  3 +--
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index d7ccb7f51dfc..d46aeda70fb0 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -5525,7 +5525,10 @@ int ext4_collapse_range(struct inode *inode, loff_t 
offset, loff_t len)
 * Need to round down offset to be aligned with page size boundary
 * for page size > block size.
 */
-   ioffset = round_down(offset, PAGE_SIZE);
+   if (IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE))
+   ioffset = round_down(offset, HPAGE_PMD_SIZE);
+   else
+   ioffset = round_down(offset, PAGE_SIZE);
/*
 * Write tail of the last page before removed range since it will get
 * removed from the page cache below.
@@ -5674,7 +5677,10 @@ int ext4_insert_range(struct inode *inode, loff_t 
offset, loff_t len)
 * Need to round down to align start offset to page size boundary
 * for page size > block size.
 */
-   ioffset = round_down(offset, PAGE_SIZE);
+   if (IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE))
+   ioffset = round_down(offset, HPAGE_PMD_SIZE);
+   else
+   ioffset = round_down(offset, PAGE_SIZE);
/* Write out all dirty pages */
ret = filemap_write_and_wait_range(inode->i_mapping, ioffset,
LLONG_MAX);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f2e34e340e65..645a984a15ef 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3711,7 +3711,6 @@ void ext4_set_aops(struct inode *inode)
 static int __ext4_block_zero_page_range(handle_t *handle,
struct address_space *mapping, loff_t from, loff_t length)
 {
-   ext4_fsblk_t index = from >> PAGE_SHIFT;
unsigned offset;
unsigned blocksize, pos;
ext4_lblk_t iblock;
@@ -3730,7 +3729,7 @@ static int __ext4_block_zero_page_range(handle_t *handle,
 
blocksize = inode->i_sb->s_blocksize;
 
-   iblock = index << (PAGE_SHIFT - inode->i_sb->s_blocksize_bits);
+   iblock = page->index << (PAGE_SHIFT - inode->i_sb->s_blocksize_bits);
 
if (!page_has_buffers(page))
create_empty_buffers(page, blocksize, 0);
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 20/41] mm: make write_cache_pages() work on huge pages

2016-09-15 Thread Kirill A. Shutemov

We writeback whole huge page a time. Let's adjust iteration this way.

Signed-off-by: Kirill A. Shutemov 
---
 include/linux/mm.h  |  1 +
 include/linux/pagemap.h |  1 +
 mm/page-writeback.c | 17 -
 3 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ef815b9cd426..44e55d1c8e41 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1054,6 +1054,7 @@ struct address_space *page_file_mapping(struct page *page)
  */
 static inline pgoff_t page_index(struct page *page)
 {
+   page = compound_head(page);
if (unlikely(PageSwapCache(page)))
return page_private(page);
return page->index;
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index a84f11a672f0..4d6e9aec2d1f 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -518,6 +518,7 @@ static inline void wait_on_page_locked(struct page *page)
  */
 static inline void wait_on_page_writeback(struct page *page)
 {
+   page = compound_head(page);
if (PageWriteback(page))
wait_on_page_bit(page, PG_writeback);
 }
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index f4cd7d8005c9..6390c9488e29 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2242,7 +2242,7 @@ retry:
 * mapping. However, page->index will not change
 * because we have a reference on the page.
 */
-   if (page->index > end) {
+   if (page_to_pgoff(page) > end) {
/*
 * can't be range_cyclic (1st pass) because
 * end == -1 in that case.
@@ -2251,7 +2251,12 @@ retry:
break;
}
 
-   done_index = page->index;
+   done_index = page_to_pgoff(page);
+   if (PageTransCompound(page)) {
+   index = round_up(index + 1, HPAGE_PMD_NR);
+   i += HPAGE_PMD_NR -
+   done_index % HPAGE_PMD_NR - 1;
+   }
 
lock_page(page);
 
@@ -2263,7 +2268,7 @@ retry:
 * even if there is now a new, dirty page at the same
 * pagecache address.
 */
-   if (unlikely(page->mapping != mapping)) {
+   if (unlikely(page_mapping(page) != mapping)) {
 continue_unlock:
unlock_page(page);
continue;
@@ -2301,7 +2306,8 @@ continue_unlock:
 * not be suitable for data integrity
 * writeout).
 */
-   done_index = page->index + 1;
+   done_index = compound_head(page)->index
+   + hpage_nr_pages(page);
done = 1;
break;
}
@@ -2313,7 +2319,8 @@ continue_unlock:
 * keep going until we have written all the pages
 * we tagged for writeback prior to entering this loop.
 */
-   if (--wbc->nr_to_write <= 0 &&
+   wbc->nr_to_write -= hpage_nr_pages(page);
+   if (wbc->nr_to_write <= 0 &&
wbc->sync_mode == WB_SYNC_NONE) {
done = 1;
break;
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 38/41] ext4: fix SEEK_DATA/SEEK_HOLE for huge pages

2016-09-15 Thread Kirill A. Shutemov

ext4_find_unwritten_pgoff() needs few tweaks to work with huge pages.
Mostly trivial page_mapping()/page_to_pgoff() and adjustment to how we
find relevant block.

Signe-off-by: Kirill A. Shutemov 
---
 fs/ext4/file.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 261ac3734c58..2c3d6bb0edfe 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -473,7 +473,7 @@ static int ext4_find_unwritten_pgoff(struct inode *inode,
 * range, it will be a hole.
 */
if (lastoff < endoff && whence == SEEK_HOLE &&
-   page->index > end) {
+   page_to_pgoff(page) > end) {
found = 1;
*offset = lastoff;
goto out;
@@ -481,7 +481,7 @@ static int ext4_find_unwritten_pgoff(struct inode *inode,
 
lock_page(page);
 
-   if (unlikely(page->mapping != inode->i_mapping)) {
+   if (unlikely(page_mapping(page) != inode->i_mapping)) {
unlock_page(page);
continue;
}
@@ -492,8 +492,12 @@ static int ext4_find_unwritten_pgoff(struct inode *inode,
}
 
if (page_has_buffers(page)) {
+   int diff;
lastoff = page_offset(page);
bh = head = page_buffers(page);
+   diff = (page - compound_head(page)) << 
inode->i_blkbits;
+   while (diff--)
+   bh = bh->b_this_page;
do {
if (buffer_uptodate(bh) ||
buffer_unwritten(bh)) {
@@ -514,8 +518,12 @@ static int ext4_find_unwritten_pgoff(struct inode *inode,
} while (bh != head);
}
 
-   lastoff = page_offset(page) + PAGE_SIZE;
+   lastoff = page_offset(page) + hpage_size(page);
unlock_page(page);
+   if (PageTransCompound(page)) {
+   i++;
+   break;
+   }
}
 
/*
@@ -528,7 +536,9 @@ static int ext4_find_unwritten_pgoff(struct inode *inode,
break;
}
 
-   index = pvec.pages[i - 1]->index + 1;
+   index = page_to_pgoff(pvec.pages[i - 1]) + 1;
+   if (PageTransCompound(pvec.pages[i - 1]))
+   index = round_up(index, HPAGE_PMD_NR);
pagevec_release(&pvec);
} while (index <= end);
 
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 31/41] ext4: handle huge pages in ext4_page_mkwrite()

2016-09-15 Thread Kirill A. Shutemov

Trivial: remove assumption on page size.

Signed-off-by: Kirill A. Shutemov 
---
 fs/ext4/inode.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c9c05840f6cc..9ed715ed00e7 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5644,7 +5644,7 @@ static int ext4_bh_unmapped(handle_t *handle, struct 
buffer_head *bh)
 
 int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-   struct page *page = vmf->page;
+   struct page *page = compound_head(vmf->page);
loff_t size;
unsigned long len;
int ret;
@@ -5680,10 +5680,10 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, 
struct vm_fault *vmf)
goto out;
}
 
-   if (page->index == size >> PAGE_SHIFT)
-   len = size & ~PAGE_MASK;
-   else
-   len = PAGE_SIZE;
+   len = hpage_size(page);
+   if (page->index + hpage_nr_pages(page) - 1 == size >> PAGE_SHIFT)
+   len = size & ~hpage_mask(page);
+
/*
 * Return if we have all the buffers mapped. This avoids the need to do
 * journal_start/journal_stop which can block and take a long time
@@ -5714,7 +5714,8 @@ retry_alloc:
ret = block_page_mkwrite(vma, vmf, get_block);
if (!ret && ext4_should_journal_data(inode)) {
if (ext4_walk_page_buffers(handle, page_buffers(page), 0,
- PAGE_SIZE, NULL, do_journal_get_write_access)) {
+ hpage_size(page), NULL,
+ do_journal_get_write_access)) {
unlock_page(page);
ret = VM_FAULT_SIGBUS;
ext4_journal_stop(handle);
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 40/41] mm, fs, ext4: expand use of page_mapping() and page_to_pgoff()

2016-09-15 Thread Kirill A. Shutemov

With huge pages in page cache we see tail pages in more code paths.
This patch replaces direct access to struct page fields with macros
which can handle tail pages properly.

Signed-off-by: Kirill A. Shutemov 
---
 fs/buffer.c |  2 +-
 fs/ext4/inode.c |  4 ++--
 mm/filemap.c| 26 ++
 mm/memory.c |  4 ++--
 mm/page-writeback.c |  2 +-
 mm/truncate.c   |  5 +++--
 6 files changed, 23 insertions(+), 20 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 20898b051044..56323862dad3 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -630,7 +630,7 @@ static void __set_page_dirty(struct page *page, struct 
address_space *mapping,
unsigned long flags;
 
spin_lock_irqsave(&mapping->tree_lock, flags);
-   if (page->mapping) {/* Race with truncate? */
+   if (page_mapping(page)) {   /* Race with truncate? */
WARN_ON_ONCE(warn && !PageUptodate(page));
account_page_dirtied(page, mapping);
radix_tree_tag_set(&mapping->page_tree,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 645a984a15ef..ff995083856b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1223,7 +1223,7 @@ retry_journal:
}
 
lock_page(page);
-   if (page->mapping != mapping) {
+   if (page_mapping(page) != mapping) {
/* The page got truncated from under us */
unlock_page(page);
put_page(page);
@@ -2961,7 +2961,7 @@ retry_journal:
}
 
lock_page(page);
-   if (page->mapping != mapping) {
+   if (page_mapping(page) != mapping) {
/* The page got truncated from under us */
unlock_page(page);
put_page(page);
diff --git a/mm/filemap.c b/mm/filemap.c
index 61d375d7f93e..c69b1204744a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -369,7 +369,7 @@ static int __filemap_fdatawait_range(struct address_space 
*mapping,
struct page *page = pvec.pages[i];
 
/* until radix tree lookup accepts end_index */
-   if (page->index > end)
+   if (page_to_pgoff(page) > end)
continue;
 
page = compound_head(page);
@@ -1307,12 +1307,12 @@ repeat:
}
 
/* Has the page been truncated? */
-   if (unlikely(page->mapping != mapping)) {
+   if (unlikely(page_mapping(page) != mapping)) {
unlock_page(page);
put_page(page);
goto repeat;
}
-   VM_BUG_ON_PAGE(page->index != offset, page);
+   VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page);
}
 
if (page && (fgp_flags & FGP_ACCESSED))
@@ -1606,7 +1606,8 @@ repeat:
 * otherwise we can get both false positives and false
 * negatives, which is just confusing to the caller.
 */
-   if (page->mapping == NULL || page_to_pgoff(page) != index) {
+   if (page_mapping(page) == NULL ||
+   page_to_pgoff(page) != index) {
put_page(page);
break;
}
@@ -1907,7 +1908,7 @@ find_page:
if (!trylock_page(page))
goto page_not_up_to_date;
/* Did it get truncated before we got the lock? */
-   if (!page->mapping)
+   if (!page_mapping(page))
goto page_not_up_to_date_locked;
if (!mapping->a_ops->is_partially_uptodate(page,
offset, iter->count))
@@ -1987,7 +1988,7 @@ page_not_up_to_date:
 
 page_not_up_to_date_locked:
/* Did it get truncated before we got the lock? */
-   if (!page->mapping) {
+   if (!page_mapping(page)) {
unlock_page(page);
put_page(page);
continue;
@@ -2023,7 +2024,7 @@ readpage:
if (unlikely(error))
goto readpage_error;
if (!PageUptodate(page)) {
-   if (page->mapping == NULL) {
+   if (page_mapping(page) == NULL) {
/*
 * invalidate_mapping_pages got it
 */
@@ -2324,12 +2325,12 @@ retry_find:
}
 
/* Did it get truncated? */
-   if (unlikely(page->mapping != mapping)) {
+   if (unlikely(page_mapping(page) != mapping)) {
unlock_page(page);
put_page(page);
goto retry_find;
}
-   VM_BUG_ON_PAGE(page->index != of

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Wouter Verhelst

On Thu, Sep 15, 2016 at 04:38:07AM -0700, Christoph Hellwig wrote:
> On Thu, Sep 15, 2016 at 12:49:35PM +0200, Wouter Verhelst wrote:
> > A while back, we spent quite some time defining the semantics of the
> > various commands in the face of the NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA
> > write barriers. At the time, we decided that it would be unreasonable
> > to expect servers to make these write barriers effective across
> > different connections.
> 
> Do you have a nbd protocol specification?

https://github.com/yoe/nbd/blob/master/doc/proto.md

> treating a flush or fua as any sort of barrier is incredibly stupid.

Maybe I'm not using the correct terminology here. The point is that
after a FLUSH, the server asserts that all write commands *for which a
reply has already been sent to the client* will also have reached
permanent storage. Nothing is asserted about commands for which the
reply has not yet been sent, regardless of whether they were sent to the
server before or after the FLUSH; they may reach permanent storage as a
result of the FLUSH, or they may not, we don't say.

With FUA, we only assert that the FUA-flagged command reaches permanent
storage before its reply is sent, nothing else.

If that's not a write barrier, then I was using the wrong terminology
(and offer my apologies for the confusion).

The point still remains that "X was sent before Y" is difficult to
determine on the client side if X was sent over a different TCP channel
than Y, because a packet might be dropped (requiring a retransmit) for
X, and similar things. If blk-mq can deal with that, we're good and
nothing further needs to be done. If not, this should be evaluated by
someone more familiar with the internals of the kernel block subsystem
than me.

> Is it really documented that way, and if yes, why?

The documentation does use the term write barrier, but only right before
the same explanation as above. It does so because I assumed that that is
what a write barrier is, and that this would make things easier to
understand. If you tell me that that term is wrong as used there, I can
easily remove it (it's not critical to the rest of the documentation).

Regards,

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
   people in the world who think they really understand all of its rules,
   and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Christoph Hellwig

On Thu, Sep 15, 2016 at 01:55:14PM +0200, Wouter Verhelst wrote:
> Maybe I'm not using the correct terminology here. The point is that
> after a FLUSH, the server asserts that all write commands *for which a
> reply has already been sent to the client* will also have reached
> permanent storage. Nothing is asserted about commands for which the
> reply has not yet been sent, regardless of whether they were sent to the
> server before or after the FLUSH; they may reach permanent storage as a
> result of the FLUSH, or they may not, we don't say.
> 
> With FUA, we only assert that the FUA-flagged command reaches permanent
> storage before its reply is sent, nothing else.

Yes, that's the correct semantics.

> If that's not a write barrier, then I was using the wrong terminology
> (and offer my apologies for the confusion).

It's not a write barrier - a write barrier was command that ensured that

 a) all previous writes were completed to the host/client
 b) all previous writes were on non-volatile storage

and

 c) the actual write with the barrier bit was on non-volatile storage

> The point still remains that "X was sent before Y" is difficult to
> determine on the client side if X was sent over a different TCP channel
> than Y, because a packet might be dropped (requiring a retransmit) for
> X, and similar things. If blk-mq can deal with that, we're good and
> nothing further needs to be done. If not, this should be evaluated by
> someone more familiar with the internals of the kernel block subsystem
> than me.

The important bit in all the existing protocols, and which Linux relies
on is that any write the Linux block layer got a completion for earlier
needs to be flushed out to non-volatile storage when a FLUSH command is
set.  Anything that still is in flight does not matter.  Which for
NBD means anything that you already replies to need to be flushed.

Or to say it more practicly - in the nbd server you simply need to
call fdatasync on the backing device or file whenever you get a FLUSH
requires, and it will do the right thing.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 14/41] filemap: allocate huge page in page_cache_read(), if allowed

2016-09-15 Thread Kirill A. Shutemov

This patch adds basic functionality to put huge page into page cache.

At the moment we only put huge pages into radix-tree if the range covered
by the huge page is empty.

We ignore shadow entires for now, just remove them from the tree before
inserting huge page.

Later we can add logic to accumulate information from shadow entires to
return to caller (average eviction time?).

Signed-off-by: Kirill A. Shutemov 
---
 include/linux/fs.h  |   5 ++
 include/linux/pagemap.h |  21 ++-
 mm/filemap.c| 148 +++-
 3 files changed, 157 insertions(+), 17 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 901e25d495cc..122024ccc739 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1829,6 +1829,11 @@ struct super_operations {
 #else
 #define S_DAX  0   /* Make all the DAX code disappear */
 #endif
+#define S_HUGE_MODE0xc000
+#define S_HUGE_NEVER   0x
+#define S_HUGE_ALWAYS  0x4000
+#define S_HUGE_WITHIN_SIZE 0x8000
+#define S_HUGE_ADVISE  0xc000
 
 /*
  * Note that nosuid etc flags are inode-specific: setting some file-system
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 66a1260b33de..a84f11a672f0 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -191,14 +191,20 @@ static inline int page_cache_add_speculative(struct page 
*page, int count)
 }
 
 #ifdef CONFIG_NUMA
-extern struct page *__page_cache_alloc(gfp_t gfp);
+extern struct page *__page_cache_alloc_order(gfp_t gfp, unsigned int order);
 #else
-static inline struct page *__page_cache_alloc(gfp_t gfp)
+static inline struct page *__page_cache_alloc_order(gfp_t gfp,
+   unsigned int order)
 {
-   return alloc_pages(gfp, 0);
+   return alloc_pages(gfp, order);
 }
 #endif
 
+static inline struct page *__page_cache_alloc(gfp_t gfp)
+{
+   return __page_cache_alloc_order(gfp, 0);
+}
+
 static inline struct page *page_cache_alloc(struct address_space *x)
 {
return __page_cache_alloc(mapping_gfp_mask(x));
@@ -215,6 +221,15 @@ static inline gfp_t readahead_gfp_mask(struct 
address_space *x)
  __GFP_COLD | __GFP_NORETRY | __GFP_NOWARN;
 }
 
+extern bool __page_cache_allow_huge(struct address_space *x, pgoff_t offset);
+static inline bool page_cache_allow_huge(struct address_space *x,
+   pgoff_t offset)
+{
+   if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+   return false;
+   return __page_cache_allow_huge(x, offset);
+}
+
 typedef int filler_t(void *, struct page *);
 
 pgoff_t page_cache_next_hole(struct address_space *mapping,
diff --git a/mm/filemap.c b/mm/filemap.c
index 6f7f45f47d68..50afe17230e7 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -637,14 +637,14 @@ static int __add_to_page_cache_locked(struct page *page,
  pgoff_t offset, gfp_t gfp_mask,
  void **shadowp)
 {
-   int huge = PageHuge(page);
+   int hugetlb = PageHuge(page);
struct mem_cgroup *memcg;
int error;
 
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(PageSwapBacked(page), page);
 
-   if (!huge) {
+   if (!hugetlb) {
error = mem_cgroup_try_charge(page, current->mm,
  gfp_mask, &memcg, false);
if (error)
@@ -653,7 +653,7 @@ static int __add_to_page_cache_locked(struct page *page,
 
error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
if (error) {
-   if (!huge)
+   if (!hugetlb)
mem_cgroup_cancel_charge(page, memcg, false);
return error;
}
@@ -663,16 +663,55 @@ static int __add_to_page_cache_locked(struct page *page,
page->index = offset;
 
spin_lock_irq(&mapping->tree_lock);
-   error = page_cache_tree_insert(mapping, page, shadowp);
+   if (PageTransHuge(page)) {
+   struct radix_tree_iter iter;
+   void **slot;
+   void *p;
+
+   error = 0;
+
+   /* Wipe shadow entires */
+   radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, 
offset) {
+   if (iter.index >= offset + HPAGE_PMD_NR)
+   break;
+
+   p = radix_tree_deref_slot_protected(slot,
+   &mapping->tree_lock);
+   if (!p)
+   continue;
+
+   if (!radix_tree_exception(p)) {
+   error = -EEXIST;
+   break;
+   }
+
+   mapping->nrexceptional--;
+   rcu_assign_pointer(*slot, NULL);
+   }
+
+   if (!error)
+   error = __radix_tree_insert

[PATCHv3 36/41] ext4: handle writeback with huge pages

2016-09-15 Thread Kirill A. Shutemov

Modify mpage_map_and_submit_buffers() and mpage_release_unused_pages()
to deal with huge pages.

Mostly result of try-and-error. Critical view would be appriciated.

Signed-off-by: Kirill A. Shutemov 
---
 fs/ext4/inode.c | 61 -
 1 file changed, 43 insertions(+), 18 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 6a8da1a8409c..f2e34e340e65 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1652,18 +1652,30 @@ static void mpage_release_unused_pages(struct 
mpage_da_data *mpd,
if (nr_pages == 0)
break;
for (i = 0; i < nr_pages; i++) {
-   struct page *page = pvec.pages[i];
+   struct page *page = compound_head(pvec.pages[i]);
+
if (page->index > end)
break;
BUG_ON(!PageLocked(page));
BUG_ON(PageWriteback(page));
if (invalidate) {
-   block_invalidatepage(page, 0, PAGE_SIZE);
+   unsigned long offset, len;
+
+   offset = (index % hpage_nr_pages(page));
+   len = min_t(unsigned long, end - page->index,
+   hpage_nr_pages(page));
+
+   block_invalidatepage(page, offset << PAGE_SHIFT,
+   len << PAGE_SHIFT);
ClearPageUptodate(page);
}
unlock_page(page);
+   if (PageTransHuge(page))
+   break;
}
-   index = pvec.pages[nr_pages - 1]->index + 1;
+   index = page_to_pgoff(pvec.pages[nr_pages - 1]) + 1;
+   if (PageTransCompound(pvec.pages[nr_pages - 1]))
+   index = round_up(index, HPAGE_PMD_NR);
pagevec_release(&pvec);
}
 }
@@ -2097,16 +2109,16 @@ static int mpage_submit_page(struct mpage_da_data *mpd, 
struct page *page)
loff_t size = i_size_read(mpd->inode);
int err;
 
-   BUG_ON(page->index != mpd->first_page);
-   if (page->index == size >> PAGE_SHIFT)
-   len = size & ~PAGE_MASK;
-   else
-   len = PAGE_SIZE;
+   page = compound_head(page);
+   len = hpage_size(page);
+   if (page->index + hpage_nr_pages(page) - 1 == size >> PAGE_SHIFT)
+   len = size & ~hpage_mask(page);
+
clear_page_dirty_for_io(page);
err = ext4_bio_write_page(&mpd->io_submit, page, len, mpd->wbc, false);
if (!err)
-   mpd->wbc->nr_to_write--;
-   mpd->first_page++;
+   mpd->wbc->nr_to_write -= hpage_nr_pages(page);
+   mpd->first_page = round_up(mpd->first_page + 1, hpage_nr_pages(page));
 
return err;
 }
@@ -2254,12 +2266,16 @@ static int mpage_map_and_submit_buffers(struct 
mpage_da_data *mpd)
break;
for (i = 0; i < nr_pages; i++) {
struct page *page = pvec.pages[i];
+   unsigned long diff;
 
-   if (page->index > end)
+   if (page_to_pgoff(page) > end)
break;
/* Up to 'end' pages must be contiguous */
-   BUG_ON(page->index != start);
+   BUG_ON(page_to_pgoff(page) != start);
+   diff = (page - compound_head(page)) << bpp_bits;
bh = head = page_buffers(page);
+   while (diff--)
+   bh = bh->b_this_page;
do {
if (lblk < mpd->map.m_lblk)
continue;
@@ -2296,7 +2312,10 @@ static int mpage_map_and_submit_buffers(struct 
mpage_da_data *mpd)
 * supports blocksize < pagesize as we will try to
 * convert potentially unmapped parts of inode.
 */
-   mpd->io_submit.io_end->size += PAGE_SIZE;
+   if (PageTransCompound(page))
+   mpd->io_submit.io_end->size += HPAGE_PMD_SIZE;
+   else
+   mpd->io_submit.io_end->size += PAGE_SIZE;
/* Page fully mapped - let IO run! */
err = mpage_submit_page(mpd, page);
if (err < 0) {
@@ -2304,6 +2323,10 @@ static int mpage_map_and_submit_buffers(struct 
mpage_da_data *mpd)
return err;
}
start++;
+   if (PageTransCompound(page)) {
+   start = round_up(start,

[PATCHv3 29/41] ext4: make ext4_mpage_readpages() hugepage-aware

2016-09-15 Thread Kirill A. Shutemov

This patch modifies ext4_mpage_readpages() to deal with huge pages.

We read out 2M at once, so we have to alloc (HPAGE_PMD_NR *
blocks_per_page) sector_t for that. I'm not entirely happy with kmalloc
in this codepath, but don't see any other option.

Signed-off-by: Kirill A. Shutemov 
---
 fs/ext4/readpage.c | 38 --
 1 file changed, 32 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c
index a81b829d56de..6d7cbddceeb2 100644
--- a/fs/ext4/readpage.c
+++ b/fs/ext4/readpage.c
@@ -104,12 +104,12 @@ int ext4_mpage_readpages(struct address_space *mapping,
 
struct inode *inode = mapping->host;
const unsigned blkbits = inode->i_blkbits;
-   const unsigned blocks_per_page = PAGE_SIZE >> blkbits;
const unsigned blocksize = 1 << blkbits;
sector_t block_in_file;
sector_t last_block;
sector_t last_block_in_file;
-   sector_t blocks[MAX_BUF_PER_PAGE];
+   sector_t blocks_on_stack[MAX_BUF_PER_PAGE];
+   sector_t *blocks = blocks_on_stack;
unsigned page_block;
struct block_device *bdev = inode->i_sb->s_bdev;
int length;
@@ -122,8 +122,9 @@ int ext4_mpage_readpages(struct address_space *mapping,
map.m_flags = 0;
 
for (; nr_pages; nr_pages--) {
-   int fully_mapped = 1;
-   unsigned first_hole = blocks_per_page;
+   int fully_mapped = 1, nr = nr_pages;
+   unsigned blocks_per_page = PAGE_SIZE >> blkbits;
+   unsigned first_hole;
 
prefetchw(&page->flags);
if (pages) {
@@ -138,10 +139,31 @@ int ext4_mpage_readpages(struct address_space *mapping,
goto confused;
 
block_in_file = (sector_t)page->index << (PAGE_SHIFT - blkbits);
-   last_block = block_in_file + nr_pages * blocks_per_page;
+
+   if (PageTransHuge(page)) {
+   BUILD_BUG_ON(BIO_MAX_PAGES < HPAGE_PMD_NR);
+   nr = HPAGE_PMD_NR * blocks_per_page;
+   /* XXX: need a better solution ? */
+   blocks = kmalloc(sizeof(sector_t) * nr, GFP_NOFS);
+   if (!blocks) {
+   if (pages) {
+   delete_from_page_cache(page);
+   goto next_page;
+   }
+   return -ENOMEM;
+   }
+
+   blocks_per_page *= HPAGE_PMD_NR;
+   last_block = block_in_file + blocks_per_page;
+   } else {
+   blocks = blocks_on_stack;
+   last_block = block_in_file + nr * blocks_per_page;
+   }
+
last_block_in_file = (i_size_read(inode) + blocksize - 1) >> 
blkbits;
if (last_block > last_block_in_file)
last_block = last_block_in_file;
+   first_hole = blocks_per_page;
page_block = 0;
 
/*
@@ -213,6 +235,8 @@ int ext4_mpage_readpages(struct address_space *mapping,
}
}
if (first_hole != blocks_per_page) {
+   if (PageTransHuge(page))
+   goto confused;
zero_user_segment(page, first_hole << blkbits,
  PAGE_SIZE);
if (first_hole == 0) {
@@ -248,7 +272,7 @@ int ext4_mpage_readpages(struct address_space *mapping,
goto set_error_page;
}
bio = bio_alloc(GFP_KERNEL,
-   min_t(int, nr_pages, BIO_MAX_PAGES));
+   min_t(int, nr, BIO_MAX_PAGES));
if (!bio) {
if (ctx)
fscrypt_release_ctx(ctx);
@@ -289,5 +313,7 @@ int ext4_mpage_readpages(struct address_space *mapping,
BUG_ON(pages && !list_empty(pages));
if (bio)
submit_bio(bio);
+   if (blocks != blocks_on_stack)
+   kfree(blocks);
return 0;
 }
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 11/41] thp: try to free page's buffers before attempt split

2016-09-15 Thread Kirill A. Shutemov

We want page to be isolated from the rest of the system before spliting
it. We rely on page count to be 2 for file pages to make sure nobody
uses the page: one pin to caller, one to radix-tree.

Filesystems with backing storage can have page count increased if it has
buffers.

Let's try to free them, before attempt split. And remove one guarding
VM_BUG_ON_PAGE().

Signed-off-by: Kirill A. Shutemov 
---
 include/linux/buffer_head.h |  1 +
 mm/huge_memory.c| 19 ++-
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index ebbacd14d450..006a8a42acfb 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -395,6 +395,7 @@ extern int __set_page_dirty_buffers(struct page *page);
 #else /* CONFIG_BLOCK */
 
 static inline void buffer_init(void) {}
+static inline int page_has_buffers(struct page *page) { return 0; }
 static inline int try_to_free_buffers(struct page *page) { return 1; }
 static inline int inode_has_buffers(struct inode *inode) { return 0; }
 static inline void invalidate_inode_buffers(struct inode *inode) {}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 020a23d6e7f8..44bf0ba3d10f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -2012,7 +2013,6 @@ int split_huge_page_to_list(struct page *page, struct 
list_head *list)
 
VM_BUG_ON_PAGE(is_huge_zero_page(page), page);
VM_BUG_ON_PAGE(!PageLocked(page), page);
-   VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
VM_BUG_ON_PAGE(!PageCompound(page), page);
 
if (PageAnon(head)) {
@@ -2041,6 +2041,23 @@ int split_huge_page_to_list(struct page *page, struct 
list_head *list)
goto out;
}
 
+   /* Try to free buffers before attempt split */
+   if (!PageSwapBacked(head) && PagePrivate(page)) {
+   /*
+* We cannot trigger writeback from here due possible
+* recursion if triggered from vmscan, only wait.
+*
+* Caller can trigger writeback it on its own, if safe.
+*/
+   wait_on_page_writeback(head);
+
+   if (page_has_buffers(head) &&
+   !try_to_free_buffers(head)) {
+   ret = -EBUSY;
+   goto out;
+   }
+   }
+
/* Addidional pin from radix tree */
extra_pins = 1;
anon_vma = NULL;
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 24/41] fs: make block_write_{begin,end}() be able to handle huge pages

2016-09-15 Thread Kirill A. Shutemov

It's more or less straight-forward.

Most changes are around getting offset/len withing page right and zero
out desired part of the page.

Signed-off-by: Kirill A. Shutemov 
---
 fs/buffer.c | 53 +++--
 1 file changed, 31 insertions(+), 22 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 2739f5dae690..7f50e5a63670 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1870,21 +1870,21 @@ void page_zero_new_buffers(struct page *page, unsigned 
from, unsigned to)
do {
block_end = block_start + bh->b_size;
 
-   if (buffer_new(bh)) {
-   if (block_end > from && block_start < to) {
-   if (!PageUptodate(page)) {
-   unsigned start, size;
+   if (buffer_new(bh) && block_end > from && block_start < to) {
+   if (!PageUptodate(page)) {
+   unsigned start, size;
 
-   start = max(from, block_start);
-   size = min(to, block_end) - start;
+   start = max(from, block_start);
+   size = min(to, block_end) - start;
 
-   zero_user(page, start, size);
-   set_buffer_uptodate(bh);
-   }
-
-   clear_buffer_new(bh);
-   mark_buffer_dirty(bh);
+   zero_user(page + block_start / PAGE_SIZE,
+   start % PAGE_SIZE,
+   size);
+   set_buffer_uptodate(bh);
}
+
+   clear_buffer_new(bh);
+   mark_buffer_dirty(bh);
}
 
block_start = block_end;
@@ -1950,18 +1950,20 @@ iomap_to_bh(struct inode *inode, sector_t block, struct 
buffer_head *bh,
 int __block_write_begin_int(struct page *page, loff_t pos, unsigned len,
get_block_t *get_block, struct iomap *iomap)
 {
-   unsigned from = pos & (PAGE_SIZE - 1);
-   unsigned to = from + len;
-   struct inode *inode = page->mapping->host;
+   unsigned from, to;
+   struct inode *inode = page_mapping(page)->host;
unsigned block_start, block_end;
sector_t block;
int err = 0;
unsigned blocksize, bbits;
struct buffer_head *bh, *head, *wait[2], **wait_bh=wait;
 
+   page = compound_head(page);
+   from = pos & ~hpage_mask(page);
+   to = from + len;
BUG_ON(!PageLocked(page));
-   BUG_ON(from > PAGE_SIZE);
-   BUG_ON(to > PAGE_SIZE);
+   BUG_ON(from > hpage_size(page));
+   BUG_ON(to > hpage_size(page));
BUG_ON(from > to);
 
head = create_page_buffers(page, inode, 0);
@@ -2001,10 +2003,15 @@ int __block_write_begin_int(struct page *page, loff_t 
pos, unsigned len,
mark_buffer_dirty(bh);
continue;
}
-   if (block_end > to || block_start < from)
-   zero_user_segments(page,
-   to, block_end,
-   block_start, from);
+   if (block_end > to || block_start < from) {
+   BUG_ON(to - from  > PAGE_SIZE);
+   zero_user_segments(page +
+   block_start / PAGE_SIZE,
+   to % PAGE_SIZE,
+   (block_start % PAGE_SIZE) + 
blocksize,
+   block_start % PAGE_SIZE,
+   from % PAGE_SIZE);
+   }
continue;
}
}
@@ -2048,6 +2055,7 @@ static int __block_commit_write(struct inode *inode, 
struct page *page,
unsigned blocksize;
struct buffer_head *bh, *head;
 
+   VM_BUG_ON_PAGE(PageTail(page), page);
bh = head = page_buffers(page);
blocksize = bh->b_size;
 
@@ -2114,7 +2122,8 @@ int block_write_end(struct file *file, struct 
address_space *mapping,
struct inode *inode = mapping->host;
unsigned start;
 
-   start = pos & (PAGE_SIZE - 1);
+   page = compound_head(page);
+   start = pos & ~hpage_mask(page);
 
if (unlikely(copied < len)) {
/*
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info a

[PATCHv3 33/41] ext4: make ext4_block_write_begin() aware about huge pages

2016-09-15 Thread Kirill A. Shutemov

It simply matches changes to __block_write_begin_int().

Signed-off-by: Kirill A. Shutemov 
---
 fs/ext4/inode.c | 24 
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index a07c055d8e78..31560b52eff2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1079,9 +1079,8 @@ int do_journal_get_write_access(handle_t *handle,
 static int ext4_block_write_begin(struct page *page, loff_t pos, unsigned len,
  get_block_t *get_block)
 {
-   unsigned from = pos & (PAGE_SIZE - 1);
-   unsigned to = from + len;
-   struct inode *inode = page->mapping->host;
+   unsigned from, to;
+   struct inode *inode = page_mapping(page)->host;
unsigned block_start, block_end;
sector_t block;
int err = 0;
@@ -1090,9 +1089,12 @@ static int ext4_block_write_begin(struct page *page, 
loff_t pos, unsigned len,
struct buffer_head *bh, *head, *wait[2], **wait_bh = wait;
bool decrypt = false;
 
+   page = compound_head(page);
+   from = pos & ~hpage_mask(page);
+   to = from + len;
BUG_ON(!PageLocked(page));
-   BUG_ON(from > PAGE_SIZE);
-   BUG_ON(to > PAGE_SIZE);
+   BUG_ON(from > hpage_size(page));
+   BUG_ON(to > hpage_size(page));
BUG_ON(from > to);
 
if (!page_has_buffers(page))
@@ -1127,9 +1129,15 @@ static int ext4_block_write_begin(struct page *page, 
loff_t pos, unsigned len,
mark_buffer_dirty(bh);
continue;
}
-   if (block_end > to || block_start < from)
-   zero_user_segments(page, to, block_end,
-  block_start, from);
+   if (block_end > to || block_start < from) {
+   BUG_ON(to - from  > PAGE_SIZE);
+   zero_user_segments(page +
+   block_start / PAGE_SIZE,
+   to % PAGE_SIZE,
+   (block_start % 
PAGE_SIZE) + blocksize,
+   block_start % PAGE_SIZE,
+   from % PAGE_SIZE);
+   }
continue;
}
}
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 32/41] ext4: handle huge pages in __ext4_block_zero_page_range()

2016-09-15 Thread Kirill A. Shutemov

As the function handles zeroing range only within one block, the
required changes are trivial, just remove assuption on page size.

Signed-off-by: Kirill A. Shutemov 
---
 fs/ext4/inode.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9ed715ed00e7..a07c055d8e78 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3679,7 +3679,7 @@ static int __ext4_block_zero_page_range(handle_t *handle,
struct address_space *mapping, loff_t from, loff_t length)
 {
ext4_fsblk_t index = from >> PAGE_SHIFT;
-   unsigned offset = from & (PAGE_SIZE-1);
+   unsigned offset;
unsigned blocksize, pos;
ext4_lblk_t iblock;
struct inode *inode = mapping->host;
@@ -3692,6 +3692,9 @@ static int __ext4_block_zero_page_range(handle_t *handle,
if (!page)
return -ENOMEM;
 
+   page = compound_head(page);
+   offset = from & ~hpage_mask(page);
+
blocksize = inode->i_sb->s_blocksize;
 
iblock = index << (PAGE_SHIFT - inode->i_sb->s_blocksize_bits);
@@ -3746,7 +3749,7 @@ static int __ext4_block_zero_page_range(handle_t *handle,
if (err)
goto unlock;
}
-   zero_user(page, offset, length);
+   zero_user(page + offset / PAGE_SIZE, offset % PAGE_SIZE, length);
BUFFER_TRACE(bh, "zeroed end of block");
 
if (ext4_should_journal_data(inode)) {
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 18/41] HACK: readahead: alloc huge pages, if allowed

2016-09-15 Thread Kirill A. Shutemov

Most page cache allocation happens via readahead (sync or async), so if
we want to have significant number of huge pages in page cache we need
to find a ways to allocate them from readahead.

Unfortunately, huge pages doesn't fit into current readahead design:
128 max readahead window, assumption on page size, PageReadahead() to
track hit/miss.

I haven't found a ways to get it right yet.

This patch just allocates huge page if allowed, but doesn't really
provide any readahead if huge page is allocated. We read out 2M a time
and I would expect spikes in latancy without readahead.

Therefore HACK.

Having that said, I don't think it should prevent huge page support to
be applied. Future will show if lacking readahead is a big deal with
huge pages in page cache.

Any suggestions are welcome.

Signed-off-by: Kirill A. Shutemov 
---
 mm/readahead.c | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index c8a955b1297e..f46a9080f6a9 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -174,6 +174,21 @@ int __do_page_cache_readahead(struct address_space 
*mapping, struct file *filp,
if (page_offset > end_index)
break;
 
+   if (IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE) &&
+   (!page_idx || !(page_offset % HPAGE_PMD_NR)) &&
+   page_cache_allow_huge(mapping, page_offset)) {
+   page = __page_cache_alloc_order(gfp_mask | __GFP_COMP,
+   HPAGE_PMD_ORDER);
+   if (page) {
+   prep_transhuge_page(page);
+   page->index = round_down(page_offset,
+   HPAGE_PMD_NR);
+   list_add(&page->lru, &page_pool);
+   ret++;
+   goto start_io;
+   }
+   }
+
rcu_read_lock();
page = radix_tree_lookup(&mapping->page_tree, page_offset);
rcu_read_unlock();
@@ -189,7 +204,7 @@ int __do_page_cache_readahead(struct address_space 
*mapping, struct file *filp,
SetPageReadahead(page);
ret++;
}
-
+start_io:
/*
 * Now start the IO.  We ignore I/O errors - if the page is not
 * uptodate then the caller will launch readpage again, and
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 37/41] ext4: make EXT4_IOC_MOVE_EXT work with huge pages

2016-09-15 Thread Kirill A. Shutemov

Adjust how we find relevant block within page and how we clear the
required part of the page.

Signed-off-by: Kirill A. Shutemov 
---
 fs/ext4/move_extent.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index a920c5d29fac..f3efdd0e0eaf 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -210,7 +210,9 @@ mext_page_mkuptodate(struct page *page, unsigned from, 
unsigned to)
return err;
}
if (!buffer_mapped(bh)) {
-   zero_user(page, block_start, blocksize);
+   zero_user(page + block_start / PAGE_SIZE,
+   block_start % PAGE_SIZE,
+   blocksize);
set_buffer_uptodate(bh);
continue;
}
@@ -267,10 +269,11 @@ move_extent_per_page(struct file *o_filp, struct inode 
*donor_inode,
unsigned int tmp_data_size, data_size, replaced_size;
int i, err2, jblocks, retries = 0;
int replaced_count = 0;
-   int from = data_offset_in_page << orig_inode->i_blkbits;
+   int from;
int blocks_per_page = PAGE_SIZE >> orig_inode->i_blkbits;
struct super_block *sb = orig_inode->i_sb;
struct buffer_head *bh = NULL;
+   int diff;
 
/*
 * It needs twice the amount of ordinary journal buffers because
@@ -355,6 +358,9 @@ again:
goto unlock_pages;
}
 data_copy:
+   diff = (pagep[0] - compound_head(pagep[0])) * blocks_per_page;
+   from = (data_offset_in_page + diff) << orig_inode->i_blkbits;
+   pagep[0] = compound_head(pagep[0]);
*err = mext_page_mkuptodate(pagep[0], from, from + replaced_size);
if (*err)
goto unlock_pages;
@@ -384,7 +390,7 @@ data_copy:
if (!page_has_buffers(pagep[0]))
create_empty_buffers(pagep[0], 1 << orig_inode->i_blkbits, 0);
bh = page_buffers(pagep[0]);
-   for (i = 0; i < data_offset_in_page; i++)
+   for (i = 0; i < data_offset_in_page + diff; i++)
bh = bh->b_this_page;
for (i = 0; i < block_len_in_page; i++) {
*err = ext4_get_block(orig_inode, orig_blk_offset + i, bh, 0);
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 15/41] filemap: handle huge pages in do_generic_file_read()

2016-09-15 Thread Kirill A. Shutemov

Most of work happans on head page. Only when we need to do copy data to
userspace we find relevant subpage.

We are still limited by PAGE_SIZE per iteration. Lifting this limitation
would require some more work.

Signed-off-by: Kirill A. Shutemov 
---
 mm/filemap.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 50afe17230e7..b77bcf6843ee 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1860,6 +1860,7 @@ find_page:
if (unlikely(page == NULL))
goto no_cached_page;
}
+   page = compound_head(page);
if (PageReadahead(page)) {
page_cache_async_readahead(mapping,
ra, filp, page,
@@ -1936,7 +1937,8 @@ page_ok:
 * now we can copy it to user space...
 */
 
-   ret = copy_page_to_iter(page, offset, nr, iter);
+   ret = copy_page_to_iter(page + index - page->index, offset,
+   nr, iter);
offset += ret;
index += offset >> PAGE_SHIFT;
offset &= ~PAGE_MASK;
@@ -2356,6 +2358,7 @@ page_not_uptodate:
 * because there really aren't any performance issues here
 * and we need to check for errors.
 */
+   page = compound_head(page);
ClearPageError(page);
error = mapping->a_ops->readpage(file, page);
if (!error) {
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh


> On 15 Sep 2016, at 12:52, Christoph Hellwig  wrote:
> 
> On Thu, Sep 15, 2016 at 12:46:07PM +0100, Alex Bligh wrote:
>> Essentially NBD does supports FLUSH/FUA like this:
>> 
>> https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt
>> 
>> IE supports the same FLUSH/FUA primitives as other block drivers (AIUI).
>> 
>> Link to protocol (per last email) here:
>> 
>> https://github.com/yoe/nbd/blob/master/doc/proto.md#ordering-of-messages-and-writes
> 
> Flush as defined by the Linux block layer (and supported that way in
> SCSI, ATA, NVMe) only requires to flush all already completed writes
> to non-volatile media.  It does not impose any ordering unlike the
> nbd spec.

As maintainer of the NBD spec, I'm confused as to why you think it
imposes any ordering - if you think this, clearly I need to clean up
the wording.

Here's what it says:

> The server MAY process commands out of order, and MAY reply out of order,
> except that:
> 
>   • All write commands (that includes NBD_CMD_WRITE, and NBD_CMD_TRIM)
> that the server completes (i.e. replies to) prior to processing to a
> NBD_CMD_FLUSH MUST be written to non-volatile storage prior to replying to 
> that
> NBD_CMD_FLUSH. This paragraph only applies if NBD_FLAG_SEND_FLUSH is set 
> within
> the transmission flags, as otherwise NBD_CMD_FLUSH will never be sent by the
> client to the server.

(and another bit re FUA that isn't relevant here).

Here's the Linux Kernel documentation:

> The REQ_PREFLUSH flag can be OR ed into the r/w flags of a bio submitted from
> the filesystem and will make sure the volatile cache of the storage device
> has been flushed before the actual I/O operation is started.  This explicitly
> guarantees that previously completed write requests are on non-volatile
> storage before the flagged bio starts. In addition the REQ_PREFLUSH flag can 
> be
> set on an otherwise empty bio structure, which causes only an explicit cache
> flush without any dependent I/O.  It is recommend to use
> the blkdev_issue_flush() helper for a pure cache flush.


I believe that NBD treats NBD_CMD_FLUSH the same as a REQ_PREFLUSH and empty
bio.

If you don't read those two as compatible, I'd like to understand why not
(i.e. what additional constraints one is applying that the other is not)
as they are meant to be the same (save that NBD only has FLUSH as a command,
i.e. the 'empty bio' version). I am happy to improve the docs to make it
clearer.

(sidenote: I am interested in the change from REQ_FLUSH to REQ_PREFLUSH,
but in an empty bio it's not really relevant I think).

> FUA as defined by the Linux block layer (and supported that way in SCSI,
> ATA, NVMe) only requires the write operation the FUA bit is set on to be
> on non-volatile media before completing the write operation.  It does
> not impose any ordering, which seems to match the nbd spec.  Unlike the
> NBD spec Linux does not allow FUA to be set on anything by WRITE
> commands.  Some other storage protocols allow a FUA bit on READ
> commands or other commands that write data to the device, though.

I think you mean "anything *but* WRITE commands". In NBD setting
FUA on a command that does not write will do nothing, but FUA can
be set on NBD_CMD_TRIM and has the expected effect.

Interestingly the kernel docs are silent on which commands REQ_FUA
can be set on.

-- 
Alex Bligh




--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 35/41] ext4: make ext4_da_page_release_reservation() aware about huge pages

2016-09-15 Thread Kirill A. Shutemov

For huge pages 'stop' must be within HPAGE_PMD_SIZE.
Let's use hpage_size() in the BUG_ON().

We also need to change how we calculate lblk for cluster deallocation.

Signed-off-by: Kirill A. Shutemov 
---
 fs/ext4/inode.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index deacd3499ec7..6a8da1a8409c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1558,7 +1558,7 @@ static void ext4_da_page_release_reservation(struct page 
*page,
int num_clusters;
ext4_fsblk_t lblk;
 
-   BUG_ON(stop > PAGE_SIZE || stop < length);
+   BUG_ON(stop > hpage_size(page) || stop < length);
 
head = page_buffers(page);
bh = head;
@@ -1593,7 +1593,8 @@ static void ext4_da_page_release_reservation(struct page 
*page,
 * need to release the reserved space for that cluster. */
num_clusters = EXT4_NUM_B2C(sbi, to_release);
while (num_clusters > 0) {
-   lblk = (page->index << (PAGE_SHIFT - inode->i_blkbits)) +
+   lblk = ((page->index + offset / PAGE_SIZE) <<
+   (PAGE_SHIFT - inode->i_blkbits)) +
((num_clusters - 1) << sbi->s_cluster_bits);
if (sbi->s_cluster_ratio == 1 ||
!ext4_find_delalloc_cluster(inode, lblk))
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 30/41] ext4: make ext4_writepage() work on huge pages

2016-09-15 Thread Kirill A. Shutemov

Change ext4_writepage() and underlying ext4_bio_write_page().

It basically removes assumption on page size, infer it from struct page
instead.

Signed-off-by: Kirill A. Shutemov 
---
 fs/ext4/inode.c   | 10 +-
 fs/ext4/page-io.c | 11 +--
 2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c6ea25a190f8..c9c05840f6cc 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2020,10 +2020,10 @@ static int ext4_writepage(struct page *page,
 
trace_ext4_writepage(page);
size = i_size_read(inode);
-   if (page->index == size >> PAGE_SHIFT)
-   len = size & ~PAGE_MASK;
-   else
-   len = PAGE_SIZE;
+
+   len = hpage_size(page);
+   if (page->index + hpage_nr_pages(page) - 1 == size >> PAGE_SHIFT)
+   len = size & ~hpage_mask(page);
 
page_bufs = page_buffers(page);
/*
@@ -2047,7 +2047,7 @@ static int ext4_writepage(struct page *page,
   ext4_bh_delay_or_unwritten)) {
redirty_page_for_writepage(wbc, page);
if ((current->flags & PF_MEMALLOC) ||
-   (inode->i_sb->s_blocksize == PAGE_SIZE)) {
+   (inode->i_sb->s_blocksize == hpage_size(page))) {
/*
 * For memory cleaning there's no point in writing only
 * some buffers. So just bail out. Warn if we came here
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index a6132a730967..952957ee48b7 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -415,6 +415,7 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
 
BUG_ON(!PageLocked(page));
BUG_ON(PageWriteback(page));
+   BUG_ON(PageTail(page));
 
if (keep_towrite)
set_page_writeback_keepwrite(page);
@@ -431,8 +432,14 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
 * the page size, the remaining memory is zeroed when mapped, and
 * writes to that region are not written out to the file."
 */
-   if (len < PAGE_SIZE)
-   zero_user_segment(page, len, PAGE_SIZE);
+   if (len < hpage_size(page)) {
+   page += len / PAGE_SIZE;
+   if (len % PAGE_SIZE)
+   zero_user_segment(page, len % PAGE_SIZE, PAGE_SIZE);
+   while (page + 1 == compound_head(page))
+   clear_highpage(++page);
+   page = compound_head(page);
+   }
/*
 * In the first loop we prepare and mark buffers to submit. We have to
 * mark all buffers in the page before submitting so that
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 19/41] block: define BIO_MAX_PAGES to HPAGE_PMD_NR if huge page cache enabled

2016-09-15 Thread Kirill A. Shutemov

We are going to do IO a huge page a time. So we need BIO_MAX_PAGES to be
at least HPAGE_PMD_NR. For x86-64, it's 512 pages.

Signed-off-by: Kirill A. Shutemov 
---
 include/linux/bio.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 23ddf4b46a9b..ebf4f312a642 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -40,7 +40,11 @@
 #define BIO_BUG_ON
 #endif
 
+#ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
+#define BIO_MAX_PAGES  (HPAGE_PMD_NR > 256 ? HPAGE_PMD_NR : 256)
+#else
 #define BIO_MAX_PAGES  256
+#endif
 
 #define bio_prio(bio)  (bio)->bi_ioprio
 #define bio_set_prio(bio, prio)((bio)->bi_ioprio = prio)
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 28/41] mm, hugetlb: switch hugetlbfs to multi-order radix-tree entries

2016-09-15 Thread Kirill A. Shutemov

From: Naoya Horiguchi 

Currently, hugetlb pages are linked to page cache on the basis of hugepage
offset (derived from vma_hugecache_offset()) for historical reason, which
doesn't match to the generic usage of page cache and requires some routines
to covert page offset <=> hugepage offset in common path. This patch
adjusts code for multi-order radix-tree to avoid the situation.

Main change is on the behavior of page->index for hugetlbfs. Before this
patch, it represented hugepage offset, but with this patch it represents
page offset. So index-related code have to be updated.
Note that hugetlb_fault_mutex_hash() and reservation region handling are
still working with hugepage offset.

Signed-off-by: Naoya Horiguchi 
[kirill.shute...@linux.intel.com: reject fixed]
Signed-off-by: Kirill A. Shutemov 
---
 fs/hugetlbfs/inode.c| 22 ++
 include/linux/pagemap.h | 10 +-
 mm/filemap.c| 30 ++
 mm/hugetlb.c| 19 ++-
 4 files changed, 35 insertions(+), 46 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 4ea71eba40a5..fc918c0e33e9 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -388,8 +388,8 @@ static void remove_inode_hugepages(struct inode *inode, 
loff_t lstart,
 {
struct hstate *h = hstate_inode(inode);
struct address_space *mapping = &inode->i_data;
-   const pgoff_t start = lstart >> huge_page_shift(h);
-   const pgoff_t end = lend >> huge_page_shift(h);
+   const pgoff_t start = lstart >> PAGE_SHIFT;
+   const pgoff_t end = lend >> PAGE_SHIFT;
struct vm_area_struct pseudo_vma;
struct pagevec pvec;
pgoff_t next;
@@ -447,8 +447,7 @@ static void remove_inode_hugepages(struct inode *inode, 
loff_t lstart,
 
i_mmap_lock_write(mapping);
hugetlb_vmdelete_list(&mapping->i_mmap,
-   next * pages_per_huge_page(h),
-   (next + 1) * pages_per_huge_page(h));
+   next, next + 1);
i_mmap_unlock_write(mapping);
}
 
@@ -467,7 +466,8 @@ static void remove_inode_hugepages(struct inode *inode, 
loff_t lstart,
freed++;
if (!truncate_op) {
if (unlikely(hugetlb_unreserve_pages(inode,
-   next, next + 1, 1)))
+   (next) << huge_page_order(h),
+   (next + 1) << 
huge_page_order(h), 1)))
hugetlb_fix_reserve_counts(inode,
rsv_on_error);
}
@@ -552,8 +552,6 @@ static long hugetlbfs_fallocate(struct file *file, int 
mode, loff_t offset,
struct hstate *h = hstate_inode(inode);
struct vm_area_struct pseudo_vma;
struct mm_struct *mm = current->mm;
-   loff_t hpage_size = huge_page_size(h);
-   unsigned long hpage_shift = huge_page_shift(h);
pgoff_t start, index, end;
int error;
u32 hash;
@@ -569,8 +567,8 @@ static long hugetlbfs_fallocate(struct file *file, int 
mode, loff_t offset,
 * For this range, start is rounded down and end is rounded up
 * as well as being converted to page offsets.
 */
-   start = offset >> hpage_shift;
-   end = (offset + len + hpage_size - 1) >> hpage_shift;
+   start = offset >> PAGE_SHIFT;
+   end = (offset + len + huge_page_size(h) - 1) >> PAGE_SHIFT;
 
inode_lock(inode);
 
@@ -588,7 +586,7 @@ static long hugetlbfs_fallocate(struct file *file, int 
mode, loff_t offset,
pseudo_vma.vm_flags = (VM_HUGETLB | VM_MAYSHARE | VM_SHARED);
pseudo_vma.vm_file = file;
 
-   for (index = start; index < end; index++) {
+   for (index = start; index < end; index += pages_per_huge_page(h)) {
/*
 * This is supposed to be the vaddr where the page is being
 * faulted in, but we have no vaddr here.
@@ -609,10 +607,10 @@ static long hugetlbfs_fallocate(struct file *file, int 
mode, loff_t offset,
}
 
/* Set numa allocation policy based on index */
-   hugetlb_set_vma_policy(&pseudo_vma, inode, index);
+   hugetlb_set_vma_policy(&pseudo_vma, inode, index >> 
huge_page_order(h));
 
/* addr is the offset within the file (zero based) */
-   addr = index * hpage_size;
+   addr = index << PAGE_SHIFT & ~huge_page_mask(h);
 
/* mutex taken here, fault path and hole punch */
hash = hugetlb_fault_mutex_hash(h, mm, &pseudo_vma, mapping,
diff --git a/include/linux/pagemap.h b/include/linu

[PATCHv3 26/41] truncate: make truncate_inode_pages_range() aware about huge pages

2016-09-15 Thread Kirill A. Shutemov

As with shmem_undo_range(), truncate_inode_pages_range() removes huge
pages, if it fully within range.

Partial truncate of huge pages zero out this part of THP.

Unlike with shmem, it doesn't prevent us having holes in the middle of
huge page we still can skip writeback not touched buffers.

With memory-mapped IO we would loose holes in some cases when we have
THP in page cache, since we cannot track access on 4k level in this
case.

Signed-off-by: Kirill A. Shutemov 
---
 fs/buffer.c   |  2 +-
 mm/truncate.c | 95 ++-
 2 files changed, 88 insertions(+), 9 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index e53808e790e2..20898b051044 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1534,7 +1534,7 @@ void block_invalidatepage(struct page *page, unsigned int 
offset,
/*
 * Check for overflow
 */
-   BUG_ON(stop > PAGE_SIZE || stop < length);
+   BUG_ON(stop > hpage_size(page) || stop < length);
 
head = page_buffers(page);
bh = head;
diff --git a/mm/truncate.c b/mm/truncate.c
index ce904e4b1708..9c339e6255f2 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -90,7 +90,7 @@ void do_invalidatepage(struct page *page, unsigned int offset,
 {
void (*invalidatepage)(struct page *, unsigned int, unsigned int);
 
-   invalidatepage = page->mapping->a_ops->invalidatepage;
+   invalidatepage = page_mapping(page)->a_ops->invalidatepage;
 #ifdef CONFIG_BLOCK
if (!invalidatepage)
invalidatepage = block_invalidatepage;
@@ -116,7 +116,7 @@ truncate_complete_page(struct address_space *mapping, 
struct page *page)
return -EIO;
 
if (page_has_private(page))
-   do_invalidatepage(page, 0, PAGE_SIZE);
+   do_invalidatepage(page, 0, hpage_size(page));
 
/*
 * Some filesystems seem to re-dirty the page even after
@@ -288,6 +288,36 @@ void truncate_inode_pages_range(struct address_space 
*mapping,
unlock_page(page);
continue;
}
+
+   if (PageTransTail(page)) {
+   /* Middle of THP: zero out the page */
+   clear_highpage(page);
+   if (page_has_private(page)) {
+   int off = page - compound_head(page);
+   do_invalidatepage(compound_head(page),
+   off * PAGE_SIZE,
+   PAGE_SIZE);
+   }
+   unlock_page(page);
+   continue;
+   } else if (PageTransHuge(page)) {
+   if (index == round_down(end, HPAGE_PMD_NR)) {
+   /*
+* Range ends in the middle of THP:
+* zero out the page
+*/
+   clear_highpage(page);
+   if (page_has_private(page)) {
+   do_invalidatepage(page, 0,
+   PAGE_SIZE);
+   }
+   unlock_page(page);
+   continue;
+   }
+   index += HPAGE_PMD_NR - 1;
+   i += HPAGE_PMD_NR - 1;
+   }
+
truncate_inode_page(mapping, page);
unlock_page(page);
}
@@ -309,9 +339,12 @@ void truncate_inode_pages_range(struct address_space 
*mapping,
wait_on_page_writeback(page);
zero_user_segment(page, partial_start, top);
cleancache_invalidate_page(mapping, page);
-   if (page_has_private(page))
-   do_invalidatepage(page, partial_start,
- top - partial_start);
+   if (page_has_private(page)) {
+   int off = page - compound_head(page);
+   do_invalidatepage(compound_head(page),
+   off * PAGE_SIZE + partial_start,
+   top - partial_start);
+   }
unlock_page(page);
put_page(page);
}
@@ -322,9 +355,12 @@ void truncate_inode_pages_range(struct address_space 
*mapping,
wait_on_page_writeback(page);
ze

[PATCHv3 17/41] filemap: handle huge pages in filemap_fdatawait_range()

2016-09-15 Thread Kirill A. Shutemov

We writeback whole huge page a time.

Signed-off-by: Kirill A. Shutemov 
---
 mm/filemap.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 05b42d3e5ed8..53da93156e60 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -372,9 +372,14 @@ static int __filemap_fdatawait_range(struct address_space 
*mapping,
if (page->index > end)
continue;
 
+   page = compound_head(page);
wait_on_page_writeback(page);
if (TestClearPageError(page))
ret = -EIO;
+   if (PageTransHuge(page)) {
+   index = page->index + HPAGE_PMD_NR;
+   i += index - pvec.pages[i]->index - 1;
+   }
}
pagevec_release(&pvec);
cond_resched();
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh

Christoph,

> It's not a write barrier - a write barrier was command that ensured that
> 
> a) all previous writes were completed to the host/client
> b) all previous writes were on non-volatile storage
> 
> and
> 
> c) the actual write with the barrier bit was on non-volatile storage

Ah! the bit you are complaining about is not the bit I pointed to you, but:

> NBD_CMD_FLUSH (3)
> 
> A flush request; a write barrier. 

I can see that's potentially confusing as isn't meant to mean 'an old-style
linux kernel block device write barrier'. I think in general terms it
probably is some form of barrier, but I see no problem in deleting the
words "a write barrier" from the spec text if only to make it
clearer. However, I think the description of the command itself:

> The server MUST NOT send a successful reply header for this request before 
> all write requests for which a reply has already been sent to the client have 
> reached permanent storage (using fsync() or similar).

and the ordering section I pointed you to before, were both correct, yes?


>> The point still remains that "X was sent before Y" is difficult to
>> determine on the client side if X was sent over a different TCP channel
>> than Y, because a packet might be dropped (requiring a retransmit) for
>> X, and similar things. If blk-mq can deal with that, we're good and
>> nothing further needs to be done. If not, this should be evaluated by
>> someone more familiar with the internals of the kernel block subsystem
>> than me.
> 
> The important bit in all the existing protocols, and which Linux relies
> on is that any write the Linux block layer got a completion for earlier
> needs to be flushed out to non-volatile storage when a FLUSH command is
> set.  Anything that still is in flight does not matter.  Which for
> NBD means anything that you already replies to need to be flushed.

... that's what it says (I hope).

> Or to say it more practicly - in the nbd server you simply need to
> call fdatasync on the backing device or file whenever you get a FLUSH
> requires, and it will do the right thing.

actually fdatasync() technically does more than is necessary, as it
will also flush commands that have been processed, but for which no
reply has yet been sent - that's no bad thing.

-- 
Alex Bligh




--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 09/41] page-flags: relax page flag policy for few flags

2016-09-15 Thread Kirill A. Shutemov

These flags are in use for filesystems with backing storage: PG_error,
PG_writeback and PG_readahead.

Signed-off-by: Kirill A. Shutemov 
---
 include/linux/page-flags.h | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 74e4dda91238..a2bef9a41bcf 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -253,7 +253,7 @@ static inline int TestClearPage##uname(struct page *page) { 
return 0; }
TESTSETFLAG_FALSE(uname) TESTCLEARFLAG_FALSE(uname)
 
 __PAGEFLAG(Locked, locked, PF_NO_TAIL)
-PAGEFLAG(Error, error, PF_NO_COMPOUND) TESTCLEARFLAG(Error, error, 
PF_NO_COMPOUND)
+PAGEFLAG(Error, error, PF_NO_TAIL) TESTCLEARFLAG(Error, error, PF_NO_TAIL)
 PAGEFLAG(Referenced, referenced, PF_HEAD)
TESTCLEARFLAG(Referenced, referenced, PF_HEAD)
__SETPAGEFLAG(Referenced, referenced, PF_HEAD)
@@ -293,15 +293,15 @@ PAGEFLAG(OwnerPriv1, owner_priv_1, PF_ANY)
  * Only test-and-set exist for PG_writeback.  The unconditional operators are
  * risky: they bypass page accounting.
  */
-TESTPAGEFLAG(Writeback, writeback, PF_NO_COMPOUND)
-   TESTSCFLAG(Writeback, writeback, PF_NO_COMPOUND)
+TESTPAGEFLAG(Writeback, writeback, PF_NO_TAIL)
+   TESTSCFLAG(Writeback, writeback, PF_NO_TAIL)
 PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL)
 
 /* PG_readahead is only used for reads; PG_reclaim is only for writes */
 PAGEFLAG(Reclaim, reclaim, PF_NO_TAIL)
TESTCLEARFLAG(Reclaim, reclaim, PF_NO_TAIL)
-PAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND)
-   TESTCLEARFLAG(Readahead, reclaim, PF_NO_COMPOUND)
+PAGEFLAG(Readahead, reclaim, PF_NO_TAIL)
+   TESTCLEARFLAG(Readahead, reclaim, PF_NO_TAIL)
 
 #ifdef CONFIG_HIGHMEM
 /*
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 23/41] fs: make block_read_full_page() be able to read huge page

2016-09-15 Thread Kirill A. Shutemov

The approach is straight-forward: for compound pages we read out whole
huge page.

For huge page we cannot have array of buffer head pointers on stack --
it's 4096 pointers on x86-64 -- 'arr' is allocated with kmalloc() for
huge pages.

Signed-off-by: Kirill A. Shutemov 
---
 fs/buffer.c | 22 +-
 include/linux/buffer_head.h |  9 +
 include/linux/page-flags.h  |  2 +-
 3 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 9c8eb9b6db6a..2739f5dae690 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -870,7 +870,7 @@ struct buffer_head *alloc_page_buffers(struct page *page, 
unsigned long size,
 
 try_again:
head = NULL;
-   offset = PAGE_SIZE;
+   offset = hpage_size(page);
while ((offset -= size) >= 0) {
bh = alloc_buffer_head(GFP_NOFS);
if (!bh)
@@ -1466,7 +1466,7 @@ void set_bh_page(struct buffer_head *bh,
struct page *page, unsigned long offset)
 {
bh->b_page = page;
-   BUG_ON(offset >= PAGE_SIZE);
+   BUG_ON(offset >= hpage_size(page));
if (PageHighMem(page))
/*
 * This catches illegal uses and preserves the offset:
@@ -2239,11 +2239,13 @@ int block_read_full_page(struct page *page, get_block_t 
*get_block)
 {
struct inode *inode = page->mapping->host;
sector_t iblock, lblock;
-   struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+   struct buffer_head *arr_on_stack[MAX_BUF_PER_PAGE];
+   struct buffer_head *bh, *head, **arr = arr_on_stack;
unsigned int blocksize, bbits;
int nr, i;
int fully_mapped = 1;
 
+   VM_BUG_ON_PAGE(PageTail(page), page);
head = create_page_buffers(page, inode, 0);
blocksize = head->b_size;
bbits = block_size_bits(blocksize);
@@ -2254,6 +2256,11 @@ int block_read_full_page(struct page *page, get_block_t 
*get_block)
nr = 0;
i = 0;
 
+   if (PageTransHuge(page)) {
+   arr = kmalloc(sizeof(struct buffer_head *) * HPAGE_PMD_NR *
+   MAX_BUF_PER_PAGE, GFP_NOFS);
+   }
+
do {
if (buffer_uptodate(bh))
continue;
@@ -2269,7 +2276,9 @@ int block_read_full_page(struct page *page, get_block_t 
*get_block)
SetPageError(page);
}
if (!buffer_mapped(bh)) {
-   zero_user(page, i * blocksize, blocksize);
+   zero_user(page + (i * blocksize / PAGE_SIZE),
+   i * blocksize % PAGE_SIZE,
+   blocksize);
if (!err)
set_buffer_uptodate(bh);
continue;
@@ -2295,7 +2304,7 @@ int block_read_full_page(struct page *page, get_block_t 
*get_block)
if (!PageError(page))
SetPageUptodate(page);
unlock_page(page);
-   return 0;
+   goto out;
}
 
/* Stage two: lock the buffers */
@@ -2317,6 +2326,9 @@ int block_read_full_page(struct page *page, get_block_t 
*get_block)
else
submit_bh(REQ_OP_READ, 0, bh);
}
+out:
+   if (arr != arr_on_stack)
+   kfree(arr);
return 0;
 }
 EXPORT_SYMBOL(block_read_full_page);
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 006a8a42acfb..194a85822d5f 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -131,13 +131,14 @@ BUFFER_FNS(Meta, meta)
 BUFFER_FNS(Prio, prio)
 BUFFER_FNS(Defer_Completion, defer_completion)
 
-#define bh_offset(bh)  ((unsigned long)(bh)->b_data & ~PAGE_MASK)
+#define bh_offset(bh)  ((unsigned long)(bh)->b_data & ~hpage_mask(bh->b_page))
 
 /* If we *know* page->private refers to buffer_heads */
-#define page_buffers(page) \
+#define page_buffers(__page)   \
({  \
-   BUG_ON(!PagePrivate(page)); \
-   ((struct buffer_head *)page_private(page)); \
+   struct page *p = compound_head(__page); \
+   BUG_ON(!PagePrivate(p));\
+   ((struct buffer_head *)page_private(p));\
})
 #define page_has_buffers(page) PagePrivate(page)
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index a2bef9a41bcf..20b7684e9298 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -730,7 +730,7 @@ static inline void ClearPageSlabPfmemalloc(struct page 
*page)
  */
 static inline int page_has_private(struct page *page)
 {
-   return !!(

[PATCHv3 21/41] thp: introduce hpage_size() and hpage_mask()

2016-09-15 Thread Kirill A. Shutemov

Introduce new helpers which return size/mask of the page:
HPAGE_PMD_SIZE/HPAGE_PMD_MASK if the page is PageTransHuge() and
PAGE_SIZE/PAGE_MASK otherwise.

Signed-off-by: Kirill A. Shutemov 
---
 include/linux/huge_mm.h | 16 
 1 file changed, 16 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 6f14de45b5ce..de2789b4402c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -138,6 +138,20 @@ static inline int hpage_nr_pages(struct page *page)
return 1;
 }
 
+static inline int hpage_size(struct page *page)
+{
+   if (unlikely(PageTransHuge(page)))
+   return HPAGE_PMD_SIZE;
+   return PAGE_SIZE;
+}
+
+static inline unsigned long hpage_mask(struct page *page)
+{
+   if (unlikely(PageTransHuge(page)))
+   return HPAGE_PMD_MASK;
+   return PAGE_MASK;
+}
+
 extern int do_huge_pmd_numa_page(struct fault_env *fe, pmd_t orig_pmd);
 
 extern struct page *huge_zero_page;
@@ -163,6 +177,8 @@ void put_huge_zero_page(void);
 #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
 
 #define hpage_nr_pages(x) 1
+#define hpage_size(x) PAGE_SIZE
+#define hpage_mask(x) PAGE_MASK
 
 #define transparent_hugepage_enabled(__vma) 0
 
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 10/41] mm, rmap: account file thp pages

2016-09-15 Thread Kirill A. Shutemov

Let's add FileHugePages and FilePmdMapped fields into meminfo and smaps.
It indicates how many times we allocate and map file THP.

Signed-off-by: Kirill A. Shutemov 
---
 drivers/base/node.c|  6 ++
 fs/proc/meminfo.c  |  4 
 fs/proc/task_mmu.c |  5 -
 include/linux/mmzone.h |  2 ++
 mm/filemap.c   |  3 ++-
 mm/huge_memory.c   |  5 -
 mm/page_alloc.c|  5 +
 mm/rmap.c  | 12 
 mm/vmstat.c|  2 ++
 9 files changed, 37 insertions(+), 7 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 5548f9686016..45be0ddb84ed 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -116,6 +116,8 @@ static ssize_t node_read_meminfo(struct device *dev,
   "Node %d AnonHugePages:  %8lu kB\n"
   "Node %d ShmemHugePages: %8lu kB\n"
   "Node %d ShmemPmdMapped: %8lu kB\n"
+  "Node %d FileHugePages: %8lu kB\n"
+  "Node %d FilePmdMapped: %8lu kB\n"
 #endif
,
   nid, K(node_page_state(pgdat, NR_FILE_DIRTY)),
@@ -139,6 +141,10 @@ static ssize_t node_read_meminfo(struct device *dev,
   nid, K(node_page_state(pgdat, NR_SHMEM_THPS) *
   HPAGE_PMD_NR),
   nid, K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED) *
+  HPAGE_PMD_NR),
+  nid, K(node_page_state(pgdat, NR_FILE_THPS) *
+  HPAGE_PMD_NR),
+  nid, K(node_page_state(pgdat, NR_FILE_PMDMAPPED) *
   HPAGE_PMD_NR));
 #else
   nid, K(sum_zone_node_page_state(nid, 
NR_SLAB_UNRECLAIMABLE)));
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index b9a8c813e5e6..fc8a487bc7ed 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -107,6 +107,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
"AnonHugePages:  %8lu kB\n"
"ShmemHugePages: %8lu kB\n"
"ShmemPmdMapped: %8lu kB\n"
+   "FileHugePages:  %8lu kB\n"
+   "FilePmdMapped:  %8lu kB\n"
 #endif
 #ifdef CONFIG_CMA
"CmaTotal:   %8lu kB\n"
@@ -167,6 +169,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
, K(global_node_page_state(NR_ANON_THPS) * HPAGE_PMD_NR)
, K(global_node_page_state(NR_SHMEM_THPS) * HPAGE_PMD_NR)
, K(global_node_page_state(NR_SHMEM_PMDMAPPED) * HPAGE_PMD_NR)
+   , K(global_node_page_state(NR_FILE_THPS) * HPAGE_PMD_NR)
+   , K(global_node_page_state(NR_FILE_PMDMAPPED) * HPAGE_PMD_NR)
 #endif
 #ifdef CONFIG_CMA
, K(totalcma_pages)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f6fa99eca515..9a1cc4a3407a 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -449,6 +449,7 @@ struct mem_size_stats {
unsigned long anonymous;
unsigned long anonymous_thp;
unsigned long shmem_thp;
+   unsigned long file_thp;
unsigned long swap;
unsigned long shared_hugetlb;
unsigned long private_hugetlb;
@@ -584,7 +585,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
else if (is_zone_device_page(page))
/* pass */;
else
-   VM_BUG_ON_PAGE(1, page);
+   mss->file_thp += HPAGE_PMD_SIZE;
smaps_account(mss, page, true, pmd_young(*pmd), pmd_dirty(*pmd));
 }
 #else
@@ -779,6 +780,7 @@ static int show_smap(struct seq_file *m, void *v, int 
is_pid)
   "Anonymous:  %8lu kB\n"
   "AnonHugePages:  %8lu kB\n"
   "ShmemPmdMapped: %8lu kB\n"
+  "FilePmdMapped:  %8lu kB\n"
   "Shared_Hugetlb: %8lu kB\n"
   "Private_Hugetlb: %7lu kB\n"
   "Swap:   %8lu kB\n"
@@ -797,6 +799,7 @@ static int show_smap(struct seq_file *m, void *v, int 
is_pid)
   mss.anonymous >> 10,
   mss.anonymous_thp >> 10,
   mss.shmem_thp >> 10,
+  mss.file_thp >> 10,
   mss.shared_hugetlb >> 10,
   mss.private_hugetlb >> 10,
   mss.swap >> 10,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7f2ae99e5daf..20c5fce13697 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -163,6 +163,8 @@ enum node_stat_item {
NR_SHMEM,   /* shmem pages (included tmpfs/GEM pages) */
NR_SHMEM_THPS,
NR_SHMEM_PMDMAPPED,
+   NR_FILE_THPS,
+   NR_FILE_PMDMAPPED,
NR_ANON_THPS,
NR_UNSTABLE_NFS,/* NFS unstable pages */
NR_VMSCAN_WRITE,
diff --git a/mm/filemap.c b/mm/filemap.c
index ac3a39b1fe6d..6f7f45f47d68 100644
--- a/mm/filemap.c
+++ b/mm/file

[PATCHv3 02/41] radix tree test suite: Allow GFP_ATOMIC allocations to fail

2016-09-15 Thread Kirill A. Shutemov

From: Matthew Wilcox 

In order to test the preload code, it is necessary to fail GFP_ATOMIC
allocations, which requires defining GFP_KERNEL and GFP_ATOMIC properly.
Remove the obsolete __GFP_WAIT and copy the definitions of the __GFP
flags which are used from the kernel include files.  We also need the
real definition of gfpflags_allow_blocking() to persuade the radix tree
to actually use its preallocated nodes.

Signed-off-by: Matthew Wilcox 
Signed-off-by: Kirill A. Shutemov 
---
 tools/testing/radix-tree/linux.c  |  7 ++-
 tools/testing/radix-tree/linux/gfp.h  | 22 +++---
 tools/testing/radix-tree/linux/slab.h |  5 -
 3 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/tools/testing/radix-tree/linux.c b/tools/testing/radix-tree/linux.c
index 154823737b20..3cfb04e98e2f 100644
--- a/tools/testing/radix-tree/linux.c
+++ b/tools/testing/radix-tree/linux.c
@@ -33,7 +33,12 @@ mempool_t *mempool_create(int min_nr, mempool_alloc_t 
*alloc_fn,
 
 void *kmem_cache_alloc(struct kmem_cache *cachep, int flags)
 {
-   void *ret = malloc(cachep->size);
+   void *ret;
+
+   if (flags & __GFP_NOWARN)
+   return NULL;
+
+   ret = malloc(cachep->size);
if (cachep->ctor)
cachep->ctor(ret);
uatomic_inc(&nr_allocated);
diff --git a/tools/testing/radix-tree/linux/gfp.h 
b/tools/testing/radix-tree/linux/gfp.h
index 5201b915f631..5b09b2ce6c33 100644
--- a/tools/testing/radix-tree/linux/gfp.h
+++ b/tools/testing/radix-tree/linux/gfp.h
@@ -3,8 +3,24 @@
 
 #define __GFP_BITS_SHIFT 26
 #define __GFP_BITS_MASK ((gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
-#define __GFP_WAIT 1
-#define __GFP_ACCOUNT 0
-#define __GFP_NOWARN 0
+
+#define __GFP_HIGH 0x20u
+#define __GFP_IO   0x40u
+#define __GFP_FS   0x80u
+#define __GFP_NOWARN   0x200u
+#define __GFP_ATOMIC   0x8u
+#define __GFP_ACCOUNT  0x10u
+#define __GFP_DIRECT_RECLAIM   0x40u
+#define __GFP_KSWAPD_RECLAIM   0x200u
+
+#define __GFP_RECLAIM  (__GFP_DIRECT_RECLAIM|__GFP_KSWAPD_RECLAIM)
+
+#define GFP_ATOMIC (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
+#define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
+
+static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
+{
+   return !!(gfp_flags & __GFP_DIRECT_RECLAIM);
+}
 
 #endif
diff --git a/tools/testing/radix-tree/linux/slab.h 
b/tools/testing/radix-tree/linux/slab.h
index 6d5a34770fd4..452e2bf502e3 100644
--- a/tools/testing/radix-tree/linux/slab.h
+++ b/tools/testing/radix-tree/linux/slab.h
@@ -7,11 +7,6 @@
 #define SLAB_PANIC 2
 #define SLAB_RECLAIM_ACCOUNT0x0002UL/* Objects are 
reclaimable */
 
-static inline int gfpflags_allow_blocking(gfp_t mask)
-{
-   return 1;
-}
-
 struct kmem_cache {
int size;
void (*ctor)(void *);
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 07/41] mm, shmem: swich huge tmpfs to multi-order radix-tree entries

2016-09-15 Thread Kirill A. Shutemov

We would need to use multi-order radix-tree entires for ext4 and other
filesystems to have coherent view on tags (dirty/towrite) in the tree.

This patch converts huge tmpfs implementation to multi-order entries, so
we will be able to use the same code patch for all filesystems.

Signed-off-by: Kirill A. Shutemov 
---
 mm/filemap.c | 320 +--
 mm/huge_memory.c |  47 +---
 mm/khugepaged.c  |  26 ++---
 mm/shmem.c   |  36 ++-
 4 files changed, 247 insertions(+), 182 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 8a287dfc5372..ac3a39b1fe6d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -114,7 +114,7 @@ static void page_cache_tree_delete(struct address_space 
*mapping,
   struct page *page, void *shadow)
 {
struct radix_tree_node *node;
-   int i, nr = PageHuge(page) ? 1 : hpage_nr_pages(page);
+   int nr = PageHuge(page) ? 1 : hpage_nr_pages(page);
 
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(PageTail(page), page);
@@ -132,36 +132,32 @@ static void page_cache_tree_delete(struct address_space 
*mapping,
}
mapping->nrpages -= nr;
 
-   for (i = 0; i < nr; i++) {
-   node = radix_tree_replace_clear_tags(&mapping->page_tree,
-   page->index + i, shadow);
-   if (!node) {
-   VM_BUG_ON_PAGE(nr != 1, page);
-   return;
-   }
+   node = radix_tree_replace_clear_tags(&mapping->page_tree,
+   page->index, shadow);
+   if (!node)
+   return;
 
-   workingset_node_pages_dec(node);
-   if (shadow)
-   workingset_node_shadows_inc(node);
-   else
-   if (__radix_tree_delete_node(&mapping->page_tree, node))
-   continue;
+   workingset_node_pages_dec(node);
+   if (shadow)
+   workingset_node_shadows_inc(node);
+   else
+   if (__radix_tree_delete_node(&mapping->page_tree, node))
+   return;
 
-   /*
-* Track node that only contains shadow entries. DAX mappings
-* contain no shadow entries and may contain other exceptional
-* entries so skip those.
-*
-* Avoid acquiring the list_lru lock if already tracked.
-* The list_empty() test is safe as node->private_list is
-* protected by mapping->tree_lock.
-*/
-   if (!dax_mapping(mapping) && !workingset_node_pages(node) &&
-   list_empty(&node->private_list)) {
-   node->private_data = mapping;
-   list_lru_add(&workingset_shadow_nodes,
-   &node->private_list);
-   }
+   /*
+* Track node that only contains shadow entries. DAX mappings
+* contain no shadow entries and may contain other exceptional
+* entries so skip those.
+*
+* Avoid acquiring the list_lru lock if already tracked.
+* The list_empty() test is safe as node->private_list is
+* protected by mapping->tree_lock.
+*/
+   if (!dax_mapping(mapping) && !workingset_node_pages(node) &&
+   list_empty(&node->private_list)) {
+   node->private_data = mapping;
+   list_lru_add(&workingset_shadow_nodes,
+   &node->private_list);
}
 }
 
@@ -264,12 +260,7 @@ void delete_from_page_cache(struct page *page)
if (freepage)
freepage(page);
 
-   if (PageTransHuge(page) && !PageHuge(page)) {
-   page_ref_sub(page, HPAGE_PMD_NR);
-   VM_BUG_ON_PAGE(page_count(page) <= 0, page);
-   } else {
-   put_page(page);
-   }
+   put_page(page);
 }
 EXPORT_SYMBOL(delete_from_page_cache);
 
@@ -1073,7 +1064,7 @@ EXPORT_SYMBOL(page_cache_prev_hole);
 struct page *find_get_entry(struct address_space *mapping, pgoff_t offset)
 {
void **pagep;
-   struct page *head, *page;
+   struct page *page;
 
rcu_read_lock();
 repeat:
@@ -1094,25 +1085,25 @@ repeat:
goto out;
}
 
-   head = compound_head(page);
-   if (!page_cache_get_speculative(head))
+   if (!page_cache_get_speculative(page))
goto repeat;
 
-   /* The page was split under us? */
-   if (compound_head(page) != head) {
-   put_page(head);
-   goto repeat;
-   }
-
/*
 * Has the page moved?
 * This is part of the lockless pagecache protocol. See
 * include/linux/pagemap.h for det

[PATCHv3 03/41] radix-tree: Add radix_tree_join

2016-09-15 Thread Kirill A. Shutemov

From: Matthew Wilcox 

This new function allows for the replacement of many smaller entries in
the radix tree with one larger multiorder entry.  From the point of view
of an RCU walker, they may see a mixture of the smaller entries and the
large entry during the same walk, but they will never see NULL for an
index which was populated before the join.

Signed-off-by: Matthew Wilcox 
Signed-off-by: Kirill A. Shutemov 
---
 include/linux/radix-tree.h|   2 +
 lib/radix-tree.c  | 159 +++---
 tools/testing/radix-tree/multiorder.c |  32 +++
 3 files changed, 163 insertions(+), 30 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 4c45105dece3..75ae4648d13d 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -319,6 +319,8 @@ static inline void radix_tree_preload_end(void)
preempt_enable();
 }
 
+int radix_tree_join(struct radix_tree_root *, unsigned long index,
+   unsigned new_order, void *);
 /**
  * struct radix_tree_iter - radix tree iterator state
  *
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 1b7bf7314141..3157f223c268 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -314,18 +314,14 @@ static void radix_tree_node_rcu_free(struct rcu_head 
*head)
 {
struct radix_tree_node *node =
container_of(head, struct radix_tree_node, rcu_head);
-   int i;
 
/*
-* must only free zeroed nodes into the slab. radix_tree_shrink
-* can leave us with a non-NULL entry in the first slot, so clear
-* that here to make sure.
+* Must only free zeroed nodes into the slab.  We can be left with
+* non-NULL entries by radix_tree_free_nodes, so clear the entries
+* and tags here.
 */
-   for (i = 0; i < RADIX_TREE_MAX_TAGS; i++)
-   tag_clear(node, i, 0);
-
-   node->slots[0] = NULL;
-   node->count = 0;
+   memset(node->slots, 0, sizeof(node->slots));
+   memset(node->tags, 0, sizeof(node->tags));
 
kmem_cache_free(radix_tree_node_cachep, node);
 }
@@ -563,14 +559,14 @@ int __radix_tree_create(struct radix_tree_root *root, 
unsigned long index,
shift = radix_tree_load_root(root, &child, &maxindex);
 
/* Make sure the tree is high enough.  */
+   if (order > 0 && max == ((1UL << order) - 1))
+   max++;
if (max > maxindex) {
int error = radix_tree_extend(root, max, shift);
if (error < 0)
return error;
shift = error;
child = root->rnode;
-   if (order == shift)
-   shift += RADIX_TREE_MAP_SHIFT;
}
 
while (shift > order) {
@@ -582,6 +578,7 @@ int __radix_tree_create(struct radix_tree_root *root, 
unsigned long index,
return -ENOMEM;
child->shift = shift;
child->offset = offset;
+   child->count = 0;
child->parent = node;
rcu_assign_pointer(*slot, node_to_entry(child));
if (node)
@@ -595,31 +592,113 @@ int __radix_tree_create(struct radix_tree_root *root, 
unsigned long index,
slot = &node->slots[offset];
}
 
+   if (nodep)
+   *nodep = node;
+   if (slotp)
+   *slotp = slot;
+   return 0;
+}
+
 #ifdef CONFIG_RADIX_TREE_MULTIORDER
-   /* Insert pointers to the canonical entry */
-   if (order > shift) {
-   unsigned i, n = 1 << (order - shift);
+/*
+ * Free any nodes below this node.  The tree is presumed to not need
+ * shrinking, and any user data in the tree is presumed to not need a
+ * destructor called on it.  If we need to add a destructor, we can
+ * add that functionality later.  Note that we may not clear tags or
+ * slots from the tree as an RCU walker may still have a pointer into
+ * this subtree.  We could replace the entries with RADIX_TREE_RETRY,
+ * but we'll still have to clear those in rcu_free.
+ */
+static void radix_tree_free_nodes(struct radix_tree_node *node)
+{
+   unsigned offset = 0;
+   struct radix_tree_node *child = entry_to_node(node);
+
+   for (;;) {
+   void *entry = child->slots[offset];
+   if (radix_tree_is_internal_node(entry) &&
+   !is_sibling_entry(child, entry)) {
+   child = entry_to_node(entry);
+   offset = 0;
+   continue;
+   }
+   offset++;
+   while (offset == RADIX_TREE_MAP_SIZE) {
+   struct radix_tree_node *old = child;
+   offset = child->offset + 1;
+   child = child->parent;
+   radix_tree_node_free(old);
+

[PATCHv3 04/41] radix-tree: Add radix_tree_split

2016-09-15 Thread Kirill A. Shutemov

From: Matthew Wilcox 

This new function splits a larger multiorder entry into smaller entries
(potentially multi-order entries).  These entries are initialised to
RADIX_TREE_RETRY to ensure that RCU walkers who see this state aren't
confused.  The caller should then call radix_tree_for_each_slot() and
radix_tree_replace_slot() in order to turn these retry entries into the
intended new entries.  Tags are replicated from the original multiorder
entry into each new entry.

Signed-off-by: Matthew Wilcox 
Signed-off-by: Kirill A. Shutemov 
---
 include/linux/radix-tree.h|   6 +-
 lib/radix-tree.c  | 109 --
 tools/testing/radix-tree/multiorder.c |  26 
 3 files changed, 135 insertions(+), 6 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 75ae4648d13d..459e8a152c8a 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -280,8 +280,7 @@ bool __radix_tree_delete_node(struct radix_tree_root *root,
  struct radix_tree_node *node);
 void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_delete(struct radix_tree_root *, unsigned long);
-struct radix_tree_node *radix_tree_replace_clear_tags(
-   struct radix_tree_root *root,
+struct radix_tree_node *radix_tree_replace_clear_tags(struct radix_tree_root *,
unsigned long index, void *entry);
 unsigned int radix_tree_gang_lookup(struct radix_tree_root *root,
void **results, unsigned long first_index,
@@ -319,8 +318,11 @@ static inline void radix_tree_preload_end(void)
preempt_enable();
 }
 
+int radix_tree_split(struct radix_tree_root *, unsigned long index,
+   unsigned new_order);
 int radix_tree_join(struct radix_tree_root *, unsigned long index,
unsigned new_order, void *);
+
 /**
  * struct radix_tree_iter - radix tree iterator state
  *
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 3157f223c268..ad3116cbe61b 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -231,7 +231,10 @@ static void dump_node(struct radix_tree_node *node, 
unsigned long index)
void *entry = node->slots[i];
if (!entry)
continue;
-   if (is_sibling_entry(node, entry)) {
+   if (entry == RADIX_TREE_RETRY) {
+   pr_debug("radix retry offset %ld indices %ld-%ld\n",
+   i, first, last);
+   } else if (is_sibling_entry(node, entry)) {
pr_debug("radix sblng %p offset %ld val %p indices 
%ld-%ld\n",
entry, i,
*(void **)entry_to_node(entry),
@@ -641,7 +644,10 @@ static inline int insert_entries(struct radix_tree_node 
*node, void **slot,
unsigned i, n, tag, offset, tags = 0;
 
if (node) {
-   n = 1 << (order - node->shift);
+   if (order > node->shift)
+   n = 1 << (order - node->shift);
+   else
+   n = 1;
offset = get_slot_offset(node, slot);
} else {
n = 1;
@@ -680,7 +686,8 @@ static inline int insert_entries(struct radix_tree_node 
*node, void **slot,
tag_set(node, tag, offset);
}
if (radix_tree_is_internal_node(old) &&
-   !is_sibling_entry(node, old))
+   !is_sibling_entry(node, old) &&
+   (old != RADIX_TREE_RETRY))
radix_tree_free_nodes(old);
}
if (node)
@@ -843,6 +850,98 @@ int radix_tree_join(struct radix_tree_root *root, unsigned 
long index,
 
return error;
 }
+
+int radix_tree_split(struct radix_tree_root *root, unsigned long index,
+   unsigned order)
+{
+   struct radix_tree_node *parent, *node, *child;
+   void **slot;
+   unsigned int offset, end;
+   unsigned n, tag, tags = 0;
+
+   if (!__radix_tree_lookup(root, index, &parent, &slot))
+   return -ENOENT;
+   if (!parent)
+   return -ENOENT;
+
+   offset = get_slot_offset(parent, slot);
+
+   for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
+   if (tag_get(parent, tag, offset))
+   tags |= 1 << tag;
+
+   for (end = offset + 1; end < RADIX_TREE_MAP_SIZE; end++) {
+   if (!is_sibling_entry(parent, parent->slots[end]))
+   break;
+   for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
+   if (tags & (1 << tag))
+   tag_set(parent, tag, end);
+   /* tags must be set before RETR

[PATCHv3 00/41] ext4: support of huge pages

2016-09-15 Thread Kirill A. Shutemov

Here's respin of my huge ext4 patchset on top of v4.8-rc6 with couple of
fixes (see below).

Please review and consider applying.

I don't see any xfstests regressions with huge pages enabled. Patch with
new configurations for xfstests-bld is below.

The basics are the same as with tmpfs[1] which is in Linus' tree now and
ext4 built on top of it. The main difference is that we need to handle
read out from and write-back to backing storage.

Head page links buffers for whole huge page. Dirty/writeback tracking
happens on per-hugepage level.

We read out whole huge page at once. It required bumping BIO_MAX_PAGES to
not less than HPAGE_PMD_NR. I defined BIO_MAX_PAGES to HPAGE_PMD_NR if
huge pagecache enabled.

On split_huge_page() we need to free buffers before splitting the page.
Page buffers takes additional pin on the page and can be a vector to mess
with the page during split. We want to avoid this.
If try_to_free_buffers() fails, split_huge_page() would return -EBUSY.

Readahead doesn't play with huge pages well: 128k max readahead window,
assumption on page size, PageReadahead() to track hit/miss.  I've got it
to allocate huge pages, but it doesn't provide any readahead as such.
I don't know how to do this right. It's not clear at this point if we
really need readahead with huge pages. I guess it's good enough for now.

Shadow entries ignored on allocation -- recently evicted page is not
promoted to active list. Not sure if current workingset logic is adequate
for huge pages. On eviction, we split the huge page and setup 4k shadow
entries as usual.

Unlike tmpfs, ext4 makes use of tags in radix-tree. The approach I used
for tmpfs -- 512 entries in radix-tree per-hugepages -- doesn't work well
if we want to have coherent view on tags. So the first 8 patches of the
patchset converts tmpfs to use multi-order entries in radix-tree.
The same infrastructure used for ext4.

Encryption doesn't handle huge pages yet. To avoid regressions we just
disable huge pages for the inode if it has EXT4_INODE_ENCRYPT.

Tested with 4k, 1k, encryption and bigalloc. All with and without
huge=always. I think it's reasonable coverage.

The patchset is also in git:

git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git hugeext4/v3

[1] 
http://lkml.kernel.org/r/1465222029-45942-1-git-send-email-kirill.shute...@linux.intel.com

Changes since v2:
  - fix intermittent crash in generic/299;
  - typo (condition inversion) in do_generic_file_read(),
reported by Jitendra;

TODO:
  - readahead ?;
  - wire up madvise()/fadvise();
  - encryption with huge pages;
  - reclaim of file huge pages can be optimized -- split_huge_page() is not
required for pages with backing storage;

>From f523dd3aad026f5a3f8cbabc0ec69958a0618f6b Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" 
Date: Fri, 12 Aug 2016 19:44:30 +0300
Subject: [PATCH] Add few more configurations to test ext4 with huge pages

Four new configurations: huge_4k, huge_1k, huge_bigalloc, huge_encrypt.

Signed-off-by: Kirill A. Shutemov 
---
 .../test-appliance/files/root/fs/ext4/cfg/all.list   |  4 
 .../test-appliance/files/root/fs/ext4/cfg/huge_1k|  6 ++
 .../test-appliance/files/root/fs/ext4/cfg/huge_4k|  6 ++
 .../test-appliance/files/root/fs/ext4/cfg/huge_bigalloc  | 14 ++
 .../files/root/fs/ext4/cfg/huge_bigalloc.exclude |  7 +++
 .../test-appliance/files/root/fs/ext4/cfg/huge_encrypt   |  5 +
 .../files/root/fs/ext4/cfg/huge_encrypt.exclude  | 16 
 kvm-xfstests/util/parse_cli  |  1 +
 8 files changed, 59 insertions(+)
 create mode 100644 kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_1k
 create mode 100644 kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_4k
 create mode 100644 
kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_bigalloc
 create mode 100644 
kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_bigalloc.exclude
 create mode 100644 
kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_encrypt
 create mode 100644 
kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_encrypt.exclude

diff --git a/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/all.list 
b/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/all.list
index 7ec37f4bafaa..14a8e72d2e6e 100644
--- a/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/all.list
+++ b/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/all.list
@@ -9,3 +9,7 @@ dioread_nolock
 data_journal
 bigalloc
 bigalloc_1k
+huge_4k
+huge_1k
+huge_bigalloc
+huge_encrypt
diff --git a/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_1k 
b/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_1k
new file mode 100644
index ..209c76a8a6c1
--- /dev/null
+++ b/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_1k
@@ -0,0 +1,6 @@
+export FS=ext4
+export TEST_DEV=$SM_TST_DEV
+export TEST_DIR=$SM_TST_MNT
+export MKFS_OPTIONS="-q -b 1024"
+export EXT_MOUNT_OPTIONS="hu

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Christoph Hellwig

On Thu, Sep 15, 2016 at 12:49:35PM +0200, Wouter Verhelst wrote:
> A while back, we spent quite some time defining the semantics of the
> various commands in the face of the NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA
> write barriers. At the time, we decided that it would be unreasonable
> to expect servers to make these write barriers effective across
> different connections.

Do you have a nbd protocol specification?  treating a flush or fua
as any sort of barrier is incredibly stupid.  Is it really documented
that way, and if yes, why?
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Christoph Hellwig

On Thu, Sep 15, 2016 at 02:01:59PM +0200, Wouter Verhelst wrote:
> Yes. There was some discussion on that part, and we decided that setting
> the flag doesn't hurt, but the spec also clarifies that using it on READ
> does nothing, semantically.
>
> 
> The problem is that there are clients in the wild which do set it on
> READ, so it's just a matter of "be liberal in what you accept".

Note that FUA on READ in SCSI and NVMe does have a meaning - it
requires you to bypass any sort of cache on the target.  I think it's an
wrong defintion because it mandates implementation details that aren't
observable by the initiator, but it's still the spec wording and nbd
diverges from it.  That's not nessecarily a bad thing, but a caveat to
look out for.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Christoph Hellwig

On Thu, Sep 15, 2016 at 01:11:24PM +0100, Alex Bligh wrote:
> > NBD_CMD_FLUSH (3)
> > 
> > A flush request; a write barrier. 
> 
> I can see that's potentially confusing as isn't meant to mean 'an old-style
> linux kernel block device write barrier'. I think in general terms it
> probably is some form of barrier, but I see no problem in deleting the
> words "a write barrier" from the spec text if only to make it
> clearer.

Yes, please do that.  A "barrier" implies draining of the queue.

> However, I think the description of the command itself:
> 
> > The server MUST NOT send a successful reply header for this request before 
> > all write requests for which a reply has already been sent to the client 
> > have reached permanent storage (using fsync() or similar).
> 
> and the ordering section I pointed you to before, were both correct, yes?

Yes, this seems correct.

> actually fdatasync() technically does more than is necessary, as it
> will also flush commands that have been processed, but for which no
> reply has yet been sent - that's no bad thing.

Yes.  But without an actual barrier it's hard to be exact - and
fdatasync does the right thing by including false positives.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Wouter Verhelst

On Thu, Sep 15, 2016 at 05:01:25AM -0700, Christoph Hellwig wrote:
> On Thu, Sep 15, 2016 at 01:55:14PM +0200, Wouter Verhelst wrote:
> > If that's not a write barrier, then I was using the wrong terminology
> > (and offer my apologies for the confusion).
> 
> It's not a write barrier - a write barrier was command that ensured that
[...]

Okay, I'll update the spec to remove that term then.

> > The point still remains that "X was sent before Y" is difficult to
> > determine on the client side if X was sent over a different TCP channel
> > than Y, because a packet might be dropped (requiring a retransmit) for
> > X, and similar things. If blk-mq can deal with that, we're good and
> > nothing further needs to be done. If not, this should be evaluated by
> > someone more familiar with the internals of the kernel block subsystem
> > than me.
> 
> The important bit in all the existing protocols, and which Linux relies
> on is that any write the Linux block layer got a completion for earlier
> needs to be flushed out to non-volatile storage when a FLUSH command is
> set.  Anything that still is in flight does not matter.  Which for
> NBD means anything that you already replies to need to be flushed.
> 
> Or to say it more practicly - in the nbd server you simply need to
> call fdatasync on the backing device or file whenever you get a FLUSH
> requires, and it will do the right thing.

Right. So do I understand you correctly that blk-mq currently doesn't
look at multiple queues, and just assumes that if a FLUSH is sent over
any one of the queues, it applies to all queues?

If so, I'll update the spec to clarify that servers should ensure this
holds in the case of multiple connections, or not allow writes, or not
allow multiple connections.

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
   people in the world who think they really understand all of its rules,
   and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Christoph Hellwig

On Thu, Sep 15, 2016 at 02:21:20PM +0200, Wouter Verhelst wrote:
> Right. So do I understand you correctly that blk-mq currently doesn't
> look at multiple queues, and just assumes that if a FLUSH is sent over
> any one of the queues, it applies to all queues?

Yes.  The same is true at the protocol level for NVMe or SCSI transports
that can make use of multiple queues.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Wouter Verhelst

On Thu, Sep 15, 2016 at 05:20:08AM -0700, Christoph Hellwig wrote:
> On Thu, Sep 15, 2016 at 02:01:59PM +0200, Wouter Verhelst wrote:
> > Yes. There was some discussion on that part, and we decided that setting
> > the flag doesn't hurt, but the spec also clarifies that using it on READ
> > does nothing, semantically.
> >
> > 
> > The problem is that there are clients in the wild which do set it on
> > READ, so it's just a matter of "be liberal in what you accept".
> 
> Note that FUA on READ in SCSI and NVMe does have a meaning - it
> requires you to bypass any sort of cache on the target.  I think it's an
> wrong defintion because it mandates implementation details that aren't
> observable by the initiator, but it's still the spec wording and nbd
> diverges from it.  That's not nessecarily a bad thing, but a caveat to
> look out for.

Yes. I think the kernel nbd driver should probably filter out FUA on
READ. It has no meaning in the case of nbd, and whatever expectations
the kernel may have cannot be provided for by nbd anyway.

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
   people in the world who think they really understand all of its rules,
   and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv3 29/41] ext4: make ext4_mpage_readpages() hugepage-aware

2016-09-15 Thread Andreas Dilger

On Sep 15, 2016, at 5:55 AM, Kirill A. Shutemov 
 wrote:
> 
> This patch modifies ext4_mpage_readpages() to deal with huge pages.
> 
> We read out 2M at once, so we have to alloc (HPAGE_PMD_NR *
> blocks_per_page) sector_t for that. I'm not entirely happy with kmalloc
> in this codepath, but don't see any other option.

If you're reading 2MB from disk (possibly from disjoint blocks with seeks
in between) I don't think that the kmalloc() is going to be the limiting
performance factor.  If you are concerned about the size of the kmalloc()
causing failures when pages are fragmented (it can be 16KB for 1KB blocks
with 4KB pages), then using ext4_kvmalloc() to fall back to vmalloc() in
case kmalloc() fails.  It shouldn't fail often for 16KB allocations,
but it could in theory.

I also notice that ext4_kvmalloc() should probably use unlikely() for
the failure case, so that the uncommon vmalloc() fallback is out-of-line
in this more important codepath.  The only other callers are during mount,
so a branch misprediction is not critical.

Cheers, Andreas

> 
> Signed-off-by: Kirill A. Shutemov 
> ---
> fs/ext4/readpage.c | 38 --
> 1 file changed, 32 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c
> index a81b829d56de..6d7cbddceeb2 100644
> --- a/fs/ext4/readpage.c
> +++ b/fs/ext4/readpage.c
> @@ -104,12 +104,12 @@ int ext4_mpage_readpages(struct address_space *mapping,
> 
>   struct inode *inode = mapping->host;
>   const unsigned blkbits = inode->i_blkbits;
> - const unsigned blocks_per_page = PAGE_SIZE >> blkbits;
>   const unsigned blocksize = 1 << blkbits;
>   sector_t block_in_file;
>   sector_t last_block;
>   sector_t last_block_in_file;
> - sector_t blocks[MAX_BUF_PER_PAGE];
> + sector_t blocks_on_stack[MAX_BUF_PER_PAGE];
> + sector_t *blocks = blocks_on_stack;
>   unsigned page_block;
>   struct block_device *bdev = inode->i_sb->s_bdev;
>   int length;
> @@ -122,8 +122,9 @@ int ext4_mpage_readpages(struct address_space *mapping,
>   map.m_flags = 0;
> 
>   for (; nr_pages; nr_pages--) {
> - int fully_mapped = 1;
> - unsigned first_hole = blocks_per_page;
> + int fully_mapped = 1, nr = nr_pages;
> + unsigned blocks_per_page = PAGE_SIZE >> blkbits;
> + unsigned first_hole;
> 
>   prefetchw(&page->flags);
>   if (pages) {
> @@ -138,10 +139,31 @@ int ext4_mpage_readpages(struct address_space *mapping,
>   goto confused;
> 
>   block_in_file = (sector_t)page->index << (PAGE_SHIFT - blkbits);
> - last_block = block_in_file + nr_pages * blocks_per_page;
> +
> + if (PageTransHuge(page)) {
> + BUILD_BUG_ON(BIO_MAX_PAGES < HPAGE_PMD_NR);
> + nr = HPAGE_PMD_NR * blocks_per_page;
> + /* XXX: need a better solution ? */
> + blocks = kmalloc(sizeof(sector_t) * nr, GFP_NOFS);
> + if (!blocks) {
> + if (pages) {
> + delete_from_page_cache(page);
> + goto next_page;
> + }
> + return -ENOMEM;
> + }
> +
> + blocks_per_page *= HPAGE_PMD_NR;
> + last_block = block_in_file + blocks_per_page;
> + } else {
> + blocks = blocks_on_stack;
> + last_block = block_in_file + nr * blocks_per_page;
> + }
> +
>   last_block_in_file = (i_size_read(inode) + blocksize - 1) >> 
> blkbits;
>   if (last_block > last_block_in_file)
>   last_block = last_block_in_file;
> + first_hole = blocks_per_page;
>   page_block = 0;
> 
>   /*
> @@ -213,6 +235,8 @@ int ext4_mpage_readpages(struct address_space *mapping,
>   }
>   }
>   if (first_hole != blocks_per_page) {
> + if (PageTransHuge(page))
> + goto confused;
>   zero_user_segment(page, first_hole << blkbits,
> PAGE_SIZE);
>   if (first_hole == 0) {
> @@ -248,7 +272,7 @@ int ext4_mpage_readpages(struct address_space *mapping,
>   goto set_error_page;
>   }
>   bio = bio_alloc(GFP_KERNEL,
> - min_t(int, nr_pages, BIO_MAX_PAGES));
> + min_t(int, nr, BIO_MAX_PAGES));
>   if (!bio) {
>   if (ctx)
>   fscrypt_release_ctx(ctx);
> @@ -289,5 +313,7 @@ int ext4_mpage_readpages(struct address_space *mapping

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh


> On 15 Sep 2016, at 13:18, Christoph Hellwig  wrote:
> 
> Yes, please do that.  A "barrier" implies draining of the queue.

Done

-- 
Alex Bligh




--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Christoph Hellwig

On Thu, Sep 15, 2016 at 02:26:31PM +0200, Wouter Verhelst wrote:
> Yes. I think the kernel nbd driver should probably filter out FUA on
> READ. It has no meaning in the case of nbd, and whatever expectations
> the kernel may have cannot be provided for by nbd anyway.

The kernel never sets FUA on reads - I just pointed out that other
protocols have defined (although horrible) semantics for it.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh

> On 15 Sep 2016, at 13:23, Christoph Hellwig  wrote:
> 
> On Thu, Sep 15, 2016 at 02:21:20PM +0200, Wouter Verhelst wrote:
>> Right. So do I understand you correctly that blk-mq currently doesn't
>> look at multiple queues, and just assumes that if a FLUSH is sent over
>> any one of the queues, it applies to all queues?
> 
> Yes.  The same is true at the protocol level for NVMe or SCSI transports
> that can make use of multiple queues.

At an implementation level that is going to be a little difficult
for some NBD servers, e.g. ones that fork() a different process per
connection. There is in general no IPC to speak of between server
instances. Such servers would thus be unsafe with more than one
connection if FLUSH is in use.

I believe such servers include the reference server where there is
process per connection (albeit possibly with several threads).

Even single process servers (including mine - gonbdserver) would
require logic to pair up multiple connections to the same
device.

I suspect ALL nbd servers would require a change. So I think we should
think carefully before introducing this protocol change.

It might be easier to make the client code change work around this
issue (somehow).

-- 
Alex Bligh

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Christoph Hellwig

On Thu, Sep 15, 2016 at 01:33:20PM +0100, Alex Bligh wrote:
> At an implementation level that is going to be a little difficult
> for some NBD servers, e.g. ones that fork() a different process per
> connection. There is in general no IPC to speak of between server
> instances. Such servers would thus be unsafe with more than one
> connection if FLUSH is in use.
> 
> I believe such servers include the reference server where there is
> process per connection (albeit possibly with several threads).
> 
> Even single process servers (including mine - gonbdserver) would
> require logic to pair up multiple connections to the same
> device.

Why?  If you only send the completion after your I/O syscall returned
your are fine if fsync comes from a difference process, no matter
if you're using direct or buffered I/O underneath.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh


> On 15 Sep 2016, at 13:36, Christoph Hellwig  wrote:
> 
> On Thu, Sep 15, 2016 at 01:33:20PM +0100, Alex Bligh wrote:
>> At an implementation level that is going to be a little difficult
>> for some NBD servers, e.g. ones that fork() a different process per
>> connection. There is in general no IPC to speak of between server
>> instances. Such servers would thus be unsafe with more than one
>> connection if FLUSH is in use.
>> 
>> I believe such servers include the reference server where there is
>> process per connection (albeit possibly with several threads).
>> 
>> Even single process servers (including mine - gonbdserver) would
>> require logic to pair up multiple connections to the same
>> device.
> 
> Why?  If you only send the completion after your I/O syscall returned
> your are fine if fsync comes from a difference process, no matter
> if you're using direct or buffered I/O underneath.

That's probably right in the case of file-based back ends that
are running on a Linux OS. But gonbdserver for instance supports
(e.g.) Ceph based backends, where each connection might be talking
to a completely separate ceph node, and there may be no cache
consistency between connections.

-- 
Alex Bligh




--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Christoph Hellwig

On Thu, Sep 15, 2016 at 01:39:11PM +0100, Alex Bligh wrote:
> That's probably right in the case of file-based back ends that
> are running on a Linux OS. But gonbdserver for instance supports
> (e.g.) Ceph based backends, where each connection might be talking
> to a completely separate ceph node, and there may be no cache
> consistency between connections.

Yes, if you don't have a cache coherent backend you are generally
screwed with a multiqueue protocol.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh


> On 15 Sep 2016, at 13:41, Christoph Hellwig  wrote:
> 
> On Thu, Sep 15, 2016 at 01:39:11PM +0100, Alex Bligh wrote:
>> That's probably right in the case of file-based back ends that
>> are running on a Linux OS. But gonbdserver for instance supports
>> (e.g.) Ceph based backends, where each connection might be talking
>> to a completely separate ceph node, and there may be no cache
>> consistency between connections.
> 
> Yes, if you don't have a cache coherent backend you are generally
> screwed with a multiqueue protocol.

I wonder if the ability to support multiqueue should be visible
in the negotiation stage. That would allow the client to refuse
to select multiqueue where it isn't safe.

Wouter?

-- 
Alex Bligh




--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 2/6] dm rq: add DM_MAPIO_DELAY_REQUEUE to delay requeue of blk-mq requests

2016-09-15 Thread Mike Snitzer

On Thu, Sep 15 2016 at  2:14am -0400,
Hannes Reinecke  wrote:

> On 09/14/2016 06:29 PM, Mike Snitzer wrote:
> > Otherwise blk-mq will immediately dispatch requests that are requeued
> > via a BLK_MQ_RQ_QUEUE_BUSY return from blk_mq_ops .queue_rq.
> > 
> > Delayed requeue is implemented using blk_mq_delay_kick_requeue_list()
> > with a delay of 5 secs.  In the context of DM multipath (all paths down)
> > it doesn't make any sense to requeue more quickly.
> > 
> > Signed-off-by: Mike Snitzer 
> > ---
> >  drivers/md/dm-rq.c| 32 ++--
> >  include/linux/device-mapper.h |  1 +
> >  2 files changed, 19 insertions(+), 14 deletions(-)
> > 
> > diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
> > index 0d301d5..9eebc8d 100644
> > --- a/drivers/md/dm-rq.c
> > +++ b/drivers/md/dm-rq.c
> [..]
> > @@ -671,7 +673,10 @@ static int map_request(struct dm_rq_target_io *tio, 
> > struct request *rq,
> > break;
> > case DM_MAPIO_REQUEUE:
> > /* The target wants to requeue the I/O */
> > -   dm_requeue_original_request(md, tio->orig);
> > +   break;
> > +   case DM_MAPIO_DELAY_REQUEUE:
> > +   /* The target wants to requeue the I/O after a delay */
> > +   dm_requeue_original_request(md, tio->orig, true);
> > break;
> > default:
> > if (r > 0) {
> Hmm? What happened here?
> Don't we need to requeue the request for DM_MAPIO_REQUEUE?

Yes, as always, the caller will perform the immediate requeue -- this is
a requirement for blk-mq but I made .request_fn do the same just for
consistency in the request-based DM code.

In the case of blk-mq it is done in terms of a BLK_MQ_RQ_QUEUE_BUSY
return from .queue_rq (which is the most immediate requeue there is
since blk-mq just puts the request back on its dispatch list for the
very next queue run).

(it is so quick that it causes excessive load when all paths are down,
hence this patchset to only use immediate requeue when it makes sense)

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Wouter Verhelst

On Thu, Sep 15, 2016 at 01:44:29PM +0100, Alex Bligh wrote:
> 
> > On 15 Sep 2016, at 13:41, Christoph Hellwig  wrote:
> > 
> > On Thu, Sep 15, 2016 at 01:39:11PM +0100, Alex Bligh wrote:
> >> That's probably right in the case of file-based back ends that
> >> are running on a Linux OS. But gonbdserver for instance supports
> >> (e.g.) Ceph based backends, where each connection might be talking
> >> to a completely separate ceph node, and there may be no cache
> >> consistency between connections.
> > 
> > Yes, if you don't have a cache coherent backend you are generally
> > screwed with a multiqueue protocol.
> 
> I wonder if the ability to support multiqueue should be visible
> in the negotiation stage. That would allow the client to refuse
> to select multiqueue where it isn't safe.

The server can always refuse to allow multiple connections.

I was thinking of changing the spec as follows:

diff --git a/doc/proto.md b/doc/proto.md
index 217f57e..cb099e2 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -308,6 +308,23 @@ specification, the
 [kernel 
documentation](https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt)
 may be useful.
 
+For performance reasons, clients MAY open multiple connections to the
+same server. To support such clients, servers SHOULD ensure that at
+least one of the following conditions hold:
+
+* Flush commands are processed for ALL connections. That is, when an
+  `NBD_CMD_WRITE` is processed on one connection, and then an
+  `NBD_CMD_FLUSH` is processed on another connection, the data of the
+  `NBD_CMD_WRITE` on the first connection MUST reach permanent storage
+  before the reply of the `NBD_CMD_FLUSH` is sent.
+* The server allows `NBD_CMD_WRITE` and `NBD_CMD_FLUSH` on at most one
+  connection
+* Multiple connections are not allowed
+
+In addition, clients using multiple connections SHOULD NOT send
+`NBD_CMD_FLUSH` if an `NBD_CMD_WRITE` for which they care in relation to
+the flush has not been replied to yet.
+
  Request message
 
 The request message, sent by the client, looks as follows:

The latter bit (on the client side) is because even if your backend has
no cache coherency issues, TCP does not guarantee ordering between
multiple connections. I don't know if the above is in line with what
blk-mq does, but consider the following scenario:

- A client sends two writes to the server, followed (immediately) by a
  flush, where at least the second write and the flush are not sent over
  the same connection.
- The first write is a small one, and it is handled almost immediately.
- The second write takes a little longer, so the flush is handled
  earlier than the second write
- The network packet containing the flush reply gets lost for whatever
  reason, so the client doesn't get it, and we fall into TCP
  retransmits.
- The second write finishes, and its reply header does not get lost
- After the second write reply reaches the client, the TCP retransmits
  for the flush reply are handled.

In the above scenario, the flush reply arrives on the client side after
a write reply which it did not cover; so the client will (incorrectly)
assume that the write has reached permanent storage when in fact it may
not have done so yet.

If the kernel does not care about the ordering of the two writes versus
the flush, then there is no problem. I don't know how blk-mq works in
that context, but if the above is a likely scenario, we may have to
reconsider adding blk-mq to nbd.

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
   people in the world who think they really understand all of its rules,
   and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Eric Blake

On 09/15/2016 06:09 AM, Alex Bligh wrote:
> 
> I also wonder whether any servers that can do caching per
> connection will always share a consistent cache between 
> connections. The one I'm worried about in particular here
> is qemu-nbd - Eric Blake CC'd.
> 

I doubt that qemu-nbd would ever want to support the situation with more
than one client connection writing to the same image at the same time;
the implications of sorting out data consistency between multiple
writers is rather complex and not worth coding into qemu.  So I think
qemu would probably prefer to just prohibit the multiple writer
situation.  And while multiple readers with no writer should be fine,
I'm not even sure if multiple readers plus one writer can always be made
to appear sane (if there is no coordination between the different
connections, on an image where the writer changes AA to BA then flushes
then changes to BB, it is still feasible that a reader could see AB
(pre-flush state of the first sector, post-flush changes to the second
sector, even though the writer never flushed that particular content to
disk).

But Paolo Bonzini (cc'd) may have more insight on qemu's NBD server and
what it supports (or forbids) in the way of multiple clients to a single
server.

> A more general point is that with multiple queues requests
> may be processed in a different order even by those servers that
> currently process the requests in strict order, or in something
> similar to strict order. The server is permitted by the spec
> (save as mandated by NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA) to
> process commands out of order anyway, but I suspect this has
> to date been little tested.

qemu-nbd is definitely capable of serving reads and writes out-of-order
to a single connection client; but that's different than the case with
multiple connections.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org

signature.asc
Description: OpenPGP digital signature

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Josef Bacik


On 09/15/2016 09:17 AM, Wouter Verhelst wrote:

On Thu, Sep 15, 2016 at 01:44:29PM +0100, Alex Bligh wrote:



On 15 Sep 2016, at 13:41, Christoph Hellwig  wrote:

On Thu, Sep 15, 2016 at 01:39:11PM +0100, Alex Bligh wrote:

That's probably right in the case of file-based back ends that
are running on a Linux OS. But gonbdserver for instance supports
(e.g.) Ceph based backends, where each connection might be talking
to a completely separate ceph node, and there may be no cache
consistency between connections.


Yes, if you don't have a cache coherent backend you are generally
screwed with a multiqueue protocol.


I wonder if the ability to support multiqueue should be visible
in the negotiation stage. That would allow the client to refuse
to select multiqueue where it isn't safe.


The server can always refuse to allow multiple connections.

I was thinking of changing the spec as follows:

diff --git a/doc/proto.md b/doc/proto.md
index 217f57e..cb099e2 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -308,6 +308,23 @@ specification, the
 [kernel 
documentation](https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt)
 may be useful.

+For performance reasons, clients MAY open multiple connections to the
+same server. To support such clients, servers SHOULD ensure that at
+least one of the following conditions hold:
+
+* Flush commands are processed for ALL connections. That is, when an
+  `NBD_CMD_WRITE` is processed on one connection, and then an
+  `NBD_CMD_FLUSH` is processed on another connection, the data of the
+  `NBD_CMD_WRITE` on the first connection MUST reach permanent storage
+  before the reply of the `NBD_CMD_FLUSH` is sent.
+* The server allows `NBD_CMD_WRITE` and `NBD_CMD_FLUSH` on at most one
+  connection
+* Multiple connections are not allowed
+
+In addition, clients using multiple connections SHOULD NOT send
+`NBD_CMD_FLUSH` if an `NBD_CMD_WRITE` for which they care in relation to
+the flush has not been replied to yet.
+
  Request message

 The request message, sent by the client, looks as follows:

The latter bit (on the client side) is because even if your backend has
no cache coherency issues, TCP does not guarantee ordering between
multiple connections. I don't know if the above is in line with what
blk-mq does, but consider the following scenario:

- A client sends two writes to the server, followed (immediately) by a
  flush, where at least the second write and the flush are not sent over
  the same connection.
- The first write is a small one, and it is handled almost immediately.
- The second write takes a little longer, so the flush is handled
  earlier than the second write
- The network packet containing the flush reply gets lost for whatever
  reason, so the client doesn't get it, and we fall into TCP
  retransmits.
- The second write finishes, and its reply header does not get lost
- After the second write reply reaches the client, the TCP retransmits
  for the flush reply are handled.

In the above scenario, the flush reply arrives on the client side after
a write reply which it did not cover; so the client will (incorrectly)
assume that the write has reached permanent storage when in fact it may
not have done so yet.


This isn't an NBD problem, this is an application problem.  The application must 
wait for all writes it cares about _before_ issuing a flush.  This is the same 
as for normal storage as it is for NBD.  It is not NBD's responsibility to 
maintain coherency between multiple requests across connections, just simply to 
act on and respond to requests.


I think changing the specification to indicate that this is the case for 
multiple connections is a good thing, to keep NBD servers from doing weird 
things like sending different connections to the same export to different 
backing stores without some sort of synchronization.  It should definitely be 
explicitly stated somewhere that NBD does not provide any ordering guarantees 
and that is up to the application.  Thanks,


Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Paolo Bonzini



On 15/09/2016 15:34, Eric Blake wrote:
> On 09/15/2016 06:09 AM, Alex Bligh wrote:
>>
>> I also wonder whether any servers that can do caching per
>> connection will always share a consistent cache between 
>> connections. The one I'm worried about in particular here
>> is qemu-nbd - Eric Blake CC'd.
>>
> 
> I doubt that qemu-nbd would ever want to support the situation with more
> than one client connection writing to the same image at the same time;
> the implications of sorting out data consistency between multiple
> writers is rather complex and not worth coding into qemu.  So I think
> qemu would probably prefer to just prohibit the multiple writer
> situation.  And while multiple readers with no writer should be fine,
> I'm not even sure if multiple readers plus one writer can always be made
> to appear sane (if there is no coordination between the different
> connections, on an image where the writer changes AA to BA then flushes
> then changes to BB, it is still feasible that a reader could see AB
> (pre-flush state of the first sector, post-flush changes to the second
> sector, even though the writer never flushed that particular content to
> disk).
> 
> But Paolo Bonzini (cc'd) may have more insight on qemu's NBD server and
> what it supports (or forbids) in the way of multiple clients to a single
> server.

I don't think QEMU forbids multiple clients to the single server, and
guarantees consistency as long as there is no overlap between writes and
reads.  These are the same guarantees you have for multiple commands on
a single connection.

In other words, from the POV of QEMU there's no difference whether
multiple commands come from one or more connections.

Paolo

>> A more general point is that with multiple queues requests
>> may be processed in a different order even by those servers that
>> currently process the requests in strict order, or in something
>> similar to strict order. The server is permitted by the spec
>> (save as mandated by NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA) to
>> process commands out of order anyway, but I suspect this has
>> to date been little tested.
> 
> qemu-nbd is definitely capable of serving reads and writes out-of-order
> to a single connection client; but that's different than the case with
> multiple connections.
> 



signature.asc
Description: OpenPGP digital signature

Re: blk-mq: allow passing in an external queue mapping V3

2016-09-15 Thread Keith Busch

On Wed, Sep 14, 2016 at 04:18:46PM +0200, Christoph Hellwig wrote:
> This series is the remainder of the earlier "automatic interrupt affinity for
> MSI/MSI-X capable devices" series, and make uses of the new irq-level
> interrupt / queue mapping code in blk-mq, as well as allowing the driver
> to pass in such a mask obtained from the (PCI) interrupt code.  To fully
> support this feature in drivers the final third in the PCI layer will
> be needed as well.

Thanks, this looks good and tests successfully on my hardware.

For the series:

Reviewed-by: Keith Busch 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: blk-mq: allow passing in an external queue mapping V3

2016-09-15 Thread Christoph Hellwig

Thanks for all the testing and the review Keith, as well as the
fixes earlier.

Jens, what do you think of the series?

Thomas has added the first 5 patches to

https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/log/?h=irq/for-block

so it would be great if we could pull that into a block branch and
get the rest into the block tree sooner or later.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: blk-mq: allow passing in an external queue mapping V3

2016-09-15 Thread Jens Axboe


On 09/15/2016 08:32 AM, Christoph Hellwig wrote:

Thanks for all the testing and the review Keith, as well as the
fixes earlier.

Jens, what do you think of the series?

Thomas has added the first 5 patches to

https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/log/?h=irq/for-block

so it would be great if we could pull that into a block branch and
get the rest into the block tree sooner or later.


I was going to ask about splitting it, but that looks fine, I can pull
that in.

The series looks fine to me. My only real concern is giving drivers the
flexibility to define mappings, I don't want that to evolve into drivers
(again) doing stupid things wrt mappings. As long as we keep it strictly
as a tunnel for passing mappings defined by the (previous blk-mq) core
code, then that's fine.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: blk-mq: allow passing in an external queue mapping V3

2016-09-15 Thread Christoph Hellwig

On Thu, Sep 15, 2016 at 08:34:42AM -0600, Jens Axboe wrote:
> I was going to ask about splitting it, but that looks fine, I can pull
> that in.
>
> The series looks fine to me. My only real concern is giving drivers the
> flexibility to define mappings, I don't want that to evolve into drivers
> (again) doing stupid things wrt mappings. As long as we keep it strictly
> as a tunnel for passing mappings defined by the (previous blk-mq) core
> code, then that's fine.

So my earlier versions just passed in the affinity mask and left
all the mapping in the core.  This doesn't really work anymore with
the sibling aware code so I had to add a method.  That being said there
are some drivers that might want slightly different mappings.

For example skd (if converted to blk-mq) has MSI-X vectors for it's up
to four queues, but it also has MSI-X vectors for misc book keeping
before those, so we'd need a version of our PCI mapping that adds an
offset to add to queue number when assining the MSI-X vectors.

That being said I structured the map_queues interface so it can't do
anything crazy - it can just build up the cpu to queue mapping array
so there isn't exactly a whole lot of crazy things a driver could do.

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: blk-mq: allow passing in an external queue mapping V3

2016-09-15 Thread Jens Axboe


On 09/15/2016 08:42 AM, Christoph Hellwig wrote:

On Thu, Sep 15, 2016 at 08:34:42AM -0600, Jens Axboe wrote:

I was going to ask about splitting it, but that looks fine, I can pull
that in.

The series looks fine to me. My only real concern is giving drivers the
flexibility to define mappings, I don't want that to evolve into drivers
(again) doing stupid things wrt mappings. As long as we keep it strictly
as a tunnel for passing mappings defined by the (previous blk-mq) core
code, then that's fine.


So my earlier versions just passed in the affinity mask and left
all the mapping in the core.  This doesn't really work anymore with
the sibling aware code so I had to add a method.  That being said there
are some drivers that might want slightly different mappings.

For example skd (if converted to blk-mq) has MSI-X vectors for it's up
to four queues, but it also has MSI-X vectors for misc book keeping
before those, so we'd need a version of our PCI mapping that adds an
offset to add to queue number when assining the MSI-X vectors.

That being said I structured the map_queues interface so it can't do
anything crazy - it can just build up the cpu to queue mapping array
so there isn't exactly a whole lot of crazy things a driver could do.


By crazy, I mean even things like thinking it knows better and defining
mappings differently just because it can. But I'm not too worried about
it, it's just something to watch for that we didn't have to care about
before, as the mappings were completely mandated by the core code.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 13/13] blk-mq: get rid of the cpumask in struct blk_mq_tags

2016-09-15 Thread Christoph Hellwig

> +static int blk_mq_create_mq_map(struct blk_mq_tag_set *set,
> + const struct cpumask *affinity_mask)
>  {
> + int queue = -1, cpu = 0;
> +
> + set->mq_map = kzalloc_node(sizeof(*set->mq_map) * nr_cpu_ids,
> + GFP_KERNEL, set->numa_node);
> + if (!set->mq_map)
> + return -ENOMEM;
> +
> + if (!affinity_mask)
> + return 0;   /* map all cpus to queue 0 */
> +
> + /* If cpus are offline, map them to first hctx */
> + for_each_online_cpu(cpu) {
> + if (cpumask_test_cpu(cpu, affinity_mask))
> + queue++;
> + if (queue >= 0)
> + set->mq_map[cpu] = queue;
> + }
> +
> + return 0;
>  }

I just noticed that the patch adds this unused function due to a rebase
error.  Jens, do you you just want to fix this up while applying or
should I resend?
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH][V2] nbd: add multi-connection support

2016-09-15 Thread Josef Bacik

NBD can become contended on its single connection.  We have to serialize all
writes and we can only process one read response at a time.  Fix this by
allowing userspace to provide multiple connections to a single nbd device.  This
coupled with block-mq drastically increases performance in multi-process cases.
Thanks,

Signed-off-by: Josef Bacik 
---
V1->V2:
-Dropped the index from nbd_cmd and just used the hctx->queue_num as HCH
 suggested
-Added the pid attribute back to the /sys/block/nbd*/ directory for the recv
 pid.
-Reworked the disconnect to simply send the command on all connections instead
 of sending a special command through the block layer.
-Fixed some of the disconnect handling to be less verbose when we specifically
 request a disconnect.

 drivers/block/nbd.c | 344 +++-
 1 file changed, 206 insertions(+), 138 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 4c6dd1a..28e9b93 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -41,26 +41,36 @@
 
 #include 
 
+struct nbd_sock {
+   struct socket *sock;
+   struct mutex tx_lock;
+};
+
 #define NBD_TIMEDOUT   0
 #define NBD_DISCONNECT_REQUESTED   1
+#define NBD_DISCONNECTED   2
+#define NBD_RUNNING3
 
 struct nbd_device {
u32 flags;
unsigned long runtime_flags;
-   struct socket * sock;   /* If == NULL, device is not ready, yet */
+   struct nbd_sock **socks;
int magic;
 
struct blk_mq_tag_set tag_set;
 
-   struct mutex tx_lock;
+   struct mutex config_lock;
struct gendisk *disk;
+   int num_connections;
+   atomic_t recv_threads;
+   wait_queue_head_t recv_wq;
int blksize;
loff_t bytesize;
 
/* protects initialization and shutdown of the socket */
spinlock_t sock_lock;
struct task_struct *task_recv;
-   struct task_struct *task_send;
+   struct task_struct *task_setup;
 
 #if IS_ENABLED(CONFIG_DEBUG_FS)
struct dentry *dbg_dir;
@@ -69,7 +79,6 @@ struct nbd_device {
 
 struct nbd_cmd {
struct nbd_device *nbd;
-   struct list_head list;
 };
 
 #if IS_ENABLED(CONFIG_DEBUG_FS)
@@ -159,22 +168,20 @@ static void nbd_end_request(struct nbd_cmd *cmd)
  */
 static void sock_shutdown(struct nbd_device *nbd)
 {
-   struct socket *sock;
-
-   spin_lock(&nbd->sock_lock);
+   int i;
 
-   if (!nbd->sock) {
-   spin_unlock_irq(&nbd->sock_lock);
+   if (nbd->num_connections == 0)
+   return;
+   if (test_and_set_bit(NBD_DISCONNECTED, &nbd->runtime_flags))
return;
-   }
-
-   sock = nbd->sock;
-   dev_warn(disk_to_dev(nbd->disk), "shutting down socket\n");
-   nbd->sock = NULL;
-   spin_unlock(&nbd->sock_lock);
 
-   kernel_sock_shutdown(sock, SHUT_RDWR);
-   sockfd_put(sock);
+   for (i = 0; i < nbd->num_connections; i++) {
+   struct nbd_sock *nsock = nbd->socks[i];
+   mutex_lock(&nsock->tx_lock);
+   kernel_sock_shutdown(nsock->sock, SHUT_RDWR);
+   mutex_unlock(&nsock->tx_lock);
+   }
+   dev_warn(disk_to_dev(nbd->disk), "shutting down sockets\n");
 }
 
 static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req,
@@ -182,35 +189,31 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct 
request *req,
 {
struct nbd_cmd *cmd = blk_mq_rq_to_pdu(req);
struct nbd_device *nbd = cmd->nbd;
-   struct socket *sock = NULL;
-
-   spin_lock(&nbd->sock_lock);
 
+   dev_err(nbd_to_dev(nbd), "Connection timed out, shutting down 
connection\n");
set_bit(NBD_TIMEDOUT, &nbd->runtime_flags);
-
-   if (nbd->sock) {
-   sock = nbd->sock;
-   get_file(sock->file);
-   }
-
-   spin_unlock(&nbd->sock_lock);
-   if (sock) {
-   kernel_sock_shutdown(sock, SHUT_RDWR);
-   sockfd_put(sock);
-   }
-
req->errors++;
-   dev_err(nbd_to_dev(nbd), "Connection timed out, shutting down 
connection\n");
+
+   /*
+* If our disconnect packet times out then we're already holding the
+* config_lock and could deadlock here, so just set an error and return,
+* we'll handle shutting everything down later.
+*/
+   if (req->cmd_type == REQ_TYPE_DRV_PRIV)
+   return BLK_EH_HANDLED;
+   mutex_lock(&nbd->config_lock);
+   sock_shutdown(nbd);
+   mutex_unlock(&nbd->config_lock);
return BLK_EH_HANDLED;
 }
 
 /*
  *  Send or receive packet.
  */
-static int sock_xmit(struct nbd_device *nbd, int send, void *buf, int size,
-   int msg_flags)
+static int sock_xmit(struct nbd_device *nbd, int index, int send, void *buf,
+int size, int msg_flags)
 {
-   struct socket *sock = nbd->sock;
+   struct socket *sock = nbd->socks[index]->sock;
int resu

Re: [PATCH 13/13] blk-mq: get rid of the cpumask in struct blk_mq_tags

2016-09-15 Thread Jens Axboe


On 09/15/2016 08:44 AM, Christoph Hellwig wrote:

+static int blk_mq_create_mq_map(struct blk_mq_tag_set *set,
+   const struct cpumask *affinity_mask)
 {
+   int queue = -1, cpu = 0;
+
+   set->mq_map = kzalloc_node(sizeof(*set->mq_map) * nr_cpu_ids,
+   GFP_KERNEL, set->numa_node);
+   if (!set->mq_map)
+   return -ENOMEM;
+
+   if (!affinity_mask)
+   return 0;   /* map all cpus to queue 0 */
+
+   /* If cpus are offline, map them to first hctx */
+   for_each_online_cpu(cpu) {
+   if (cpumask_test_cpu(cpu, affinity_mask))
+   queue++;
+   if (queue >= 0)
+   set->mq_map[cpu] = queue;
+   }
+
+   return 0;
 }


I just noticed that the patch adds this unused function due to a rebase
error.  Jens, do you you just want to fix this up while applying or
should I resend?


Killed it off manually, I already applied and pushed it out.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[GIT PULL] Block fixes for 4.8-rc

2016-09-15 Thread Jens Axboe


Hi Linus,

A set of fixes for the current series in the realm of block. Like the
previous pull request, the meat of it are fixes for the nvme fabrics/target
code. Outside of that, just one fix from Gabriel for not doing a queue
suspend if we didn't get the admin queue setup in the first place.

Note that this branch contains a patch that will cause a merge conflict
with your commit:

commit bd0b841fee49de421f615cc173ccff063303672f
Author: Linus Torvalds 
Date:   Sun Sep 11 14:41:49 2016 -0700

nvme: make NVME_RDMA depend on BLOCK

since you ended up fixing that one up yourself. Apart from that, it will
merge cleanly.

Please pull!


  git://git.kernel.dk/linux-block.git for-linus



Arnd Bergmann (1):
  nvme-rdma: add back dependency on CONFIG_BLOCK

Colin Ian King (1):
  nvme-rdma: fix null pointer dereference on req->mr

Gabriel Krisman Bertazi (1):
  nvme: Don't suspend admin queue that wasn't created

Jens Axboe (1):
  Merge branch 'nvmf-4.8-rc' of 
git://git.infradead.org/nvme-fabrics into for-linus


Sagi Grimberg (1):
  nvme-rdma: add DELETING queue flag

Steve Wise (5):
  iw_cxgb4: call dev_put() on l2t allocation failure
  iw_cxgb4: block module unload until all ep resources are released
  nvme_rdma: keep a ref on the ctrl during delete/flush
  nvme-rdma: destroy nvme queue rdma resources on connect failure
  nvme-rdma: use ib_client API to detect device removal

Wenbo Wang (1):
  nvme/quirk: Add a delay before checking device ready for memblaze 
device


 drivers/infiniband/hw/cxgb4/cm.c   |   6 +-
 drivers/infiniband/hw/cxgb4/device.c   |   5 ++
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h |   1 +
 drivers/nvme/host/Kconfig  |   1 +
 drivers/nvme/host/pci.c|   9 ++-
 drivers/nvme/host/rdma.c   | 144 
-

 6 files changed, 91 insertions(+), 75 deletions(-)

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh

Josef,

> On 15 Sep 2016, at 14:57, Josef Bacik  wrote:
> 
> This isn't an NBD problem, this is an application problem.  The application 
> must wait for all writes it cares about _before_ issuing a flush.  This is 
> the same as for normal storage as it is for NBD.  It is not NBD's 
> responsibility to maintain coherency between multiple requests across 
> connections, just simply to act on and respond to requests.
> 
> I think changing the specification to indicate that this is the case for 
> multiple connections is a good thing, to keep NBD servers from doing weird 
> things like sending different connections to the same export to different 
> backing stores without some sort of synchronization.  It should definitely be 
> explicitly stated somewhere that NBD does not provide any ordering guarantees 
> and that is up to the application.  Thanks,

I don't think that's correct.

The block stack issues a flush to mean (precisely) "do not reply to this until 
all preceding writes THAT HAVE BEEN REPLIED TO have been persisted to 
non-volatile storage".

The danger is with multiple connections (where apparently only one flush is 
sent - let's say down connection 1) that not al the writes that have been 
replied to on connection 2 have been persisted to non-volatile storage. Only 
the ones on connection 1 have been persisted (this is assuming the nbd server 
doesn't 'link' in some way the connections).

There's nothing the 'application' (here meaning the kernel or higher level) can 
do to mitigate this. Sure it can wait for all the replies, but this doesn't 
guarantee the writes have been persisted to non-volatile storage, precisely 
because writes may return prior to this.

-- 
Alex Bligh

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh

Paolo,

> On 15 Sep 2016, at 15:07, Paolo Bonzini  wrote:
> 
> I don't think QEMU forbids multiple clients to the single server, and
> guarantees consistency as long as there is no overlap between writes and
> reads.  These are the same guarantees you have for multiple commands on
> a single connection.
> 
> In other words, from the POV of QEMU there's no difference whether
> multiple commands come from one or more connections.

This isn't really about ordering, it's about cache coherency
and persisting things to disk.

What you say is correct as far as it goes in terms of ordering. However
consider the scenario with read and writes on two channels as follows
of the same block:

 Channel1 Channel2

 R  Block read, and cached in user space in
channel 1's cache
Reply sent

  W New value written, channel 2's cache updated
channel 1's cache not

 R  Value returned from channel 1's cache.

In the above scenario, there is a problem if the server(s) handling the
two channels each use a read cache which is not coherent between the
two channels. An example would be a read-through cache on a server that
did fork() and shared no state between connections.

Similarly, if there is a write on channel 1 that has completed, and
the flush goes to channel 2, it may not (if state is not shared) guarantee
that the write on channel 1 (which has completed) is persisted to non-volatile
media. Obviously if the 'state' is OS block cache/buffers/whatever, it
will, but if it's (e.g.) a user-space per process write-through cache,
it won't.

I don't know whether qemu-nbd is likely to suffer from either of these.

--
Alex Bligh

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh

Eric,

> I doubt that qemu-nbd would ever want to support the situation with more
> than one client connection writing to the same image at the same time;
> the implications of sorting out data consistency between multiple
> writers is rather complex and not worth coding into qemu.  So I think
> qemu would probably prefer to just prohibit the multiple writer
> situation.

Yeah, I was thinking about a 'no multiple connection' flag.

>  And while multiple readers with no writer should be fine,
> I'm not even sure if multiple readers plus one writer can always be made
> to appear sane (if there is no coordination between the different
> connections, on an image where the writer changes AA to BA then flushes
> then changes to BB, it is still feasible that a reader could see AB
> (pre-flush state of the first sector, post-flush changes to the second
> sector, even though the writer never flushed that particular content to
> disk).

Agree

--
Alex Bligh






signature.asc
Description: Message signed with OpenPGP using GPGMail

[PATCH] lightnvm: propagate device_add() error code

2016-09-15 Thread Arnd Bergmann

device_add() may fail, and all callers are supposed to check the
return value, but one new user in lightnvm doesn't:

drivers/lightnvm/sysfs.c: In function 'nvm_sysfs_register_dev':
drivers/lightnvm/sysfs.c:184:2: error: ignoring return value of 'device_add', 
declared with attribute warn_unused_result [-Werror=unused-result]

This changes the caller to propagate any error codes, which avoids
the warning.

Signed-off-by: Arnd Bergmann 
Fixes: 38c9e260b9f9 ("lightnvm: expose device geometry through sysfs")
---
 drivers/lightnvm/lightnvm.h | 2 +-
 drivers/lightnvm/sysfs.c| 9 ++---
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/lightnvm/lightnvm.h b/drivers/lightnvm/lightnvm.h
index 93f1aacc9f02..305c181509a6 100644
--- a/drivers/lightnvm/lightnvm.h
+++ b/drivers/lightnvm/lightnvm.h
@@ -24,7 +24,7 @@
 #include 
 
 /* core -> sysfs.c */
-int nvm_sysfs_register_dev(struct nvm_dev *);
+int __must_check nvm_sysfs_register_dev(struct nvm_dev *);
 void nvm_sysfs_unregister_dev(struct nvm_dev *);
 int nvm_sysfs_register(void);
 void nvm_sysfs_unregister(void);
diff --git a/drivers/lightnvm/sysfs.c b/drivers/lightnvm/sysfs.c
index 72ad089c0269..0338c27ab95a 100644
--- a/drivers/lightnvm/sysfs.c
+++ b/drivers/lightnvm/sysfs.c
@@ -174,6 +174,8 @@ static struct device_type nvm_type = {
 
 int nvm_sysfs_register_dev(struct nvm_dev *dev)
 {
+   int ret;
+
if (!dev->parent_dev)
return 0;
 
@@ -181,11 +183,12 @@ int nvm_sysfs_register_dev(struct nvm_dev *dev)
dev_set_name(&dev->dev, "%s", dev->name);
dev->dev.type = &nvm_type;
device_initialize(&dev->dev);
-   device_add(&dev->dev);
+   ret = device_add(&dev->dev);
 
-   blk_mq_register_dev(&dev->dev, dev->q);
+   if (!ret)
+   blk_mq_register_dev(&dev->dev, dev->q);
 
-   return 0;
+   return ret;
 }
 
 void nvm_sysfs_unregister_dev(struct nvm_dev *dev)
-- 
2.9.0

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh

Wouter,

> The server can always refuse to allow multiple connections.

Sure, but it would be neater to warn the client of that
at negotiation stage (it would only be one flag, e.g.
'multiple connections unsafe'). That way the connection
won't fail with a cryptic EBUSY or whatever, but will
just negotiate a single connection.

> I was thinking of changing the spec as follows:
> 
> diff --git a/doc/proto.md b/doc/proto.md
> index 217f57e..cb099e2 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -308,6 +308,23 @@ specification, the
> [kernel 
> documentation](https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt)
> may be useful.
> 
> +For performance reasons, clients MAY open multiple connections to the
> +same server. To support such clients, servers SHOULD ensure that at
> +least one of the following conditions hold:
> +
> +* Flush commands are processed for ALL connections. That is, when an
> +  `NBD_CMD_WRITE` is processed on one connection, and then an
> +  `NBD_CMD_FLUSH` is processed on another connection, the data of the
> +  `NBD_CMD_WRITE` on the first connection MUST reach permanent storage
> +  before the reply of the `NBD_CMD_FLUSH` is sent.
> +* The server allows `NBD_CMD_WRITE` and `NBD_CMD_FLUSH` on at most one
> +  connection
> +* Multiple connections are not allowed
> +
> +In addition, clients using multiple connections SHOULD NOT send
> +`NBD_CMD_FLUSH` if an `NBD_CMD_WRITE` for which they care in relation to
> +the flush has not been replied to yet.
> +

I don't think that should be a mandatory behaviour. For once, it would
be reasonably easy on gonbdserver but a PITA on the reference server.
You'd need to put in IPC between each of the forked processes OR rely
on fdatasync() only - I'm not sure that would necessarily work
safely with (e.g.) the 'treefiles' / COW options.

I think better would be to say that the server MUST either

* Not support NBD_CMD_FLUSH at all
* Support NBD_CMD_FLUSH across channels (as you set out above), or
* Indicate that it does not support multiple channels.

>  Request message
> 
> The request message, sent by the client, looks as follows:
> 
> The latter bit (on the client side) is because even if your backend has
> no cache coherency issues, TCP does not guarantee ordering between
> multiple connections. I don't know if the above is in line with what
> blk-mq does, but consider the following scenario:
> 
> - A client sends two writes to the server, followed (immediately) by a
>  flush, where at least the second write and the flush are not sent over
>  the same connection.
> - The first write is a small one, and it is handled almost immediately.
> - The second write takes a little longer, so the flush is handled
>  earlier than the second write
> - The network packet containing the flush reply gets lost for whatever
>  reason, so the client doesn't get it, and we fall into TCP
>  retransmits.
> - The second write finishes, and its reply header does not get lost
> - After the second write reply reaches the client, the TCP retransmits
>  for the flush reply are handled.
> 
> In the above scenario, the flush reply arrives on the client side after
> a write reply which it did not cover; so the client will (incorrectly)
> assume that the write has reached permanent storage when in fact it may
> not have done so yet.
> 
> If the kernel does not care about the ordering of the two writes versus
> the flush, then there is no problem. I don't know how blk-mq works in
> that context, but if the above is a likely scenario, we may have to
> reconsider adding blk-mq to nbd.

Actually I think this is a problem anyway. A simpler failure case is
one where (by chance) one channel gets the writes, and one channel
gets the flushes. The flush reply is delayed beyond the replies to
unconnected writes (on the other channel) and hence the kernel thinks
replied-to writes have been persisted when they have not been.

The only way to fix that (as far as I can see) without changing flush
semantics is for the block layer to issue flush requests on each
channel when flushing on one channel. This, incidentally, would
resolve every other issue!

-- 
Alex Bligh




--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH V2 03/11] block-throttle: configure bps/iops limit for cgroup in high limit

2016-09-15 Thread Shaohua Li

each queue will have a state machine. Initially queue is in LIMIT_HIGH
state, which means all cgroups will be throttled according to their high
limit. After all cgroups with high limit cross the limit, the queue state
gets upgraded to LIMIT_MAX state.
cgroups without high limit will use max limit for their high limit.

Signed-off-by: Shaohua Li 
---
 block/blk-throttle.c | 21 +++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 02323fb..6bae1b4 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -208,12 +208,29 @@ static struct throtl_data *sq_to_td(struct 
throtl_service_queue *sq)
 
 static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
 {
-   return tg->bps[rw][tg->td->limit_index];
+   struct blkcg_gq *blkg = tg_to_blkg(tg);
+   uint64_t ret;
+
+   if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
+   return -1;
+   ret = tg->bps[rw][tg->td->limit_index];
+   if (ret == -1 && tg->td->limit_index == LIMIT_HIGH)
+   return tg->bps[rw][LIMIT_MAX];
+
+   return ret;
 }
 
 static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
 {
-   return tg->iops[rw][tg->td->limit_index];
+   struct blkcg_gq *blkg = tg_to_blkg(tg);
+   unsigned int ret;
+
+   if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
+   return -1;
+   ret = tg->iops[rw][tg->td->limit_index];
+   if (ret == -1 && tg->td->limit_index == LIMIT_HIGH)
+   return tg->iops[rw][LIMIT_MAX];
+   return ret;
 }
 
 /**
-- 
2.8.0.rc2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH V2 07/11] blk-throttle: make throtl_slice tunable

2016-09-15 Thread Shaohua Li

throtl_slice is important for blk-throttling. A lot of stuffes depend on
it, for example, throughput measurement. It has 100ms default value,
which is not appropriate for all disks. For example, for SSD we might
use a smaller value to make the throughput smoother. This patch makes it
tunable.

Signed-off-by: Shaohua Li 
---
 block/blk-sysfs.c| 11 
 block/blk-throttle.c | 72 
 block/blk.h  |  3 +++
 3 files changed, 64 insertions(+), 22 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index f87a7e7..610f08d 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -526,6 +526,14 @@ static struct queue_sysfs_entry queue_dax_entry = {
.show = queue_dax_show,
 };
 
+#ifdef CONFIG_BLK_DEV_THROTTLING
+static struct queue_sysfs_entry throtl_slice_entry = {
+   .attr = {.name = "throttling_slice", .mode = S_IRUGO | S_IWUSR },
+   .show = blk_throtl_slice_show,
+   .store = blk_throtl_slice_store,
+};
+#endif
+
 static struct attribute *default_attrs[] = {
&queue_requests_entry.attr,
&queue_ra_entry.attr,
@@ -553,6 +561,9 @@ static struct attribute *default_attrs[] = {
&queue_poll_entry.attr,
&queue_wc_entry.attr,
&queue_dax_entry.attr,
+#ifdef CONFIG_BLK_DEV_THROTTLING
+   &throtl_slice_entry.attr,
+#endif
NULL,
 };
 
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 2bd8333..bc94086 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -19,7 +19,8 @@ static int throtl_grp_quantum = 8;
 static int throtl_quantum = 32;
 
 /* Throttling is performed over 100ms slice and after that slice is renewed */
-static unsigned long throtl_slice = HZ/10; /* 100 ms */
+#define DFL_THROTL_SLICE (HZ / 10)
+#define MAX_THROTL_SLICE (HZ / 5)
 
 static struct blkcg_policy blkcg_policy_throtl;
 
@@ -158,6 +159,8 @@ struct throtl_data
/* Total Number of queued bios on READ and WRITE lists */
unsigned int nr_queued[2];
 
+   unsigned int throtl_slice;
+
/* Work for dispatching throttled bios */
struct work_struct dispatch_work;
unsigned int limit_index;
@@ -589,7 +592,7 @@ static void throtl_dequeue_tg(struct throtl_grp *tg)
 static void throtl_schedule_pending_timer(struct throtl_service_queue *sq,
  unsigned long expires)
 {
-   unsigned long max_expire = jiffies + 8 * throtl_slice;
+   unsigned long max_expire = jiffies + 8 * sq_to_tg(sq)->td->throtl_slice;
if (time_after(expires, max_expire))
expires = max_expire;
mod_timer(&sq->pending_timer, expires);
@@ -649,7 +652,7 @@ static inline void 
throtl_start_new_slice_with_credit(struct throtl_grp *tg,
if (time_after_eq(start, tg->slice_start[rw]))
tg->slice_start[rw] = start;
 
-   tg->slice_end[rw] = jiffies + throtl_slice;
+   tg->slice_end[rw] = jiffies + tg->td->throtl_slice;
throtl_log(&tg->service_queue,
   "[%c] new slice with credit start=%lu end=%lu jiffies=%lu",
   rw == READ ? 'R' : 'W', tg->slice_start[rw],
@@ -661,7 +664,7 @@ static inline void throtl_start_new_slice(struct throtl_grp 
*tg, bool rw)
tg->bytes_disp[rw] = 0;
tg->io_disp[rw] = 0;
tg->slice_start[rw] = jiffies;
-   tg->slice_end[rw] = jiffies + throtl_slice;
+   tg->slice_end[rw] = jiffies + tg->td->throtl_slice;
throtl_log(&tg->service_queue,
   "[%c] new slice start=%lu end=%lu jiffies=%lu",
   rw == READ ? 'R' : 'W', tg->slice_start[rw],
@@ -671,13 +674,13 @@ static inline void throtl_start_new_slice(struct 
throtl_grp *tg, bool rw)
 static inline void throtl_set_slice_end(struct throtl_grp *tg, bool rw,
unsigned long jiffy_end)
 {
-   tg->slice_end[rw] = roundup(jiffy_end, throtl_slice);
+   tg->slice_end[rw] = roundup(jiffy_end, tg->td->throtl_slice);
 }
 
 static inline void throtl_extend_slice(struct throtl_grp *tg, bool rw,
   unsigned long jiffy_end)
 {
-   tg->slice_end[rw] = roundup(jiffy_end, throtl_slice);
+   tg->slice_end[rw] = roundup(jiffy_end, tg->td->throtl_slice);
throtl_log(&tg->service_queue,
   "[%c] extend slice start=%lu end=%lu jiffies=%lu",
   rw == READ ? 'R' : 'W', tg->slice_start[rw],
@@ -717,19 +720,19 @@ static inline void throtl_trim_slice(struct throtl_grp 
*tg, bool rw)
 * is bad because it does not allow new slice to start.
 */
 
-   throtl_set_slice_end(tg, rw, jiffies + throtl_slice);
+   throtl_set_slice_end(tg, rw, jiffies + tg->td->throtl_slice);
 
time_elapsed = jiffies - tg->slice_start[rw];
 
-   nr_slices = time_elapsed / throtl_slice;
+   nr_slices = time_elapsed / tg->td->throtl_slice;
 
if (!nr_slices)
return;
-   tmp

[PATCH V2 06/11] blk-throttle: make sure expire time isn't too big

2016-09-15 Thread Shaohua Li

cgroup could be throttled to a limit but when all cgroups cross high
limit, queue enters a higher state and so the group should be throttled
to a higher limit. It's possible the cgroup is sleeping because of
throttle and other cgroups don't dispatch IO any more. In this case,
nobody can trigger current downgrade/upgrade logic. To fix this issue,
we could either set up a timer to wakeup the cgroup if other cgroups are
idle or make sure this cgroup doesn't sleep too long. Setting up a timer
means we must change the timer very frequently. This patch chooses the
latter. Making cgroup sleep time not too big wouldn't change cgroup
bps/iops, but could make it wakeup more frequently, which isn't a big
issue because throtl_slice * 8 is already quite big.

Signed-off-by: Shaohua Li 
---
 block/blk-throttle.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 76988f0..2bd8333 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -589,6 +589,9 @@ static void throtl_dequeue_tg(struct throtl_grp *tg)
 static void throtl_schedule_pending_timer(struct throtl_service_queue *sq,
  unsigned long expires)
 {
+   unsigned long max_expire = jiffies + 8 * throtl_slice;
+   if (time_after(expires, max_expire))
+   expires = max_expire;
mod_timer(&sq->pending_timer, expires);
throtl_log(sq, "schedule timer. delay=%lu jiffies=%lu",
   expires - jiffies, jiffies);
-- 
2.8.0.rc2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH V2 05/11] block-throttle: add downgrade logic

2016-09-15 Thread Shaohua Li

When queue state machine is in LIMIT_MAX state, but a cgroup is below
its high limit for some time, the queue should be downgraded to lower
state as one cgroup's high limit isn't met.

Signed-off-by: Shaohua Li 
---
 block/blk-throttle.c | 187 +++
 1 file changed, 187 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 80b50cc..76988f0 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -136,6 +136,13 @@ struct throtl_grp {
/* Number of bio's dispatched in current slice */
unsigned int io_disp[2];
 
+   unsigned long last_high_overflow_time[2];
+
+   uint64_t last_bytes_disp[2];
+   unsigned int last_io_disp[2];
+
+   unsigned long last_check_time;
+
/* When did we start a new slice */
unsigned long slice_start[2];
unsigned long slice_end[2];
@@ -155,6 +162,9 @@ struct throtl_data
struct work_struct dispatch_work;
unsigned int limit_index;
bool limit_valid[LIMIT_CNT];
+
+   unsigned long high_upgrade_time;
+   unsigned long high_downgrade_time;
 };
 
 static void throtl_pending_timer_fn(unsigned long arg);
@@ -894,6 +904,8 @@ static void throtl_charge_bio(struct throtl_grp *tg, struct 
bio *bio)
/* Charge the bio to the group */
tg->bytes_disp[rw] += bio->bi_iter.bi_size;
tg->io_disp[rw]++;
+   tg->last_bytes_disp[rw] += bio->bi_iter.bi_size;
+   tg->last_io_disp[rw]++;
 
/*
 * REQ_THROTTLED is used to prevent the same bio to be throttled
@@ -1508,6 +1520,64 @@ static struct blkcg_policy blkcg_policy_throtl = {
.pd_free_fn = throtl_pd_free,
 };
 
+static unsigned long __tg_last_high_overflow_time(struct throtl_grp *tg)
+{
+   unsigned long rtime = -1, wtime = -1;
+   if (tg->bps[READ][LIMIT_HIGH] != -1 ||
+   tg->iops[READ][LIMIT_HIGH] != -1 ||
+   tg->bps[READ][LIMIT_MAX] != -1 ||
+   tg->iops[READ][LIMIT_MAX] != -1)
+   rtime = tg->last_high_overflow_time[READ];
+   if (tg->bps[WRITE][LIMIT_HIGH] != -1 ||
+   tg->iops[WRITE][LIMIT_HIGH] != -1 ||
+   tg->bps[WRITE][LIMIT_MAX] != -1 ||
+   tg->iops[WRITE][LIMIT_MAX] != -1)
+   wtime = tg->last_high_overflow_time[WRITE];
+   return min(rtime, wtime) == -1 ? 0 : min(rtime, wtime);
+}
+
+static unsigned long tg_last_high_overflow_time(struct throtl_grp *tg)
+{
+   struct throtl_service_queue *parent_sq;
+   struct throtl_grp *parent = tg;
+   unsigned long ret = __tg_last_high_overflow_time(tg);
+
+   while (true) {
+   parent_sq = parent->service_queue.parent_sq;
+   parent = sq_to_tg(parent_sq);
+   if (!parent)
+   break;
+   if (((parent->bps[READ][LIMIT_HIGH] != -1 &&
+ parent->bps[READ][LIMIT_HIGH] >=
+  tg->bps[READ][LIMIT_HIGH]) ||
+(parent->bps[READ][LIMIT_HIGH] == -1 &&
+ parent->bps[READ][LIMIT_MAX] >=
+  tg->bps[READ][LIMIT_HIGH])) &&
+   ((parent->bps[WRITE][LIMIT_HIGH] != -1 &&
+ parent->bps[WRITE][LIMIT_HIGH] >=
+  tg->bps[WRITE][LIMIT_HIGH]) ||
+(parent->bps[WRITE][LIMIT_HIGH] == -1 &&
+ parent->bps[WRITE][LIMIT_MAX] >=
+  tg->bps[WRITE][LIMIT_HIGH])) &&
+   ((parent->iops[READ][LIMIT_HIGH] != -1 &&
+ parent->iops[READ][LIMIT_HIGH] >=
+  tg->iops[READ][LIMIT_HIGH]) ||
+(parent->iops[READ][LIMIT_HIGH] == -1 &&
+ parent->iops[READ][LIMIT_MAX] >=
+  tg->iops[READ][LIMIT_HIGH])) &&
+   ((parent->iops[WRITE][LIMIT_HIGH] != -1 &&
+ parent->iops[WRITE][LIMIT_HIGH] >=
+  tg->iops[WRITE][LIMIT_HIGH]) ||
+(parent->iops[WRITE][LIMIT_HIGH] == -1 &&
+ parent->iops[WRITE][LIMIT_MAX] >=
+  tg->iops[WRITE][LIMIT_HIGH])))
+   break;
+   if (time_after(__tg_last_high_overflow_time(parent), ret))
+   ret = __tg_last_high_overflow_time(parent);
+   }
+   return ret;
+}
+
 static bool throtl_upgrade_check_one(struct throtl_grp *tg)
 {
struct throtl_service_queue *sq = &tg->service_queue;
@@ -1555,6 +1625,9 @@ static bool throtl_can_upgrade(struct throtl_data *td,
if (td->limit_index != LIMIT_HIGH)
return false;
 
+   if (time_before(jiffies, td->high_downgrade_time + throtl_slice))
+   return false;
+
rcu_read_lock();
blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
struct throtl_grp *tg = blkg_to_tg(blkg);
@@ -1578,6 +1651,7 @@ static void throtl_upgrade_state(

[PATCH V2 02/11] block-throttle: add .high interface

2016-09-15 Thread Shaohua Li

Add high limit for cgroup and corresponding cgroup interface.

Signed-off-by: Shaohua Li 
---
 block/blk-throttle.c | 139 +++
 1 file changed, 107 insertions(+), 32 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 71ecee7..02323fb 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -84,8 +84,9 @@ enum tg_state_flags {
 #define rb_entry_tg(node)  rb_entry((node), struct throtl_grp, rb_node)
 
 enum {
-   LIMIT_MAX = 0,
-   LIMIT_CNT = 1,
+   LIMIT_HIGH = 0,
+   LIMIT_MAX = 1,
+   LIMIT_CNT = 2,
 };
 
 struct throtl_grp {
@@ -352,7 +353,7 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, 
int node)
 
RB_CLEAR_NODE(&tg->rb_node);
for (rw = READ; rw <= WRITE; rw++) {
-   for (index = LIMIT_MAX; index < LIMIT_CNT; index++) {
+   for (index = LIMIT_HIGH; index < LIMIT_CNT; index++) {
tg->bps[rw][index] = -1;
tg->iops[rw][index] = -1;
}
@@ -414,6 +415,46 @@ static void throtl_pd_online(struct blkg_policy_data *pd)
tg_update_has_rules(pd_to_tg(pd));
 }
 
+static void blk_throtl_update_valid_limit(struct throtl_data *td)
+{
+   struct cgroup_subsys_state *pos_css;
+   struct blkcg_gq *blkg;
+   bool high_valid = false;
+
+   rcu_read_lock();
+   blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
+   struct throtl_grp *tg = blkg_to_tg(blkg);
+
+   if (tg->bps[READ][LIMIT_HIGH] != -1 ||
+   tg->bps[WRITE][LIMIT_HIGH] != -1 ||
+   tg->iops[READ][LIMIT_HIGH] != -1 ||
+   tg->iops[WRITE][LIMIT_HIGH] != -1)
+   high_valid = true;
+   }
+   rcu_read_unlock();
+
+   if (high_valid)
+   td->limit_valid[LIMIT_HIGH] = true;
+   else
+   td->limit_valid[LIMIT_HIGH] = false;
+}
+
+static void throtl_pd_offline(struct blkg_policy_data *pd)
+{
+   struct throtl_grp *tg = pd_to_tg(pd);
+
+   tg->bps[READ][LIMIT_HIGH] = -1;
+   tg->bps[WRITE][LIMIT_HIGH] = -1;
+   tg->iops[READ][LIMIT_HIGH] = -1;
+   tg->iops[WRITE][LIMIT_HIGH] = -1;
+
+   blk_throtl_update_valid_limit(tg->td);
+
+   if (tg->td->limit_index == LIMIT_HIGH &&
+   !tg->td->limit_valid[LIMIT_HIGH])
+   tg->td->limit_index = LIMIT_MAX;
+}
+
 static void throtl_pd_free(struct blkg_policy_data *pd)
 {
struct throtl_grp *tg = pd_to_tg(pd);
@@ -1281,7 +1322,7 @@ static struct cftype throtl_legacy_files[] = {
{ } /* terminate */
 };
 
-static u64 tg_prfill_max(struct seq_file *sf, struct blkg_policy_data *pd,
+static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
 int off)
 {
struct throtl_grp *tg = pd_to_tg(pd);
@@ -1290,36 +1331,32 @@ static u64 tg_prfill_max(struct seq_file *sf, struct 
blkg_policy_data *pd,
 
if (!dname)
return 0;
-   if (tg->bps[READ][LIMIT_MAX] == -1 && tg->bps[WRITE][LIMIT_MAX] == -1 &&
-   tg->iops[READ][LIMIT_MAX] == -1 && tg->iops[WRITE][LIMIT_MAX] == -1)
+   if (tg->bps[READ][off] == -1 && tg->bps[WRITE][off] == -1 &&
+   tg->iops[READ][off] == -1 && tg->iops[WRITE][off] == -1)
return 0;
 
-   if (tg->bps[READ][LIMIT_MAX] != -1)
-   snprintf(bufs[0], sizeof(bufs[0]), "%llu",
-   tg->bps[READ][LIMIT_MAX]);
-   if (tg->bps[WRITE][LIMIT_MAX] != -1)
-   snprintf(bufs[1], sizeof(bufs[1]), "%llu",
-   tg->bps[WRITE][LIMIT_MAX]);
-   if (tg->iops[READ][LIMIT_MAX] != -1)
-   snprintf(bufs[2], sizeof(bufs[2]), "%u",
-   tg->iops[READ][LIMIT_MAX]);
-   if (tg->iops[WRITE][LIMIT_MAX] != -1)
-   snprintf(bufs[3], sizeof(bufs[3]), "%u",
-   tg->iops[WRITE][LIMIT_MAX]);
+   if (tg->bps[READ][off] != -1)
+   snprintf(bufs[0], sizeof(bufs[0]), "%llu", tg->bps[READ][off]);
+   if (tg->bps[WRITE][off] != -1)
+   snprintf(bufs[1], sizeof(bufs[1]), "%llu", tg->bps[WRITE][off]);
+   if (tg->iops[READ][off] != -1)
+   snprintf(bufs[2], sizeof(bufs[2]), "%u", tg->iops[READ][off]);
+   if (tg->iops[WRITE][off] != -1)
+   snprintf(bufs[3], sizeof(bufs[3]), "%u", tg->iops[WRITE][off]);
 
seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s\n",
   dname, bufs[0], bufs[1], bufs[2], bufs[3]);
return 0;
 }
 
-static int tg_print_max(struct seq_file *sf, void *v)
+static int tg_print_limit(struct seq_file *sf, void *v)
 {
-   blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), tg_prfill_max,
+   blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), tg_prfill_limit,
  &blkcg_policy_throtl, seq_cft(sf)->private, false);
return 0;

[PATCH V2 08/11] blk-throttle: detect completed idle cgroup

2016-09-15 Thread Shaohua Li

cgroup could be assigned a limit, but doesn't dispatch enough IO, eg the
cgroup is idle. When this happens, the cgroup doesn't hit its limit, so
we can't move the state machine to higher level and all cgroups will be
throttled to thier lower limit, so we waste bandwidth. Detecting idle
cgroup is hard. This patch handles a simple case, a cgroup doesn't
dispatch any IO. We ignore such cgroup's limit, so other cgroups can use
the bandwidth.

Signed-off-by: Shaohua Li 
---
 block/blk-throttle.c | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index bc94086..e4713b1 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -144,6 +144,8 @@ struct throtl_grp {
 
unsigned long last_check_time;
 
+   unsigned long last_dispatch_time[2];
+
/* When did we start a new slice */
unsigned long slice_start[2];
unsigned long slice_end[2];
@@ -438,11 +440,14 @@ static void tg_update_has_rules(struct throtl_grp *tg)
 
 static void throtl_pd_online(struct blkg_policy_data *pd)
 {
+   struct throtl_grp *tg = pd_to_tg(pd);
/*
 * We don't want new groups to escape the limits of its ancestors.
 * Update has_rules[] after a new group is brought online.
 */
-   tg_update_has_rules(pd_to_tg(pd));
+   tg_update_has_rules(tg);
+   tg->last_dispatch_time[READ] = jiffies;
+   tg->last_dispatch_time[WRITE] = jiffies;
 }
 
 static void blk_throtl_update_valid_limit(struct throtl_data *td)
@@ -1604,6 +1609,12 @@ static bool throtl_upgrade_check_one(struct throtl_grp 
*tg)
if (write_limit && sq->nr_queued[WRITE] &&
(!read_limit || sq->nr_queued[READ]))
return true;
+
+   if (time_after_eq(jiffies,
+tg->last_dispatch_time[READ] + tg->td->throtl_slice) &&
+   time_after_eq(jiffies,
+tg->last_dispatch_time[WRITE] + tg->td->throtl_slice))
+   return true;
return false;
 }
 
@@ -1684,6 +1695,9 @@ static bool throtl_downgrade_check_one(struct throtl_grp 
*tg)
struct throtl_data *td = tg->td;
unsigned long now = jiffies;
 
+   if (time_after_eq(now, tg->last_dispatch_time[READ] + td->throtl_slice) 
&&
+   time_after_eq(now, tg->last_dispatch_time[WRITE] + 
td->throtl_slice))
+   return false;
/*
 * If cgroup is below high limit, consider downgrade and throttle other
 * cgroups
@@ -1802,6 +1816,7 @@ bool blk_throtl_bio(struct request_queue *q, struct 
blkcg_gq *blkg,
 
 again:
while (true) {
+   tg->last_dispatch_time[rw] = jiffies;
if (tg->last_high_overflow_time[rw] == 0)
tg->last_high_overflow_time[rw] = jiffies;
throtl_downgrade_check(tg);
-- 
2.8.0.rc2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH V2 09/11] block-throttle: make bandwidth change smooth

2016-09-15 Thread Shaohua Li

When cgroups all reach high limit, cgroups can dispatch more IO. This
could make some cgroups dispatch more IO but others not, and even some
cgroups could dispatch less IO than their high limit. For example, cg1
high limit 10MB/s, cg2 limit 80MB/s, assume disk maximum bandwidth is
120M/s for the workload. Their bps could something like this:

cg1/cg2 bps: T1: 10/80 -> T2: 60/60 -> T3: 10/80

At T1, all cgroups reach high limit, so they can dispatch more IO later.
Then cg1 dispatch more IO and cg2 has no room to dispatch enough IO. At
T2, cg2 only dispatches 60M/s. Since We detect cg2 dispatches less IO
than its high limit 80M/s, we downgrade the queue from LIMIT_MAX to
LIMIT_HIGH, then all cgroups are throttled to their high limit (T3). cg2
will have bandwidth below its high limit at most time.

The big problem here is we don't know the maximum bandwidth of the
workload, so we can't make smart decision to avoid the situation. This
patch makes cgroup bandwidth change smooth. After disk upgrades from
LIMIT_HIGH to LIMIT_MAX, we don't allow cgroups use all bandwidth upto
their max limit immediately. Their bandwidth limit will be increased
gradually to avoid above situation. So above example will became
something like:

cg1/cg2 bps: 10/80 -> 15/105 -> 20/100 -> 25/95 -> 30/90 -> 35/85 -> 40/80
-> 45/75 -> 10/80

In this way cgroups bandwidth will be above their limit in majority
time, this still doesn't fully utilize disk bandwidth, but that's
something we pay for sharing.

Note this doesn't completely avoid cgroup running under its high limit.
The best way to guarantee cgroup doesn't run under its limit is to set
max limit. For example, if we set cg1 max limit to 40, cg2 will never
run under its high limit.

Signed-off-by: Shaohua Li 
---
 block/blk-throttle.c | 44 
 1 file changed, 40 insertions(+), 4 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index e4713b1..d956831 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -170,6 +170,8 @@ struct throtl_data
 
unsigned long high_upgrade_time;
unsigned long high_downgrade_time;
+
+   unsigned int scale;
 };
 
 static void throtl_pending_timer_fn(unsigned long arg);
@@ -224,12 +226,28 @@ static struct throtl_data *sq_to_td(struct 
throtl_service_queue *sq)
 static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
 {
struct blkcg_gq *blkg = tg_to_blkg(tg);
+   struct throtl_data *td;
uint64_t ret;
 
if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
return -1;
-   ret = tg->bps[rw][tg->td->limit_index];
-   if (ret == -1 && tg->td->limit_index == LIMIT_HIGH)
+   td = tg->td;
+   ret = tg->bps[rw][td->limit_index];
+   if (td->limit_index == LIMIT_MAX && tg->bps[rw][LIMIT_HIGH] != -1) {
+   uint64_t increase;
+
+   if (td->scale < 4096 && time_after_eq(jiffies,
+   td->high_upgrade_time + td->scale * td->throtl_slice)) {
+   uint64_t time = jiffies - td->high_upgrade_time;
+
+   do_div(time, td->throtl_slice);
+   td->scale = time;
+   }
+   increase = (tg->bps[rw][LIMIT_HIGH] >> 1) * td->scale;
+   ret = min(tg->bps[rw][LIMIT_MAX],
+   tg->bps[rw][LIMIT_HIGH] + increase);
+   }
+   if (ret == -1 && td->limit_index == LIMIT_HIGH)
return tg->bps[rw][LIMIT_MAX];
 
return ret;
@@ -238,12 +256,29 @@ static uint64_t tg_bps_limit(struct throtl_grp *tg, int 
rw)
 static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
 {
struct blkcg_gq *blkg = tg_to_blkg(tg);
+   struct throtl_data *td;
unsigned int ret;
 
if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
return -1;
-   ret = tg->iops[rw][tg->td->limit_index];
-   if (ret == -1 && tg->td->limit_index == LIMIT_HIGH)
+   td = tg->td;
+   ret = tg->iops[rw][td->limit_index];
+   if (td->limit_index == LIMIT_MAX && tg->iops[rw][LIMIT_HIGH] != -1) {
+   uint64_t increase;
+
+   if (td->scale < 4096 && time_after_eq(jiffies,
+   td->high_upgrade_time + td->scale * td->throtl_slice)) {
+   uint64_t time = jiffies - td->high_upgrade_time;
+
+   do_div(time, td->throtl_slice);
+   td->scale = time;
+   }
+
+   increase = (tg->iops[rw][LIMIT_HIGH] >> 1) * td->scale;
+   ret = min(tg->iops[rw][LIMIT_MAX],
+   tg->iops[rw][LIMIT_HIGH] + (unsigned int)increase);
+   }
+   if (ret == -1 && td->limit_index == LIMIT_HIGH)
return tg->iops[rw][LIMIT_MAX];
return ret;
 }
@@ -1669,6 +1704,7 @@ static void throtl_upgrade_state(struct throtl_data *td)
 
td->limit_index = LIMIT_MAX;
td->high_upgrade_time =

[PATCH V2 01/11] block-throttle: prepare support multiple limits

2016-09-15 Thread Shaohua Li

We are going to support high/max limit, each cgroup will have 2 limits
after that. This patch prepares for the multiple limits change.

Signed-off-by: Shaohua Li 
---
 block/blk-throttle.c | 109 ---
 1 file changed, 68 insertions(+), 41 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index f1aba26..71ecee7 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -83,6 +83,11 @@ enum tg_state_flags {
 
 #define rb_entry_tg(node)  rb_entry((node), struct throtl_grp, rb_node)
 
+enum {
+   LIMIT_MAX = 0,
+   LIMIT_CNT = 1,
+};
+
 struct throtl_grp {
/* must be the first member */
struct blkg_policy_data pd;
@@ -120,10 +125,10 @@ struct throtl_grp {
bool has_rules[2];
 
/* bytes per second rate limits */
-   uint64_t bps[2];
+   uint64_t bps[2][LIMIT_CNT];
 
/* IOPS limits */
-   unsigned int iops[2];
+   unsigned int iops[2][LIMIT_CNT];
 
/* Number of bytes disptached in current slice */
uint64_t bytes_disp[2];
@@ -147,6 +152,8 @@ struct throtl_data
 
/* Work for dispatching throttled bios */
struct work_struct dispatch_work;
+   unsigned int limit_index;
+   bool limit_valid[LIMIT_CNT];
 };
 
 static void throtl_pending_timer_fn(unsigned long arg);
@@ -198,6 +205,16 @@ static struct throtl_data *sq_to_td(struct 
throtl_service_queue *sq)
return container_of(sq, struct throtl_data, service_queue);
 }
 
+static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
+{
+   return tg->bps[rw][tg->td->limit_index];
+}
+
+static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
+{
+   return tg->iops[rw][tg->td->limit_index];
+}
+
 /**
  * throtl_log - log debug message via blktrace
  * @sq: the service_queue being reported
@@ -320,7 +337,7 @@ static void throtl_service_queue_init(struct 
throtl_service_queue *sq)
 static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
 {
struct throtl_grp *tg;
-   int rw;
+   int rw, index;
 
tg = kzalloc_node(sizeof(*tg), gfp, node);
if (!tg)
@@ -334,10 +351,12 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t 
gfp, int node)
}
 
RB_CLEAR_NODE(&tg->rb_node);
-   tg->bps[READ] = -1;
-   tg->bps[WRITE] = -1;
-   tg->iops[READ] = -1;
-   tg->iops[WRITE] = -1;
+   for (rw = READ; rw <= WRITE; rw++) {
+   for (index = LIMIT_MAX; index < LIMIT_CNT; index++) {
+   tg->bps[rw][index] = -1;
+   tg->iops[rw][index] = -1;
+   }
+   }
 
return &tg->pd;
 }
@@ -376,11 +395,14 @@ static void throtl_pd_init(struct blkg_policy_data *pd)
 static void tg_update_has_rules(struct throtl_grp *tg)
 {
struct throtl_grp *parent_tg = sq_to_tg(tg->service_queue.parent_sq);
+   struct throtl_data *td = tg->td;
int rw;
 
for (rw = READ; rw <= WRITE; rw++)
tg->has_rules[rw] = (parent_tg && parent_tg->has_rules[rw]) ||
-   (tg->bps[rw] != -1 || tg->iops[rw] != -1);
+   (td->limit_valid[td->limit_index] &&
+(tg_bps_limit(tg, rw) != -1 ||
+ tg_iops_limit(tg, rw) != -1));
 }
 
 static void throtl_pd_online(struct blkg_policy_data *pd)
@@ -632,11 +654,11 @@ static inline void throtl_trim_slice(struct throtl_grp 
*tg, bool rw)
 
if (!nr_slices)
return;
-   tmp = tg->bps[rw] * throtl_slice * nr_slices;
+   tmp = tg_bps_limit(tg, rw) * throtl_slice * nr_slices;
do_div(tmp, HZ);
bytes_trim = tmp;
 
-   io_trim = (tg->iops[rw] * throtl_slice * nr_slices)/HZ;
+   io_trim = (tg_iops_limit(tg, rw) * throtl_slice * nr_slices)/HZ;
 
if (!bytes_trim && !io_trim)
return;
@@ -682,7 +704,7 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, 
struct bio *bio,
 * have been trimmed.
 */
 
-   tmp = (u64)tg->iops[rw] * jiffy_elapsed_rnd;
+   tmp = (u64)tg_iops_limit(tg, rw) * jiffy_elapsed_rnd;
do_div(tmp, HZ);
 
if (tmp > UINT_MAX)
@@ -697,7 +719,7 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, 
struct bio *bio,
}
 
/* Calc approx time to dispatch */
-   jiffy_wait = ((tg->io_disp[rw] + 1) * HZ)/tg->iops[rw] + 1;
+   jiffy_wait = ((tg->io_disp[rw] + 1) * HZ)/tg_iops_limit(tg, rw) + 1;
 
if (jiffy_wait > jiffy_elapsed)
jiffy_wait = jiffy_wait - jiffy_elapsed;
@@ -724,7 +746,7 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, 
struct bio *bio,
 
jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
 
-   tmp = tg->bps[rw] * jiffy_elapsed_rnd;
+   tmp = tg_bps_limit(tg, rw) * jiffy_elapsed_rnd;
do_div(tmp, HZ);
bytes_allowed = tmp;
 
@@ -736,7 +758,7 @@ static bool tg_with_in_bps_limi

[PATCH V2 00/11] block-throttle: add .high limit

2016-09-15 Thread Shaohua Li

Hi,

The background is we don't have an ioscheduler for blk-mq yet, so we can't
prioritize processes/cgroups. This patch set tries to add basic arbitration
between cgroups with blk-throttle. It adds a new limit io.high for
blk-throttle. It's only for cgroup2.

io.max is a hard limit throttling. cgroups with a max limit never dispatch more
IO than their max limit. While io.high is a best effort throttling. cgroups
with high limit can run above their high limit at appropriate time.
Specifically, if all cgroups reach their high limit, all cgroups can run above
their high limit. If any cgroup runs under its high limit, all other cgroups
will run according to their high limit.

An example usage is we have a high prio cgroup with high high limit and a low
prio cgroup with low high limit. If the high prio cgroup isn't running, the low
prio can run above its high limit, so we don't waste the bandwidth. When the
high prio cgroup runs and is below its high limit, low prio cgroup will run
under its high limit. This will protect high prio cgroup to get more resources.
If both cgroups reach their high limit, both can run above their high limit
(eg, fully utilize disk bandwidth). All these can't be done with io.max limit.

The implementation is simple. The disk queue has 2 states LIMIT_HIGH and
LIMIT_MAX. In each disk state, we throttle cgroups according to the limit of
the state. That is io.high limit for LIMIT_HIGH state, io.max limit for
LIMIT_MAX. The disk state can be upgraded/downgraded between
LIMIT_HIGH/LIMIT_MAX according to the rule above. Initially disk state is
LIMIT_MAX. And if no cgroup sets io.high, the disk state will remain in
LIMIT_MAX state. Users with only io.max set will find nothing changed with the
patches.

The first 8 patches implement the basic framework. Add interface, handle
upgrade and downgrade logic. The patch 8 detects a special case a cgroup is
completely idle. In this case, we ignore the cgroup's limit. The patch 9-11
adds more heuristics.

The basic framework has 2 major issues.
1. fluctuation. When the state is upgraded from LIMIT_HIGH to LIMIT_MAX, the
cgroup's bandwidth can change dramatically, sometimes in a way not expected.
For example, one cgroup's bandwidth will drop below its io.high limit very soon
after a upgrade. patch 9 has more details about the issue.
2. idle cgroup. cgroup with a io.high limit doesn't always dispatch enough IO.
In above upgrade rule, the disk will remain in LIMIT_HIGH state and all other
cgroups can't dispatch more IO above their high limit. Hence this is a waste of
disk bandwidth. patch 10 has more details about the issue.

For issue 1, we make cgroup bandwidth increase smoothly after a upgrade. This
will reduce the chance a cgroup's bandwidth drop under its high limit rapidly.
The smoothness means we could waste some bandwidth in the transition though.
But we must pay something for sharing.

The issue 2 is very hard to solve. To be honest, I don't have a good solution
yet. The patch 10 uses the 'think time check' idea borrowed from CFQ to detect
idle cgroup. It's not perfect, eg, not works well for high IO depth workloads.
But it's the best I tried so far and in practice works well. This definitively
needs more tuning.

Please review, test and consider merge.

Thanks,
Shaohua

V1->V2:
- Drop io.low interface for simplicity and the interface isn't a must-have to
  prioritize cgroups.
- Remove the 'trial' logic, which creates too much fluctuation
- Add a new idle cgroup detection
- Other bug fixes and improvements

V1:
http://marc.info/?l=linux-block&m=146292596425689&w=2

--
Shaohua Li (11):
  block-throttle: prepare support multiple limits
  block-throttle: add .high interface
  block-throttle: configure bps/iops limit for cgroup in high limit
  block-throttle: add upgrade logic for LIMIT_HIGH state
  block-throttle: add downgrade logic
  blk-throttle: make sure expire time isn't too big
  blk-throttle: make throtl_slice tunable
  blk-throttle: detect completed idle cgroup
  block-throttle: make bandwidth change smooth
  block-throttle: add a simple idle detection
  blk-throttle: ignore idle cgroup limit

 block/bio.c   |   2 +
 block/blk-sysfs.c |  18 ++
 block/blk-throttle.c  | 704 +-
 block/blk.h   |   9 +
 include/linux/blk_types.h |   1 +
 5 files changed, 669 insertions(+), 65 deletions(-)

-- 
2.8.0.rc2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 111 matches

Mail list logo