Hi Shaohua,
[auto build test ERROR on block/for-next]
[also build test ERROR on v4.8-rc6 next-20160915]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system]
[Suggest to use git(>=2.9.0) format-patch --base= (or --base=auto for
convenience) to rec
On 15/09/2016 17:23, Alex Bligh wrote:
> Paolo,
>
>> On 15 Sep 2016, at 15:07, Paolo Bonzini wrote:
>>
>> I don't think QEMU forbids multiple clients to the single server, and
>> guarantees consistency as long as there is no overlap between writes and
>> reads. These are
[Resent without html]
Dear colleagues,
The attached patch keeps a count of block device I/O errors -- any
error event that generates a klog message in blk_update_request -- and
reports the count as a 12th field in /sys/block//stat. That
allows, e.g., monitoring systems to detect and count block
On 09/15/2016 11:27 AM, Wouter Verhelst wrote:
> On Thu, Sep 15, 2016 at 05:08:21PM +0100, Alex Bligh wrote:
>> Wouter,
>>
>>> The server can always refuse to allow multiple connections.
>>
>> Sure, but it would be neater to warn the client of that at negotiation
>> stage (it would only be one
Wouter,
> On 15 Sep 2016, at 17:27, Wouter Verhelst wrote:
>
> On Thu, Sep 15, 2016 at 05:08:21PM +0100, Alex Bligh wrote:
>> Wouter,
>>
>>> The server can always refuse to allow multiple connections.
>>
>> Sure, but it would be neater to warn the client of that at negotiation
On Thu, Sep 15, 2016 at 05:08:21PM +0100, Alex Bligh wrote:
> Wouter,
>
> > The server can always refuse to allow multiple connections.
>
> Sure, but it would be neater to warn the client of that at negotiation
> stage (it would only be one flag, e.g. 'multiple connections
> unsafe').
I
Add high limit for cgroup and corresponding cgroup interface.
Signed-off-by: Shaohua Li
---
block/blk-throttle.c | 139 +++
1 file changed, 107 insertions(+), 32 deletions(-)
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
A cgroup gets assigned a high limit, but the cgroup could never dispatch
enough IO to cross the high limit. In such case, the queue state machine
will remain in LIMIT_HIGH state and all other cgroups will be throttled
according to high limit. This is unfair for other cgroups. We should
treat the
When queue is in LIMIT_HIGH state and all cgroups with high limit cross
the bps/iops limitation, we will upgrade queue's state to
LIMIT_MAX
For a cgroup hierarchy, there are two cases. Children has lower high
limit than parent. Parent's high limit is meaningless. If children's
bps/iops cross high
Last patch introduces a way to detect idle cgroup. We use it to make
upgrade/downgrade decision.
Signed-off-by: Shaohua Li
---
block/blk-throttle.c | 30 ++
1 file changed, 18 insertions(+), 12 deletions(-)
diff --git a/block/blk-throttle.c
cgroup could be assigned a limit, but doesn't dispatch enough IO, eg the
cgroup is idle. When this happens, the cgroup doesn't hit its limit, so
we can't move the state machine to higher level and all cgroups will be
throttled to thier lower limit, so we waste bandwidth. Detecting idle
cgroup is
When cgroups all reach high limit, cgroups can dispatch more IO. This
could make some cgroups dispatch more IO but others not, and even some
cgroups could dispatch less IO than their high limit. For example, cg1
high limit 10MB/s, cg2 limit 80MB/s, assume disk maximum bandwidth is
120M/s for the
Hi,
The background is we don't have an ioscheduler for blk-mq yet, so we can't
prioritize processes/cgroups. This patch set tries to add basic arbitration
between cgroups with blk-throttle. It adds a new limit io.high for
blk-throttle. It's only for cgroup2.
io.max is a hard limit throttling.
throtl_slice is important for blk-throttling. A lot of stuffes depend on
it, for example, throughput measurement. It has 100ms default value,
which is not appropriate for all disks. For example, for SSD we might
use a smaller value to make the throughput smoother. This patch makes it
tunable.
cgroup could be throttled to a limit but when all cgroups cross high
limit, queue enters a higher state and so the group should be throttled
to a higher limit. It's possible the cgroup is sleeping because of
throttle and other cgroups don't dispatch IO any more. In this case,
nobody can trigger
Hi Linus,
A set of fixes for the current series in the realm of block. Like the
previous pull request, the meat of it are fixes for the nvme fabrics/target
code. Outside of that, just one fix from Gabriel for not doing a queue
suspend if we didn't get the admin queue setup in the first place.
> +static int blk_mq_create_mq_map(struct blk_mq_tag_set *set,
> + const struct cpumask *affinity_mask)
> {
> + int queue = -1, cpu = 0;
> +
> + set->mq_map = kzalloc_node(sizeof(*set->mq_map) * nr_cpu_ids,
> + GFP_KERNEL, set->numa_node);
> + if
On Thu, Sep 15, 2016 at 08:34:42AM -0600, Jens Axboe wrote:
> I was going to ask about splitting it, but that looks fine, I can pull
> that in.
>
> The series looks fine to me. My only real concern is giving drivers the
> flexibility to define mappings, I don't want that to evolve into drivers
>
Thanks for all the testing and the review Keith, as well as the
fixes earlier.
Jens, what do you think of the series?
Thomas has added the first 5 patches to
https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/log/?h=irq/for-block
so it would be great if we could pull that into a block
On Wed, Sep 14, 2016 at 04:18:46PM +0200, Christoph Hellwig wrote:
> This series is the remainder of the earlier "automatic interrupt affinity for
> MSI/MSI-X capable devices" series, and make uses of the new irq-level
> interrupt / queue mapping code in blk-mq, as well as allowing the driver
> to
On 09/15/2016 09:17 AM, Wouter Verhelst wrote:
On Thu, Sep 15, 2016 at 01:44:29PM +0100, Alex Bligh wrote:
On 15 Sep 2016, at 13:41, Christoph Hellwig wrote:
On Thu, Sep 15, 2016 at 01:39:11PM +0100, Alex Bligh wrote:
That's probably right in the case of file-based back
On Thu, Sep 15, 2016 at 01:44:29PM +0100, Alex Bligh wrote:
>
> > On 15 Sep 2016, at 13:41, Christoph Hellwig wrote:
> >
> > On Thu, Sep 15, 2016 at 01:39:11PM +0100, Alex Bligh wrote:
> >> That's probably right in the case of file-based back ends that
> >> are running on a
On Thu, Sep 15 2016 at 2:14am -0400,
Hannes Reinecke wrote:
> On 09/14/2016 06:29 PM, Mike Snitzer wrote:
> > Otherwise blk-mq will immediately dispatch requests that are requeued
> > via a BLK_MQ_RQ_QUEUE_BUSY return from blk_mq_ops .queue_rq.
> >
> > Delayed requeue is
> On 15 Sep 2016, at 13:41, Christoph Hellwig wrote:
>
> On Thu, Sep 15, 2016 at 01:39:11PM +0100, Alex Bligh wrote:
>> That's probably right in the case of file-based back ends that
>> are running on a Linux OS. But gonbdserver for instance supports
>> (e.g.) Ceph based
On Thu, Sep 15, 2016 at 01:39:11PM +0100, Alex Bligh wrote:
> That's probably right in the case of file-based back ends that
> are running on a Linux OS. But gonbdserver for instance supports
> (e.g.) Ceph based backends, where each connection might be talking
> to a completely separate ceph node,
> On 15 Sep 2016, at 13:36, Christoph Hellwig wrote:
>
> On Thu, Sep 15, 2016 at 01:33:20PM +0100, Alex Bligh wrote:
>> At an implementation level that is going to be a little difficult
>> for some NBD servers, e.g. ones that fork() a different process per
>> connection.
> On 15 Sep 2016, at 13:23, Christoph Hellwig wrote:
>
> On Thu, Sep 15, 2016 at 02:21:20PM +0200, Wouter Verhelst wrote:
>> Right. So do I understand you correctly that blk-mq currently doesn't
>> look at multiple queues, and just assumes that if a FLUSH is sent over
>> any
> On 15 Sep 2016, at 13:18, Christoph Hellwig wrote:
>
> Yes, please do that. A "barrier" implies draining of the queue.
Done
--
Alex Bligh
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
On Sep 15, 2016, at 5:55 AM, Kirill A. Shutemov
wrote:
>
> This patch modifies ext4_mpage_readpages() to deal with huge pages.
>
> We read out 2M at once, so we have to alloc (HPAGE_PMD_NR *
> blocks_per_page) sector_t for that. I'm not entirely happy with
On Thu, Sep 15, 2016 at 05:20:08AM -0700, Christoph Hellwig wrote:
> On Thu, Sep 15, 2016 at 02:01:59PM +0200, Wouter Verhelst wrote:
> > Yes. There was some discussion on that part, and we decided that setting
> > the flag doesn't hurt, but the spec also clarifies that using it on READ
> > does
From: Matthew Wilcox
This new function splits a larger multiorder entry into smaller entries
(potentially multi-order entries). These entries are initialised to
RADIX_TREE_RETRY to ensure that RCU walkers who see this state aren't
confused. The caller should then call
From: Matthew Wilcox
This new function allows for the replacement of many smaller entries in
the radix tree with one larger multiorder entry. From the point of view
of an RCU walker, they may see a mixture of the smaller entries and the
large entry during the same walk,
We would need to use multi-order radix-tree entires for ext4 and other
filesystems to have coherent view on tags (dirty/towrite) in the tree.
This patch converts huge tmpfs implementation to multi-order entries, so
we will be able to use the same code patch for all filesystems.
Signed-off-by:
These flags are in use for filesystems with backing storage: PG_error,
PG_writeback and PG_readahead.
Signed-off-by: Kirill A. Shutemov
---
include/linux/page-flags.h | 10 +-
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git
Let's add FileHugePages and FilePmdMapped fields into meminfo and smaps.
It indicates how many times we allocate and map file THP.
Signed-off-by: Kirill A. Shutemov
---
drivers/base/node.c| 6 ++
fs/proc/meminfo.c | 4
fs/proc/task_mmu.c
We writeback whole huge page a time.
Signed-off-by: Kirill A. Shutemov
---
mm/filemap.c | 5 +
1 file changed, 5 insertions(+)
diff --git a/mm/filemap.c b/mm/filemap.c
index 05b42d3e5ed8..53da93156e60 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -372,9
As with shmem_undo_range(), truncate_inode_pages_range() removes huge
pages, if it fully within range.
Partial truncate of huge pages zero out this part of THP.
Unlike with shmem, it doesn't prevent us having holes in the middle of
huge page we still can skip writeback not touched buffers.
With
For huge pages 'stop' must be within HPAGE_PMD_SIZE.
Let's use hpage_size() in the BUG_ON().
We also need to change how we calculate lblk for cluster deallocation.
Signed-off-by: Kirill A. Shutemov
---
fs/ext4/inode.c | 5 +++--
1 file changed, 3 insertions(+),
As the function handles zeroing range only within one block, the
required changes are trivial, just remove assuption on page size.
Signed-off-by: Kirill A. Shutemov
---
fs/ext4/inode.c | 7 +--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git
It simply matches changes to __block_write_begin_int().
Signed-off-by: Kirill A. Shutemov
---
fs/ext4/inode.c | 24
1 file changed, 16 insertions(+), 8 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index
Modify mpage_map_and_submit_buffers() and mpage_release_unused_pages()
to deal with huge pages.
Mostly result of try-and-error. Critical view would be appriciated.
Signed-off-by: Kirill A. Shutemov
---
fs/ext4/inode.c | 61
This patch modifies ext4_mpage_readpages() to deal with huge pages.
We read out 2M at once, so we have to alloc (HPAGE_PMD_NR *
blocks_per_page) sector_t for that. I'm not entirely happy with kmalloc
in this codepath, but don't see any other option.
Signed-off-by: Kirill A. Shutemov
We want page to be isolated from the rest of the system before spliting
it. We rely on page count to be 2 for file pages to make sure nobody
uses the page: one pin to caller, one to radix-tree.
Filesystems with backing storage can have page count increased if it has
buffers.
Let's try to free
It's more or less straight-forward.
Most changes are around getting offset/len withing page right and zero
out desired part of the page.
Signed-off-by: Kirill A. Shutemov
---
fs/buffer.c | 53 +++--
1 file
Most of work happans on head page. Only when we need to do copy data to
userspace we find relevant subpage.
We are still limited by PAGE_SIZE per iteration. Lifting this limitation
would require some more work.
Signed-off-by: Kirill A. Shutemov
---
mm/filemap.c
On Thu, Sep 15, 2016 at 04:38:07AM -0700, Christoph Hellwig wrote:
> On Thu, Sep 15, 2016 at 12:49:35PM +0200, Wouter Verhelst wrote:
> > A while back, we spent quite some time defining the semantics of the
> > various commands in the face of the NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA
> > write
On Thu, Sep 15, 2016 at 01:55:14PM +0200, Wouter Verhelst wrote:
> Maybe I'm not using the correct terminology here. The point is that
> after a FLUSH, the server asserts that all write commands *for which a
> reply has already been sent to the client* will also have reached
> permanent storage.
This patch adds basic functionality to put huge page into page cache.
At the moment we only put huge pages into radix-tree if the range covered
by the huge page is empty.
We ignore shadow entires for now, just remove them from the tree before
inserting huge page.
Later we can add logic to
The same four values as in tmpfs case.
Encyption code is not yet ready to handle huge page, so we disable huge
pages support if the inode has EXT4_INODE_ENCRYPT.
Signed-off-by: Kirill A. Shutemov
---
fs/ext4/ext4.h | 5 +
fs/ext4/inode.c | 26
On Thu, Sep 15, 2016 at 04:52:17AM -0700, Christoph Hellwig wrote:
> On Thu, Sep 15, 2016 at 12:46:07PM +0100, Alex Bligh wrote:
> > Essentially NBD does supports FLUSH/FUA like this:
> >
> > https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt
> >
> > IE supports the same
ext4_find_unwritten_pgoff() needs few tweaks to work with huge pages.
Mostly trivial page_mapping()/page_to_pgoff() and adjustment to how we
find relevant block.
Signe-off-by: Kirill A. Shutemov
---
fs/ext4/file.c | 18 ++
1 file changed, 14
Adjust check on whether part of the page beyond file size and apply
compound_head() and page_mapping() where appropriate.
Signed-off-by: Kirill A. Shutemov
---
fs/buffer.c | 10 +-
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/buffer.c
For filesystems that wants to be write-notified (has mkwrite), we will
encount write-protection faults for huge PMDs in shared mappings.
The easiest way to handle them is to clear the PMD and let it refault as
wriable.
Signed-off-by: Kirill A. Shutemov
---
Call ext4_da_should_update_i_disksize() for head page with offset
relative to head page.
Signed-off-by: Kirill A. Shutemov
---
fs/ext4/inode.c | 7 +++
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index
Slab pages can be compound, but we shouldn't threat them as THP for
pupose of hpage_* helpers, otherwise it would lead to confusing results.
For instance, ext4 uses slab pages for journal pages and we shouldn't
confuse them with THPs. The easiest way is to exclude them in hpage_*
helpers.
Write path allocate pages using pagecache_get_page(). We should be able
to allocate huge pages there, if it's allowed. As usually, fallback to
small pages, if failed.
Signed-off-by: Kirill A. Shutemov
---
mm/filemap.c | 18 --
1 file changed, 16
From: Matthew Wilcox
Calculate how many nodes we need to allocate to split an old_order entry
into multiple entries, each of size new_order. The test suite checks that
we allocated exactly the right number of nodes; neither too many (checked
by rtp->nr == 0), nor too few
This reverts commit 356e1c23292a4f63cfdf1daf0e0ddada51f32de8.
After conversion of huge tmpfs to multi-order entries, we don't need
this anymore.
Signed-off-by: Kirill A. Shutemov
---
include/linux/radix-tree.h | 1 -
lib/radix-tree.c | 74
From: Matthew Wilcox
radix_tree_replace_clear_tags() can be called with NULL as the replacement
value; in this case we need to delete sibling entries which point to
the slot.
Signed-off-by: Matthew Wilcox
Signed-off-by: Kirill A. Shutemov
For huge pages we need to unmap whole range covered by the huge page.
Signed-off-by: Kirill A. Shutemov
---
mm/truncate.c | 27 +++
1 file changed, 19 insertions(+), 8 deletions(-)
diff --git a/mm/truncate.c b/mm/truncate.c
index
On Thu, Sep 15, 2016 at 12:43:35PM +0100, Alex Bligh wrote:
> Sure, it's at:
>
> https://github.com/yoe/nbd/blob/master/doc/proto.md#ordering-of-messages-and-writes
>
> and that link takes you to the specific section.
>
> The treatment of FLUSH and FUA is meant to mirror exactly the
> linux
> On 15 Sep 2016, at 12:40, Christoph Hellwig wrote:
>
> On Thu, Sep 15, 2016 at 01:29:36PM +0200, Wouter Verhelst wrote:
>> Yes, and that is why I was asking about this. If the write barriers
>> are expected to be shared across connections, we have a problem. If,
>>
Christoph,
> On 15 Sep 2016, at 12:38, Christoph Hellwig wrote:
>
> On Thu, Sep 15, 2016 at 12:49:35PM +0200, Wouter Verhelst wrote:
>> A while back, we spent quite some time defining the semantics of the
>> various commands in the face of the NBD_CMD_FLUSH and
On Thu, Sep 15, 2016 at 01:29:36PM +0200, Wouter Verhelst wrote:
> Yes, and that is why I was asking about this. If the write barriers
> are expected to be shared across connections, we have a problem. If,
> however, they are not, then it doesn't matter that the commands may be
> processed out of
On Thu, Sep 15, 2016 at 12:09:28PM +0100, Alex Bligh wrote:
> A more general point is that with multiple queues requests
> may be processed in a different order even by those servers that
> currently process the requests in strict order, or in something
> similar to strict order. The server is
On Thu, Sep 15, 2016 at 12:09:28PM +0100, Alex Bligh wrote:
> Wouter, Josef, (& Eric)
>
> > On 15 Sep 2016, at 11:49, Wouter Verhelst wrote:
> >
> > Hi,
> >
> > On Fri, Sep 09, 2016 at 10:02:03PM +0200, Wouter Verhelst wrote:
> >> I see some practical problems with this:
> >
Hi,
On Fri, Sep 09, 2016 at 10:02:03PM +0200, Wouter Verhelst wrote:
> I see some practical problems with this:
[...]
One more that I didn't think about earlier:
A while back, we spent quite some time defining the semantics of the
various commands in the face of the NBD_CMD_FLUSH and
Let's try reporting this again to new email addresses...
Btw, belated thanks for creating a linux-block mailing list Jens. :)
regards,
dan carpenter
On Thu, Aug 04, 2016 at 05:02:06PM +0300, Dan Carpenter wrote:
> Hello Matthew Wilcox,
>
> The patch 47a191fd38eb: "fs/block_dev.c: add
On 09/14/2016 06:29 PM, Mike Snitzer wrote:
> Make it possible for a request-based target to kick the DM device's
> blk-mq request_queue's requeue_list.
>
> Signed-off-by: Mike Snitzer
> ---
> drivers/md/dm-rq.c | 17 +
> drivers/md/dm-rq.h | 2 ++
> 2 files
On 09/14/2016 06:29 PM, Mike Snitzer wrote:
> Otherwise blk-mq will immediately dispatch requests that are requeued
> via a BLK_MQ_RQ_QUEUE_BUSY return from blk_mq_ops .queue_rq.
>
> Delayed requeue is implemented using blk_mq_delay_kick_requeue_list()
> with a delay of 5 secs. In the context of
70 matches
Mail list logo