Re: Big I/O requests are split into small ones due to unaligned ext4 partition boundary?

2016-12-15 Thread Ming Lei
On Thu, Dec 15, 2016 at 9:53 PM, Dexuan Cui  wrote:
>> From: Ming Lei [mailto:tom.leim...@gmail.com]
>> Sent: Thursday, December 15, 2016 20:43
>>
>> On Thu, Dec 15, 2016 at 7:47 PM, Dexuan Cui  wrote:
>> > Hi, when I run "mkfs.ext4 /dev/sdc2" in a Linux virtual machine on Hyper-V,
>> > where a disk IOPS=500 limit is applied by me [0],  the command takes much
>> > more time, if the ext4 partition boundary is not properly aligned:
>> >
>> > Example 1 [1]: it takes ~7 minutes with average wMB/s = 0.3   (slow)
>> > Example 2 [2]: it takes ~3.5 minutes with average wMB/s = 0.6 (slow)
>> > Example 3 [3]: it takes ~0.5 minute with average wMB/s = 4 (expected)
>> >
>> > strace shows the mkfs.ext3 program calls seek()/write() a lot and most of
>> > the writes use 32KB buffers (this should be big enough), and the program
>> > only invokes fsync() once, after it issues all the writes -- the fsync() 
>> > takes
>> >>99% of the time.
>> >
>> > By logging SCSI commands, the SCSI Write(10) command is used here for the
>> > userspace 32KB write:
>> > in example 1, *each* command writes 1 or 2 sectors only (1 sector = 512
>> bytes);
>> > in example 2, *each* command writes 2 or 4 sectors only;
>> > in example 3, each command writes 1024 sectors.
>> >
>> > It looks the kernel block I/O layer can somehow split big user-space 
>> > buffers
>> > into really small write requests (1, 2, and 4 sectors)?
>> > This looks really strange to me.
>> >
>> > Note: in my test, this strange issue happens to 4.4 and the mainline 4.9 
>> > kernels,
>> > but the stable 3.18.45 kernel doesn't have the issue, i.e. all the 3 above 
>> > test
>> > examples can finish in ~0.5 minute.
>> >
>> > Any comment?
>>
>> I remember that we discussed this kind of issue, please see the discussion[1]
>> and check if the patch[2] can fix your issue.
>>
>> [1] http://marc.info/?t=14580552552&r=1&w=2
>> [2] http://marc.info/?l=linux-kernel&m=145934325429152&w=2
>>
>> Ming
>
> Thank you very much, Ming! The patch can fix my issue!
> It looks your patch was not merged into the upstream somehow.
> Would you please submit the patch again?

Yeah, will do, and thanks for your test!



Thanks,
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] block: avoid incorrect bdi_unregiter call

2016-12-15 Thread Masayoshi Mizuma
Hi Jens,

Could you add this patch for 4.10?

- Masayoshi Mizuma

On Fri, 2 Dec 2016 14:08:34 +0900 Masayoshi Mizuma wrote:
> bdi_unregister() should be called after bdi_register() is called,
> so we should check whether WB_registered flag is set.
> 
> For example of the situation, error path in device driver may call
> blk_cleanup_queue() before the driver calls bdi_register().
> 
> Signed-off-by: Masayoshi Mizuma 
> ---
>   mm/backing-dev.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 8fde443..f8b07d4 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -853,6 +853,9 @@ static void bdi_remove_from_list(struct backing_dev_info 
> *bdi)
>   
>   void bdi_unregister(struct backing_dev_info *bdi)
>   {
> + if (!test_bit(WB_registered, &bdi->wb.state))
> + return;
> +
>   /* make sure nobody finds us on the bdi_list anymore */
>   bdi_remove_from_list(bdi);
>   wb_shutdown(&bdi->wb);
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] block: Remove unused member (busy) from struct blk_queue_tag

2016-12-15 Thread Ritesh Harjani
Signed-off-by: Ritesh Harjani 
---
 include/linux/blkdev.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 286b2a2..8369564 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -288,7 +288,6 @@ enum blk_queue_state {
 struct blk_queue_tag {
struct request **tag_index; /* map of busy tags */
unsigned long *tag_map; /* bit map of free/busy tags */
-   int busy;   /* current depth */
int max_depth;  /* what we will send to device */
int real_max_depth; /* what the array can hold */
atomic_t refcnt;/* map can be shared */
-- 
1.8.2.1

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[LSF/MM ATTEND] Un-addressable device memory and block/fs implications

2016-12-15 Thread Aneesh Kumar K.V
Jerome Glisse  writes:

> I would like to discuss un-addressable device memory in the context of
> filesystem and block device. Specificaly how to handle write-back, read,
> ... when a filesystem page is migrated to device memory that CPU can not
> access.
>
> I intend to post a patchset leveraging the same idea as the existing
> block bounce helper (block/bounce.c) to handle this. I believe this is
> worth discussing during summit see how people feels about such plan and
> if they have better ideas.
>
>
> I also like to join discussions on:
>   - Peer-to-Peer DMAs between PCIe devices
>   - CDM coherent device memory
>   - PMEM
>   - overall mm discussions

I would like to attend this discussion. I can talk about coherent device
memory and how having HMM handle that will make it easy to have one
interface for device driver. For Coherent device case we definitely need
page cache migration support.

-aneesh

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications

2016-12-15 Thread Aneesh Kumar K.V
Jan Kara  writes:

> On Wed 14-12-16 12:15:14, Jerome Glisse wrote:
>  page handling>
>
>> > So won't it be easier to leave the pagecache page where it is and *copy* it
>> > to the device? Can the device notify us *before* it is going to modify a
>> > page, not just after it has modified it? Possibly if we just give it the
>> > page read-only and it will have to ask CPU to get write permission? If yes,
>> > then I belive this could work and even fs support should be doable.
>> 
>> Well yes and no. Device obey the same rule as CPU so if a file back page is
>> map read only in the process it must first do a write fault which will call
>> in the fs (page_mkwrite() of vm_ops). But once a page has write permission
>> there is no way to be notify by hardware on every write. First the hardware
>> do not have the capability. Second we are talking thousand (10 000 is upper
>> range in today device) of concurrent thread, each can possibly write to page
>> under consideration.
>
> Sure, I meant whether the device is able to do equivalent of ->page_mkwrite
> notification which apparently it is. OK.
>
>> We really want the device page to behave just like regular page. Most fs code
>> path never map file content, it only happens during read/write and i believe
>> this can be handled either by migrating back or by using bounce page. I want
>> to provide the choice between the two solutions as one will be better for 
>> some
>> workload and the other for different workload.
>
> I agree with keeping page used by the device behaving as similar as
> possible as any other page. I'm just exploring different possibilities how
> to make that happen. E.g. the scheme I was aiming at is:
>
> When you want page A to be used by the device, you set up page A' in the
> device but make sure any access to it will fault.
>
> When the device wants to access A', it notifies the CPU, that writeprotects
> all mappings of A, copy A to A' and map A' read-only for the device.


A and A' will have different pfns here and hence different struct page.
So what will be there in the address_space->page_tree ? If we place
A' in the page cache, then we are essentially bringing lot of locking
complexity Dave talked about in previous mails.

>
> When the device wants to write to A', it notifies CPU, that will clear all
> mappings of A and mark A as not-uptodate & dirty. When the CPU will then
> want to access the data in A again - we need to catch ->readpage,
> ->readpages, ->writepage, ->writepages - it will writeprotect A' in
> the device, copy data to A, mark A as uptodate & dirty, and off we go.
>
> When we want to write to the page on CPU - we get either wp fault if it was
> via mmap, or we have to catch that in places using kmap() - we just remove
> access to A' from the device.
>
> This scheme makes the device mapping functionality transparent to the
> filesystem (you actually don't need to hook directly into ->readpage etc.
> handlers, you can just have wrappers around them for this functionality)
> and fairly straightforward... It is so transparent that even direct IO works
> with this since the page cache invalidation pass we do before actually doing
> the direct IO will make sure to pull all the pages from the device and write
> them to disk if needed. What do you think?
>

-aneesh

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 15/17] block: track request size in blk_issue_stat

2016-12-15 Thread kbuild test robot
Hi Shaohua,

[auto build test WARNING on block/for-next]
[also build test WARNING on next-20161215]
[cannot apply to v4.9]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Shaohua-Li/blk-throttle-add-low-limit/20161216-093257
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git 
for-next
config: i386-randconfig-s0-201650 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All warnings (new ones prefixed by >>):

   block/blk-stat.c: In function 'blk_stat_set_issue':
>> block/blk-stat.c:243:26: warning: left shift count >= width of type 
>> [-Wshift-count-overflow]
  (blk_capped_size(size) << BLK_STAT_SIZE_SHIFT);
 ^~

vim +243 block/blk-stat.c

   227  queue_for_each_hw_ctx(q, hctx, i) {
   228  hctx_for_each_ctx(hctx, ctx, j) {
   229  
blk_stat_init(&ctx->stat[BLK_STAT_READ]);
   230  
blk_stat_init(&ctx->stat[BLK_STAT_WRITE]);
   231  }
   232  }
   233  } else {
   234  blk_stat_init(&q->rq_stats[BLK_STAT_READ]);
   235  blk_stat_init(&q->rq_stats[BLK_STAT_WRITE]);
   236  }
   237  }
   238  
   239  void blk_stat_set_issue(struct blk_issue_stat *stat, sector_t size)
   240  {
   241  stat->stat = (stat->stat & BLK_STAT_RES_MASK) |
   242  (ktime_to_ns(ktime_get()) & BLK_STAT_TIME_MASK) |
 > 243  (blk_capped_size(size) << BLK_STAT_SIZE_SHIFT);
   244  }
   245  
   246  /*
   247   * Enable stat tracking, return whether it was enabled
   248   */
   249  bool blk_stat_enable(struct request_queue *q)
   250  {
   251  if (!test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers

2016-12-15 Thread Jens Axboe
On 12/15/2016 12:29 PM, Omar Sandoval wrote:
> Hey, Jens, a couple of minor nits below.
> 
> One bigger note: adding the blk_mq_sched_*() helpers and keeping the
> blk_mq_*() helpers that they replaced seems risky. For example,
> blk_mq_free_request() is superseded by blk_mq_sched_put_request(), but
> we kept blk_mq_free_request(). There are definitely some codepaths that
> are still using blk_mq_free_request() that are now wrong
> (__nvme_submit_user_cmd() is the most obvious one I saw).
> 
> Can we get rid of the old, non-sched functions? Or maybe even make old
> interface do the sched stuff instead of adding blk_mq_sched_*()?

I fixed this up. blk_mq_free_request() is now what we still use as an
exported interface, and that just calls blk_mq_sched_put_request().  The
old (__)blk_mq_free_request() are now (__)blk_mq_finish_request() and
internal only.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V5 05/17] blk-throttle: add upgrade logic for LIMIT_LOW state

2016-12-15 Thread Shaohua Li
When queue is in LIMIT_LOW state and all cgroups with low limit cross
the bps/iops limitation, we will upgrade queue's state to
LIMIT_MAX

For a cgroup hierarchy, there are two cases. Children has lower low
limit than parent. Parent's low limit is meaningless. If children's
bps/iops cross low limit, we can upgrade queue state. The other case is
children has higher low limit than parent. Children's low limit is
meaningless. As long as parent's bps/iops cross low limit, we can
upgrade queue state.

Signed-off-by: Shaohua Li 
---
 block/blk-throttle.c | 101 +--
 1 file changed, 97 insertions(+), 4 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index e55bd36..cfd74cfc 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -455,6 +455,7 @@ static void blk_throtl_update_valid_limit(struct 
throtl_data *td)
td->limit_valid[LIMIT_LOW] = false;
 }
 
+static void throtl_upgrade_state(struct throtl_data *td);
 static void throtl_pd_offline(struct blkg_policy_data *pd)
 {
struct throtl_grp *tg = pd_to_tg(pd);
@@ -466,9 +467,8 @@ static void throtl_pd_offline(struct blkg_policy_data *pd)
 
blk_throtl_update_valid_limit(tg->td);
 
-   if (tg->td->limit_index == LIMIT_LOW &&
-   !tg->td->limit_valid[LIMIT_LOW])
-   tg->td->limit_index = LIMIT_MAX;
+   if (!tg->td->limit_valid[tg->td->limit_index])
+   throtl_upgrade_state(tg->td);
 }
 
 static void throtl_pd_free(struct blkg_policy_data *pd)
@@ -1077,6 +1077,8 @@ static int throtl_select_dispatch(struct 
throtl_service_queue *parent_sq)
return nr_disp;
 }
 
+static bool throtl_can_upgrade(struct throtl_data *td,
+   struct throtl_grp *this_tg);
 /**
  * throtl_pending_timer_fn - timer function for service_queue->pending_timer
  * @arg: the throtl_service_queue being serviced
@@ -1103,6 +1105,9 @@ static void throtl_pending_timer_fn(unsigned long arg)
int ret;
 
spin_lock_irq(q->queue_lock);
+   if (throtl_can_upgrade(td, NULL))
+   throtl_upgrade_state(td);
+
 again:
parent_sq = sq->parent_sq;
dispatched = false;
@@ -1506,6 +1511,88 @@ static struct blkcg_policy blkcg_policy_throtl = {
.pd_free_fn = throtl_pd_free,
 };
 
+static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
+{
+   struct throtl_service_queue *sq = &tg->service_queue;
+   bool read_limit, write_limit;
+
+   /*
+* if cgroup reaches low/max limit (max >= low), it's ok to next
+* limit
+*/
+   read_limit = tg->bps[READ][LIMIT_LOW] != U64_MAX ||
+tg->iops[READ][LIMIT_LOW] != UINT_MAX;
+   write_limit = tg->bps[WRITE][LIMIT_LOW] != U64_MAX ||
+ tg->iops[WRITE][LIMIT_LOW] != UINT_MAX;
+   if (read_limit && sq->nr_queued[READ] &&
+   (!write_limit || sq->nr_queued[WRITE]))
+   return true;
+   if (write_limit && sq->nr_queued[WRITE] &&
+   (!read_limit || sq->nr_queued[READ]))
+   return true;
+   return false;
+}
+
+static bool throtl_hierarchy_can_upgrade(struct throtl_grp *tg)
+{
+   while (true) {
+   if (throtl_tg_can_upgrade(tg))
+   return true;
+   tg = sq_to_tg(tg->service_queue.parent_sq);
+   if (!tg || (cgroup_subsys_on_dfl(io_cgrp_subsys) &&
+   !tg_to_blkg(tg)->parent))
+   return false;
+   }
+   return false;
+}
+
+static bool throtl_can_upgrade(struct throtl_data *td,
+   struct throtl_grp *this_tg)
+{
+   struct cgroup_subsys_state *pos_css;
+   struct blkcg_gq *blkg;
+
+   if (td->limit_index != LIMIT_LOW)
+   return false;
+
+   rcu_read_lock();
+   blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
+   struct throtl_grp *tg = blkg_to_tg(blkg);
+
+   if (tg == this_tg)
+   continue;
+   if (!list_empty(&tg_to_blkg(tg)->blkcg->css.children))
+   continue;
+   if (!throtl_hierarchy_can_upgrade(tg)) {
+   rcu_read_unlock();
+   return false;
+   }
+   }
+   rcu_read_unlock();
+   return true;
+}
+
+static void throtl_upgrade_state(struct throtl_data *td)
+{
+   struct cgroup_subsys_state *pos_css;
+   struct blkcg_gq *blkg;
+
+   td->limit_index = LIMIT_MAX;
+   rcu_read_lock();
+   blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
+   struct throtl_grp *tg = blkg_to_tg(blkg);
+   struct throtl_service_queue *sq = &tg->service_queue;
+
+   tg->disptime = jiffies - 1;
+   throtl_select_dispatch(sq);
+   throtl_schedule_next_dispatch(sq, false);
+   }
+   rcu_read_unlock();
+   throtl_select_dispatch(&td->s

[PATCH V5 06/17] blk-throttle: add downgrade logic

2016-12-15 Thread Shaohua Li
When queue state machine is in LIMIT_MAX state, but a cgroup is below
its low limit for some time, the queue should be downgraded to lower
state as one cgroup's low limit isn't met.

Signed-off-by: Shaohua Li 
---
 block/blk-throttle.c | 148 +++
 1 file changed, 148 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index cfd74cfc..0f65fce 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -140,6 +140,13 @@ struct throtl_grp {
/* Number of bio's dispatched in current slice */
unsigned int io_disp[2];
 
+   unsigned long last_low_overflow_time[2];
+
+   uint64_t last_bytes_disp[2];
+   unsigned int last_io_disp[2];
+
+   unsigned long last_check_time;
+
/* When did we start a new slice */
unsigned long slice_start[2];
unsigned long slice_end[2];
@@ -159,6 +166,9 @@ struct throtl_data
struct work_struct dispatch_work;
unsigned int limit_index;
bool limit_valid[LIMIT_CNT];
+
+   unsigned long low_upgrade_time;
+   unsigned long low_downgrade_time;
 };
 
 static void throtl_pending_timer_fn(unsigned long arg);
@@ -896,6 +906,8 @@ static void throtl_charge_bio(struct throtl_grp *tg, struct 
bio *bio)
/* Charge the bio to the group */
tg->bytes_disp[rw] += bio->bi_iter.bi_size;
tg->io_disp[rw]++;
+   tg->last_bytes_disp[rw] += bio->bi_iter.bi_size;
+   tg->last_io_disp[rw]++;
 
/*
 * BIO_THROTTLED is used to prevent the same bio to be throttled
@@ -1511,6 +1523,36 @@ static struct blkcg_policy blkcg_policy_throtl = {
.pd_free_fn = throtl_pd_free,
 };
 
+static unsigned long __tg_last_low_overflow_time(struct throtl_grp *tg)
+{
+   unsigned long rtime = -1, wtime = -1;
+
+   if (tg->bps[READ][LIMIT_LOW] != U64_MAX ||
+   tg->iops[READ][LIMIT_LOW] != UINT_MAX)
+   rtime = tg->last_low_overflow_time[READ];
+   if (tg->bps[WRITE][LIMIT_LOW] != U64_MAX ||
+   tg->iops[WRITE][LIMIT_LOW] != UINT_MAX)
+   wtime = tg->last_low_overflow_time[WRITE];
+   return min(rtime, wtime) == -1 ? 0 : min(rtime, wtime);
+}
+
+static unsigned long tg_last_low_overflow_time(struct throtl_grp *tg)
+{
+   struct throtl_service_queue *parent_sq;
+   struct throtl_grp *parent = tg;
+   unsigned long ret = __tg_last_low_overflow_time(tg);
+
+   while (true) {
+   parent_sq = parent->service_queue.parent_sq;
+   parent = sq_to_tg(parent_sq);
+   if (!parent)
+   break;
+   if (time_after(__tg_last_low_overflow_time(parent), ret))
+   ret = __tg_last_low_overflow_time(parent);
+   }
+   return ret;
+}
+
 static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
 {
struct throtl_service_queue *sq = &tg->service_queue;
@@ -1555,6 +1597,9 @@ static bool throtl_can_upgrade(struct throtl_data *td,
if (td->limit_index != LIMIT_LOW)
return false;
 
+   if (time_before(jiffies, td->low_downgrade_time + throtl_slice))
+   return false;
+
rcu_read_lock();
blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
struct throtl_grp *tg = blkg_to_tg(blkg);
@@ -1578,6 +1623,7 @@ static void throtl_upgrade_state(struct throtl_data *td)
struct blkcg_gq *blkg;
 
td->limit_index = LIMIT_MAX;
+   td->low_upgrade_time = jiffies;
rcu_read_lock();
blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
struct throtl_grp *tg = blkg_to_tg(blkg);
@@ -1593,6 +1639,100 @@ static void throtl_upgrade_state(struct throtl_data *td)
queue_work(kthrotld_workqueue, &td->dispatch_work);
 }
 
+static void throtl_downgrade_state(struct throtl_data *td, int new)
+{
+   td->limit_index = new;
+   td->low_downgrade_time = jiffies;
+}
+
+static bool throtl_tg_can_downgrade(struct throtl_grp *tg)
+{
+   struct throtl_data *td = tg->td;
+   unsigned long now = jiffies;
+
+   /*
+* If cgroup is below low limit, consider downgrade and throttle other
+* cgroups
+*/
+   if (time_after_eq(now, td->low_upgrade_time + throtl_slice) &&
+   time_after_eq(now, tg_last_low_overflow_time(tg) + throtl_slice))
+   return true;
+   return false;
+}
+
+static bool throtl_hierarchy_can_downgrade(struct throtl_grp *tg)
+{
+   while (true) {
+   if (!throtl_tg_can_downgrade(tg))
+   return false;
+   tg = sq_to_tg(tg->service_queue.parent_sq);
+   if (!tg || (cgroup_subsys_on_dfl(io_cgrp_subsys) &&
+   !tg_to_blkg(tg)->parent))
+   break;
+   }
+   return true;
+}
+
+static void throtl_downgrade_check(struct throtl_grp *tg)
+{
+   uint64_t bps;
+   unsigned

[PATCH V5 15/17] block: track request size in blk_issue_stat

2016-12-15 Thread Shaohua Li
Currently there is no way to know the request size when the request is
finished. Next patch will need this info, so add to blk_issue_stat. With
this, we will have 49bits to track time, which still is very long time.

Signed-off-by: Shaohua Li 
---
 block/blk-core.c  |  2 +-
 block/blk-mq.c|  2 +-
 block/blk-stat.c  |  7 ---
 block/blk-stat.h  | 29 +++--
 block/blk-wbt.h   | 10 +-
 include/linux/blk_types.h |  2 +-
 6 files changed, 35 insertions(+), 17 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 61ba08c..485c32d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2459,7 +2459,7 @@ void blk_start_request(struct request *req)
blk_dequeue_request(req);
 
if (test_bit(QUEUE_FLAG_STATS, &req->q->queue_flags)) {
-   blk_stat_set_issue_time(&req->issue_stat);
+   blk_stat_set_issue(&req->issue_stat, blk_rq_sectors(req));
req->rq_flags |= RQF_STATS;
wbt_issue(req->q->rq_wb, &req->issue_stat);
}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4bf850e..891db62 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -474,7 +474,7 @@ void blk_mq_start_request(struct request *rq)
rq->next_rq->resid_len = blk_rq_bytes(rq->next_rq);
 
if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {
-   blk_stat_set_issue_time(&rq->issue_stat);
+   blk_stat_set_issue(&rq->issue_stat, blk_rq_sectors(rq));
rq->rq_flags |= RQF_STATS;
wbt_issue(q->rq_wb, &rq->issue_stat);
}
diff --git a/block/blk-stat.c b/block/blk-stat.c
index 9b43efb..0469855 100644
--- a/block/blk-stat.c
+++ b/block/blk-stat.c
@@ -236,10 +236,11 @@ void blk_stat_clear(struct request_queue *q)
}
 }
 
-void blk_stat_set_issue_time(struct blk_issue_stat *stat)
+void blk_stat_set_issue(struct blk_issue_stat *stat, sector_t size)
 {
-   stat->time = (stat->time & BLK_STAT_MASK) |
-   (ktime_to_ns(ktime_get()) & BLK_STAT_TIME_MASK);
+   stat->stat = (stat->stat & BLK_STAT_RES_MASK) |
+   (ktime_to_ns(ktime_get()) & BLK_STAT_TIME_MASK) |
+   (blk_capped_size(size) << BLK_STAT_SIZE_SHIFT);
 }
 
 /*
diff --git a/block/blk-stat.h b/block/blk-stat.h
index a2050a0..462197f 100644
--- a/block/blk-stat.h
+++ b/block/blk-stat.h
@@ -8,12 +8,19 @@
 #define BLK_STAT_NSEC_MASK ~(BLK_STAT_NSEC - 1)
 
 /*
- * Upper 3 bits can be used elsewhere
+ * from upper:
+ * 3 bits: reserved for other usage
+ * 12 bits: size
+ * 49 bits: time
  */
 #define BLK_STAT_RES_BITS  3
-#define BLK_STAT_SHIFT (64 - BLK_STAT_RES_BITS)
-#define BLK_STAT_TIME_MASK ((1ULL << BLK_STAT_SHIFT) - 1)
-#define BLK_STAT_MASK  ~BLK_STAT_TIME_MASK
+#define BLK_STAT_SIZE_BITS 12
+#define BLK_STAT_RES_SHIFT (64 - BLK_STAT_RES_BITS)
+#define BLK_STAT_SIZE_SHIFT(BLK_STAT_RES_SHIFT - BLK_STAT_SIZE_BITS)
+#define BLK_STAT_TIME_MASK ((1ULL << BLK_STAT_SIZE_SHIFT) - 1)
+#define BLK_STAT_SIZE_MASK \
+   (((1ULL << BLK_STAT_SIZE_BITS) - 1) << BLK_STAT_SIZE_SHIFT)
+#define BLK_STAT_RES_MASK  (~((1ULL << BLK_STAT_RES_SHIFT) - 1))
 
 enum {
BLK_STAT_READ   = 0,
@@ -26,7 +33,7 @@ void blk_queue_stat_get(struct request_queue *, struct 
blk_rq_stat *);
 void blk_stat_clear(struct request_queue *);
 void blk_stat_init(struct blk_rq_stat *);
 bool blk_stat_is_current(struct blk_rq_stat *);
-void blk_stat_set_issue_time(struct blk_issue_stat *);
+void blk_stat_set_issue(struct blk_issue_stat *stat, sector_t size);
 bool blk_stat_enable(struct request_queue *);
 
 static inline u64 __blk_stat_time(u64 time)
@@ -36,7 +43,17 @@ static inline u64 __blk_stat_time(u64 time)
 
 static inline u64 blk_stat_time(struct blk_issue_stat *stat)
 {
-   return __blk_stat_time(stat->time);
+   return __blk_stat_time(stat->stat);
+}
+
+static inline sector_t blk_capped_size(sector_t size)
+{
+   return size & ((1ULL << BLK_STAT_SIZE_BITS) - 1);
+}
+
+static inline sector_t blk_stat_size(struct blk_issue_stat *stat)
+{
+   return (stat->stat & BLK_STAT_SIZE_MASK) >> BLK_STAT_SIZE_SHIFT;
 }
 
 #endif
diff --git a/block/blk-wbt.h b/block/blk-wbt.h
index 65f1de5..7265f1f 100644
--- a/block/blk-wbt.h
+++ b/block/blk-wbt.h
@@ -32,27 +32,27 @@ enum {
 
 static inline void wbt_clear_state(struct blk_issue_stat *stat)
 {
-   stat->time &= BLK_STAT_TIME_MASK;
+   stat->stat &= ~BLK_STAT_RES_MASK;
 }
 
 static inline enum wbt_flags wbt_stat_to_mask(struct blk_issue_stat *stat)
 {
-   return (stat->time & BLK_STAT_MASK) >> BLK_STAT_SHIFT;
+   return (stat->stat & BLK_STAT_RES_MASK) >> BLK_STAT_RES_SHIFT;
 }
 
 static inline void wbt_track(struct blk_issue_stat *stat, enum wbt_flags 
wb_acct)
 {
-   stat->time |= ((u64) wb_acct) << BLK_STAT_SHIFT;
+   stat->stat |= ((u64) wb_acct) << BLK_STAT_RES_SHIFT;
 }
 
 static inline bool wbt_is_track

[PATCH V5 07/17] blk-throttle: make sure expire time isn't too big

2016-12-15 Thread Shaohua Li
cgroup could be throttled to a limit but when all cgroups cross high
limit, queue enters a higher state and so the group should be throttled
to a higher limit. It's possible the cgroup is sleeping because of
throttle and other cgroups don't dispatch IO any more. In this case,
nobody can trigger current downgrade/upgrade logic. To fix this issue,
we could either set up a timer to wakeup the cgroup if other cgroups are
idle or make sure this cgroup doesn't sleep too long. Setting up a timer
means we must change the timer very frequently. This patch chooses the
latter. Making cgroup sleep time not too big wouldn't change cgroup
bps/iops, but could make it wakeup more frequently, which isn't a big
issue because throtl_slice * 8 is already quite big.

Signed-off-by: Shaohua Li 
---
 block/blk-throttle.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 0f65fce..41ec72c 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -588,6 +588,10 @@ static void throtl_dequeue_tg(struct throtl_grp *tg)
 static void throtl_schedule_pending_timer(struct throtl_service_queue *sq,
  unsigned long expires)
 {
+   unsigned long max_expire = jiffies + 8 * throtl_slice;
+
+   if (time_after(expires, max_expire))
+   expires = max_expire;
mod_timer(&sq->pending_timer, expires);
throtl_log(sq, "schedule timer. delay=%lu jiffies=%lu",
   expires - jiffies, jiffies);
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V5 09/17] blk-throttle: detect completed idle cgroup

2016-12-15 Thread Shaohua Li
cgroup could be assigned a limit, but doesn't dispatch enough IO, eg the
cgroup is idle. When this happens, the cgroup doesn't hit its limit, so
we can't move the state machine to higher level and all cgroups will be
throttled to their lower limit, so we waste bandwidth. Detecting idle
cgroup is hard. This patch handles a simple case, a cgroup doesn't
dispatch any IO. We ignore such cgroup's limit, so other cgroups can use
the bandwidth.

Signed-off-by: Shaohua Li 
---
 block/blk-throttle.c | 19 ++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index cd10c65..a0ba961 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -148,6 +148,8 @@ struct throtl_grp {
 
unsigned long last_check_time;
 
+   unsigned long last_dispatch_time[2];
+
/* When did we start a new slice */
unsigned long slice_start[2];
unsigned long slice_end[2];
@@ -437,11 +439,14 @@ static void tg_update_has_rules(struct throtl_grp *tg)
 
 static void throtl_pd_online(struct blkg_policy_data *pd)
 {
+   struct throtl_grp *tg = pd_to_tg(pd);
/*
 * We don't want new groups to escape the limits of its ancestors.
 * Update has_rules[] after a new group is brought online.
 */
-   tg_update_has_rules(pd_to_tg(pd));
+   tg_update_has_rules(tg);
+   tg->last_dispatch_time[READ] = jiffies;
+   tg->last_dispatch_time[WRITE] = jiffies;
 }
 
 static void blk_throtl_update_valid_limit(struct throtl_data *td)
@@ -1582,6 +1587,12 @@ static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
if (write_limit && sq->nr_queued[WRITE] &&
(!read_limit || sq->nr_queued[READ]))
return true;
+
+   if (time_after_eq(jiffies,
+tg->last_dispatch_time[READ] + tg->td->throtl_slice) &&
+   time_after_eq(jiffies,
+tg->last_dispatch_time[WRITE] + tg->td->throtl_slice))
+   return true;
return false;
 }
 
@@ -1660,6 +1671,11 @@ static bool throtl_tg_can_downgrade(struct throtl_grp 
*tg)
struct throtl_data *td = tg->td;
unsigned long now = jiffies;
 
+   if (time_after_eq(now, tg->last_dispatch_time[READ] +
+   td->throtl_slice) &&
+   time_after_eq(now, tg->last_dispatch_time[WRITE] +
+   td->throtl_slice))
+   return false;
/*
 * If cgroup is below low limit, consider downgrade and throttle other
 * cgroups
@@ -1769,6 +1785,7 @@ bool blk_throtl_bio(struct request_queue *q, struct 
blkcg_gq *blkg,
 
 again:
while (true) {
+   tg->last_dispatch_time[rw] = jiffies;
if (tg->last_low_overflow_time[rw] == 0)
tg->last_low_overflow_time[rw] = jiffies;
throtl_downgrade_check(tg);
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V5 08/17] blk-throttle: make throtl_slice tunable

2016-12-15 Thread Shaohua Li
throtl_slice is important for blk-throttling. It's called slice
internally but it really is a time window blk-throttling samples data.
blk-throttling will make decision based on the samplings. An example is
bandwidth measurement. A cgroup's bandwidth is measured in the time
interval of throtl_slice.

A small throtl_slice meanse cgroups have smoother throughput but burn
more CPUs. It has 100ms default value, which is not appropriate for all
disks. A fast SSD can dispatch a lot of IOs in 100ms. This patch makes
it tunable.

Since throtl_slice isn't a time slice, the sysfs name
'throttle_sample_time' reflects its character better.

Signed-off-by: Shaohua Li 
---
 Documentation/block/queue-sysfs.txt |  6 +++
 block/blk-sysfs.c   | 10 +
 block/blk-throttle.c| 77 ++---
 block/blk.h |  3 ++
 4 files changed, 74 insertions(+), 22 deletions(-)

diff --git a/Documentation/block/queue-sysfs.txt 
b/Documentation/block/queue-sysfs.txt
index 5164215..91f371d 100644
--- a/Documentation/block/queue-sysfs.txt
+++ b/Documentation/block/queue-sysfs.txt
@@ -192,5 +192,11 @@ scaling back writes. Writing a value of '0' to this file 
disables the
 feature. Writing a value of '-1' to this file resets the value to the
 default setting.
 
+throttle_sample_time (RW)
+-
+This is the time window that blk-throttle samples data, in millisecond.
+blk-throttle makes decision based on the samplings. Lower time means cgroups
+have more smooth throughput, but higher CPU overhead. This exists only when
+CONFIG_BLK_DEV_THROTTLING is enabled.
 
 Jens Axboe , February 2009
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 1dbce05..0e3fb2a 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -690,6 +690,13 @@ static struct queue_sysfs_entry queue_wb_lat_entry = {
.show = queue_wb_lat_show,
.store = queue_wb_lat_store,
 };
+#ifdef CONFIG_BLK_DEV_THROTTLING
+static struct queue_sysfs_entry throtl_sample_time_entry = {
+   .attr = {.name = "throttle_sample_time", .mode = S_IRUGO | S_IWUSR },
+   .show = blk_throtl_sample_time_show,
+   .store = blk_throtl_sample_time_store,
+};
+#endif
 
 static struct attribute *default_attrs[] = {
&queue_requests_entry.attr,
@@ -724,6 +731,9 @@ static struct attribute *default_attrs[] = {
&queue_stats_entry.attr,
&queue_wb_lat_entry.attr,
&queue_poll_delay_entry.attr,
+#ifdef CONFIG_BLK_DEV_THROTTLING
+   &throtl_sample_time_entry.attr,
+#endif
NULL,
 };
 
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 41ec72c..cd10c65 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -19,7 +19,8 @@ static int throtl_grp_quantum = 8;
 static int throtl_quantum = 32;
 
 /* Throttling is performed over 100ms slice and after that slice is renewed */
-static unsigned long throtl_slice = HZ/10; /* 100 ms */
+#define DFL_THROTL_SLICE (HZ / 10)
+#define MAX_THROTL_SLICE (HZ / 5)
 
 static struct blkcg_policy blkcg_policy_throtl;
 
@@ -162,6 +163,8 @@ struct throtl_data
/* Total Number of queued bios on READ and WRITE lists */
unsigned int nr_queued[2];
 
+   unsigned int throtl_slice;
+
/* Work for dispatching throttled bios */
struct work_struct dispatch_work;
unsigned int limit_index;
@@ -588,7 +591,7 @@ static void throtl_dequeue_tg(struct throtl_grp *tg)
 static void throtl_schedule_pending_timer(struct throtl_service_queue *sq,
  unsigned long expires)
 {
-   unsigned long max_expire = jiffies + 8 * throtl_slice;
+   unsigned long max_expire = jiffies + 8 * sq_to_tg(sq)->td->throtl_slice;
 
if (time_after(expires, max_expire))
expires = max_expire;
@@ -649,7 +652,7 @@ static inline void 
throtl_start_new_slice_with_credit(struct throtl_grp *tg,
if (time_after_eq(start, tg->slice_start[rw]))
tg->slice_start[rw] = start;
 
-   tg->slice_end[rw] = jiffies + throtl_slice;
+   tg->slice_end[rw] = jiffies + tg->td->throtl_slice;
throtl_log(&tg->service_queue,
   "[%c] new slice with credit start=%lu end=%lu jiffies=%lu",
   rw == READ ? 'R' : 'W', tg->slice_start[rw],
@@ -661,7 +664,7 @@ static inline void throtl_start_new_slice(struct throtl_grp 
*tg, bool rw)
tg->bytes_disp[rw] = 0;
tg->io_disp[rw] = 0;
tg->slice_start[rw] = jiffies;
-   tg->slice_end[rw] = jiffies + throtl_slice;
+   tg->slice_end[rw] = jiffies + tg->td->throtl_slice;
throtl_log(&tg->service_queue,
   "[%c] new slice start=%lu end=%lu jiffies=%lu",
   rw == READ ? 'R' : 'W', tg->slice_start[rw],
@@ -671,13 +674,13 @@ static inline void throtl_start_new_slice(struct 
throtl_grp *tg, bool rw)
 static inline void throtl_set_slice_end(struct throtl_grp *tg, bool rw,
  

[PATCH V5 00/17] blk-throttle: add .low limit

2016-12-15 Thread Shaohua Li
Hi,

cgroup still lacks a good iocontroller. CFQ works well for hard disk, but not
much for SSD. This patch set try to add a conservative limit for blk-throttle.
It isn't a proportional scheduling, but can help prioritize cgroups. There are
several advantages we choose blk-throttle:
- blk-throttle resides early in the block stack. It works for both bio and
  request based queues.
- blk-throttle is light weight in general. It still takes queue lock, but it's
  not hard to implement a per-cpu cache and remove the lock contention.
- blk-throttle doesn't use 'idle disk' mechanism, which is used by CFQ/BFQ. The
  mechanism is proved to harm performance for fast SSD.

The patch set add a new io.low limit for blk-throttle. It's only for cgroup2.
The existing io.max is a hard limit throttling. cgroup with a max limit never
dispatch more IO than its max limit. While io.low is a best effort throttling.
cgroups with 'low' limit can run above their 'low' limit at appropriate time.
Specifically, if all cgroups reach their 'low' limit, all cgroups can run above
their 'low' limit. If any cgroup runs under its 'low' limit, all other cgroups
will run according to their 'low' limit. So the 'low' limit could act as two
roles, it allows cgroups using free bandwidth and it protects cgroups from
their 'low' limit.

An example usage is we have a high prio cgroup with high 'low' limit and a low
prio cgroup with low 'low' limit. If the high prio cgroup isn't running, the low
prio can run above its 'low' limit, so we don't waste the bandwidth. When the
high prio cgroup runs and is below its 'low' limit, low prio cgroup will run
under its 'low' limit. This will protect high prio cgroup to get more
resources.

The implementation is simple. The disk queue has a state machine. We have 2
states LIMIT_LOW and LIMIT_MAX. In each disk state, we throttle cgroups
according to the limit of the state. That is io.low limit for LIMIT_LOW state,
io.max limit for LIMIT_MAX. The disk state can be upgraded/downgraded between
LIMIT_LOW and LIMIT_MAX according to the rule aboe. Initially disk state is
LIMIT_MAX. And if no cgroup sets io.low, the disk state will remain in
LIMIT_MAX state. Systems with only io.max set will find nothing changed with the
patches.

The first 9 patches implement the basic framework. Add interface, handle
upgrade and downgrade logic. The patch 9 detects a special case a cgroup is
completely idle. In this case, we ignore the cgroup's limit. The patch 10-17
adds more heuristics.

The basic framework has 2 major issues.

1. fluctuation. When the state is upgraded from LIMIT_LOW to LIMIT_MAX, the
cgroup's bandwidth can change dramatically, sometimes in a way we are not
expected. For example, one cgroup's bandwidth will drop below its io.low limit
very soon after a upgrade. patch 10 has more details about the issue.

2. idle cgroup. cgroup with a io.low limit doesn't always dispatch enough IO.
In above upgrade rule, the disk will remain in LIMIT_LOW state and all other
cgroups can't dispatch more IO above their 'low' limit. Hence there is waste.
patch 11 has more details about the issue.

For issue 1, we make cgroup bandwidth increase/decrease smoothly after a
upgrade/downgrade. This will reduce the chance a cgroup's bandwidth drop under
its 'low' limit rapidly. The smoothness means we could waste some bandwidth in
the transition though. But we must pay something for sharing.

The issue 2 is very hard. We introduce two mechanisms for this. One is 'idle
time' or 'think time' borrowed from CFQ. If a cgroup's average idle time is
high, we treat it's idle and its 'low' limit isn't respected. Please see patch
11 - 13 for details. The other is 'latency target'. If a cgroup's io latency is
low, we treat it's idle and its 'low' limit isn't resptected. Please see patch
14 - 17 for fetails. Both mechanisms only happen when a cgroup runs below its
'low' limit.

The disadvantages of blk-throttle is it exports a kind of low level knobs.
Configuration would not be easy for normal users. It would be powerful for
experienced users though.

More tuning is required of course, but otherwise this works well. Please
review, test and consider merge.

Thanks,
Shaohua

V4->V5, basically address Tejun's comments:
- Change interface from 'io.high' to 'io.low' so consistent with memcg
- Change interface for 'idle time' and 'latency target'
- Make 'idle time' per-cgroup-disk instead of per-cgroup
- Chnage interface name for 'throttle slice'. It's not a real slice
- Make downgrade smooth too
- Make latency sampling work for both bio and request based queue
- Change latency estimation method from 'line fitting' to 'bucket based
  calculation'
- Rebase and fix other problems

Issue pointed out by Tejun isn't fixed yet:
- .pd_offline_fn vs .pd_free_fn. .pd_free_fn seems too late to change states

V3->V4:
- Add latency target for cgroup
- Fix bugs
http://marc.info/?l=linux-block&m=147916216512915&w=2

V2->V3:
- Rebase
- Fix several bugs
- Make harddisk think tim

[PATCH V5 02/17] blk-throttle: prepare support multiple limits

2016-12-15 Thread Shaohua Li
We are going to support low/max limit, each cgroup will have 2 limits
after that. This patch prepares for the multiple limits change.

Signed-off-by: Shaohua Li 
---
 block/blk-throttle.c | 114 +--
 1 file changed, 73 insertions(+), 41 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index e45bf50..a75bfa2 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -83,6 +83,11 @@ enum tg_state_flags {
 
 #define rb_entry_tg(node)  rb_entry((node), struct throtl_grp, rb_node)
 
+enum {
+   LIMIT_MAX,
+   LIMIT_CNT,
+};
+
 struct throtl_grp {
/* must be the first member */
struct blkg_policy_data pd;
@@ -120,10 +125,10 @@ struct throtl_grp {
bool has_rules[2];
 
/* bytes per second rate limits */
-   uint64_t bps[2];
+   uint64_t bps[2][LIMIT_CNT];
 
/* IOPS limits */
-   unsigned int iops[2];
+   unsigned int iops[2][LIMIT_CNT];
 
/* Number of bytes disptached in current slice */
uint64_t bytes_disp[2];
@@ -147,6 +152,8 @@ struct throtl_data
 
/* Work for dispatching throttled bios */
struct work_struct dispatch_work;
+   unsigned int limit_index;
+   bool limit_valid[LIMIT_CNT];
 };
 
 static void throtl_pending_timer_fn(unsigned long arg);
@@ -198,6 +205,16 @@ static struct throtl_data *sq_to_td(struct 
throtl_service_queue *sq)
return container_of(sq, struct throtl_data, service_queue);
 }
 
+static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
+{
+   return tg->bps[rw][tg->td->limit_index];
+}
+
+static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
+{
+   return tg->iops[rw][tg->td->limit_index];
+}
+
 /**
  * throtl_log - log debug message via blktrace
  * @sq: the service_queue being reported
@@ -320,7 +337,7 @@ static void throtl_service_queue_init(struct 
throtl_service_queue *sq)
 static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
 {
struct throtl_grp *tg;
-   int rw;
+   int rw, index;
 
tg = kzalloc_node(sizeof(*tg), gfp, node);
if (!tg)
@@ -334,10 +351,12 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t 
gfp, int node)
}
 
RB_CLEAR_NODE(&tg->rb_node);
-   tg->bps[READ] = U64_MAX;
-   tg->bps[WRITE] = U64_MAX;
-   tg->iops[READ] = UINT_MAX;
-   tg->iops[WRITE] = UINT_MAX;
+   for (rw = READ; rw <= WRITE; rw++) {
+   for (index = 0; index < LIMIT_CNT; index++) {
+   tg->bps[rw][index] = U64_MAX;
+   tg->iops[rw][index] = UINT_MAX;
+   }
+   }
 
return &tg->pd;
 }
@@ -376,11 +395,14 @@ static void throtl_pd_init(struct blkg_policy_data *pd)
 static void tg_update_has_rules(struct throtl_grp *tg)
 {
struct throtl_grp *parent_tg = sq_to_tg(tg->service_queue.parent_sq);
+   struct throtl_data *td = tg->td;
int rw;
 
for (rw = READ; rw <= WRITE; rw++)
tg->has_rules[rw] = (parent_tg && parent_tg->has_rules[rw]) ||
-   (tg->bps[rw] != U64_MAX || tg->iops[rw] != UINT_MAX);
+   (td->limit_valid[td->limit_index] &&
+(tg_bps_limit(tg, rw) != U64_MAX ||
+ tg_iops_limit(tg, rw) != UINT_MAX));
 }
 
 static void throtl_pd_online(struct blkg_policy_data *pd)
@@ -632,11 +654,11 @@ static inline void throtl_trim_slice(struct throtl_grp 
*tg, bool rw)
 
if (!nr_slices)
return;
-   tmp = tg->bps[rw] * throtl_slice * nr_slices;
+   tmp = tg_bps_limit(tg, rw) * throtl_slice * nr_slices;
do_div(tmp, HZ);
bytes_trim = tmp;
 
-   io_trim = (tg->iops[rw] * throtl_slice * nr_slices)/HZ;
+   io_trim = (tg_iops_limit(tg, rw) * throtl_slice * nr_slices) / HZ;
 
if (!bytes_trim && !io_trim)
return;
@@ -682,7 +704,7 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, 
struct bio *bio,
 * have been trimmed.
 */
 
-   tmp = (u64)tg->iops[rw] * jiffy_elapsed_rnd;
+   tmp = (u64)tg_iops_limit(tg, rw) * jiffy_elapsed_rnd;
do_div(tmp, HZ);
 
if (tmp > UINT_MAX)
@@ -697,7 +719,7 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, 
struct bio *bio,
}
 
/* Calc approx time to dispatch */
-   jiffy_wait = ((tg->io_disp[rw] + 1) * HZ)/tg->iops[rw] + 1;
+   jiffy_wait = ((tg->io_disp[rw] + 1) * HZ) / tg_iops_limit(tg, rw) + 1;
 
if (jiffy_wait > jiffy_elapsed)
jiffy_wait = jiffy_wait - jiffy_elapsed;
@@ -724,7 +746,7 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, 
struct bio *bio,
 
jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
 
-   tmp = tg->bps[rw] * jiffy_elapsed_rnd;
+   tmp = tg_bps_limit(tg, rw) * jiffy_elapsed_rnd;
do_div(tmp, HZ);
bytes_allowed = tmp;
 
@@ -736,7 +758,7 @@ s

[PATCH V5 17/17] blk-throttle: add latency target support

2016-12-15 Thread Shaohua Li
One hard problem adding .low limit is to detect idle cgroup. If one
cgroup doesn't dispatch enough IO against its low limit, we must have a
mechanism to determine if other cgroups dispatch more IO. We added the
think time detection mechanism before, but it doesn't work for all
workloads. Here we add a latency based approach.

We already have mechanism to calculate latency threshold for each IO
size. For every IO dispatched from a cgorup, we compare its latency
against its threshold and record the info. If most IO latency is below
threshold (in the code I use 75%), the cgroup could be treated idle and
other cgroups can dispatch more IO.

Currently this latency target check is only for SSD as we can't
calcualte the latency target for hard disk. And this is only for cgroup
leaf node so far.

Signed-off-by: Shaohua Li 
---
 block/blk-throttle.c | 41 -
 1 file changed, 36 insertions(+), 5 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 1dc707a..915ebf5 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -162,6 +162,10 @@ struct throtl_grp {
u64 checked_last_finish_time;
u64 avg_ttime;
u64 idle_ttime_threshold;
+
+   unsigned int bio_cnt; /* total bios */
+   unsigned int bad_bio_cnt; /* bios exceeding latency threshold */
+   unsigned long bio_cnt_reset_time;
 };
 
 /* We measure latency for request size from <= 4k to >= 1M */
@@ -1688,11 +1692,14 @@ static bool throtl_tg_is_idle(struct throtl_grp *tg)
 * - single idle is too long, longer than a fixed value (in case user
 *   configure a too big threshold) or 4 times of slice
 * - average think time is more than threshold
+* - IO latency is largely below threshold
 */
u64 time = (u64)jiffies_to_usecs(4 * tg->td->throtl_slice) * 1000;
time = min_t(u64, MAX_IDLE_TIME, time);
return ktime_get_ns() - tg->last_finish_time > time ||
-  tg->avg_ttime > tg->idle_ttime_threshold;
+  tg->avg_ttime > tg->idle_ttime_threshold ||
+  (tg->latency_target && tg->bio_cnt &&
+   tg->bad_bio_cnt * 5 < tg->bio_cnt);
 }
 
 static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
@@ -2170,12 +2177,36 @@ void blk_throtl_bio_endio(struct bio *bio)
 
start_time = blk_stat_time(&bio->bi_issue_stat);
finish_time = __blk_stat_time(finish_time);
-   if (start_time && finish_time > start_time &&
-   tg->td->track_bio_latency == 1 &&
-   !(bio->bi_issue_stat.stat & SKIP_TRACK)) {
-   lat = finish_time - start_time;
+   if (!start_time || finish_time <= start_time)
+   return;
+
+   lat = finish_time - start_time;
+   if (tg->td->track_bio_latency == 1 &&
+   !(bio->bi_issue_stat.stat & SKIP_TRACK))
throtl_track_latency(tg->td, blk_stat_size(&bio->bi_issue_stat),
bio_op(bio), lat);
+
+   if (tg->latency_target) {
+   int bucket;
+   unsigned int threshold;
+
+   bucket = request_bucket_index(
+   blk_stat_size(&bio->bi_issue_stat));
+   threshold = tg->td->avg_buckets[bucket].latency +
+   tg->latency_target;
+   if (lat > threshold)
+   tg->bad_bio_cnt++;
+   /*
+* Not race free, could get wrong count, which means cgroups
+* will be throttled
+*/
+   tg->bio_cnt++;
+   }
+
+   if (time_after(jiffies, tg->bio_cnt_reset_time) || tg->bio_cnt > 1024) {
+   tg->bio_cnt_reset_time = tg->td->throtl_slice + jiffies;
+   tg->bio_cnt /= 2;
+   tg->bad_bio_cnt /= 2;
}
 }
 
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V5 12/17] blk-throttle: add interface to configure idle time threshold

2016-12-15 Thread Shaohua Li
Add interface to configure the threshold. The io.low interface will
like:
echo "8:16 rbps=2097152 wbps=max idle=2000" > io.low

idle is in microsecond unit.

Signed-off-by: Shaohua Li 
---
 block/blk-throttle.c | 27 ---
 1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 3cd8400..0218ff9 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -180,6 +180,8 @@ struct throtl_data
unsigned int limit_index;
bool limit_valid[LIMIT_CNT];
 
+   u64 dft_idle_ttime_threshold;
+
unsigned long low_upgrade_time;
unsigned long low_downgrade_time;
 
@@ -1427,6 +1429,7 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct 
blkg_policy_data *pd,
struct throtl_grp *tg = pd_to_tg(pd);
const char *dname = blkg_dev_name(pd->blkg);
char bufs[4][21] = { "max", "max", "max", "max" };
+   char idle_time[26] = "";
 
if (!dname)
return 0;
@@ -1434,7 +1437,9 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct 
blkg_policy_data *pd,
if (tg->bps_conf[READ][off] == U64_MAX &&
tg->bps_conf[WRITE][off] == U64_MAX &&
tg->iops_conf[READ][off] == UINT_MAX &&
-   tg->iops_conf[WRITE][off] == UINT_MAX)
+   tg->iops_conf[WRITE][off] == UINT_MAX &&
+   (off != LIMIT_LOW || tg->idle_ttime_threshold ==
+ tg->td->dft_idle_ttime_threshold))
return 0;
 
if (tg->bps_conf[READ][off] != U64_MAX)
@@ -1449,9 +1454,16 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct 
blkg_policy_data *pd,
if (tg->iops_conf[WRITE][off] != UINT_MAX)
snprintf(bufs[3], sizeof(bufs[3]), "%u",
tg->iops_conf[WRITE][off]);
+   if (off == LIMIT_LOW) {
+   if (tg->idle_ttime_threshold == U64_MAX)
+   strcpy(idle_time, " idle=max");
+   else
+   snprintf(idle_time, sizeof(idle_time), " idle=%llu",
+   div_u64(tg->idle_ttime_threshold, 1000));
+   }
 
-   seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s\n",
-  dname, bufs[0], bufs[1], bufs[2], bufs[3]);
+   seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s%s\n",
+  dname, bufs[0], bufs[1], bufs[2], bufs[3], idle_time);
return 0;
 }
 
@@ -1469,6 +1481,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
struct blkg_conf_ctx ctx;
struct throtl_grp *tg;
u64 v[4];
+   u64 idle_time;
int ret;
int index = of_cft(of)->private;
 
@@ -1483,6 +1496,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
v[2] = tg->iops_conf[READ][index];
v[3] = tg->iops_conf[WRITE][index];
 
+   idle_time = tg->idle_ttime_threshold;
while (true) {
char tok[27];   /* wiops=18446744073709551616 */
char *p;
@@ -1514,6 +1528,8 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
v[2] = min_t(u64, val, UINT_MAX);
else if (!strcmp(tok, "wiops"))
v[3] = min_t(u64, val, UINT_MAX);
+   else if (off == LIMIT_LOW && !strcmp(tok, "idle"))
+   idle_time = val;
else
goto out_finish;
}
@@ -1542,6 +1558,8 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
blk_throtl_update_valid_limit(tg->td);
if (tg->td->limit_valid[LIMIT_LOW])
tg->td->limit_index = LIMIT_LOW;
+   tg->idle_ttime_threshold = (idle_time == U64_MAX) ?
+   U64_MAX : idle_time * 1000;
}
tg_conf_updated(tg);
ret = 0;
@@ -1862,6 +1880,9 @@ bool blk_throtl_bio(struct request_queue *q, struct 
blkcg_gq *blkg,
tg->idle_ttime_threshold = DFL_IDLE_THRESHOLD_SSD;
else
tg->idle_ttime_threshold = DFL_IDLE_THRESHOLD_HD;
+   if (tg->td->dft_idle_ttime_threshold == 0)
+   tg->td->dft_idle_ttime_threshold =
+   tg->idle_ttime_threshold;
}
 
/* see throtl_charge_bio() */
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V5 01/17] blk-throttle: use U64_MAX/UINT_MAX to replace -1

2016-12-15 Thread Shaohua Li
clean up the code to avoid using -1

Signed-off-by: Shaohua Li 
---
 block/blk-throttle.c | 32 
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a6bb4fe..e45bf50 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -334,10 +334,10 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t 
gfp, int node)
}
 
RB_CLEAR_NODE(&tg->rb_node);
-   tg->bps[READ] = -1;
-   tg->bps[WRITE] = -1;
-   tg->iops[READ] = -1;
-   tg->iops[WRITE] = -1;
+   tg->bps[READ] = U64_MAX;
+   tg->bps[WRITE] = U64_MAX;
+   tg->iops[READ] = UINT_MAX;
+   tg->iops[WRITE] = UINT_MAX;
 
return &tg->pd;
 }
@@ -380,7 +380,7 @@ static void tg_update_has_rules(struct throtl_grp *tg)
 
for (rw = READ; rw <= WRITE; rw++)
tg->has_rules[rw] = (parent_tg && parent_tg->has_rules[rw]) ||
-   (tg->bps[rw] != -1 || tg->iops[rw] != -1);
+   (tg->bps[rw] != U64_MAX || tg->iops[rw] != UINT_MAX);
 }
 
 static void throtl_pd_online(struct blkg_policy_data *pd)
@@ -771,7 +771,7 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct 
bio *bio,
   bio != throtl_peek_queued(&tg->service_queue.queued[rw]));
 
/* If tg->bps = -1, then BW is unlimited */
-   if (tg->bps[rw] == -1 && tg->iops[rw] == -1) {
+   if (tg->bps[rw] == U64_MAX && tg->iops[rw] == UINT_MAX) {
if (wait)
*wait = 0;
return true;
@@ -1110,7 +1110,7 @@ static u64 tg_prfill_conf_u64(struct seq_file *sf, struct 
blkg_policy_data *pd,
struct throtl_grp *tg = pd_to_tg(pd);
u64 v = *(u64 *)((void *)tg + off);
 
-   if (v == -1)
+   if (v == U64_MAX)
return 0;
return __blkg_prfill_u64(sf, pd, v);
 }
@@ -1121,7 +1121,7 @@ static u64 tg_prfill_conf_uint(struct seq_file *sf, 
struct blkg_policy_data *pd,
struct throtl_grp *tg = pd_to_tg(pd);
unsigned int v = *(unsigned int *)((void *)tg + off);
 
-   if (v == -1)
+   if (v == UINT_MAX)
return 0;
return __blkg_prfill_u64(sf, pd, v);
 }
@@ -1195,7 +1195,7 @@ static ssize_t tg_set_conf(struct kernfs_open_file *of,
if (sscanf(ctx.body, "%llu", &v) != 1)
goto out_finish;
if (!v)
-   v = -1;
+   v = U64_MAX;
 
tg = blkg_to_tg(ctx.blkg);
 
@@ -1270,17 +1270,17 @@ static u64 tg_prfill_max(struct seq_file *sf, struct 
blkg_policy_data *pd,
 
if (!dname)
return 0;
-   if (tg->bps[READ] == -1 && tg->bps[WRITE] == -1 &&
-   tg->iops[READ] == -1 && tg->iops[WRITE] == -1)
+   if (tg->bps[READ] == U64_MAX && tg->bps[WRITE] == U64_MAX &&
+   tg->iops[READ] == UINT_MAX && tg->iops[WRITE] == UINT_MAX)
return 0;
 
-   if (tg->bps[READ] != -1)
+   if (tg->bps[READ] != U64_MAX)
snprintf(bufs[0], sizeof(bufs[0]), "%llu", tg->bps[READ]);
-   if (tg->bps[WRITE] != -1)
+   if (tg->bps[WRITE] != U64_MAX)
snprintf(bufs[1], sizeof(bufs[1]), "%llu", tg->bps[WRITE]);
-   if (tg->iops[READ] != -1)
+   if (tg->iops[READ] != UINT_MAX)
snprintf(bufs[2], sizeof(bufs[2]), "%u", tg->iops[READ]);
-   if (tg->iops[WRITE] != -1)
+   if (tg->iops[WRITE] != UINT_MAX)
snprintf(bufs[3], sizeof(bufs[3]), "%u", tg->iops[WRITE]);
 
seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s\n",
@@ -1318,7 +1318,7 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
while (true) {
char tok[27];   /* wiops=18446744073709551616 */
char *p;
-   u64 val = -1;
+   u64 val = U64_MAX;
int len;
 
if (sscanf(ctx.body, "%26s%n", tok, &len) != 1)
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V5 03/17] blk-throttle: add .low interface

2016-12-15 Thread Shaohua Li
Add low limit for cgroup and corresponding cgroup interface.

Signed-off-by: Shaohua Li 
---
 block/blk-throttle.c | 134 ---
 1 file changed, 106 insertions(+), 28 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a75bfa2..fcc4199 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -84,6 +84,7 @@ enum tg_state_flags {
 #define rb_entry_tg(node)  rb_entry((node), struct throtl_grp, rb_node)
 
 enum {
+   LIMIT_LOW,
LIMIT_MAX,
LIMIT_CNT,
 };
@@ -124,11 +125,15 @@ struct throtl_grp {
/* are there any throtl rules between this group and td? */
bool has_rules[2];
 
-   /* bytes per second rate limits */
+   /* internally used bytes per second rate limits */
uint64_t bps[2][LIMIT_CNT];
+   /* user configured bps limits */
+   uint64_t bps_conf[2][LIMIT_CNT];
 
-   /* IOPS limits */
+   /* internally used IOPS limits */
unsigned int iops[2][LIMIT_CNT];
+   /* user configured IOPS limits */
+   unsigned int iops_conf[2][LIMIT_CNT];
 
/* Number of bytes disptached in current slice */
uint64_t bytes_disp[2];
@@ -355,6 +360,8 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, 
int node)
for (index = 0; index < LIMIT_CNT; index++) {
tg->bps[rw][index] = U64_MAX;
tg->iops[rw][index] = UINT_MAX;
+   tg->bps_conf[rw][index] = U64_MAX;
+   tg->iops_conf[rw][index] = UINT_MAX;
}
}
 
@@ -414,6 +421,46 @@ static void throtl_pd_online(struct blkg_policy_data *pd)
tg_update_has_rules(pd_to_tg(pd));
 }
 
+static void blk_throtl_update_valid_limit(struct throtl_data *td)
+{
+   struct cgroup_subsys_state *pos_css;
+   struct blkcg_gq *blkg;
+   bool low_valid = false;
+
+   rcu_read_lock();
+   blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
+   struct throtl_grp *tg = blkg_to_tg(blkg);
+
+   if (tg->bps[READ][LIMIT_LOW] != U64_MAX ||
+   tg->bps[WRITE][LIMIT_LOW] != U64_MAX ||
+   tg->iops[READ][LIMIT_LOW] != UINT_MAX ||
+   tg->iops[WRITE][LIMIT_LOW] != UINT_MAX)
+   low_valid = true;
+   }
+   rcu_read_unlock();
+
+   if (low_valid)
+   td->limit_valid[LIMIT_LOW] = true;
+   else
+   td->limit_valid[LIMIT_LOW] = false;
+}
+
+static void throtl_pd_offline(struct blkg_policy_data *pd)
+{
+   struct throtl_grp *tg = pd_to_tg(pd);
+
+   tg->bps[READ][LIMIT_LOW] = U64_MAX;
+   tg->bps[WRITE][LIMIT_LOW] = U64_MAX;
+   tg->iops[READ][LIMIT_LOW] = UINT_MAX;
+   tg->iops[WRITE][LIMIT_LOW] = UINT_MAX;
+
+   blk_throtl_update_valid_limit(tg->td);
+
+   if (tg->td->limit_index == LIMIT_LOW &&
+   !tg->td->limit_valid[LIMIT_LOW])
+   tg->td->limit_index = LIMIT_MAX;
+}
+
 static void throtl_pd_free(struct blkg_policy_data *pd)
 {
struct throtl_grp *tg = pd_to_tg(pd);
@@ -1284,7 +1331,7 @@ static struct cftype throtl_legacy_files[] = {
{ } /* terminate */
 };
 
-static u64 tg_prfill_max(struct seq_file *sf, struct blkg_policy_data *pd,
+static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
 int off)
 {
struct throtl_grp *tg = pd_to_tg(pd);
@@ -1294,38 +1341,38 @@ static u64 tg_prfill_max(struct seq_file *sf, struct 
blkg_policy_data *pd,
if (!dname)
return 0;
 
-   if (tg->bps[READ][LIMIT_MAX] == U64_MAX &&
-   tg->bps[WRITE][LIMIT_MAX] == U64_MAX &&
-   tg->iops[READ][LIMIT_MAX] == UINT_MAX &&
-   tg->iops[WRITE][LIMIT_MAX] == UINT_MAX)
+   if (tg->bps_conf[READ][off] == U64_MAX &&
+   tg->bps_conf[WRITE][off] == U64_MAX &&
+   tg->iops_conf[READ][off] == UINT_MAX &&
+   tg->iops_conf[WRITE][off] == UINT_MAX)
return 0;
 
-   if (tg->bps[READ][LIMIT_MAX] != U64_MAX)
+   if (tg->bps_conf[READ][off] != U64_MAX)
snprintf(bufs[0], sizeof(bufs[0]), "%llu",
-   tg->bps[READ][LIMIT_MAX]);
-   if (tg->bps[WRITE][LIMIT_MAX] != U64_MAX)
+   tg->bps_conf[READ][off]);
+   if (tg->bps_conf[WRITE][off] != U64_MAX)
snprintf(bufs[1], sizeof(bufs[1]), "%llu",
-   tg->bps[WRITE][LIMIT_MAX]);
-   if (tg->iops[READ][LIMIT_MAX] != UINT_MAX)
+   tg->bps_conf[WRITE][off]);
+   if (tg->iops_conf[READ][off] != UINT_MAX)
snprintf(bufs[2], sizeof(bufs[2]), "%u",
-   tg->iops[READ][LIMIT_MAX]);
-   if (tg->iops[WRITE][LIMIT_MAX] != UINT_MAX)
+   tg->iops_conf[READ][off]);
+   if (tg->iops_conf[WRITE][off] != UINT_MAX)
snprintf(bufs[3], siz

[PATCH V5 10/17] blk-throttle: make bandwidth change smooth

2016-12-15 Thread Shaohua Li
When cgroups all reach low limit, cgroups can dispatch more IO. This
could make some cgroups dispatch more IO but others not, and even some
cgroups could dispatch less IO than their low limit. For example, cg1
low limit 10MB/s, cg2 limit 80MB/s, assume disk maximum bandwidth is
120M/s for the workload. Their bps could something like this:

cg1/cg2 bps: T1: 10/80 -> T2: 60/60 -> T3: 10/80

At T1, all cgroups reach low limit, so they can dispatch more IO later.
Then cg1 dispatch more IO and cg2 has no room to dispatch enough IO. At
T2, cg2 only dispatches 60M/s. Since We detect cg2 dispatches less IO
than its low limit 80M/s, we downgrade the queue from LIMIT_MAX to
LIMIT_LOW, then all cgroups are throttled to their low limit (T3). cg2
will have bandwidth below its low limit at most time.

The big problem here is we don't know the maximum bandwidth of the
workload, so we can't make smart decision to avoid the situation. This
patch makes cgroup bandwidth change smooth. After disk upgrades from
LIMIT_LOW to LIMIT_MAX, we don't allow cgroups use all bandwidth upto
their max limit immediately. Their bandwidth limit will be increased
gradually to avoid above situation. So above example will became
something like:

cg1/cg2 bps: 10/80 -> 15/105 -> 20/100 -> 25/95 -> 30/90 -> 35/85 -> 40/80
-> 45/75 -> 22/98

In this way cgroups bandwidth will be above their limit in majority
time, this still doesn't fully utilize disk bandwidth, but that's
something we pay for sharing.

Note this doesn't completely avoid cgroup running under its low limit.
The best way to guarantee cgroup doesn't run under its limit is to set
max limit. For example, if we set cg1 max limit to 40, cg2 will never
run under its low limit.

Signed-off-by: Shaohua Li 
---
 block/blk-throttle.c | 51 +--
 1 file changed, 49 insertions(+), 2 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a0ba961..6b2f365 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -174,6 +174,8 @@ struct throtl_data
 
unsigned long low_upgrade_time;
unsigned long low_downgrade_time;
+
+   unsigned int scale;
 };
 
 static void throtl_pending_timer_fn(unsigned long arg);
@@ -228,21 +230,58 @@ static struct throtl_data *sq_to_td(struct 
throtl_service_queue *sq)
 static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
 {
struct blkcg_gq *blkg = tg_to_blkg(tg);
+   struct throtl_data *td;
uint64_t ret;
 
if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
return U64_MAX;
-   return tg->bps[rw][tg->td->limit_index];
+
+   td = tg->td;
+   ret = tg->bps[rw][td->limit_index];
+   if (td->limit_index == LIMIT_MAX && tg->bps[rw][LIMIT_LOW] !=
+   tg->bps[rw][LIMIT_MAX]) {
+   uint64_t increase;
+
+   if (td->scale < 4096 && time_after_eq(jiffies,
+   td->low_upgrade_time + td->scale * td->throtl_slice)) {
+   unsigned int time = jiffies - td->low_upgrade_time;
+
+   td->scale = time / td->throtl_slice;
+   }
+   increase = (tg->bps[rw][LIMIT_LOW] >> 1) * td->scale;
+   ret = min(tg->bps[rw][LIMIT_MAX],
+   tg->bps[rw][LIMIT_LOW] + increase);
+   }
+   return ret;
 }
 
 static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
 {
struct blkcg_gq *blkg = tg_to_blkg(tg);
+   struct throtl_data *td;
unsigned int ret;
 
if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
return UINT_MAX;
-   return tg->iops[rw][tg->td->limit_index];
+
+   td = tg->td;
+   ret = tg->iops[rw][td->limit_index];
+   if (td->limit_index == LIMIT_MAX && tg->iops[rw][LIMIT_LOW] !=
+   tg->iops[rw][LIMIT_MAX]) {
+   uint64_t increase;
+
+   if (td->scale < 4096 && time_after_eq(jiffies,
+   td->low_upgrade_time + td->scale * td->throtl_slice)) {
+   unsigned int time = jiffies - td->low_upgrade_time;
+
+   td->scale = time / td->throtl_slice;
+   }
+
+   increase = (tg->iops[rw][LIMIT_LOW] >> 1) * td->scale;
+   ret = min(tg->iops[rw][LIMIT_MAX],
+   tg->iops[rw][LIMIT_LOW] + (unsigned int)increase);
+   }
+   return ret;
 }
 
 /**
@@ -1645,6 +1684,7 @@ static void throtl_upgrade_state(struct throtl_data *td)
 
td->limit_index = LIMIT_MAX;
td->low_upgrade_time = jiffies;
+   td->scale = 0;
rcu_read_lock();
blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
struct throtl_grp *tg = blkg_to_tg(blkg);
@@ -1662,6 +1702,13 @@ static void throtl_upgrade_state(struct throtl_data *td)
 
 static void throtl_downgrade_state(struct throtl_data *td, int new)
 {
+   td->scale /= 2;
+
+   if (td->scale) {
+  

[PATCH V5 13/17] blk-throttle: ignore idle cgroup limit

2016-12-15 Thread Shaohua Li
Last patch introduces a way to detect idle cgroup. We use it to make
upgrade/downgrade decision. And the new algorithm can detect completely
idle cgroup too, so we can delete the corresponding code.

Signed-off-by: Shaohua Li 
---
 block/blk-throttle.c | 40 ++--
 1 file changed, 26 insertions(+), 14 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 0218ff9..62fe72ea 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -151,8 +151,6 @@ struct throtl_grp {
 
unsigned long last_check_time;
 
-   unsigned long last_dispatch_time[2];
-
/* When did we start a new slice */
unsigned long slice_start[2];
unsigned long slice_end[2];
@@ -495,8 +493,6 @@ static void throtl_pd_online(struct blkg_policy_data *pd)
 * Update has_rules[] after a new group is brought online.
 */
tg_update_has_rules(tg);
-   tg->last_dispatch_time[READ] = jiffies;
-   tg->last_dispatch_time[WRITE] = jiffies;
 }
 
 static void blk_throtl_update_valid_limit(struct throtl_data *td)
@@ -1669,9 +1665,8 @@ static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
return true;
 
if (time_after_eq(jiffies,
-tg->last_dispatch_time[READ] + tg->td->throtl_slice) &&
-   time_after_eq(jiffies,
-tg->last_dispatch_time[WRITE] + tg->td->throtl_slice))
+   tg_last_low_overflow_time(tg) + tg->td->throtl_slice) &&
+   throtl_tg_is_idle(tg))
return true;
return false;
 }
@@ -1718,6 +1713,26 @@ static bool throtl_can_upgrade(struct throtl_data *td,
return true;
 }
 
+static void throtl_upgrade_check(struct throtl_grp *tg)
+{
+   unsigned long now = jiffies;
+
+   if (tg->td->limit_index != LIMIT_LOW)
+   return;
+
+   if (time_after(tg->last_check_time + tg->td->throtl_slice, now))
+   return;
+
+   tg->last_check_time = now;
+
+   if (!time_after_eq(now,
+__tg_last_low_overflow_time(tg) + tg->td->throtl_slice))
+   return;
+
+   if (throtl_can_upgrade(tg->td, NULL))
+   throtl_upgrade_state(tg->td);
+}
+
 static void throtl_upgrade_state(struct throtl_data *td)
 {
struct cgroup_subsys_state *pos_css;
@@ -1759,18 +1774,15 @@ static bool throtl_tg_can_downgrade(struct throtl_grp 
*tg)
struct throtl_data *td = tg->td;
unsigned long now = jiffies;
 
-   if (time_after_eq(now, tg->last_dispatch_time[READ] +
-   td->throtl_slice) &&
-   time_after_eq(now, tg->last_dispatch_time[WRITE] +
-   td->throtl_slice))
-   return false;
/*
 * If cgroup is below low limit, consider downgrade and throttle other
 * cgroups
 */
if (time_after_eq(now, td->low_upgrade_time + td->throtl_slice) &&
time_after_eq(now, tg_last_low_overflow_time(tg) +
-   td->throtl_slice))
+   td->throtl_slice) &&
+   (!throtl_tg_is_idle(tg) ||
+!list_empty(&tg_to_blkg(tg)->blkcg->css.children)))
return true;
return false;
 }
@@ -1904,10 +1916,10 @@ bool blk_throtl_bio(struct request_queue *q, struct 
blkcg_gq *blkg,
 
 again:
while (true) {
-   tg->last_dispatch_time[rw] = jiffies;
if (tg->last_low_overflow_time[rw] == 0)
tg->last_low_overflow_time[rw] = jiffies;
throtl_downgrade_check(tg);
+   throtl_upgrade_check(tg);
/* throtl is FIFO - if bios are already queued, should queue */
if (sq->nr_queued[rw])
break;
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V5 16/17] blk-throttle: add a mechanism to estimate IO latency

2016-12-15 Thread Shaohua Li
User configures latency target, but the latency threshold for each
request size isn't fixed. For a SSD, the IO latency highly depends on
request size. To calculate latency threshold, we sample some data, eg,
average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency
threshold of each request size will be the sample latency (I'll call it
base latency) plus latency target. For example, the base latency for
request size 4k is 80us and user configures latency target 60us. The 4k
latency threshold will be 80 + 60 = 140us.

To sample data, we calculate the order base 2 of rounded up IO sectors.
If the IO size is bigger than 1M, it will be accounted as 1M. Since the
calculation does round up, the base latency will be slightly smaller
than actual value. Also if there isn't any IO dispatched for a specific
IO size, we will use the base latency of smaller IO size for this IO
size.

But we shouldn't sample data at any time. The base latency is supposed
to be latency where disk isn't congested, because we use latency
threshold to schedule IOs between cgroups. If disk is congested, the
latency is higher, using it for scheduling is meaningless. Hence we only
do the sampling when block throttling is in the LOW limit, with
assumption disk isn't congested in such state. If the assumption isn't
true, eg, low limit is too high, calculated latency threshold will be
higher.

Hard disk is completely different. Latency depends on spindle seek
instead of request size. Currently this feature is SSD only, we probably
can use a fixed threshold like 4ms for hard disk though.

Signed-off-by: Shaohua Li 
---
 block/blk-stat.c  |   4 ++
 block/blk-throttle.c  | 159 +-
 block/blk.h   |   2 +
 include/linux/blk_types.h |   9 +--
 4 files changed, 168 insertions(+), 6 deletions(-)

diff --git a/block/blk-stat.c b/block/blk-stat.c
index 0469855..15c1621 100644
--- a/block/blk-stat.c
+++ b/block/blk-stat.c
@@ -7,6 +7,7 @@
 #include 
 
 #include "blk-stat.h"
+#include "blk.h"
 #include "blk-mq.h"
 
 static void blk_stat_flush_batch(struct blk_rq_stat *stat)
@@ -204,6 +205,9 @@ void blk_stat_add(struct blk_rq_stat *stat, struct request 
*rq)
__blk_stat_init(stat, now);
 
value = now - blk_stat_time(&rq->issue_stat);
+
+   blk_throtl_stat_add(rq, value);
+
if (value > stat->max)
stat->max = value;
if (value < stat->min)
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 3431b1d..1dc707a 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -25,6 +25,8 @@ static int throtl_quantum = 32;
 #define DFL_IDLE_THRESHOLD_HD (1000 * 1000) /* 1 ms */
 #define MAX_IDLE_TIME (500L * 1000 * 1000) /* 500 ms */
 
+#define SKIP_TRACK (((u64)1) << BLK_STAT_RES_SHIFT)
+
 static struct blkcg_policy blkcg_policy_throtl;
 
 /* A workqueue to queue throttle related work */
@@ -162,6 +164,19 @@ struct throtl_grp {
u64 idle_ttime_threshold;
 };
 
+/* We measure latency for request size from <= 4k to >= 1M */
+#define LATENCY_BUCKET_SIZE 9
+
+struct latency_bucket {
+   u64 total_latency;
+   int samples;
+};
+
+struct avg_latency_bucket {
+   u64 latency;
+   bool valid;
+};
+
 struct throtl_data
 {
/* service tree for active throtl groups */
@@ -185,6 +200,13 @@ struct throtl_data
unsigned long low_downgrade_time;
 
unsigned int scale;
+
+   struct latency_bucket tmp_buckets[LATENCY_BUCKET_SIZE];
+   struct avg_latency_bucket avg_buckets[LATENCY_BUCKET_SIZE];
+   struct latency_bucket __percpu *latency_buckets;
+   unsigned long last_calculate_time;
+
+   unsigned int track_bio_latency;
 };
 
 static void throtl_pending_timer_fn(unsigned long arg);
@@ -293,6 +315,13 @@ static unsigned int tg_iops_limit(struct throtl_grp *tg, 
int rw)
return ret;
 }
 
+static int request_bucket_index(sector_t sectors)
+{
+   int order = order_base_2(sectors);
+
+   return clamp_t(int, order - 3, 0, LATENCY_BUCKET_SIZE - 1);
+}
+
 /**
  * throtl_log - log debug message via blktrace
  * @sq: the service_queue being reported
@@ -1896,12 +1925,74 @@ static void blk_throtl_update_ttime(struct throtl_grp 
*tg)
tg->checked_last_finish_time = last_finish_time;
 }
 
+static void throtl_update_latency_buckets(struct throtl_data *td)
+{
+   struct avg_latency_bucket avg_latency[LATENCY_BUCKET_SIZE];
+   int i, cpu;
+   u64 last_latency = 0;
+   u64 latency;
+
+   if (!blk_queue_nonrot(td->queue))
+   return;
+   if (time_before(jiffies, td->last_calculate_time + HZ))
+   return;
+   td->last_calculate_time = jiffies;
+
+   memset(avg_latency, 0, sizeof(avg_latency));
+   for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
+   struct latency_bucket *tmp = &td->tmp_buckets[i];
+
+   for_each_possible_cpu(cpu) {
+   struct latency_bucket *bucket;
+
+   

[PATCH V5 14/17] blk-throttle: add interface for per-cgroup target latency

2016-12-15 Thread Shaohua Li
Here we introduce per-cgroup latency target. The target determines how a
cgroup can afford latency increasement. We will use the target latency
to calculate a threshold and use it to schedule IO for cgroups. If a
cgroup's bandwidth is below its low limit but its average latency is
below the threshold, other cgroups can safely dispatch more IO even
their bandwidth is higher than their low limits. On the other hand, if
the first cgroup's latency is higher than the threshold, other cgroups
are throttled to their low limits. So the target latency determines how
we efficiently utilize free disk resource without sacifice of worload's
IO latency.

For example, assume 4k IO average latency is 50us when disk isn't
congested. A cgroup sets the target latency to 30us. Then the cgroup can
accept 50+30=80us IO latency. If the cgroupt's average IO latency is
90us and its bandwidth is below low limit, other cgroups are throttled
to their low limit. If the cgroup's average IO latency is 60us, other
cgroups are allowed to dispatch more IO. When other cgroups dispatch
more IO, the first cgroup's IO latency will increase. If it increases to
81us, we then throttle other cgroups.

User will configure the interface in this way:
echo "8:16 rbps=2097152 wbps=max latency=100 idle=200" > io.low

latency is in microsecond unit

Signed-off-by: Shaohua Li 
---
 block/blk-throttle.c | 30 ++
 1 file changed, 26 insertions(+), 4 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 62fe72ea..3431b1d 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -151,6 +151,7 @@ struct throtl_grp {
 
unsigned long last_check_time;
 
+   u64 latency_target;
/* When did we start a new slice */
unsigned long slice_start[2];
unsigned long slice_end[2];
@@ -438,6 +439,11 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, 
int node)
}
tg->idle_ttime_threshold = U64_MAX;
 
+   /*
+* target latency default 0, eg, latency threshold is 0, which means
+* cgroup's latency is always higher than threshold
+*/
+
return &tg->pd;
 }
 
@@ -1426,6 +1432,7 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct 
blkg_policy_data *pd,
const char *dname = blkg_dev_name(pd->blkg);
char bufs[4][21] = { "max", "max", "max", "max" };
char idle_time[26] = "";
+   char latency_time[26] = "";
 
if (!dname)
return 0;
@@ -1434,8 +1441,9 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct 
blkg_policy_data *pd,
tg->bps_conf[WRITE][off] == U64_MAX &&
tg->iops_conf[READ][off] == UINT_MAX &&
tg->iops_conf[WRITE][off] == UINT_MAX &&
-   (off != LIMIT_LOW || tg->idle_ttime_threshold ==
- tg->td->dft_idle_ttime_threshold))
+   (off != LIMIT_LOW ||
+(tg->idle_ttime_threshold == tg->td->dft_idle_ttime_threshold &&
+ tg->latency_target == 0)))
return 0;
 
if (tg->bps_conf[READ][off] != U64_MAX)
@@ -1456,10 +1464,18 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct 
blkg_policy_data *pd,
else
snprintf(idle_time, sizeof(idle_time), " idle=%llu",
div_u64(tg->idle_ttime_threshold, 1000));
+
+   if (tg->latency_target == U64_MAX)
+   strcpy(latency_time, " latency=max");
+   else
+   snprintf(latency_time, sizeof(latency_time),
+   " latency=%llu",
+   div_u64(tg->latency_target, 1000));
}
 
-   seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s%s\n",
-  dname, bufs[0], bufs[1], bufs[2], bufs[3], idle_time);
+   seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s%s%s\n",
+  dname, bufs[0], bufs[1], bufs[2], bufs[3], idle_time,
+  latency_time);
return 0;
 }
 
@@ -1478,6 +1494,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
struct throtl_grp *tg;
u64 v[4];
u64 idle_time;
+   u64 latency_time;
int ret;
int index = of_cft(of)->private;
 
@@ -1493,6 +1510,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
v[3] = tg->iops_conf[WRITE][index];
 
idle_time = tg->idle_ttime_threshold;
+   latency_time = tg->latency_target;
while (true) {
char tok[27];   /* wiops=18446744073709551616 */
char *p;
@@ -1526,6 +1544,8 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
v[3] = min_t(u64, val, UINT_MAX);
else if (off == LIMIT_LOW && !strcmp(tok, "idle"))
idle_time = val;
+   else if (off == LIMIT_LOW && !strcmp(tok, "latency"))
+   latency_time = val;
 

[PATCH V5 04/17] blk-throttle: configure bps/iops limit for cgroup in low limit

2016-12-15 Thread Shaohua Li
each queue will have a state machine. Initially queue is in LIMIT_LOW
state, which means all cgroups will be throttled according to their low
limit. After all cgroups with low limit cross the limit, the queue state
gets upgraded to LIMIT_MAX state.
For max limit, cgroup will use the limit configured by user.
For low limit, cgroup will use the minimal between low limit and max
limit configured by user. Last patch already did the convertion.

Signed-off-by: Shaohua Li 
---
 block/blk-throttle.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index fcc4199..e55bd36 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -212,11 +212,21 @@ static struct throtl_data *sq_to_td(struct 
throtl_service_queue *sq)
 
 static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
 {
+   struct blkcg_gq *blkg = tg_to_blkg(tg);
+   uint64_t ret;
+
+   if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
+   return U64_MAX;
return tg->bps[rw][tg->td->limit_index];
 }
 
 static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
 {
+   struct blkcg_gq *blkg = tg_to_blkg(tg);
+   unsigned int ret;
+
+   if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
+   return UINT_MAX;
return tg->iops[rw][tg->td->limit_index];
 }
 
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V5 11/17] blk-throttle: add a simple idle detection

2016-12-15 Thread Shaohua Li
A cgroup gets assigned a low limit, but the cgroup could never dispatch
enough IO to cross the low limit. In such case, the queue state machine
will remain in LIMIT_LOW state and all other cgroups will be throttled
according to low limit. This is unfair for other cgroups. We should
treat the cgroup idle and upgrade the state machine to lower state.

We also have a downgrade logic. If the state machine upgrades because of
cgroup idle (real idle), the state machine will downgrade soon as the
cgroup is below its low limit. This isn't what we want. A more
complicated case is cgroup isn't idle when queue is in LIMIT_LOW. But
when queue gets upgraded to lower state, other cgroups could dispatch
more IO and this cgroup can't dispatch enough IO, so the cgroup is below
its low limit and looks like idle (fake idle). In this case, the queue
should downgrade soon. The key to determine if we should do downgrade is
to detect if cgroup is truely idle.

Unfortunately it's very hard to determine if a cgroup is real idle. This
patch uses the 'think time check' idea from CFQ for the purpose. Please
note, the idea doesn't work for all workloads. For example, a workload
with io depth 8 has disk utilization 100%, hence think time is 0, eg,
not idle. But the workload can run higher bandwidth with io depth 16.
Compared to io depth 16, the io depth 8 workload is idle. We use the
idea to roughly determine if a cgroup is idle.

We treat a cgroup idle if its think time is above a threshold (by
default 50us for SSD and 1ms for HD). The idea is think time above the
threshold will start to harm performance. HD is much slower so a longer
think time is ok.

Signed-off-by: Shaohua Li 
---
 block/bio.c   |  2 ++
 block/blk-throttle.c  | 65 ++-
 block/blk.h   |  2 ++
 include/linux/blk_types.h |  1 +
 4 files changed, 69 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index 2b37502..948ebc1 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -30,6 +30,7 @@
 #include 
 
 #include 
+#include "blk.h"
 
 /*
  * Test patch to inline a certain number of bi_io_vec's inside the bio
@@ -1813,6 +1814,7 @@ void bio_endio(struct bio *bio)
goto again;
}
 
+   blk_throtl_bio_endio(bio);
if (bio->bi_end_io)
bio->bi_end_io(bio);
 }
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 6b2f365..3cd8400 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -21,6 +21,9 @@ static int throtl_quantum = 32;
 /* Throttling is performed over 100ms slice and after that slice is renewed */
 #define DFL_THROTL_SLICE (HZ / 10)
 #define MAX_THROTL_SLICE (HZ / 5)
+#define DFL_IDLE_THRESHOLD_SSD (50 * 1000) /* 50 us */
+#define DFL_IDLE_THRESHOLD_HD (1000 * 1000) /* 1 ms */
+#define MAX_IDLE_TIME (500L * 1000 * 1000) /* 500 ms */
 
 static struct blkcg_policy blkcg_policy_throtl;
 
@@ -153,6 +156,11 @@ struct throtl_grp {
/* When did we start a new slice */
unsigned long slice_start[2];
unsigned long slice_end[2];
+
+   u64 last_finish_time;
+   u64 checked_last_finish_time;
+   u64 avg_ttime;
+   u64 idle_ttime_threshold;
 };
 
 struct throtl_data
@@ -428,6 +436,7 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, 
int node)
tg->iops_conf[rw][index] = UINT_MAX;
}
}
+   tg->idle_ttime_threshold = U64_MAX;
 
return &tg->pd;
 }
@@ -1607,6 +1616,20 @@ static unsigned long tg_last_low_overflow_time(struct 
throtl_grp *tg)
return ret;
 }
 
+static bool throtl_tg_is_idle(struct throtl_grp *tg)
+{
+   /*
+* cgroup is idle if:
+* - single idle is too long, longer than a fixed value (in case user
+*   configure a too big threshold) or 4 times of slice
+* - average think time is more than threshold
+*/
+   u64 time = (u64)jiffies_to_usecs(4 * tg->td->throtl_slice) * 1000;
+   time = min_t(u64, MAX_IDLE_TIME, time);
+   return ktime_get_ns() - tg->last_finish_time > time ||
+  tg->avg_ttime > tg->idle_ttime_threshold;
+}
+
 static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
 {
struct throtl_service_queue *sq = &tg->service_queue;
@@ -1808,6 +1831,19 @@ static void throtl_downgrade_check(struct throtl_grp *tg)
tg->last_io_disp[WRITE] = 0;
 }
 
+static void blk_throtl_update_ttime(struct throtl_grp *tg)
+{
+   u64 now = ktime_get_ns();
+   u64 last_finish_time = tg->last_finish_time;
+
+   if (now <= last_finish_time || last_finish_time == 0 ||
+   last_finish_time == tg->checked_last_finish_time)
+   return;
+
+   tg->avg_ttime = (tg->avg_ttime * 7 + now - last_finish_time) >> 3;
+   tg->checked_last_finish_time = last_finish_time;
+}
+
 bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
struct bio *bio)
 {
@@ -1816,9 +1852,18 @@ bool blk_throtl_bio

Re: [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers

2016-12-15 Thread Jens Axboe
On Thu, Dec 15 2016, Omar Sandoval wrote:
> Hey, Jens, a couple of minor nits below.
> 
> One bigger note: adding the blk_mq_sched_*() helpers and keeping the
> blk_mq_*() helpers that they replaced seems risky. For example,
> blk_mq_free_request() is superseded by blk_mq_sched_put_request(), but
> we kept blk_mq_free_request(). There are definitely some codepaths that
> are still using blk_mq_free_request() that are now wrong
> (__nvme_submit_user_cmd() is the most obvious one I saw).
> 
> Can we get rid of the old, non-sched functions? Or maybe even make old
> interface do the sched stuff instead of adding blk_mq_sched_*()?

You are right, that is a concern. I'll think of some ways to lessen that
risk, it's much better to have build time breakage than runtime.

> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 8d1cec8e25d1..d10a246a3bc7 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -2267,10 +2229,10 @@ static int blk_mq_queue_reinit_dead(unsigned int 
> > cpu)
> >   * Now CPU1 is just onlined and a request is inserted into ctx1->rq_list
> >   * and set bit0 in pending bitmap as ctx1->index_hw is still zero.
> >   *
> > - * And then while running hw queue, flush_busy_ctxs() finds bit0 is set in
> > - * pending bitmap and tries to retrieve requests in hctx->ctxs[0]->rq_list.
> > - * But htx->ctxs[0] is a pointer to ctx0, so the request in ctx1->rq_list
> > - * is ignored.
> > + * And then while running hw queue, blk_mq_flush_busy_ctxs() finds bit0 is 
> > set
> > + * in pending bitmap and tries to retrieve requests in 
> > hctx->ctxs[0]->rq_list.
> > + * But htx->ctxs[0] is a pointer to ctx0, so the request in ctx1->rq_list 
> > is
> > + * ignored.
> >   */
> 
> This belongs in patch 4 where flush_busy_ctxs() got renamed.

Fixed, thanks!

> >  static int blk_mq_queue_reinit_prepare(unsigned int cpu)
> >  {
> > diff --git a/block/blk-mq.h b/block/blk-mq.h
> > index e59f5ca520a2..a5ddc860b220 100644
> > --- a/block/blk-mq.h
> > +++ b/block/blk-mq.h
> > @@ -47,6 +47,9 @@ struct blk_mq_tags *blk_mq_init_rq_map(struct 
> > blk_mq_tag_set *set,
> >   */
> >  void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request 
> > *rq,
> > bool at_head);
> > +void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx 
> > *ctx,
> > +   struct list_head *list);
> > +void blk_mq_process_sw_list(struct blk_mq_hw_ctx *hctx);
> 
> Looks like the declaration of blk_mq_process_sw_list() survived a rebase
> even though it doesn't exist anymore.

Indeed, now killed.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/7] blk-mq-sched: add framework for MQ capable IO schedulers

2016-12-15 Thread Omar Sandoval
Hey, Jens, a couple of minor nits below.

One bigger note: adding the blk_mq_sched_*() helpers and keeping the
blk_mq_*() helpers that they replaced seems risky. For example,
blk_mq_free_request() is superseded by blk_mq_sched_put_request(), but
we kept blk_mq_free_request(). There are definitely some codepaths that
are still using blk_mq_free_request() that are now wrong
(__nvme_submit_user_cmd() is the most obvious one I saw).

Can we get rid of the old, non-sched functions? Or maybe even make old
interface do the sched stuff instead of adding blk_mq_sched_*()?

On Wed, Dec 14, 2016 at 10:26:06PM -0700, Jens Axboe wrote:
> Signed-off-by: Jens Axboe 
> ---
>  block/Makefile   |   2 +-
>  block/blk-core.c |   7 +-
>  block/blk-exec.c |   3 +-
>  block/blk-flush.c|   7 +-
>  block/blk-mq-sched.c | 375 
> +++
>  block/blk-mq-sched.h | 190 
>  block/blk-mq-tag.c   |   1 +
>  block/blk-mq.c   | 192 ++--
>  block/blk-mq.h   |   3 +
>  block/elevator.c | 186 +--
>  include/linux/blk-mq.h   |   3 +-
>  include/linux/elevator.h |  29 
>  12 files changed, 833 insertions(+), 165 deletions(-)
>  create mode 100644 block/blk-mq-sched.c
>  create mode 100644 block/blk-mq-sched.h
> 

[snip]

> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 8d1cec8e25d1..d10a246a3bc7 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2267,10 +2229,10 @@ static int blk_mq_queue_reinit_dead(unsigned int cpu)
>   * Now CPU1 is just onlined and a request is inserted into ctx1->rq_list
>   * and set bit0 in pending bitmap as ctx1->index_hw is still zero.
>   *
> - * And then while running hw queue, flush_busy_ctxs() finds bit0 is set in
> - * pending bitmap and tries to retrieve requests in hctx->ctxs[0]->rq_list.
> - * But htx->ctxs[0] is a pointer to ctx0, so the request in ctx1->rq_list
> - * is ignored.
> + * And then while running hw queue, blk_mq_flush_busy_ctxs() finds bit0 is 
> set
> + * in pending bitmap and tries to retrieve requests in 
> hctx->ctxs[0]->rq_list.
> + * But htx->ctxs[0] is a pointer to ctx0, so the request in ctx1->rq_list is
> + * ignored.
>   */

This belongs in patch 4 where flush_busy_ctxs() got renamed.

>  static int blk_mq_queue_reinit_prepare(unsigned int cpu)
>  {
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index e59f5ca520a2..a5ddc860b220 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -47,6 +47,9 @@ struct blk_mq_tags *blk_mq_init_rq_map(struct 
> blk_mq_tag_set *set,
>   */
>  void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>   bool at_head);
> +void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx 
> *ctx,
> + struct list_head *list);
> +void blk_mq_process_sw_list(struct blk_mq_hw_ctx *hctx);

Looks like the declaration of blk_mq_process_sw_list() survived a rebase
even though it doesn't exist anymore.

[snip]
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications

2016-12-15 Thread Jerome Glisse
On Thu, Dec 15, 2016 at 05:19:39PM +0100, Jan Kara wrote:
> On Wed 14-12-16 12:15:14, Jerome Glisse wrote:
>  page handling>
> 
> > > So won't it be easier to leave the pagecache page where it is and *copy* 
> > > it
> > > to the device? Can the device notify us *before* it is going to modify a
> > > page, not just after it has modified it? Possibly if we just give it the
> > > page read-only and it will have to ask CPU to get write permission? If 
> > > yes,
> > > then I belive this could work and even fs support should be doable.
> > 
> > Well yes and no. Device obey the same rule as CPU so if a file back page is
> > map read only in the process it must first do a write fault which will call
> > in the fs (page_mkwrite() of vm_ops). But once a page has write permission
> > there is no way to be notify by hardware on every write. First the hardware
> > do not have the capability. Second we are talking thousand (10 000 is upper
> > range in today device) of concurrent thread, each can possibly write to page
> > under consideration.
> 
> Sure, I meant whether the device is able to do equivalent of ->page_mkwrite
> notification which apparently it is. OK.
> 
> > We really want the device page to behave just like regular page. Most fs 
> > code
> > path never map file content, it only happens during read/write and i believe
> > this can be handled either by migrating back or by using bounce page. I want
> > to provide the choice between the two solutions as one will be better for 
> > some
> > workload and the other for different workload.
> 
> I agree with keeping page used by the device behaving as similar as
> possible as any other page. I'm just exploring different possibilities how
> to make that happen. E.g. the scheme I was aiming at is:
> 
> When you want page A to be used by the device, you set up page A' in the
> device but make sure any access to it will fault.
> 
> When the device wants to access A', it notifies the CPU, that writeprotects
> all mappings of A, copy A to A' and map A' read-only for the device.
> 
> When the device wants to write to A', it notifies CPU, that will clear all
> mappings of A and mark A as not-uptodate & dirty. When the CPU will then
> want to access the data in A again - we need to catch ->readpage,
> ->readpages, ->writepage, ->writepages - it will writeprotect A' in
> the device, copy data to A, mark A as uptodate & dirty, and off we go.
> 
> When we want to write to the page on CPU - we get either wp fault if it was
> via mmap, or we have to catch that in places using kmap() - we just remove
> access to A' from the device.
> 
> This scheme makes the device mapping functionality transparent to the
> filesystem (you actually don't need to hook directly into ->readpage etc.
> handlers, you can just have wrappers around them for this functionality)
> and fairly straightforward... It is so transparent that even direct IO works
> with this since the page cache invalidation pass we do before actually doing
> the direct IO will make sure to pull all the pages from the device and write
> them to disk if needed. What do you think?

This is do-able but i think it will require the same amount of changes than
what i had in mind (excluding the block bounce code) with one drawback. Doing
it that way we can not free page A.

On some workload this probably does not hurt much but on workload where you
read a big dataset from disk and then use it only on the GPU for long period
of time (minutes/hours) you will waste GB of system memory.

Right now i am working on some other patchset, i intend to take a stab at this
in January/February time frame, before summit so i can post an RFC and have a
clear picture of every code path that needs modifications. I expect this would
provide better frame for discussion.

I assume i will have to change >readpage >readpages writepage >writepages but
i think that the only place i really need to change are do_generic_file_read()
and generic_perform_write() (or iov_iter_copy_*). Of course this only apply to
fs that use those generic helpers.

I also probably will change >mmap or rather the helper it uses to set the pte
depending on what looks better.

Note that i don't think wrapping is an easy task. I would need to replace page
A mapping (struct page.mapping) to point to a wrapping address_space but there
is enough place in the kernel that directly dereference that and expect to hit
the right (real) address_space. I would need to replace all dereference of
page->mapping to an helper function and possibly would need to change some of
the call site logic accordingly. This might prove a bigger change than just
having to use bounce in do_generic_file_read() and generic_perform_write().

Cheers,
Jérôme
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RESEND] block: allow WRITE_SAME commands with the SG_IO ioctl

2016-12-15 Thread Mauricio Faria de Oliveira
The WRITE_SAME commands are not present in the blk_default_cmd_filter
write_ok list, and thus are failed with -EPERM when the SG_IO ioctl()
is executed without CAP_SYS_RAWIO capability (e.g., unprivileged users).
[ sg_io() -> blk_fill_sghdr_rq() > blk_verify_command() -> -EPERM ]

The problem can be reproduced with the sg_write_same command

  # sg_write_same --num 1 --xferlen 512 /dev/sda
  #

  # capsh --drop=cap_sys_rawio -- -c \
'sg_write_same --num 1 --xferlen 512 /dev/sda'
Write same: pass through os error: Operation not permitted
  #

For comparison, the WRITE_VERIFY command does not observe this problem,
since it is in that list:

  # capsh --drop=cap_sys_rawio -- -c \
'sg_write_verify --num 1 --ilen 512 --lba 0 /dev/sda'
  #

So, this patch adds the WRITE_SAME commands to the list, in order
for the SG_IO ioctl to finish successfully:

  # capsh --drop=cap_sys_rawio -- -c \
'sg_write_same --num 1 --xferlen 512 /dev/sda'
  #

That case happens to be exercised by QEMU KVM guests with 'scsi-block' devices
(qemu "-device scsi-block" [1], libvirt "" [2]),
which employs the SG_IO ioctl() and runs as an unprivileged user (libvirt-qemu).

In that scenario, when a filesystem (e.g., ext4) performs its zero-out calls,
which are translated to write-same calls in the guest kernel, and then into
SG_IO ioctls to the host kernel, SCSI I/O errors may be observed in the guest:

  [...] sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK 
driverbyte=DRIVER_SENSE
  [...] sd 0:0:0:0: [sda] tag#0 Sense Key : Aborted Command [current]
  [...] sd 0:0:0:0: [sda] tag#0 Add. Sense: I/O process terminated
  [...] sd 0:0:0:0: [sda] tag#0 CDB: Write Same(10) 41 00 01 04 e0 78 00 00 08 
00
  [...] blk_update_request: I/O error, dev sda, sector 17096824

Links:
[1] 
http://git.qemu.org/?p=qemu.git;a=commit;h=336a6915bc7089fb20fea4ba99972ad9a97c5f52
[2] https://libvirt.org/formatdomain.html#elementsDisks (see 'disk' -> 'device')

Signed-off-by: Mauricio Faria de Oliveira 
Signed-off-by: Brahadambal Srinivasan 
Reported-by: Manjunatha H R 
---
 block/scsi_ioctl.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
index 0774799..c6fee74 100644
--- a/block/scsi_ioctl.c
+++ b/block/scsi_ioctl.c
@@ -182,6 +182,9 @@ static void blk_set_cmd_filter_defaults(struct 
blk_cmd_filter *filter)
__set_bit(WRITE_16, filter->write_ok);
__set_bit(WRITE_LONG, filter->write_ok);
__set_bit(WRITE_LONG_2, filter->write_ok);
+   __set_bit(WRITE_SAME, filter->write_ok);
+   __set_bit(WRITE_SAME_16, filter->write_ok);
+   __set_bit(WRITE_SAME_32, filter->write_ok);
__set_bit(ERASE, filter->write_ok);
__set_bit(GPCMD_MODE_SELECT_10, filter->write_ok);
__set_bit(MODE_SELECT, filter->write_ok);
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH RFC] block: fix bio merge checks when virt_boundary is set

2016-12-15 Thread Dexuan Cui
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Vitaly Kuznetsov
> Sent: Wednesday, April 20, 2016 21:48
> To: Ming Lei 
> Cc: Keith Busch ; linux-block@vger.kernel.org; Linux
> Kernel Mailing List ; Jens Axboe
> ; Dan Williams ; Martin K.
> Petersen ; Sagi Grimberg
> ; Mike Snitzer ; KY Srinivasan
> ; Cathy Avery 
> Subject: Re: [PATCH RFC] block: fix bio merge checks when virt_boundary is set
> 
> Ming Lei  writes:
> 
> > On Fri, Mar 18, 2016 at 10:59 AM, Ming Lei  wrote:
> >> On Fri, Mar 18, 2016 at 12:39 AM, Keith Busch 
> wrote:
> >>> On Thu, Mar 17, 2016 at 12:20:28PM +0100, Vitaly Kuznetsov wrote:
>  Keith Busch  writes:
>  > been combined. In any case, I think you can get what you're after just
>  > by moving the gap check after BIOVEC_PHYS_MERGABLE. Does the
> following
>  > look ok to you?
>  >
> 
>  Thanks, it does.
> >>>
> >>> Cool, thanks for confirming.
> >>>
>  Will you send it or would you like me to do that with your Suggested-by?
> >>>
> >>> I'm not confident yet this doesn't break anything, particularly since
> >>> we moved the gap check after the length check. Just wanted to confirm
> >>> the concept addressed your concern, but still need to take a closer look
> >>> and test before submitting.
> >>
> >> IMO, the change on blk_bio_segment_split() is correct, because actually it
> >> is a sg gap and the check should have been done between segments
> >> instead of bvecs. So it is reasonable to move the check just before 
> >> populating
> >> a new segment.
> >
> > Thinking of the 1st part change further, looks it is just correct in 
> > concept,
> > but wrong from current implementation. Because of bios/reqs merge,
> > blk_rq_map_sg() may end one segment in any bvec in theroy, so I guess
> > that is why each non-1st bvec need the check to make sure no sg gap.
> > Looks a very crazy limit, :-)
> >
> >>
> >> But for the 2nd change in bio_will_gap(), which should fix Vitaly's 
> >> problem, I
> >> am still not sure if it is completely correct. bio_will_gap() is used
> >> to check if two
> >> bios may be merged. Suppose two bios are continues physically, the last 
> >> bvec
> >> in 1st bio and the first bvec in 2nd bio might not be in one same segment
> >> because of segment size limit.
> >
> > How about the attached patch?
> >
> 
> I just wanted to revive the discussion as the issue persists. I
> re-tested your patch against 4.6-rc4 and it efficiently solves the
> issue.
> 
> pre-patch:
> # time mkfs.ntfs /dev/sdb1
> Cluster size has been automatically set to 4096 bytes.
> Initializing device with zeroes: 100% - Done.
> Creating NTFS volume structures.
> mkntfs completed successfully. Have a nice day.
> 
> real8m10.977s
> user0m0.115s
> sys0m12.672s
> 
> post-patch:
> # time mkfs.ntfs /dev/sdb1
> Cluster size has been automatically set to 4096 bytes.
> Initializing device with zeroes: 100% - Done.
> Creating NTFS volume structures.
> mkntfs completed successfully. Have a nice day.
> 
> real0m42.430s
> user0m0.171s
> sys0m7.675s
> 
> Will you send this patch? Please let me know if I can further
> assist. Thanks!
> 
> --
>   Vitaly

Hi, I'm reviving the thread because I'm suffering from exactly the same issue.
This is the thread I created today: 
"Big I/O requests are split into small ones due to unaligned ext4 partition 
boundary?"
http://marc.info/?t=14818034612&r=1&w=2

Ming's patch can fix this issue for me. 

Stable 4.4 and later are affected too.
I didn't check 4.3.x kernels, but for Linux guest on Hyper-V, any kernel with 
the
patch "storvsc: get rid of bounce buffer"
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=81988a0e6b031bc80da15257201810ddcf989e64
should be affected.

Thanks,
-- Dexuan
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/7] blk-mq: make mq_ops a const pointer

2016-12-15 Thread Bart Van Assche
On 12/15/2016 06:26 AM, Jens Axboe wrote:
> We never change it, make that clear.

Reviewed-by: Bart Van Assche 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications

2016-12-15 Thread Jan Kara
On Wed 14-12-16 12:15:14, Jerome Glisse wrote:


> > So won't it be easier to leave the pagecache page where it is and *copy* it
> > to the device? Can the device notify us *before* it is going to modify a
> > page, not just after it has modified it? Possibly if we just give it the
> > page read-only and it will have to ask CPU to get write permission? If yes,
> > then I belive this could work and even fs support should be doable.
> 
> Well yes and no. Device obey the same rule as CPU so if a file back page is
> map read only in the process it must first do a write fault which will call
> in the fs (page_mkwrite() of vm_ops). But once a page has write permission
> there is no way to be notify by hardware on every write. First the hardware
> do not have the capability. Second we are talking thousand (10 000 is upper
> range in today device) of concurrent thread, each can possibly write to page
> under consideration.

Sure, I meant whether the device is able to do equivalent of ->page_mkwrite
notification which apparently it is. OK.

> We really want the device page to behave just like regular page. Most fs code
> path never map file content, it only happens during read/write and i believe
> this can be handled either by migrating back or by using bounce page. I want
> to provide the choice between the two solutions as one will be better for some
> workload and the other for different workload.

I agree with keeping page used by the device behaving as similar as
possible as any other page. I'm just exploring different possibilities how
to make that happen. E.g. the scheme I was aiming at is:

When you want page A to be used by the device, you set up page A' in the
device but make sure any access to it will fault.

When the device wants to access A', it notifies the CPU, that writeprotects
all mappings of A, copy A to A' and map A' read-only for the device.

When the device wants to write to A', it notifies CPU, that will clear all
mappings of A and mark A as not-uptodate & dirty. When the CPU will then
want to access the data in A again - we need to catch ->readpage,
->readpages, ->writepage, ->writepages - it will writeprotect A' in
the device, copy data to A, mark A as uptodate & dirty, and off we go.

When we want to write to the page on CPU - we get either wp fault if it was
via mmap, or we have to catch that in places using kmap() - we just remove
access to A' from the device.

This scheme makes the device mapping functionality transparent to the
filesystem (you actually don't need to hook directly into ->readpage etc.
handlers, you can just have wrappers around them for this functionality)
and fairly straightforward... It is so transparent that even direct IO works
with this since the page cache invalidation pass we do before actually doing
the direct IO will make sure to pull all the pages from the device and write
them to disk if needed. What do you think?

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Big I/O requests are split into small ones due to unaligned ext4 partition boundary?

2016-12-15 Thread Dexuan Cui
> From: Ming Lei [mailto:tom.leim...@gmail.com]
> Sent: Thursday, December 15, 2016 20:43
> 
> On Thu, Dec 15, 2016 at 7:47 PM, Dexuan Cui  wrote:
> > Hi, when I run "mkfs.ext4 /dev/sdc2" in a Linux virtual machine on Hyper-V,
> > where a disk IOPS=500 limit is applied by me [0],  the command takes much
> > more time, if the ext4 partition boundary is not properly aligned:
> >
> > Example 1 [1]: it takes ~7 minutes with average wMB/s = 0.3   (slow)
> > Example 2 [2]: it takes ~3.5 minutes with average wMB/s = 0.6 (slow)
> > Example 3 [3]: it takes ~0.5 minute with average wMB/s = 4 (expected)
> >
> > strace shows the mkfs.ext3 program calls seek()/write() a lot and most of
> > the writes use 32KB buffers (this should be big enough), and the program
> > only invokes fsync() once, after it issues all the writes -- the fsync() 
> > takes
> >>99% of the time.
> >
> > By logging SCSI commands, the SCSI Write(10) command is used here for the
> > userspace 32KB write:
> > in example 1, *each* command writes 1 or 2 sectors only (1 sector = 512
> bytes);
> > in example 2, *each* command writes 2 or 4 sectors only;
> > in example 3, each command writes 1024 sectors.
> >
> > It looks the kernel block I/O layer can somehow split big user-space buffers
> > into really small write requests (1, 2, and 4 sectors)?
> > This looks really strange to me.
> >
> > Note: in my test, this strange issue happens to 4.4 and the mainline 4.9 
> > kernels,
> > but the stable 3.18.45 kernel doesn't have the issue, i.e. all the 3 above 
> > test
> > examples can finish in ~0.5 minute.
> >
> > Any comment?
> 
> I remember that we discussed this kind of issue, please see the discussion[1]
> and check if the patch[2] can fix your issue.
> 
> [1] http://marc.info/?t=14580552552&r=1&w=2
> [2] http://marc.info/?l=linux-kernel&m=145934325429152&w=2
> 
> Ming
 
Thank you very much, Ming! The patch can fix my issue!
It looks your patch was not merged into the upstream somehow.
Would you please submit the patch again?

Thanks,
-- Dexuan


Re: Big I/O requests are split into small ones due to unaligned ext4 partition boundary?

2016-12-15 Thread Ming Lei
On Thu, Dec 15, 2016 at 7:47 PM, Dexuan Cui  wrote:
> Hi, when I run "mkfs.ext4 /dev/sdc2" in a Linux virtual machine on Hyper-V,
> where a disk IOPS=500 limit is applied by me [0],  the command takes much
> more time, if the ext4 partition boundary is not properly aligned:
>
> Example 1 [1]: it takes ~7 minutes with average wMB/s = 0.3   (slow)
> Example 2 [2]: it takes ~3.5 minutes with average wMB/s = 0.6 (slow)
> Example 3 [3]: it takes ~0.5 minute with average wMB/s = 4 (expected)
>
> strace shows the mkfs.ext3 program calls seek()/write() a lot and most of
> the writes use 32KB buffers (this should be big enough), and the program
> only invokes fsync() once, after it issues all the writes -- the fsync() takes
>>99% of the time.
>
> By logging SCSI commands, the SCSI Write(10) command is used here for the
> userspace 32KB write:
> in example 1, *each* command writes 1 or 2 sectors only (1 sector = 512 
> bytes);
> in example 2, *each* command writes 2 or 4 sectors only;
> in example 3, each command writes 1024 sectors.
>
> It looks the kernel block I/O layer can somehow split big user-space buffers
> into really small write requests (1, 2, and 4 sectors)?
> This looks really strange to me.
>
> Note: in my test, this strange issue happens to 4.4 and the mainline 4.9 
> kernels,
> but the stable 3.18.45 kernel doesn't have the issue, i.e. all the 3 above 
> test
> examples can finish in ~0.5 minute.
>
> Any comment?

I remember that we discussed this kind of issue, please see the discussion[1]
and check if the patch[2] can fix your issue.

[1] http://marc.info/?t=14580552552&r=1&w=2
[2] http://marc.info/?l=linux-kernel&m=145934325429152&w=2


Thanks,
Ming


>
> Thanks!
> -- Dexuan
>
>
> [0] The max IOPS are measured in 8KB increments, meaning the max
>  throughput is 8KB * 500 = 4000KB.
>
> [1] This is the partition info of my 20GB disk:
> # fdisk -l /dev/sdc
> Disk /dev/sdc: 20 GiB, 21474836480 bytes, 41943040 sectors
> Units: sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 4096 bytes
> I/O size (minimum/optimal): 4096 bytes / 4096 bytes
> Disklabel type: dos
> Disk identifier: 0x
>
> Device BootStart  End  Sectors  Size Id Type
> /dev/sdc1  1 14281784 14281784  6.8G 82 Linux swap / Solaris
> /dev/sdc2   14281785 41929649 27647865 13.2G 83 Linux
>
> Here, start_sector = 14281785, end_sector = 41929649.
>
> [2] start_sector = 14282752, end_sector = 41929649
>
> [3] start_sector = 14282752, end_sector = 41943039



-- 
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Big I/O requests are split into small ones due to unaligned ext4 partition boundary?

2016-12-15 Thread Dexuan Cui
Hi, when I run "mkfs.ext4 /dev/sdc2" in a Linux virtual machine on Hyper-V,
where a disk IOPS=500 limit is applied by me [0],  the command takes much
more time, if the ext4 partition boundary is not properly aligned:

Example 1 [1]: it takes ~7 minutes with average wMB/s = 0.3   (slow)
Example 2 [2]: it takes ~3.5 minutes with average wMB/s = 0.6 (slow)
Example 3 [3]: it takes ~0.5 minute with average wMB/s = 4 (expected)

strace shows the mkfs.ext3 program calls seek()/write() a lot and most of
the writes use 32KB buffers (this should be big enough), and the program
only invokes fsync() once, after it issues all the writes -- the fsync() takes
>99% of the time.

By logging SCSI commands, the SCSI Write(10) command is used here for the
userspace 32KB write:
in example 1, *each* command writes 1 or 2 sectors only (1 sector = 512 bytes);
in example 2, *each* command writes 2 or 4 sectors only;
in example 3, each command writes 1024 sectors.

It looks the kernel block I/O layer can somehow split big user-space buffers
into really small write requests (1, 2, and 4 sectors)?
This looks really strange to me.

Note: in my test, this strange issue happens to 4.4 and the mainline 4.9 
kernels,
but the stable 3.18.45 kernel doesn't have the issue, i.e. all the 3 above test
examples can finish in ~0.5 minute.

Any comment?

Thanks!
-- Dexuan


[0] The max IOPS are measured in 8KB increments, meaning the max
 throughput is 8KB * 500 = 4000KB.

[1] This is the partition info of my 20GB disk:
# fdisk -l /dev/sdc
Disk /dev/sdc: 20 GiB, 21474836480 bytes, 41943040 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0x

Device BootStart  End  Sectors  Size Id Type
/dev/sdc1  1 14281784 14281784  6.8G 82 Linux swap / Solaris
/dev/sdc2   14281785 41929649 27647865 13.2G 83 Linux

Here, start_sector = 14281785, end_sector = 41929649.

[2] start_sector = 14282752, end_sector = 41929649

[3] start_sector = 14282752, end_sector = 41943039
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] block: check partition alignment

2016-12-15 Thread Christoph Hellwig
On Thu, Dec 15, 2016 at 10:33:47AM +0900, Damien Le Moal wrote:
> For a regular block device, I agree. But in Stephan case, I think that
> the check really needs to be against the physical block size, with the
> added condition that the bdev is a DASD device (similarly to the zone
> alignment check for zoned block devices).

Then they need to expose a chunk_size.  physical block size is defined
as not having a meaning for the kernel.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] block: check partition alignment

2016-12-15 Thread Damien Le Moal


On 12/15/16 17:45, Christoph Hellwig wrote:
> On Thu, Dec 15, 2016 at 10:33:47AM +0900, Damien Le Moal wrote:
>> For a regular block device, I agree. But in Stephan case, I think that
>> the check really needs to be against the physical block size, with the
>> added condition that the bdev is a DASD device (similarly to the zone
>> alignment check for zoned block devices).
> 
> Then they need to expose a chunk_size.  physical block size is defined
> as not having a meaning for the kernel.

Or create the block device with the logical block size set to the
physical sector size. Which would be even more simple and in fact
correct in this case since only physically aligned accesses make sense
for DASD.

-- 
Damien Le Moal, Ph.D.
Sr. Manager, System Software Research Group,
Western Digital Corporation
damien.lem...@wdc.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa,
Kanagawa, 252-0888 Japan
www.wdc.com, www.hgst.com
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html