Re: [Nbd] NBD: exported files over something around 1 TiB get an insane device size on the client side and are actually empty

2017-01-14 Thread Alex Bligh

> On 15 Jan 2017, at 01:03, Josef Bacik  wrote:
> 
> Yeah I noticed this in testing, there's a bug with NBD since it's 
> inception where it uses a 32 bit number for keeping track of the size 
> of the device, I fixed it with
> 
> nbd: use loff_t for blocksize and nbd_set_size args
> 
> It'll be in 4.10.  Thanks,

Interestingly enough, this is a regression, as the random 3.x kernel I tried 
did not show the issue.

Thanks for the fix Josef.

-- 
Alex Bligh




--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V6 05/18] blk-throttle: add upgrade logic for LIMIT_LOW state

2017-01-14 Thread Shaohua Li
When queue is in LIMIT_LOW state and all cgroups with low limit cross
the bps/iops limitation, we will upgrade queue's state to
LIMIT_MAX. To determine if a cgroup exceeds its limitation, we check if
the cgroup has pending request. Since cgroup is throttled according to
the limit, pending request means the cgroup reaches the limit.

If a cgroup has limit set for both read and write, we consider the
combination of them for upgrade. The reason is read IO and write IO can
interfere with each other. If we do the upgrade based in one direction
IO, the other direction IO could be severly harmed.

For a cgroup hierarchy, there are two cases. Children has lower low
limit than parent. Parent's low limit is meaningless. If children's
bps/iops cross low limit, we can upgrade queue state. The other case is
children has higher low limit than parent. Children's low limit is
meaningless. As long as parent's bps/iops (which is a sum of childrens
bps/iops) cross low limit, we can upgrade queue state.

Signed-off-by: Shaohua Li 
---
 block/blk-throttle.c | 100 ---
 1 file changed, 96 insertions(+), 4 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 3bc6deb..4cc066f 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -457,6 +457,7 @@ static void blk_throtl_update_limit_valid(struct 
throtl_data *td)
td->limit_valid[LIMIT_LOW] = low_valid;
 }
 
+static void throtl_upgrade_state(struct throtl_data *td);
 static void throtl_pd_offline(struct blkg_policy_data *pd)
 {
struct throtl_grp *tg = pd_to_tg(pd);
@@ -468,9 +469,8 @@ static void throtl_pd_offline(struct blkg_policy_data *pd)
 
blk_throtl_update_limit_valid(tg->td);
 
-   if (tg->td->limit_index == LIMIT_LOW &&
-   !tg->td->limit_valid[LIMIT_LOW])
-   tg->td->limit_index = LIMIT_MAX;
+   if (!tg->td->limit_valid[tg->td->limit_index])
+   throtl_upgrade_state(tg->td);
 }
 
 static void throtl_pd_free(struct blkg_policy_data *pd)
@@ -1079,6 +1079,8 @@ static int throtl_select_dispatch(struct 
throtl_service_queue *parent_sq)
return nr_disp;
 }
 
+static bool throtl_can_upgrade(struct throtl_data *td,
+   struct throtl_grp *this_tg);
 /**
  * throtl_pending_timer_fn - timer function for service_queue->pending_timer
  * @arg: the throtl_service_queue being serviced
@@ -1105,6 +1107,9 @@ static void throtl_pending_timer_fn(unsigned long arg)
int ret;
 
spin_lock_irq(q->queue_lock);
+   if (throtl_can_upgrade(td, NULL))
+   throtl_upgrade_state(td);
+
 again:
parent_sq = sq->parent_sq;
dispatched = false;
@@ -1518,6 +1523,87 @@ static struct blkcg_policy blkcg_policy_throtl = {
.pd_free_fn = throtl_pd_free,
 };
 
+static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
+{
+   struct throtl_service_queue *sq = >service_queue;
+   bool read_limit, write_limit;
+
+   /*
+* if cgroup reaches low limit (if low limit is 0, the cgroup always
+* reaches), it's ok to upgrade to next limit
+*/
+   read_limit = tg->bps[READ][LIMIT_LOW] || tg->iops[READ][LIMIT_LOW];
+   write_limit = tg->bps[WRITE][LIMIT_LOW] || tg->iops[WRITE][LIMIT_LOW];
+   if (!read_limit && !write_limit)
+   return true;
+   if (read_limit && sq->nr_queued[READ] &&
+   (!write_limit || sq->nr_queued[WRITE]))
+   return true;
+   if (write_limit && sq->nr_queued[WRITE] &&
+   (!read_limit || sq->nr_queued[READ]))
+   return true;
+   return false;
+}
+
+static bool throtl_hierarchy_can_upgrade(struct throtl_grp *tg)
+{
+   while (true) {
+   if (throtl_tg_can_upgrade(tg))
+   return true;
+   tg = sq_to_tg(tg->service_queue.parent_sq);
+   if (!tg || !tg_to_blkg(tg)->parent)
+   return false;
+   }
+   return false;
+}
+
+static bool throtl_can_upgrade(struct throtl_data *td,
+   struct throtl_grp *this_tg)
+{
+   struct cgroup_subsys_state *pos_css;
+   struct blkcg_gq *blkg;
+
+   if (td->limit_index != LIMIT_LOW)
+   return false;
+
+   rcu_read_lock();
+   blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
+   struct throtl_grp *tg = blkg_to_tg(blkg);
+
+   if (tg == this_tg)
+   continue;
+   if (!list_empty(_to_blkg(tg)->blkcg->css.children))
+   continue;
+   if (!throtl_hierarchy_can_upgrade(tg)) {
+   rcu_read_unlock();
+   return false;
+   }
+   }
+   rcu_read_unlock();
+   return true;
+}
+
+static void throtl_upgrade_state(struct throtl_data *td)
+{
+   struct cgroup_subsys_state *pos_css;
+   struct blkcg_gq *blkg;
+
+   td->limit_index = LIMIT_MAX;
+

[PATCH V6 04/18] blk-throttle: configure bps/iops limit for cgroup in low limit

2017-01-14 Thread Shaohua Li
each queue will have a state machine. Initially queue is in LIMIT_LOW
state, which means all cgroups will be throttled according to their low
limit. After all cgroups with low limit cross the limit, the queue state
gets upgraded to LIMIT_MAX state.
For max limit, cgroup will use the limit configured by user.
For low limit, cgroup will use the minimal value between low limit and
max limit configured by user. If the minimal value is 0, which means the
cgroup doesn't configure low limit, we will use max limit to throttle
the cgroup and the cgroup is ready to upgrade to LIMIT_MAX

Signed-off-by: Shaohua Li 
---
 block/blk-throttle.c | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index d3ad43c..3bc6deb 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -212,12 +212,28 @@ static struct throtl_data *sq_to_td(struct 
throtl_service_queue *sq)
 
 static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
 {
-   return tg->bps[rw][tg->td->limit_index];
+   struct blkcg_gq *blkg = tg_to_blkg(tg);
+   uint64_t ret;
+
+   if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
+   return U64_MAX;
+   ret = tg->bps[rw][tg->td->limit_index];
+   if (ret == 0 && tg->td->limit_index == LIMIT_LOW)
+   return tg->bps[rw][LIMIT_MAX];
+   return ret;
 }
 
 static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
 {
-   return tg->iops[rw][tg->td->limit_index];
+   struct blkcg_gq *blkg = tg_to_blkg(tg);
+   unsigned int ret;
+
+   if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
+   return UINT_MAX;
+   ret = tg->iops[rw][tg->td->limit_index];
+   if (ret == 0 && tg->td->limit_index == LIMIT_LOW)
+   return tg->iops[rw][LIMIT_MAX];
+   return ret;
 }
 
 /**
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V6 03/18] blk-throttle: add .low interface

2017-01-14 Thread Shaohua Li
Add low limit for cgroup and corresponding cgroup interface. To be
consistent with memcg, we allow users configure .low limit higher than
.max limit. But the internal logic always assumes .low limit is lower
than .max limit. So we add extra bps/iops_conf fields in throtl_grp for
userspace configuration. Old bps/iops fields in throtl_grp will be the
actual limit we use for throttling.

Signed-off-by: Shaohua Li 
---
 block/blk-throttle.c | 142 +--
 1 file changed, 114 insertions(+), 28 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index b08369b..d3ad43c 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -84,6 +84,7 @@ enum tg_state_flags {
 #define rb_entry_tg(node)  rb_entry((node), struct throtl_grp, rb_node)
 
 enum {
+   LIMIT_LOW,
LIMIT_MAX,
LIMIT_CNT,
 };
@@ -124,11 +125,15 @@ struct throtl_grp {
/* are there any throtl rules between this group and td? */
bool has_rules[2];
 
-   /* bytes per second rate limits */
+   /* internally used bytes per second rate limits */
uint64_t bps[2][LIMIT_CNT];
+   /* user configured bps limits */
+   uint64_t bps_conf[2][LIMIT_CNT];
 
-   /* IOPS limits */
+   /* internally used IOPS limits */
unsigned int iops[2][LIMIT_CNT];
+   /* user configured IOPS limits */
+   unsigned int iops_conf[2][LIMIT_CNT];
 
/* Number of bytes disptached in current slice */
uint64_t bytes_disp[2];
@@ -355,6 +360,11 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, 
int node)
tg->bps[WRITE][LIMIT_MAX] = U64_MAX;
tg->iops[READ][LIMIT_MAX] = UINT_MAX;
tg->iops[WRITE][LIMIT_MAX] = UINT_MAX;
+   tg->bps_conf[READ][LIMIT_MAX] = U64_MAX;
+   tg->bps_conf[WRITE][LIMIT_MAX] = U64_MAX;
+   tg->iops_conf[READ][LIMIT_MAX] = UINT_MAX;
+   tg->iops_conf[WRITE][LIMIT_MAX] = UINT_MAX;
+   /* LIMIT_LOW will have default value 0 */
 
return >pd;
 }
@@ -412,6 +422,41 @@ static void throtl_pd_online(struct blkg_policy_data *pd)
tg_update_has_rules(pd_to_tg(pd));
 }
 
+static void blk_throtl_update_limit_valid(struct throtl_data *td)
+{
+   struct cgroup_subsys_state *pos_css;
+   struct blkcg_gq *blkg;
+   bool low_valid = false;
+
+   rcu_read_lock();
+   blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
+   struct throtl_grp *tg = blkg_to_tg(blkg);
+
+   if (tg->bps[READ][LIMIT_LOW] || tg->bps[WRITE][LIMIT_LOW] ||
+   tg->iops[READ][LIMIT_LOW] || tg->iops[WRITE][LIMIT_LOW])
+   low_valid = true;
+   }
+   rcu_read_unlock();
+
+   td->limit_valid[LIMIT_LOW] = low_valid;
+}
+
+static void throtl_pd_offline(struct blkg_policy_data *pd)
+{
+   struct throtl_grp *tg = pd_to_tg(pd);
+
+   tg->bps[READ][LIMIT_LOW] = 0;
+   tg->bps[WRITE][LIMIT_LOW] = 0;
+   tg->iops[READ][LIMIT_LOW] = 0;
+   tg->iops[WRITE][LIMIT_LOW] = 0;
+
+   blk_throtl_update_limit_valid(tg->td);
+
+   if (tg->td->limit_index == LIMIT_LOW &&
+   !tg->td->limit_valid[LIMIT_LOW])
+   tg->td->limit_index = LIMIT_MAX;
+}
+
 static void throtl_pd_free(struct blkg_policy_data *pd)
 {
struct throtl_grp *tg = pd_to_tg(pd);
@@ -1282,48 +1327,58 @@ static struct cftype throtl_legacy_files[] = {
{ } /* terminate */
 };
 
-static u64 tg_prfill_max(struct seq_file *sf, struct blkg_policy_data *pd,
+static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
 int off)
 {
struct throtl_grp *tg = pd_to_tg(pd);
const char *dname = blkg_dev_name(pd->blkg);
char bufs[4][21] = { "max", "max", "max", "max" };
+   u64 bps_dft;
+   unsigned int iops_dft;
 
if (!dname)
return 0;
 
-   if (tg->bps[READ][LIMIT_MAX] == U64_MAX &&
-   tg->bps[WRITE][LIMIT_MAX] == U64_MAX &&
-   tg->iops[READ][LIMIT_MAX] == UINT_MAX &&
-   tg->iops[WRITE][LIMIT_MAX] == UINT_MAX)
+   if (off == LIMIT_LOW) {
+   bps_dft = 0;
+   iops_dft = 0;
+   } else {
+   bps_dft = U64_MAX;
+   iops_dft = UINT_MAX;
+   }
+
+   if (tg->bps_conf[READ][off] == bps_dft &&
+   tg->bps_conf[WRITE][off] == bps_dft &&
+   tg->iops_conf[READ][off] == iops_dft &&
+   tg->iops_conf[WRITE][off] == iops_dft)
return 0;
 
-   if (tg->bps[READ][LIMIT_MAX] != U64_MAX)
+   if (tg->bps_conf[READ][off] != bps_dft)
snprintf(bufs[0], sizeof(bufs[0]), "%llu",
-   tg->bps[READ][LIMIT_MAX]);
-   if (tg->bps[WRITE][LIMIT_MAX] != U64_MAX)
+   tg->bps_conf[READ][off]);
+   if (tg->bps_conf[WRITE][off] != bps_dft)
snprintf(bufs[1], sizeof(bufs[1]), "%llu",
- 

[PATCH V6 00/18] blk-throttle: add .low limit

2017-01-14 Thread Shaohua Li
Hi,

cgroup still lacks a good iocontroller. CFQ works well for hard disk, but not
much for SSD. This patch set try to add a conservative limit for blk-throttle.
It isn't a proportional scheduling, but can help prioritize cgroups. There are
several advantages we choose blk-throttle:
- blk-throttle resides early in the block stack. It works for both bio and
  request based queues.
- blk-throttle is light weight in general. It still takes queue lock, but it's
  not hard to implement a per-cpu cache and remove the lock contention.
- blk-throttle doesn't use 'idle disk' mechanism, which is used by CFQ/BFQ. The
  mechanism is proved to harm performance for fast SSD.

The patch set add a new io.low limit for blk-throttle. It's only for cgroup2.
The existing io.max is a hard limit throttling. cgroup with a max limit never
dispatch more IO than its max limit. While io.low is a best effort throttling.
cgroups with 'low' limit can run above their 'low' limit at appropriate time.
Specifically, if all cgroups reach their 'low' limit, all cgroups can run above
their 'low' limit. If any cgroup runs under its 'low' limit, all other cgroups
will run according to their 'low' limit. So the 'low' limit could act as two
roles, it allows cgroups using free bandwidth and it protects cgroups from
their 'low' limit.

An example usage is we have a high prio cgroup with high 'low' limit and a low
prio cgroup with low 'low' limit. If the high prio cgroup isn't running, the low
prio can run above its 'low' limit, so we don't waste the bandwidth. When the
high prio cgroup runs and is below its 'low' limit, low prio cgroup will run
under its 'low' limit. This will protect high prio cgroup to get more
resources.

The implementation is simple. The disk queue has a state machine. We have 2
states LIMIT_LOW and LIMIT_MAX. In each disk state, we throttle cgroups
according to the limit of the state. That is io.low limit for LIMIT_LOW state,
io.max limit for LIMIT_MAX. The disk state can be upgraded/downgraded between
LIMIT_LOW and LIMIT_MAX according to the rule aboe. Initially disk state is
LIMIT_MAX. And if no cgroup sets io.low, the disk state will remain in
LIMIT_MAX state. Systems with only io.max set will find nothing changed with the
patches.

The first 10 patches implement the basic framework. Add interface, handle
upgrade and downgrade logic. The patch 10 detects a special case a cgroup is
completely idle. In this case, we ignore the cgroup's limit. The patch 11-18
adds more heuristics.

The basic framework has 2 major issues.

1. fluctuation. When the state is upgraded from LIMIT_LOW to LIMIT_MAX, the
cgroup's bandwidth can change dramatically, sometimes in a way we are not
expected. For example, one cgroup's bandwidth will drop below its io.low limit
very soon after a upgrade. patch 10 has more details about the issue.

2. idle cgroup. cgroup with a io.low limit doesn't always dispatch enough IO.
In above upgrade rule, the disk will remain in LIMIT_LOW state and all other
cgroups can't dispatch more IO above their 'low' limit. Hence there is waste.
patch 11 has more details about the issue.

For issue 1, we make cgroup bandwidth increase/decrease smoothly after a
upgrade/downgrade. This will reduce the chance a cgroup's bandwidth drop under
its 'low' limit rapidly. The smoothness means we could waste some bandwidth in
the transition though. But we must pay something for sharing.

The issue 2 is very hard. We introduce two mechanisms for this. One is 'idle
time' or 'think time' borrowed from CFQ. If a cgroup's average idle time is
high, we treat it's idle and its 'low' limit isn't respected. Please see patch
12 - 14 for details. The other is 'latency target'. If a cgroup's io latency is
low, we treat it's idle and its 'low' limit isn't resptected. Please see patch
15 - 18 for fetails. Both mechanisms only happen when a cgroup runs below its
'low' limit.

The disadvantages of blk-throttle is it exports a kind of low level knobs.
Configuration would not be easy for normal users. It would be powerful for
experienced users though.

More tuning is required of course, but otherwise this works well. Please
review, test and consider merge.

Thanks,
Shaohua

V5->V6:
- Change default setting for io.low limit. It's 0 now, which makes more sense
- The default setting for latency is still 0, the default setting for idle time
  becomes bigger. So with the default settings, cgroups have small latency but
  disk sharing could be harmed
- Addressed other issues pointed out by Tejun

V4->V5, basically address Tejun's comments:
- Change interface from 'io.high' to 'io.low' so consistent with memcg
- Change interface for 'idle time' and 'latency target'
- Make 'idle time' per-cgroup-disk instead of per-cgroup
- Chnage interface name for 'throttle slice'. It's not a real slice
- Make downgrade smooth too
- Make latency sampling work for both bio and request based queue
- Change latency estimation method from 'line fitting' to 'bucket based
  

[PATCH V6 13/18] blk-throttle: add interface to configure idle time threshold

2017-01-14 Thread Shaohua Li
Add interface to configure the threshold. The io.low interface will
like:
echo "8:16 rbps=2097152 wbps=max idle=2000" > io.low

idle is in microsecond unit.

Signed-off-by: Shaohua Li 
---
 block/blk-throttle.c | 41 -
 1 file changed, 28 insertions(+), 13 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 974ec9d..88c0d1e 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -181,6 +181,8 @@ struct throtl_data
unsigned int limit_index;
bool limit_valid[LIMIT_CNT];
 
+   unsigned long dft_idletime_threshold; /* us */
+
unsigned long low_upgrade_time;
unsigned long low_downgrade_time;
 
@@ -477,10 +479,7 @@ static void throtl_pd_init(struct blkg_policy_data *pd)
sq->parent_sq = _to_tg(blkg->parent)->service_queue;
tg->td = td;
 
-   if (blk_queue_nonrot(td->queue))
-   tg->idletime_threshold = DFL_IDLE_THRESHOLD_SSD;
-   else
-   tg->idletime_threshold = DFL_IDLE_THRESHOLD_HD;
+   tg->idletime_threshold = td->dft_idletime_threshold;
 }
 
 /*
@@ -1447,6 +1446,7 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct 
blkg_policy_data *pd,
char bufs[4][21] = { "max", "max", "max", "max" };
u64 bps_dft;
unsigned int iops_dft;
+   char idle_time[26] = "";
 
if (!dname)
return 0;
@@ -1462,7 +1462,9 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct 
blkg_policy_data *pd,
if (tg->bps_conf[READ][off] == bps_dft &&
tg->bps_conf[WRITE][off] == bps_dft &&
tg->iops_conf[READ][off] == iops_dft &&
-   tg->iops_conf[WRITE][off] == iops_dft)
+   tg->iops_conf[WRITE][off] == iops_dft &&
+   (off != LIMIT_LOW || tg->idletime_threshold ==
+ tg->td->dft_idletime_threshold))
return 0;
 
if (tg->bps_conf[READ][off] != bps_dft)
@@ -1477,9 +1479,16 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct 
blkg_policy_data *pd,
if (tg->iops_conf[WRITE][off] != iops_dft)
snprintf(bufs[3], sizeof(bufs[3]), "%u",
tg->iops_conf[WRITE][off]);
+   if (off == LIMIT_LOW) {
+   if (tg->idletime_threshold == ULONG_MAX)
+   strcpy(idle_time, " idle=max");
+   else
+   snprintf(idle_time, sizeof(idle_time), " idle=%lu",
+   tg->idletime_threshold);
+   }
 
-   seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s\n",
-  dname, bufs[0], bufs[1], bufs[2], bufs[3]);
+   seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s%s\n",
+  dname, bufs[0], bufs[1], bufs[2], bufs[3], idle_time);
return 0;
 }
 
@@ -1497,6 +1506,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
struct blkg_conf_ctx ctx;
struct throtl_grp *tg;
u64 v[4];
+   unsigned long idle_time;
int ret;
int index = of_cft(of)->private;
 
@@ -1511,6 +1521,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
v[2] = tg->iops_conf[READ][index];
v[3] = tg->iops_conf[WRITE][index];
 
+   idle_time = tg->idletime_threshold;
while (true) {
char tok[27];   /* wiops=18446744073709551616 */
char *p;
@@ -1542,6 +1553,8 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
v[2] = min_t(u64, val, UINT_MAX);
else if (!strcmp(tok, "wiops"))
v[3] = min_t(u64, val, UINT_MAX);
+   else if (off == LIMIT_LOW && !strcmp(tok, "idle"))
+   idle_time = val;
else
goto out_finish;
}
@@ -1570,6 +1583,8 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
blk_throtl_update_limit_valid(tg->td);
if (tg->td->limit_valid[LIMIT_LOW])
tg->td->limit_index = LIMIT_LOW;
+   tg->idletime_threshold = (idle_time == ULONG_MAX) ?
+   ULONG_MAX : idle_time;
}
tg_conf_updated(tg);
ret = 0;
@@ -2114,10 +2129,13 @@ void blk_throtl_register_queue(struct request_queue *q)
 
td = q->td;
BUG_ON(!td);
-   if (blk_queue_nonrot(q))
+   if (blk_queue_nonrot(q)) {
td->throtl_slice = DFL_THROTL_SLICE_SSD;
-   else
+   td->dft_idletime_threshold = DFL_IDLE_THRESHOLD_SSD;
+   } else {
td->throtl_slice = DFL_THROTL_SLICE_HD;
+   td->dft_idletime_threshold = DFL_IDLE_THRESHOLD_HD;
+   }
 
/*
 * some tg are created before queue is fully initialized, eg, nonrot
@@ -2127,10 +2145,7 @@ void blk_throtl_register_queue(struct request_queue *q)
blkg_for_each_descendant_post(blkg, pos_css, q->root_blkg) {
 

[PATCH V6 10/18] blk-throttle: detect completed idle cgroup

2017-01-14 Thread Shaohua Li
cgroup could be assigned a limit, but doesn't dispatch enough IO, eg the
cgroup is idle. When this happens, the cgroup doesn't hit its limit, so
we can't move the state machine to higher level and all cgroups will be
throttled to their lower limit, so we waste bandwidth. Detecting idle
cgroup is hard. This patch handles a simple case, a cgroup doesn't
dispatch any IO. We ignore such cgroup's limit, so other cgroups can use
the bandwidth.

Please note this will be replaced with a more sophisticated algorithm
later, but this demonstrates the idea how we handle idle cgroups, so I
leave it here.

Signed-off-by: Shaohua Li 
---
 block/blk-throttle.c | 19 ++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 2d05c91..b3ce176 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -149,6 +149,8 @@ struct throtl_grp {
 
unsigned long last_check_time;
 
+   unsigned long last_dispatch_time[2];
+
/* When did we start a new slice */
unsigned long slice_start[2];
unsigned long slice_end[2];
@@ -445,11 +447,14 @@ static void tg_update_has_rules(struct throtl_grp *tg)
 
 static void throtl_pd_online(struct blkg_policy_data *pd)
 {
+   struct throtl_grp *tg = pd_to_tg(pd);
/*
 * We don't want new groups to escape the limits of its ancestors.
 * Update has_rules[] after a new group is brought online.
 */
-   tg_update_has_rules(pd_to_tg(pd));
+   tg_update_has_rules(tg);
+   tg->last_dispatch_time[READ] = jiffies;
+   tg->last_dispatch_time[WRITE] = jiffies;
 }
 
 static void blk_throtl_update_limit_valid(struct throtl_data *td)
@@ -1611,6 +1616,12 @@ static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
if (write_limit && sq->nr_queued[WRITE] &&
(!read_limit || sq->nr_queued[READ]))
return true;
+
+   if (time_after_eq(jiffies,
+tg->last_dispatch_time[READ] + tg->td->throtl_slice) &&
+   time_after_eq(jiffies,
+tg->last_dispatch_time[WRITE] + tg->td->throtl_slice))
+   return true;
return false;
 }
 
@@ -1688,6 +1699,11 @@ static bool throtl_tg_can_downgrade(struct throtl_grp 
*tg)
struct throtl_data *td = tg->td;
unsigned long now = jiffies;
 
+   if (time_after_eq(now, tg->last_dispatch_time[READ] +
+   td->throtl_slice) &&
+   time_after_eq(now, tg->last_dispatch_time[WRITE] +
+   td->throtl_slice))
+   return false;
/*
 * If cgroup is below low limit, consider downgrade and throttle other
 * cgroups
@@ -1796,6 +1812,7 @@ bool blk_throtl_bio(struct request_queue *q, struct 
blkcg_gq *blkg,
 
 again:
while (true) {
+   tg->last_dispatch_time[rw] = jiffies;
if (tg->last_low_overflow_time[rw] == 0)
tg->last_low_overflow_time[rw] = jiffies;
throtl_downgrade_check(tg);
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V6 11/18] blk-throttle: make bandwidth change smooth

2017-01-14 Thread Shaohua Li
When cgroups all reach low limit, cgroups can dispatch more IO. This
could make some cgroups dispatch more IO but others not, and even some
cgroups could dispatch less IO than their low limit. For example, cg1
low limit 10MB/s, cg2 limit 80MB/s, assume disk maximum bandwidth is
120M/s for the workload. Their bps could something like this:

cg1/cg2 bps: T1: 10/80 -> T2: 60/60 -> T3: 10/80

At T1, all cgroups reach low limit, so they can dispatch more IO later.
Then cg1 dispatch more IO and cg2 has no room to dispatch enough IO. At
T2, cg2 only dispatches 60M/s. Since We detect cg2 dispatches less IO
than its low limit 80M/s, we downgrade the queue from LIMIT_MAX to
LIMIT_LOW, then all cgroups are throttled to their low limit (T3). cg2
will have bandwidth below its low limit at most time.

The big problem here is we don't know the maximum bandwidth of the
workload, so we can't make smart decision to avoid the situation. This
patch makes cgroup bandwidth change smooth. After disk upgrades from
LIMIT_LOW to LIMIT_MAX, we don't allow cgroups use all bandwidth upto
their max limit immediately. Their bandwidth limit will be increased
gradually to avoid above situation. So above example will became
something like:

cg1/cg2 bps: 10/80 -> 15/105 -> 20/100 -> 25/95 -> 30/90 -> 35/85 -> 40/80
-> 45/75 -> 22/98

In this way cgroups bandwidth will be above their limit in majority
time, this still doesn't fully utilize disk bandwidth, but that's
something we pay for sharing.

Scale up is linear. The limit scales up 1/2 .low limit every
throtl_slice after upgrade. The scale up will stop if the adjusted limit
hits .max limit. Scale down is exponential. We cut the scale value half
if a cgroup doesn't hit its .low limit. If the scale becomes 0, we then
fully downgrade the queue to LIMIT_LOW state.

Note this doesn't completely avoid cgroup running under its low limit.
The best way to guarantee cgroup doesn't run under its limit is to set
max limit. For example, if we set cg1 max limit to 40, cg2 will never
run under its low limit.

Signed-off-by: Shaohua Li 
---
 block/blk-throttle.c | 57 +---
 1 file changed, 54 insertions(+), 3 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index b3ce176..a9eb03b 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -175,6 +175,8 @@ struct throtl_data
 
unsigned long low_upgrade_time;
unsigned long low_downgrade_time;
+
+   unsigned int scale;
 };
 
 static void throtl_pending_timer_fn(unsigned long arg);
@@ -226,29 +228,70 @@ static struct throtl_data *sq_to_td(struct 
throtl_service_queue *sq)
return container_of(sq, struct throtl_data, service_queue);
 }
 
+/*
+ * cgroup's limit in LIMIT_MAX is scaled if low limit is set. This scale is to
+ * make the IO dispatch more smooth.
+ * Scale up: linearly scale up according to lapsed time since upgrade. For
+ *   every throtl_slice, the limit scales up 1/2 .low limit till the
+ *   limit hits .max limit
+ * Scale down: exponentially scale down if a cgroup doesn't hit its .low limit
+ */
+static uint64_t throtl_adjusted_limit(uint64_t low, struct throtl_data *td)
+{
+   /* arbitrary value to avoid too big scale */
+   if (td->scale < 4096 && time_after_eq(jiffies,
+   td->low_upgrade_time + td->scale * td->throtl_slice))
+   td->scale = (jiffies - td->low_upgrade_time) / td->throtl_slice;
+
+   return low + (low >> 1) * td->scale;
+}
+
 static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
 {
struct blkcg_gq *blkg = tg_to_blkg(tg);
+   struct throtl_data *td;
uint64_t ret;
 
if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
return U64_MAX;
-   ret = tg->bps[rw][tg->td->limit_index];
-   if (ret == 0 && tg->td->limit_index == LIMIT_LOW)
+
+   td = tg->td;
+   ret = tg->bps[rw][td->limit_index];
+   if (ret == 0 && td->limit_index == LIMIT_LOW)
return tg->bps[rw][LIMIT_MAX];
+
+   if (td->limit_index == LIMIT_MAX && tg->bps[rw][LIMIT_LOW] &&
+   tg->bps[rw][LIMIT_LOW] != tg->bps[rw][LIMIT_MAX]) {
+   uint64_t adjusted;
+
+   adjusted = throtl_adjusted_limit(tg->bps[rw][LIMIT_LOW], td);
+   ret = min(tg->bps[rw][LIMIT_MAX], adjusted);
+   }
return ret;
 }
 
 static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
 {
struct blkcg_gq *blkg = tg_to_blkg(tg);
+   struct throtl_data *td;
unsigned int ret;
 
if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
return UINT_MAX;
-   ret = tg->iops[rw][tg->td->limit_index];
+   td = tg->td;
+   ret = tg->iops[rw][td->limit_index];
if (ret == 0 && tg->td->limit_index == LIMIT_LOW)
return tg->iops[rw][LIMIT_MAX];
+
+   if (td->limit_index == LIMIT_MAX && tg->iops[rw][LIMIT_LOW] &&
+ 

[PATCH V6 09/18] blk-throttle: choose a small throtl_slice for SSD

2017-01-14 Thread Shaohua Li
The throtl_slice is 100ms by default. This is a long time for SSD, a lot
of IO can run. To make cgroups have smoother throughput, we choose a
small value (20ms) for SSD.

Signed-off-by: Shaohua Li 
---
 block/blk-sysfs.c|  2 ++
 block/blk-throttle.c | 18 +++---
 block/blk.h  |  2 ++
 3 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 0e3fb2a..7285f74 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -907,6 +907,8 @@ int blk_register_queue(struct gendisk *disk)
 
blk_wb_init(q);
 
+   blk_throtl_register_queue(q);
+
if (!q->request_fn)
return 0;
 
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 49cad9a..2d05c91 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -18,8 +18,9 @@ static int throtl_grp_quantum = 8;
 /* Total max dispatch from all groups in one round */
 static int throtl_quantum = 32;
 
-/* Throttling is performed over 100ms slice and after that slice is renewed */
-#define DFL_THROTL_SLICE (HZ / 10)
+/* Throttling is performed over a slice and after that slice is renewed */
+#define DFL_THROTL_SLICE_HD (HZ / 10)
+#define DFL_THROTL_SLICE_SSD (HZ / 50)
 #define MAX_THROTL_SLICE (HZ)
 
 static struct blkcg_policy blkcg_policy_throtl;
@@ -1957,7 +1958,6 @@ int blk_throtl_init(struct request_queue *q)
 
q->td = td;
td->queue = q;
-   td->throtl_slice = DFL_THROTL_SLICE;
 
td->limit_valid[LIMIT_MAX] = true;
td->limit_index = LIMIT_MAX;
@@ -1978,6 +1978,18 @@ void blk_throtl_exit(struct request_queue *q)
kfree(q->td);
 }
 
+void blk_throtl_register_queue(struct request_queue *q)
+{
+   struct throtl_data *td;
+
+   td = q->td;
+   BUG_ON(!td);
+   if (blk_queue_nonrot(q))
+   td->throtl_slice = DFL_THROTL_SLICE_SSD;
+   else
+   td->throtl_slice = DFL_THROTL_SLICE_HD;
+}
+
 ssize_t blk_throtl_sample_time_show(struct request_queue *q, char *page)
 {
if (!q->td)
diff --git a/block/blk.h b/block/blk.h
index e83e757..186c67d 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -293,10 +293,12 @@ extern void blk_throtl_exit(struct request_queue *q);
 extern ssize_t blk_throtl_sample_time_show(struct request_queue *q, char 
*page);
 extern ssize_t blk_throtl_sample_time_store(struct request_queue *q,
const char *page, size_t count);
+extern void blk_throtl_register_queue(struct request_queue *q);
 #else /* CONFIG_BLK_DEV_THROTTLING */
 static inline void blk_throtl_drain(struct request_queue *q) { }
 static inline int blk_throtl_init(struct request_queue *q) { return 0; }
 static inline void blk_throtl_exit(struct request_queue *q) { }
+static inline void blk_throtl_register_queue(struct request_queue *q) { }
 #endif /* CONFIG_BLK_DEV_THROTTLING */
 
 #endif /* BLK_INTERNAL_H */
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V6 14/18] blk-throttle: ignore idle cgroup limit

2017-01-14 Thread Shaohua Li
Last patch introduces a way to detect idle cgroup. We use it to make
upgrade/downgrade decision. And the new algorithm can detect completely
idle cgroup too, so we can delete the corresponding code.

Signed-off-by: Shaohua Li 
---
 block/blk-throttle.c | 40 ++--
 1 file changed, 26 insertions(+), 14 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 88c0d1e..cf8d386 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -152,8 +152,6 @@ struct throtl_grp {
 
unsigned long last_check_time;
 
-   unsigned long last_dispatch_time[2];
-
/* When did we start a new slice */
unsigned long slice_start[2];
unsigned long slice_end[2];
@@ -508,8 +506,6 @@ static void throtl_pd_online(struct blkg_policy_data *pd)
 * Update has_rules[] after a new group is brought online.
 */
tg_update_has_rules(tg);
-   tg->last_dispatch_time[READ] = jiffies;
-   tg->last_dispatch_time[WRITE] = jiffies;
 }
 
 static void blk_throtl_update_limit_valid(struct throtl_data *td)
@@ -1704,9 +1700,8 @@ static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
return true;
 
if (time_after_eq(jiffies,
-tg->last_dispatch_time[READ] + tg->td->throtl_slice) &&
-   time_after_eq(jiffies,
-tg->last_dispatch_time[WRITE] + tg->td->throtl_slice))
+   tg_last_low_overflow_time(tg) + tg->td->throtl_slice) &&
+   throtl_tg_is_idle(tg))
return true;
return false;
 }
@@ -1752,6 +1747,26 @@ static bool throtl_can_upgrade(struct throtl_data *td,
return true;
 }
 
+static void throtl_upgrade_check(struct throtl_grp *tg)
+{
+   unsigned long now = jiffies;
+
+   if (tg->td->limit_index != LIMIT_LOW)
+   return;
+
+   if (time_after(tg->last_check_time + tg->td->throtl_slice, now))
+   return;
+
+   tg->last_check_time = now;
+
+   if (!time_after_eq(now,
+__tg_last_low_overflow_time(tg) + tg->td->throtl_slice))
+   return;
+
+   if (throtl_can_upgrade(tg->td, NULL))
+   throtl_upgrade_state(tg->td);
+}
+
 static void throtl_upgrade_state(struct throtl_data *td)
 {
struct cgroup_subsys_state *pos_css;
@@ -1793,18 +1808,15 @@ static bool throtl_tg_can_downgrade(struct throtl_grp 
*tg)
struct throtl_data *td = tg->td;
unsigned long now = jiffies;
 
-   if (time_after_eq(now, tg->last_dispatch_time[READ] +
-   td->throtl_slice) &&
-   time_after_eq(now, tg->last_dispatch_time[WRITE] +
-   td->throtl_slice))
-   return false;
/*
 * If cgroup is below low limit, consider downgrade and throttle other
 * cgroups
 */
if (time_after_eq(now, td->low_upgrade_time + td->throtl_slice) &&
time_after_eq(now, tg_last_low_overflow_time(tg) +
-   td->throtl_slice))
+   td->throtl_slice) &&
+   (!throtl_tg_is_idle(tg) ||
+!list_empty(_to_blkg(tg)->blkcg->css.children)))
return true;
return false;
 }
@@ -1926,10 +1938,10 @@ bool blk_throtl_bio(struct request_queue *q, struct 
blkcg_gq *blkg,
 
 again:
while (true) {
-   tg->last_dispatch_time[rw] = jiffies;
if (tg->last_low_overflow_time[rw] == 0)
tg->last_low_overflow_time[rw] = jiffies;
throtl_downgrade_check(tg);
+   throtl_upgrade_check(tg);
/* throtl is FIFO - if bios are already queued, should queue */
if (sq->nr_queued[rw])
break;
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V6 17/18] blk-throttle: add a mechanism to estimate IO latency

2017-01-14 Thread Shaohua Li
User configures latency target, but the latency threshold for each
request size isn't fixed. For a SSD, the IO latency highly depends on
request size. To calculate latency threshold, we sample some data, eg,
average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency
threshold of each request size will be the sample latency (I'll call it
base latency) plus latency target. For example, the base latency for
request size 4k is 80us and user configures latency target 60us. The 4k
latency threshold will be 80 + 60 = 140us.

To sample data, we calculate the order base 2 of rounded up IO sectors.
If the IO size is bigger than 1M, it will be accounted as 1M. Since the
calculation does round up, the base latency will be slightly smaller
than actual value. Also if there isn't any IO dispatched for a specific
IO size, we will use the base latency of smaller IO size for this IO
size.

But we shouldn't sample data at any time. The base latency is supposed
to be latency where disk isn't congested, because we use latency
threshold to schedule IOs between cgroups. If disk is congested, the
latency is higher, using it for scheduling is meaningless. Hence we only
do the sampling when block throttling is in the LOW limit, with
assumption disk isn't congested in such state. If the assumption isn't
true, eg, low limit is too high, calculated latency threshold will be
higher.

Hard disk is completely different. Latency depends on spindle seek
instead of request size. Currently this feature is SSD only, we probably
can use a fixed threshold like 4ms for hard disk though.

Signed-off-by: Shaohua Li 
---
 block/blk-stat.c  |   4 ++
 block/blk-throttle.c  | 162 --
 block/blk.h   |   2 +
 include/linux/blk_types.h |   9 +--
 4 files changed, 168 insertions(+), 9 deletions(-)

diff --git a/block/blk-stat.c b/block/blk-stat.c
index 7e9df17..073eff4 100644
--- a/block/blk-stat.c
+++ b/block/blk-stat.c
@@ -7,6 +7,7 @@
 #include 
 
 #include "blk-stat.h"
+#include "blk.h"
 #include "blk-mq.h"
 
 static void blk_stat_flush_batch(struct blk_rq_stat *stat)
@@ -204,6 +205,9 @@ void blk_stat_add(struct blk_rq_stat *stat, struct request 
*rq)
__blk_stat_init(stat, now);
 
value = now - blk_stat_time(>issue_stat);
+
+   blk_throtl_stat_add(rq, value);
+
if (value > stat->max)
stat->max = value;
if (value < stat->min)
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index b2fab58..c0eb100 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -28,6 +28,8 @@ static int throtl_quantum = 32;
 /* default latency target is 0, eg, guarantee IO latency by default */
 #define DFL_LATENCY_TARGET (0)
 
+#define SKIP_LATENCY (((u64)1) << BLK_STAT_RES_SHIFT)
+
 static struct blkcg_policy blkcg_policy_throtl;
 
 /* A workqueue to queue throttle related work */
@@ -165,6 +167,19 @@ struct throtl_grp {
unsigned long idletime_threshold; /* us */
 };
 
+/* We measure latency for request size from <= 4k to >= 1M */
+#define LATENCY_BUCKET_SIZE 9
+
+struct latency_bucket {
+   unsigned long total_latency; /* ns / 1024 */
+   int samples;
+};
+
+struct avg_latency_bucket {
+   unsigned long latency; /* ns / 1024 */
+   bool valid;
+};
+
 struct throtl_data
 {
/* service tree for active throtl groups */
@@ -188,6 +203,13 @@ struct throtl_data
unsigned long low_downgrade_time;
 
unsigned int scale;
+
+   struct latency_bucket tmp_buckets[LATENCY_BUCKET_SIZE];
+   struct avg_latency_bucket avg_buckets[LATENCY_BUCKET_SIZE];
+   struct latency_bucket __percpu *latency_buckets;
+   unsigned long last_calculate_time;
+
+   bool track_bio_latency;
 };
 
 static void throtl_pending_timer_fn(unsigned long arg);
@@ -306,6 +328,13 @@ static unsigned int tg_iops_limit(struct throtl_grp *tg, 
int rw)
return ret;
 }
 
+static int request_bucket_index(sector_t sectors)
+{
+   int order = order_base_2(sectors);
+
+   return clamp_t(int, order - 3, 0, LATENCY_BUCKET_SIZE - 1);
+}
+
 /**
  * throtl_log - log debug message via blktrace
  * @sq: the service_queue being reported
@@ -1927,6 +1956,67 @@ static void blk_throtl_update_idletime(struct throtl_grp 
*tg)
tg->checked_last_finish_time = last_finish_time;
 }
 
+static void throtl_update_latency_buckets(struct throtl_data *td)
+{
+   struct avg_latency_bucket avg_latency[LATENCY_BUCKET_SIZE];
+   int i, cpu;
+   unsigned long last_latency = 0;
+   unsigned long latency;
+
+   if (!blk_queue_nonrot(td->queue))
+   return;
+   if (time_before(jiffies, td->last_calculate_time + HZ))
+   return;
+   td->last_calculate_time = jiffies;
+
+   memset(avg_latency, 0, sizeof(avg_latency));
+   for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
+   struct latency_bucket *tmp = >tmp_buckets[i];
+
+   

Re: NBD: exported files over something around 1 TiB get an insane device size on the client side and are actually empty

2017-01-14 Thread Christoph Anton Mitterer
On Sat, 2017-01-14 at 20:03 -0500, Josef Bacik wrote:
> nbd: use loff_t for blocksize and nbd_set_size args

I'm just trying your patch, which does however not apply to 4.9.2 (but
I've adapted it)... will tell later if it worked for me.

Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH] nbd: use an idr to keep track of nbd devices

2017-01-14 Thread Josef Bacik

On Sat, Jan 14, 2017 at 4:10 PM, Sagi Grimberg  wrote:



Hey Josef,

To prepare for dynamically adding new nbd devices to the system 
switch
from using an array for the nbd devices and instead use an idr.  
This

copies what loop does for keeping track of its devices.


I think ida_simple_* is simpler and sufficient here isn't it?


I use more of the IDR stuff in later patches, I just haven't posted 
those yet because meetings.  Thanks,


Can you elaborate on the usage? What do you intend to do
that a simple ida cannot satisfy?


I'm going to use it the same way loop does, there will be a 
/dev/nbd-control where you can say ADD, REMOVE, and GET_NEXT. I need 
the search functionality to see if we are adding something that already 
exists, and to see what is the next unused device that can be used for 
a connection.  Looking at the ida api it does not appear I can do that. 
If I'm wrong then please point out an example I can look at, because I 
haven't been able to find one.  Thanks,


Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NBD: exported files over something around 1 TiB get an insane device size on the client side and are actually empty

2017-01-14 Thread Josef Bacik
On Sat, Jan 14, 2017 at 6:31 PM, Christoph Anton Mitterer 
 wrote:

Hi.

On advice from Alex Bligh I'd like to ping linux-block and nbd-general
about the issue described here:
https://github.com/NetworkBlockDevice/nbd/issues/44

What basically happens is, that with a recent kernel (Linux heisenberg
4.9.0-1-amd64 #1 SMP Debian 4.9.2-2 (2017-01-12) x86_64 GNU/Linux),
when a device larger than something between 1 TiB and 2 TiB is 
exported

via nbd-server and connected to via nbd-client (which uses the kernel
driver, AFAIU) than the device size gets insanely large while reading
from /dev/nbd0 gives actually nothing.


The bug does not seem to happen when the kernel is not involved (we
tried using nbd-server and qemu-img as the client).

Don't think it makes all too much sense to copy & paste everything 
from

what was tried already for testing here, so please have a look at the
issue on github.


Yeah I noticed this in testing, there's a bug with NBD since it's 
inception where it uses a 32 bit number for keeping track of the size 
of the device, I fixed it with


nbd: use loff_t for blocksize and nbd_set_size args

It'll be in 4.10.  Thanks,

Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Block fixes for 4.10-rc

2017-01-14 Thread Jens Axboe
Hi Linus,

Here's a set of fixes for the current series. This pull request
contains:

- The virtio_blk stack DMA corruption fix from Christoph, fixing and
  issue with VMAP stacks.

- O_DIRECT blkbits calculation fix from Chandan.

- Discard regression fix from Christoph.

- Queue init error handling fixes for nbd and virtio_blk, from Omar and
  Jeff.

- Two small nvme fixes, from Christoph and Guilherme.

- Rename of blk_queue_zone_size and bdev_zone_size to _sectors instead,
  to more closely follow what we do in other places in the block layer.
  This interface is new for this series, so let's get the naming right
  before releasing a kernel with this feature. From Damien.

Please pull!


  git://git.kernel.dk/linux-block.git for-linus



Chandan Rajendra (1):
  do_direct_IO: Use inode->i_blkbits to compute block count to be cleaned

Christoph Hellwig (7):
  virtio_blk: avoid DMA to stack for the sense buffer
  nvme-rdma: fix nvme_rdma_queue_is_ready
  block: add blk_rq_payload_bytes
  scsi: use blk_rq_payload_bytes
  nvme: use blk_rq_payload_bytes
  sd: remove __data_len hack for WRITE SAME
  block: don't try to discard from __blkdev_issue_zeroout

Damien Le Moal (1):
  block: Rename blk_queue_zone_size and bdev_zone_size

Guilherme G. Piccoli (1):
  nvme: apply DELAY_BEFORE_CHK_RDY quirk at probe time too

Jeff Moyer (1):
  nbd: blk_mq_init_queue returns an error code on failure, not NULL

Omar Sandoval (1):
  virtio_blk: fix panic in initialization error path

 block/blk-lib.c| 13 ++---
 block/blk-zoned.c  |  4 ++--
 block/partition-generic.c  | 14 +++---
 drivers/block/nbd.c|  6 --
 drivers/block/virtio_blk.c |  7 +--
 drivers/nvme/host/core.c   |  7 +--
 drivers/nvme/host/fc.c |  5 ++---
 drivers/nvme/host/nvme.h   |  8 
 drivers/nvme/host/pci.c| 19 ---
 drivers/nvme/host/rdma.c   | 15 ++-
 drivers/scsi/scsi_lib.c|  2 +-
 drivers/scsi/sd.c  | 17 +
 fs/direct-io.c |  3 ++-
 fs/f2fs/segment.c  |  4 ++--
 fs/f2fs/super.c|  6 +++---
 include/linux/blkdev.h | 19 ---
 16 files changed, 66 insertions(+), 83 deletions(-)

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


NBD: exported files over something around 1 TiB get an insane device size on the client side and are actually empty

2017-01-14 Thread Christoph Anton Mitterer
Hi.

On advice from Alex Bligh I'd like to ping linux-block and nbd-general
about the issue described here:
https://github.com/NetworkBlockDevice/nbd/issues/44

What basically happens is, that with a recent kernel (Linux heisenberg
4.9.0-1-amd64 #1 SMP Debian 4.9.2-2 (2017-01-12) x86_64 GNU/Linux),
when a device larger than something between 1 TiB and 2 TiB is exported
via nbd-server and connected to via nbd-client (which uses the kernel
driver, AFAIU) than the device size gets insanely large while reading
from /dev/nbd0 gives actually nothing.


The bug does not seem to happen when the kernel is not involved (we
tried using nbd-server and qemu-img as the client).

Don't think it makes all too much sense to copy & paste everything from
what was tried already for testing here, so please have a look at the
issue on github.


Best,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH] nbd: create a recv workqueue per nbd device

2017-01-14 Thread Josef Bacik

> On Jan 14, 2017, at 4:15 PM, Sagi Grimberg  wrote:
> 
> 
>>> Hey Josef,
>>> 
 Since we are in the memory reclaim path we need our recv work to be on a
 workqueue that has WQ_MEM_RECLAIM set so we can avoid deadlocks.  Also
 set WQ_HIGHPRI since we are in the completion path for IO.
>>> 
>>> Really a workqueue per device?? Did this really give performance
>>> advantage? Can this really scale with number of devices?
>> 
>> I don't see why not, especially since these things run the whole time the 
>> device is active.  I have patches forthcoming to make device creation 
>> dynamic so we don't have a bunch all at once.  That being said I'm not 
>> married to the idea, just seemed like a good idea at the time and not 
>> particularly harmful.  Thanks,
> 
> I just don't see how having a worqueue per device helps anything? There
> are plenty of active workers per workqueue and even if its not enough
> you can specify more with max_active.
> 
> I guess what I'm trying to say is that I don't understand what this is
> solving. The commit message explains why you need WQ_MEM_RECLAIM and why
> you want WQ_HIGHPRI, but does not explain why workqueue per device is
> helping/solving anything.

There's no reason for it, that's just the way I did it. I will test both ways 
on Tuesday and if there's no measurable difference then I'll do a global one.  
Thanks,

Josef--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] nbd: create a recv workqueue per nbd device

2017-01-14 Thread Sagi Grimberg



Hey Josef,


Since we are in the memory reclaim path we need our recv work to be on a
workqueue that has WQ_MEM_RECLAIM set so we can avoid deadlocks.  Also
set WQ_HIGHPRI since we are in the completion path for IO.


Really a workqueue per device?? Did this really give performance
advantage? Can this really scale with number of devices?


I don't see why not, especially since these things run the whole time the 
device is active.  I have patches forthcoming to make device creation dynamic 
so we don't have a bunch all at once.  That being said I'm not married to the 
idea, just seemed like a good idea at the time and not particularly harmful.  
Thanks,


I just don't see how having a worqueue per device helps anything? There
are plenty of active workers per workqueue and even if its not enough
you can specify more with max_active.

I guess what I'm trying to say is that I don't understand what this is
solving. The commit message explains why you need WQ_MEM_RECLAIM and why
you want WQ_HIGHPRI, but does not explain why workqueue per device is
helping/solving anything.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] nbd: use an idr to keep track of nbd devices

2017-01-14 Thread Sagi Grimberg



Hey Josef,


To prepare for dynamically adding new nbd devices to the system switch
from using an array for the nbd devices and instead use an idr.  This
copies what loop does for keeping track of its devices.


I think ida_simple_* is simpler and sufficient here isn't it?


I use more of the IDR stuff in later patches, I just haven't posted those yet 
because meetings.  Thanks,


Can you elaborate on the usage? What do you intend to do
that a simple ida cannot satisfy?
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html