Re: [PATCH] direct-io: don't introduce another read of inode->i_blkbits

2017-01-09 Thread Chandan Rajendra
On Monday, January 09, 2017 04:42:58 PM Jeff Moyer wrote:
> Commit 20ce44d545844 ("do_direct_IO: Use inode->i_blkbits to compute
> block count to be cleaned") introduced a regression: if the block size
> of the block device is changed while a direct I/O request is being
> setup, it can result in a panic.  See commit ab73857e354ab ("direct-io:
> don't read inode->i_blkbits multiple times") for the reasoning, and
> commit b87570f5d3496 ("Fix a crash when block device is read and block
> size is changed at the same time") for a more detailed problem
> description and reproducer.
> 
> Fixes: 20ce44d545844
> Signed-off-by: Jeff Moyer 
> 
> ---
> Chandan, can you please test this to ensure this still fixes your problem?

This patch fixes the failure,

Tested-by: Chandan Rajendra 

-- 
chandan

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os

2017-01-09 Thread Theodore Ts'o
On Tue, Jan 10, 2017 at 10:42:45AM +0900, Damien Le Moal wrote:
> Thank you for the information. I will check this out. Is it the
> optimization that aggressively delay meta-data update by allowing
> reading of meta-data blocks directly from the journal (for blocks that
> are not yet updated in place) ?

Essentially, yes.  In some cases, the metadata might never be written
back to its permanent location in disk.  Instead, we might copy a
metadata block from the tail of the journal to the head, if there is
enough space.  So effectively, it turns the journal into a log
structured store for metadata blocks.  Over time, if metadata blocks
become cold we can evict them from the journal to make room for more
frequently updated metadata blocks (given the currently running
workload).  Eliminating the random 4k updates to the allocation
bitmaps, inode table blocks, etc., really helps with host-aware SMR
drives, without requiring the massive amount of changes needed to make
a file system compatible with the host-managed model.

When you consider that for most files, they are never updated in
place, but are usually just replaced, I suspect that with some
additional adjustments to a traditional file system's block allocator,
we can get even further wins.  But that's for future work...

> Emulators may indeed be very useful for development. But we could also
> go further and implement the different models using device mappers too.
> Doing so, the same device could be used with different FTL through the
> same DM interface. And this may also simplify the implementation of
> complex models using DM stacking (e.g. the host-aware model can be
> implemented on top of a host-managed model).

Yes, indeed.  That would also allow people to experiment with how much
benefit can be derived if we were to give additional side channel
information to STL / FTL's --- since it's much easier to adjust kernel
code than to go through a negotiations with a HDD/SSD vendor to make
firmware changes!

This may be an area where if we can create the right framework, and
fund some research work, we might be able to get some researchers and
their graduate students interested in doing some work in figuring out
what sort of divisions of responsibilities and hints back and forth
between the storage device and host have the most benefit.

Cheers,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio_blk: fix panic in initialization error path

2017-01-09 Thread Jens Axboe
On 01/09/2017 09:10 PM, Michael S. Tsirkin wrote:
> On Mon, Jan 09, 2017 at 11:44:12AM -0800, Omar Sandoval wrote:
>> From: Omar Sandoval 
>>
>> If blk_mq_init_queue() returns an error, it gets assigned to
>> vblk->disk->queue. Then, when we call put_disk(), we end up calling
>> blk_put_queue() with the ERR_PTR, causing a bad dereference. Fix it by
>> only assigning to vblk->disk->queue on success.
>>
>> Signed-off-by: Omar Sandoval 
> 
> Acked-by: Michael S. Tsirkin 
> 
> Jens, do you mind picking this one up as well, since
> you have one virtio-blk patch already?

No problem, in fact I already queued it up.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio_blk: fix panic in initialization error path

2017-01-09 Thread Michael S. Tsirkin
On Mon, Jan 09, 2017 at 11:44:12AM -0800, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> If blk_mq_init_queue() returns an error, it gets assigned to
> vblk->disk->queue. Then, when we call put_disk(), we end up calling
> blk_put_queue() with the ERR_PTR, causing a bad dereference. Fix it by
> only assigning to vblk->disk->queue on success.
> 
> Signed-off-by: Omar Sandoval 

Acked-by: Michael S. Tsirkin 

Jens, do you mind picking this one up as well, since
you have one virtio-blk patch already?


> ---
>  drivers/block/virtio_blk.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 5545a679abd8..8587361e5356 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -628,11 +628,12 @@ static int virtblk_probe(struct virtio_device *vdev)
>   if (err)
>   goto out_put_disk;
>  
> - q = vblk->disk->queue = blk_mq_init_queue(&vblk->tag_set);
> + q = blk_mq_init_queue(&vblk->tag_set);
>   if (IS_ERR(q)) {
>   err = -ENOMEM;
>   goto out_free_tags;
>   }
> + vblk->disk->queue = q;
>  
>   q->queuedata = vblk;
>  
> -- 
> 2.11.0
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio_blk: fix panic in initialization error path

2017-01-09 Thread Jason Wang



On 2017年01月10日 03:44, Omar Sandoval wrote:

From: Omar Sandoval 

If blk_mq_init_queue() returns an error, it gets assigned to
vblk->disk->queue. Then, when we call put_disk(), we end up calling
blk_put_queue() with the ERR_PTR, causing a bad dereference. Fix it by
only assigning to vblk->disk->queue on success.

Signed-off-by: Omar Sandoval 
---
  drivers/block/virtio_blk.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 5545a679abd8..8587361e5356 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -628,11 +628,12 @@ static int virtblk_probe(struct virtio_device *vdev)
if (err)
goto out_put_disk;
  
-	q = vblk->disk->queue = blk_mq_init_queue(&vblk->tag_set);

+   q = blk_mq_init_queue(&vblk->tag_set);
if (IS_ERR(q)) {
err = -ENOMEM;
goto out_free_tags;
}
+   vblk->disk->queue = q;
  
  	q->queuedata = vblk;
  


Acked-by: Jason Wang 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] block: fix blk_get_backing_dev_info() crash, use bdev->bd_queue

2017-01-09 Thread Dan Williams
On Sun, Jan 8, 2017 at 12:50 PM, Dan Williams  wrote:
> On Sun, Jan 8, 2017 at 11:46 AM, Jan Kara  wrote:
>> On Fri 06-01-17 09:45:45, Dan Williams wrote:
>>> On Fri, Jan 6, 2017 at 2:23 AM, Jan Kara  wrote:
>>> > On Thu 05-01-17 17:17:55, Dan Williams wrote:
>>> >> The ->bd_queue member of struct block_device was added in commit
>>> >> 87192a2a49c4 ("vfs: cache request_queue in struct block_device") in
>>> >> v3.3. However, blk_get_backing_dev_info() has been using
>>> >> bdev_get_queue() and grabbing the request_queue through the gendisk
>>> >> since before the git era.
>>> >>
>>> >> At final __blkdev_put() time ->bd_disk is cleared while ->bd_queue is
>>> >> not. The queue remains valid until the final put of the parent disk.
>>> >>
>>> >> The following crash signature results from blk_get_backing_dev_info()
>>> >> trying to lookup the queue through ->bd_disk after the final put of the
>>> >> block device. Simply switch bdev_get_queue() to use ->bd_queue directly
>>> >> which is guaranteed to still be valid at invalidate_partition() time.
>>> >>
>>> >>  BUG: unable to handle kernel NULL pointer dereference at 
>>> >> 0568
>>> >>  IP: blk_get_backing_dev_info+0x10/0x20
>>> >>  [..]
>>> >>  Call Trace:
>>> >>   __inode_attach_wb+0x3a7/0x5d0
>>> >>   __filemap_fdatawrite_range+0xf8/0x100
>>> >>   filemap_write_and_wait+0x40/0x90
>>> >>   fsync_bdev+0x54/0x60
>>> >>   ? bdget_disk+0x30/0x40
>>> >>   invalidate_partition+0x24/0x50
>>> >>   del_gendisk+0xfa/0x230
>>> >
>>> > So we have a similar reports of the same problem. E.g.:
>>> >
>>> > http://www.spinics.net/lists/linux-fsdevel/msg105153.html
>>> >
>>> > However I kind of miss how your patch would fix all those cases. The
>>> > principial problem is that inode_to_bdi() called on block device inode
>>> > wants to get the backing_dev_info however on last close of a block device
>>> > we do put_disk() and thus the request queue containing backing_dev_info
>>> > does not have to be around at that time. In your case you are lucky enough
>>> > to have the containing disk still around but that's not the case for all
>>> > inode_to_bdi() users (see e.g. the report I referenced) and your patch
>>> > would change relatively straightforward NULL pointer dereference to rather
>>> > subtle use-after-free issue
>>>
>>> True. If there are other cases that don't hold their own queue
>>> reference this patch makes things worse.
>>>
>>> > so I disagree with going down this path.
>>>
>>> I still think this patch is the right thing to do, but it needs to
>>> come after the wider guarantee that having an active bdev reference
>>> guarantees that the queue and backing_dev_info are still allocated.
>>>
>>> > So what I think needs to be done is that we make backing_dev_info
>>> > independently allocated structure with different lifetime rules to gendisk
>>> > or request_queue - definitely I want it to live as long as block device
>>> > inode exists. However it needs more thought what the exact lifetime rules
>>> > will be.
>>>
>>> Hmm, why does it need to be separately allocated?
>>>
>>> Something like this, passes the libnvdimm unit tests: (non-whitespace
>>> damaged version attached)
>>
>> So the problem with this approach is that request queue will be pinned while
>> bdev inode exists. And how long that is is impossible to predict or influence
>> from userspace so e.g. you cannot remove device driver from memory and even
>> unplugging USB after it has been unmounted would suddently go via a path of
>> "device removed while it is used" which can have unexpected consequences. I
>> guess Jens or Christoph will know more about the details...
>
> We do have the "block, fs: reliably communicate bdev end-of-life"
> effort that I need to revisit:
>
>http://www.spinics.net/lists/linux-fsdevel/msg93312.html
>
> ...but I don't immediately see how keeping the request_queue around
> longer makes the situation worse?
>

Well the 0day kbuild robot and further testing with the libnvdimm unit
tests shows that bdi code is not ready for this release timing change.
So I retract it, sorry for the noise.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os

2017-01-09 Thread Damien Le Moal

Ted,

On 1/5/17 01:57, Theodore Ts'o wrote:
> I agree with Damien, but I'd also add that in the future there may
> very well be some new Zone types added to the ZBC model.  So we
> shouldn't assume that the ZBC model is a fixed one.  And who knows?
> Perhaps T10 standards body will come up with a simpler model for
> interfacing with SCSI/SATA-attached SSD's that might leverage the ZBC
> model --- or not.

Totally agree. There is already some activity in T10 for a ZBC V2
standard which indeed may include new zone types (for instance a
"circular buffer zone type that can be sequentially rewritten without a
reset, preserving previously written data for reads after the write
pointer). Such type of zone could be a perfect match for an FS journal
log space for instance.

> Either way, that's not really relevant as far as the Linux block layer
> is concerned, since the Linux block layer is designed to be an
> abstraction on top of hardware --- and in some cases we can use a
> similar abstraction on top of eMMC's, SCSI's, and SATA's
> implementation definition of TRIM/DISCARD/WRITE SAME/SECURE
> TRIM/QUEUED TRIM, even though they are different in some subtle ways,
> and may have different performance characteristics and semantics.
> 
> The trick is to expose similarities where the differences won't matter
> to the upper layers, but also to expose the fine distinctions and
> allow the file system and/or user space to use the protocol-specific
> differences when it matters to them.

Absolutely. The initial zoned block device support was written to match
what ZBC/ZAC defines. It was simple this way and there was no other
users of the zone concept. But the device models and zone types are just
numerical values reported to the device user. The block I/O stack
currently does not use these values beyond the device initialization. It
is up to the users (e.g. FS) of the device to determine what to do to
correctly use the device according to the types reported. So this basic
design is definitely extensible to new zone types and device models.

> 
> Designing that is going to be important, and I can guarantee we won't
> get it right at first.  Which is why it's a good thing that internal
> kernel interfaces aren't cast into concrete, and can be subject to
> change as new revisions to ZBC, or new interfaces (like perhaps
> OCSSD's) get promulgated by various standards bodies or by various
> vendors.

Indeed. The ZBC case was simple as we matched the standard defined
models. Whihc in any case is not really used in any way directly by the
block I/O stack itself. Only upper layers use that.

In the case of ACSSD, this adds one hardware-defined model set by the
standard, plus a potential collection of software defined models through
different FTL implementations on the host. Getting these models and
there API right will be indeed tricky. In a first step, providing a
ZBC-like host-aware model and a host-managed model may be a good idea as
upper layer code already ready for ZBC disks will work out-of-the-box
for OCSSDs too. From there, I can see a lot of possibilities for more
SSD optimized models though.

>>> Another point that QLC device could have more tricky features of
>>> erase blocks management. Also we should apply erase operation on NAND
>>> flash erase block but it is not mandatory for the case of SMR zone.
>>
>> Incorrect: host-managed devices require a zone "reset" (equivalent to
>> discard/trim) to be reused after being written once. So again, the
>> "tricky features" you mention will depend on the device "model",
>> whatever this ends up to be for an open channel SSD.
> 
> ... and this is exposed by having different zone types (sequential
> write required vs sequential write preferred vs conventional).  And if
> OCSSD's "zones" don't fit into the current ZBC zone types, we can
> easily add new ones.  I would suggest however, that we explicitly
> disclaim that the block device layer's code points for zone types is
> an exact match with the ZBC zone types numbering, precisely so we can
> add new zone types that correspond to abstractions from different
> hardware types, such as OCSSD.

The struct blk_zone type is 64B in size but only currently uses 32B. So
there is room for new fields, and existing fields can have newly defined
values too as the ZBC standard uses only few of the possible values in
the structure fields.

>> Not necessarily. Again think in terms of device "model" and associated
>> feature set. An FS implementation may decide to support all possible
>> models, with likely a resulting incredible complexity. More likely,
>> similarly with what is happening with SMR, only models that make sense
>> will be supported by FS implementation that can be easily modified.
>> Example again here of f2fs: changes to support SMR were rather simple,
>> whereas the initial effort to support SMR with ext4 was pretty much
>> abandoned as it was too complex to integrate in the existing code while
>> keeping the existing on-d

Re: [PATCH V5 00/17] blk-throttle: add .low limit

2017-01-09 Thread Shaohua Li
On Mon, Jan 09, 2017 at 04:46:35PM -0500, Tejun Heo wrote:
> Hello,
> 
> Sorry about the long delay.  Generally looks good to me.  Overall,
> there are only a few things that I think should be addressed.

Thanks for your time!
 
> * Low limit should default to zero.

I forgot to change it after changing the name from high to low

> * The default values and roles of idle timeout vs. latency target.
> 
> I don't think either would be difficult to address.  Oh and a bit more
> documentation overall would be nice.

Will fix these.

> Thanks a lot for the great work and happy new year!

Thanks and happy new year!

Best regards,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 00/17] blk-throttle: add .low limit

2017-01-09 Thread Tejun Heo
Hello,

Sorry about the long delay.  Generally looks good to me.  Overall,
there are only a few things that I think should be addressed.

* Low limit should default to zero.

* The default values and roles of idle timeout vs. latency target.

I don't think either would be difficult to address.  Oh and a bit more
documentation overall would be nice.

Thanks a lot for the great work and happy new year!

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] direct-io: don't introduce another read of inode->i_blkbits

2017-01-09 Thread Jens Axboe
On 01/09/2017 02:42 PM, Jeff Moyer wrote:
> Commit 20ce44d545844 ("do_direct_IO: Use inode->i_blkbits to compute
> block count to be cleaned") introduced a regression: if the block size
> of the block device is changed while a direct I/O request is being
> setup, it can result in a panic.  See commit ab73857e354ab ("direct-io:
> don't read inode->i_blkbits multiple times") for the reasoning, and
> commit b87570f5d3496 ("Fix a crash when block device is read and block
> size is changed at the same time") for a more detailed problem
> description and reproducer.
> 
> Fixes: 20ce44d545844
> Signed-off-by: Jeff Moyer 
> 
> ---
> Chandan, can you please test this to ensure this still fixes your problem?

Please, that would be great. And if so, then let's fold the two.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] direct-io: don't introduce another read of inode->i_blkbits

2017-01-09 Thread Jeff Moyer
Commit 20ce44d545844 ("do_direct_IO: Use inode->i_blkbits to compute
block count to be cleaned") introduced a regression: if the block size
of the block device is changed while a direct I/O request is being
setup, it can result in a panic.  See commit ab73857e354ab ("direct-io:
don't read inode->i_blkbits multiple times") for the reasoning, and
commit b87570f5d3496 ("Fix a crash when block device is read and block
size is changed at the same time") for a more detailed problem
description and reproducer.

Fixes: 20ce44d545844
Signed-off-by: Jeff Moyer 

---
Chandan, can you please test this to ensure this still fixes your problem?

diff --git a/fs/direct-io.c b/fs/direct-io.c
index b20adf9..c87bae4 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -905,8 +905,8 @@ static inline void dio_zero_block(struct dio *dio, struct 
dio_submit *sdio,
 static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
struct buffer_head *map_bh)
 {
-   const unsigned i_blkbits = dio->inode->i_blkbits;
const unsigned blkbits = sdio->blkbits;
+   const unsigned i_blkbits = blkbits + sdio->blkfactor;
int ret = 0;
 
while (sdio->block_in_file < sdio->final_block_in_request) {
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 16/17] blk-throttle: add a mechanism to estimate IO latency

2017-01-09 Thread Tejun Heo
Hello,

On Thu, Dec 15, 2016 at 12:33:07PM -0800, Shaohua Li wrote:
> User configures latency target, but the latency threshold for each
> request size isn't fixed. For a SSD, the IO latency highly depends on
> request size. To calculate latency threshold, we sample some data, eg,
> average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency
> threshold of each request size will be the sample latency (I'll call it
> base latency) plus latency target. For example, the base latency for
> request size 4k is 80us and user configures latency target 60us. The 4k
> latency threshold will be 80 + 60 = 140us.

Ah okay, the user configures the extra latency.  Yeah, this is way
better than treating what the user configures as the target latency
for 4k IOs.

> @@ -25,6 +25,8 @@ static int throtl_quantum = 32;
>  #define DFL_IDLE_THRESHOLD_HD (1000 * 1000) /* 1 ms */
>  #define MAX_IDLE_TIME (500L * 1000 * 1000) /* 500 ms */
>  
> +#define SKIP_TRACK (((u64)1) << BLK_STAT_RES_SHIFT)

SKIP_LATENCY?

> +static void throtl_update_latency_buckets(struct throtl_data *td)
> +{
> + struct avg_latency_bucket avg_latency[LATENCY_BUCKET_SIZE];
> + int i, cpu;
> + u64 last_latency = 0;
> + u64 latency;
> +
> + if (!blk_queue_nonrot(td->queue))
> + return;
> + if (time_before(jiffies, td->last_calculate_time + HZ))
> + return;
> + td->last_calculate_time = jiffies;
> +
> + memset(avg_latency, 0, sizeof(avg_latency));
> + for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
> + struct latency_bucket *tmp = &td->tmp_buckets[i];
> +
> + for_each_possible_cpu(cpu) {
> + struct latency_bucket *bucket;
> +
> + /* this isn't race free, but ok in practice */
> + bucket = per_cpu_ptr(td->latency_buckets, cpu);
> + tmp->total_latency += bucket[i].total_latency;
> + tmp->samples += bucket[i].samples;

Heh, this *can* lead to surprising results (like reading zero for a
value larger than 2^32) on 32bit machines due to split updates, and if
we're using nanosecs, those surprises have a chance, albeit low, of
happening every four secs, which is a bit unsettling.  If we have to
use nanosecs, let's please use u64_stats_sync.  If we're okay with
microsecs, ulongs should be fine.

>  void blk_throtl_bio_endio(struct bio *bio)
>  {
>   struct throtl_grp *tg;
> + u64 finish_time;
> + u64 start_time;
> + u64 lat;
>  
>   tg = bio->bi_cg_private;
>   if (!tg)
>   return;
>   bio->bi_cg_private = NULL;
>  
> - tg->last_finish_time = ktime_get_ns();
> + finish_time = ktime_get_ns();
> + tg->last_finish_time = finish_time;
> +
> + start_time = blk_stat_time(&bio->bi_issue_stat);
> + finish_time = __blk_stat_time(finish_time);
> + if (start_time && finish_time > start_time &&
> + tg->td->track_bio_latency == 1 &&
> + !(bio->bi_issue_stat.stat & SKIP_TRACK)) {

Heh, can't we collapse some of the conditions?  e.g. flip SKIP_TRACK
to TRACK_LATENCY and set it iff the td has track_bio_latency set and
also the bio has start time set?

> @@ -2106,6 +2251,12 @@ int blk_throtl_init(struct request_queue *q)
>   td = kzalloc_node(sizeof(*td), GFP_KERNEL, q->node);
>   if (!td)
>   return -ENOMEM;
> + td->latency_buckets = __alloc_percpu(sizeof(struct latency_bucket) *
> + LATENCY_BUCKET_SIZE, __alignof__(u64));
> + if (!td->latency_buckets) {
> + kfree(td);
> + return -ENOMEM;
> + }
>  
>   INIT_WORK(&td->dispatch_work, blk_throtl_dispatch_work_fn);
>   throtl_service_queue_init(&td->service_queue);
> @@ -2119,10 +2270,13 @@ int blk_throtl_init(struct request_queue *q)
>   td->low_upgrade_time = jiffies;
>   td->low_downgrade_time = jiffies;
>  
> + td->track_bio_latency = UINT_MAX;

I don't think using 0, 1, UINT_MAX as enums is good for readability.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 15/17] block: track request size in blk_issue_stat

2017-01-09 Thread Tejun Heo
On Thu, Dec 15, 2016 at 12:33:06PM -0800, Shaohua Li wrote:
> Currently there is no way to know the request size when the request is
> finished. Next patch will need this info, so add to blk_issue_stat. With
> this, we will have 49bits to track time, which still is very long time.

Not necessarily an objection but do we really need to overload the
size field?  Would a normal extra field hurt too much?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 14/17] blk-throttle: add interface for per-cgroup target latency

2017-01-09 Thread Tejun Heo
Hello,

On Thu, Dec 15, 2016 at 12:33:05PM -0800, Shaohua Li wrote:
> @@ -438,6 +439,11 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t 
> gfp, int node)
>   }
>   tg->idle_ttime_threshold = U64_MAX;
>  
> + /*
> +  * target latency default 0, eg, latency threshold is 0, which means
> +  * cgroup's latency is always higher than threshold
> +  */
> +
>   return &tg->pd;
>  }

So, this is something which bothers me regarding the default settings.
I suspect the reason why the earlier patch went for tight idle time
was because we're setting default latency to zero, so to achieve good
utilization, the idle timeout must be shortened so that it neutralizes
the 0 latency target here.

I don't think this is a good default configuration.  Latency target
should be the mechanism which determines how shareable an active
cgroup which is under its low limit is.  That's the only thing it can
do anyway.  Idle time mechanism should serve a different purpose, not
an overlapping one.

If we want to default to latency guarantee, we can go for 0 latency
and a long idle timeout, even infinity.  If we want to default to good
utilization, we should pick a reasonable latency target (which is tied
to the device latency) with a reasonable idle timeout (which is tied
to how human perceives something to be idle).

Please note that it's kinda clear that we're misconfiguring it in the
previous patch in that we're altering idle timeout on device type.
Idle timeout is about the application behavior.  This isn't really
decided by request completion latency.  On the other hand, latency
target is the parameter which is device dependent.  The fact that it
was picking different idle time depending on device type means that
the roles of idle timeout and latency target are overlapping.  They
shouldn't.  It gets really confusing.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 13/17] blk-throttle: ignore idle cgroup limit

2017-01-09 Thread Tejun Heo
On Thu, Dec 15, 2016 at 12:33:04PM -0800, Shaohua Li wrote:
> Last patch introduces a way to detect idle cgroup. We use it to make
> upgrade/downgrade decision. And the new algorithm can detect completely
> idle cgroup too, so we can delete the corresponding code.

Ah, okay, the slice based idle detection gets completely removed here.
Might be better to not add it in the first place but no biggie either
way.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: blk_queue_bounce_limit() broken for mask=0xffffffff on 64bit archs

2017-01-09 Thread Christoph Hellwig
On Mon, Jan 09, 2017 at 11:48:11PM +0300, Nikita Yushchenko wrote:
> Hi
> 
> There is a use cases when architecture is 64-bit but hardware supports
> only DMA to lower 4G of address space. E.g. NVMe device on RCar PCIe host.

The solution is to shoot the SOC designer.  If that doesn't work use
swiotlb.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 12/17] blk-throttle: add interface to configure idle time threshold

2017-01-09 Thread Tejun Heo
On Thu, Dec 15, 2016 at 12:33:03PM -0800, Shaohua Li wrote:
> @@ -180,6 +180,8 @@ struct throtl_data
>   unsigned int limit_index;
>   bool limit_valid[LIMIT_CNT];
>  
> + u64 dft_idle_ttime_threshold;

BTW, wouldn't idle_time be a better name for these?  Currently, it's
"idle thinktime" and I'm not sure what that means.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 11/17] blk-throttle: add a simple idle detection

2017-01-09 Thread Tejun Heo
On Thu, Dec 15, 2016 at 12:33:02PM -0800, Shaohua Li wrote:
>  /* Throttling is performed over 100ms slice and after that slice is renewed 
> */
>  #define DFL_THROTL_SLICE (HZ / 10)
>  #define MAX_THROTL_SLICE (HZ / 5)
> +#define DFL_IDLE_THRESHOLD_SSD (50 * 1000) /* 50 us */
> +#define DFL_IDLE_THRESHOLD_HD (1000 * 1000) /* 1 ms */
> +#define MAX_IDLE_TIME (500L * 1000 * 1000) /* 500 ms */

Hmm... why are we capping idle time so low?  This is a value to be
configured by userland.  Does it make sense to cap it this low?  Also,
wouldn't it make sense to start with higher default value given that
the user has to explicitly enable low limit for it to be effective and
thus explicitly requesting best effort latency target which will be
added later?

I'm really uncomfortable with pitting these two knobs against each
other in the similar time ranges.  It's really difficult tell what
latency target of 25us means and predict its behavior and when the
idle timeout is 50us.  It's fine if some people fiddle with them but
it'd be great if the defaults clearly indicate that they're operating
in mostly separate time scales.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] nbd: blk_mq_init_queue returns an error code on failure, not NULL

2017-01-09 Thread Josef Bacik

On Mon, Jan 9, 2017 at 3:41 PM, Jens Axboe  wrote:

On 01/09/2017 01:27 PM, Omar Sandoval wrote:

 On Mon, Jan 09, 2017 at 03:20:31PM -0500, Jeff Moyer wrote:

 Additionally, don't assign directly to disk->queue, otherwise
 blk_put_queue (called via put_disk) will choke (panic) on the errno
 stored there.

 Bug found by code inspection after Omar found a similar issue in
 virtio_blk.  Compile-tested only.

 Signed-off-by: Jeff Moyer 


 Reviewed-by: Omar Sandoval 

 Compile-reviewed only :) Josef can probably test it if he cares 
enough,

 but it looks right.


Looks good to me, too. Josef, you want me to queue this up directly?


Y'all are hilarious.

Reviewed-by: Josef Bacik 

Thanks,

Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


blk_queue_bounce_limit() broken for mask=0xffffffff on 64bit archs

2017-01-09 Thread Nikita Yushchenko
Hi

There is a use cases when architecture is 64-bit but hardware supports
only DMA to lower 4G of address space. E.g. NVMe device on RCar PCIe host.

For such cases, it looks proper to call blk_queue_bounce_limit() with
mask set to 0x - thus making block layer to use bounce buffers
for any addresses beyond 4G.  To support that, architecture provides
GFP_DMA zone that covers exactly low 4G on arm64.

However setting this limit does not work:

  if (b_pfn < (min_t(u64, 0xUL, BLK_BOUNCE_HIGH) >> PAGE_SHIFT))
  dma = 1;

When mask is 0x that condition is false

  q->limits.bounce_pfn = max(max_low_pfn, b_pfn);

this line is executed and replaces any limit with end of memory (on
64bit arch all memory is low).


Not sure how to fix this properly. Any hints?
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2017-01-09 Thread Johannes Weiner
On Mon, Jan 09, 2017 at 09:30:05PM +0100, Jan Kara wrote:
> On Sat 07-01-17 21:02:00, Johannes Weiner wrote:
> > On Tue, Jan 03, 2017 at 01:28:25PM +0100, Jan Kara wrote:
> > > On Mon 02-01-17 16:11:36, Johannes Weiner wrote:
> > > > On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> > > > > On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > > > > > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > > > > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > > > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner 
> > > > > > > >  wrote:
> > > > > > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > > > > > workload again and about a minute in this fired:
> > > > > > > > >
> > > > > > > > > [628867.607417] [ cut here ]
> > > > > > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at 
> > > > > > > > > mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
> > > > > > > > 
> > > > > > > > Well, part of the changes during the merge window were the 
> > > > > > > > shadow
> > > > > > > > entry tracking changes that came in through Andrew's tree. 
> > > > > > > > Adding
> > > > > > > > Johannes Weiner to the participants.
> > 
> > Okay, the below patch should address this problem. Dave Jones managed
> > to reproduce it with the added WARN_ONs, and they made it obvious. He
> > cannot trigger it anymore with this fix applied. Thanks Dave!
> 
> FWIW the patch looks good to me. I'd just note that the need to pass the
> callback to deletion function and the fact that we do it only in cases
> where we think it is needed appears errorprone. With the warning you've
> added it should at least catch the cases where we got it wrong but more
> robust would be if the radix tree root contained a pointer to the callback
> function so that we would not rely on passing the callback to every place
> which can possibly free a node. Also conceptually this would make more
> sense to me since the fact that we may need to do some cleanup on node
> deletion is a property of the particular radix tree and how we use it.
> OTOH that would mean growing radix tree root by one pointer which would be
> unpopular I guess.

The last sentence is the crux, unfortunately. The first iteration of
the shadow shrinker linked up mappings that contained shadow entries,
rather than nodes. The code would have been drastically simpler in
pretty much all regards, without changes to the radix tree API. But
Dave Chinner objected to adding a pointer to the inode when we could
stick them into empty space (slab fragmentation) inside the nodes. A
fair point, I guess, especially when you consider metadata-intense
workloads. But now we're kind of stuck with the complexity of this.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] nbd: blk_mq_init_queue returns an error code on failure, not NULL

2017-01-09 Thread Jens Axboe
On 01/09/2017 01:27 PM, Omar Sandoval wrote:
> On Mon, Jan 09, 2017 at 03:20:31PM -0500, Jeff Moyer wrote:
>> Additionally, don't assign directly to disk->queue, otherwise
>> blk_put_queue (called via put_disk) will choke (panic) on the errno
>> stored there.
>>
>> Bug found by code inspection after Omar found a similar issue in
>> virtio_blk.  Compile-tested only.
>>
>> Signed-off-by: Jeff Moyer 
> 
> Reviewed-by: Omar Sandoval 
> 
> Compile-reviewed only :) Josef can probably test it if he cares enough,
> but it looks right.

Looks good to me, too. Josef, you want me to queue this up directly?

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2017-01-09 Thread Jan Kara
On Sat 07-01-17 21:02:00, Johannes Weiner wrote:
> On Tue, Jan 03, 2017 at 01:28:25PM +0100, Jan Kara wrote:
> > On Mon 02-01-17 16:11:36, Johannes Weiner wrote:
> > > On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> > > > On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > > > > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > > > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner 
> > > > > > >  wrote:
> > > > > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > > > > workload again and about a minute in this fired:
> > > > > > > >
> > > > > > > > [628867.607417] [ cut here ]
> > > > > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at 
> > > > > > > > mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
> > > > > > > 
> > > > > > > Well, part of the changes during the merge window were the shadow
> > > > > > > entry tracking changes that came in through Andrew's tree. Adding
> > > > > > > Johannes Weiner to the participants.
> 
> Okay, the below patch should address this problem. Dave Jones managed
> to reproduce it with the added WARN_ONs, and they made it obvious. He
> cannot trigger it anymore with this fix applied. Thanks Dave!

FWIW the patch looks good to me. I'd just note that the need to pass the
callback to deletion function and the fact that we do it only in cases
where we think it is needed appears errorprone. With the warning you've
added it should at least catch the cases where we got it wrong but more
robust would be if the radix tree root contained a pointer to the callback
function so that we would not rely on passing the callback to every place
which can possibly free a node. Also conceptually this would make more
sense to me since the fact that we may need to do some cleanup on node
deletion is a property of the particular radix tree and how we use it.
OTOH that would mean growing radix tree root by one pointer which would be
unpopular I guess.

Honza

-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 10/17] blk-throttle: make bandwidth change smooth

2017-01-09 Thread Tejun Heo
Hello,

On Thu, Dec 15, 2016 at 12:33:01PM -0800, Shaohua Li wrote:
>  static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
>  {
>   struct blkcg_gq *blkg = tg_to_blkg(tg);
> + struct throtl_data *td;
>   uint64_t ret;
>  
>   if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
>   return U64_MAX;
> - return tg->bps[rw][tg->td->limit_index];
> +
> + td = tg->td;
> + ret = tg->bps[rw][td->limit_index];
> + if (td->limit_index == LIMIT_MAX && tg->bps[rw][LIMIT_LOW] !=
> + tg->bps[rw][LIMIT_MAX]) {
> + uint64_t increase;
> +
> + if (td->scale < 4096 && time_after_eq(jiffies,

Hmm... why do we need to limit scale to 4096?  As 4096 is a big
number, this is only theoretical but this means that if max is more
then 2048 times low, that will never be reached, right?

> + td->low_upgrade_time + td->scale * td->throtl_slice)) {
> + unsigned int time = jiffies - td->low_upgrade_time;
> +
> + td->scale = time / td->throtl_slice;
> + }
> + increase = (tg->bps[rw][LIMIT_LOW] >> 1) * td->scale;
> + ret = min(tg->bps[rw][LIMIT_MAX],
> + tg->bps[rw][LIMIT_LOW] + increase);
> + }
> + return ret;
>  }

I think the code can use some comments.

>  static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
>  {
>   struct blkcg_gq *blkg = tg_to_blkg(tg);
> + struct throtl_data *td;
>   unsigned int ret;
>  
>   if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
>   return UINT_MAX;
> - return tg->iops[rw][tg->td->limit_index];
> +
> + td = tg->td;
> + ret = tg->iops[rw][td->limit_index];
> + if (td->limit_index == LIMIT_MAX && tg->iops[rw][LIMIT_LOW] !=
> + tg->iops[rw][LIMIT_MAX]) {
> + uint64_t increase;
> +
> + if (td->scale < 4096 && time_after_eq(jiffies,
> + td->low_upgrade_time + td->scale * td->throtl_slice)) {
> + unsigned int time = jiffies - td->low_upgrade_time;
> +
> + td->scale = time / td->throtl_slice;
> + }
> +
> + increase = (tg->iops[rw][LIMIT_LOW] >> 1) * td->scale;
> + ret = min(tg->iops[rw][LIMIT_MAX],
> + tg->iops[rw][LIMIT_LOW] + (unsigned int)increase);

Would it be worthwhile to factor the common part into a helper?

> @@ -1662,6 +1702,13 @@ static void throtl_upgrade_state(struct throtl_data 
> *td)
>  
>  static void throtl_downgrade_state(struct throtl_data *td, int new)
>  {
> + td->scale /= 2;
> +
> + if (td->scale) {
> + td->low_upgrade_time = jiffies - td->scale * td->throtl_slice;
> + return;
> + }

Cool, so linear increase and exponential backdown.  Yeah, that makes
sense to me but let's please document it.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] nbd: blk_mq_init_queue returns an error code on failure, not NULL

2017-01-09 Thread Omar Sandoval
On Mon, Jan 09, 2017 at 03:20:31PM -0500, Jeff Moyer wrote:
> Additionally, don't assign directly to disk->queue, otherwise
> blk_put_queue (called via put_disk) will choke (panic) on the errno
> stored there.
> 
> Bug found by code inspection after Omar found a similar issue in
> virtio_blk.  Compile-tested only.
> 
> Signed-off-by: Jeff Moyer 

Reviewed-by: Omar Sandoval 

Compile-reviewed only :) Josef can probably test it if he cares enough,
but it looks right.

> diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
> index 38c576f..50a2020 100644
> --- a/drivers/block/nbd.c
> +++ b/drivers/block/nbd.c
> @@ -1042,6 +1042,7 @@ static int __init nbd_init(void)
>   return -ENOMEM;
>  
>   for (i = 0; i < nbds_max; i++) {
> + struct request_queue *q;
>   struct gendisk *disk = alloc_disk(1 << part_shift);
>   if (!disk)
>   goto out;
> @@ -1067,12 +1068,13 @@ static int __init nbd_init(void)
>* every gendisk to have its very own request_queue struct.
>* These structs are big so we dynamically allocate them.
>*/
> - disk->queue = blk_mq_init_queue(&nbd_dev[i].tag_set);
> - if (!disk->queue) {
> + q = blk_mq_init_queue(&nbd_dev[i].tag_set);
> + if (IS_ERR(q)) {
>   blk_mq_free_tag_set(&nbd_dev[i].tag_set);
>   put_disk(disk);
>   goto out;
>   }
> + disk->queue = q;
>  
>   /*
>* Tell the block layer that we are not a rotational device
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch] nbd: blk_mq_init_queue returns an error code on failure, not NULL

2017-01-09 Thread Jeff Moyer
Additionally, don't assign directly to disk->queue, otherwise
blk_put_queue (called via put_disk) will choke (panic) on the errno
stored there.

Bug found by code inspection after Omar found a similar issue in
virtio_blk.  Compile-tested only.

Signed-off-by: Jeff Moyer 

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 38c576f..50a2020 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -1042,6 +1042,7 @@ static int __init nbd_init(void)
return -ENOMEM;
 
for (i = 0; i < nbds_max; i++) {
+   struct request_queue *q;
struct gendisk *disk = alloc_disk(1 << part_shift);
if (!disk)
goto out;
@@ -1067,12 +1068,13 @@ static int __init nbd_init(void)
 * every gendisk to have its very own request_queue struct.
 * These structs are big so we dynamically allocate them.
 */
-   disk->queue = blk_mq_init_queue(&nbd_dev[i].tag_set);
-   if (!disk->queue) {
+   q = blk_mq_init_queue(&nbd_dev[i].tag_set);
+   if (IS_ERR(q)) {
blk_mq_free_tag_set(&nbd_dev[i].tag_set);
put_disk(disk);
goto out;
}
+   disk->queue = q;
 
/*
 * Tell the block layer that we are not a rotational device
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] do_direct_IO: Use inode->i_blkbits to compute block count to be cleaned

2017-01-09 Thread Jan Kara
On Sun 08-01-17 20:17:10, Chandan Rajendra wrote:
> The code currently uses sdio->blkbits to compute the number of blocks to
> be cleaned. However sdio->blkbits is derived from the logical block size
> of the underlying block device (Refer to the definition of
> do_blockdev_direct_IO()). Due to this, generic/299 test would rarely
> fail when executed on an ext4 filesystem with 64k as the block size and
> when using a virtio based disk (having 512 byte as the logical block
> size) inside a kvm guest.
> 
> This commit fixes the bug by using inode->i_blkbits to compute the
> number of blocks to be cleaned.

Ah, good catch. You can add:

Reviewed-by: Jan Kara 

Honza

> Signed-off-by: Chandan Rajendra 
> ---
>  fs/direct-io.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index aeae8c0..b20adf9 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -905,6 +905,7 @@ static inline void dio_zero_block(struct dio *dio, struct 
> dio_submit *sdio,
>  static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
>   struct buffer_head *map_bh)
>  {
> + const unsigned i_blkbits = dio->inode->i_blkbits;
>   const unsigned blkbits = sdio->blkbits;
>   int ret = 0;
>  
> @@ -949,7 +950,7 @@ static int do_direct_IO(struct dio *dio, struct 
> dio_submit *sdio,
>   clean_bdev_aliases(
>   map_bh->b_bdev,
>   map_bh->b_blocknr,
> - map_bh->b_size >> blkbits);
> + map_bh->b_size >> i_blkbits);
>   }
>  
>   if (!sdio->blkfactor)
> -- 
> 2.5.5
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 09/17] blk-throttle: detect completed idle cgroup

2017-01-09 Thread Tejun Heo
Hello,

On Thu, Dec 15, 2016 at 12:33:00PM -0800, Shaohua Li wrote:
> @@ -1660,6 +1671,11 @@ static bool throtl_tg_can_downgrade(struct throtl_grp 
> *tg)
>   struct throtl_data *td = tg->td;
>   unsigned long now = jiffies;
>  
> + if (time_after_eq(now, tg->last_dispatch_time[READ] +
> + td->throtl_slice) &&
> + time_after_eq(now, tg->last_dispatch_time[WRITE] +
> + td->throtl_slice))
> + return false;

So, the duration used here is gonna be made explicitly configurable by
a future patch, right?  Might be worthwhile to note that in the
description.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 08/17] blk-throttle: make throtl_slice tunable

2017-01-09 Thread Tejun Heo
Hello,

On Thu, Dec 15, 2016 at 12:32:59PM -0800, Shaohua Li wrote:
> throtl_slice is important for blk-throttling. It's called slice
> internally but it really is a time window blk-throttling samples data.
> blk-throttling will make decision based on the samplings. An example is
> bandwidth measurement. A cgroup's bandwidth is measured in the time
> interval of throtl_slice.
> 
> A small throtl_slice meanse cgroups have smoother throughput but burn
> more CPUs. It has 100ms default value, which is not appropriate for all
> disks. A fast SSD can dispatch a lot of IOs in 100ms. This patch makes
> it tunable.
> 
> Since throtl_slice isn't a time slice, the sysfs name
> 'throttle_sample_time' reflects its character better.

Generally looks good.  I think some documentation would be great.
Also, do we wannt set it to a lower number if blk_queue_nonrot()?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio_blk: fix panic in initialization error path

2017-01-09 Thread Jeff Moyer
Omar Sandoval  writes:

> From: Omar Sandoval 
>
> If blk_mq_init_queue() returns an error, it gets assigned to
> vblk->disk->queue. Then, when we call put_disk(), we end up calling
> blk_put_queue() with the ERR_PTR, causing a bad dereference. Fix it by
> only assigning to vblk->disk->queue on success.
>
> Signed-off-by: Omar Sandoval 

Reviewed-by: Jeff Moyer 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 07/17] blk-throttle: make sure expire time isn't too big

2017-01-09 Thread Tejun Heo
On Thu, Dec 15, 2016 at 12:32:58PM -0800, Shaohua Li wrote:
> cgroup could be throttled to a limit but when all cgroups cross high
> limit, queue enters a higher state and so the group should be throttled
> to a higher limit. It's possible the cgroup is sleeping because of
> throttle and other cgroups don't dispatch IO any more. In this case,
> nobody can trigger current downgrade/upgrade logic. To fix this issue,
> we could either set up a timer to wakeup the cgroup if other cgroups are
> idle or make sure this cgroup doesn't sleep too long. Setting up a timer
> means we must change the timer very frequently. This patch chooses the
> latter. Making cgroup sleep time not too big wouldn't change cgroup
> bps/iops, but could make it wakeup more frequently, which isn't a big
> issue because throtl_slice * 8 is already quite big.
> 
> Signed-off-by: Shaohua Li 
> ---
>  block/blk-throttle.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 0f65fce..41ec72c 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -588,6 +588,10 @@ static void throtl_dequeue_tg(struct throtl_grp *tg)
>  static void throtl_schedule_pending_timer(struct throtl_service_queue *sq,
> unsigned long expires)
>  {
> + unsigned long max_expire = jiffies + 8 * throtl_slice;
> +
> + if (time_after(expires, max_expire))
> + expires = max_expire;

A comment explaining why we need this would be nice.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 05/17] blk-throttle: add upgrade logic for LIMIT_LOW state

2017-01-09 Thread Tejun Heo
Hello, again.

On Mon, Jan 09, 2017 at 01:40:53PM -0500, Tejun Heo wrote:
> I think it'd be great to explain the above.  It was a bit difficult
> for me to follow.  It's also interesting because we're tying state
> transitions for both read and write together.  blk-throtl has been
> handling reads and writes independently, now the mode switching from
> low to max is shared across reads and writes.  I suppose it could be
> fine but would it be complex to separate them out?  It's weird to make
> this one state shared across reads and writes while not for others or
> was this sharing intentional?

I thought more about it and as the low limit is regulated by latency,
it makes sense to make the state shared across reads and writes;
otherwise, IOs in one direction could easily mess up the other
direction.  Can you please document that this is an intentional design
and explain the rationale tho?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] virtio_blk: fix panic in initialization error path

2017-01-09 Thread Omar Sandoval
From: Omar Sandoval 

If blk_mq_init_queue() returns an error, it gets assigned to
vblk->disk->queue. Then, when we call put_disk(), we end up calling
blk_put_queue() with the ERR_PTR, causing a bad dereference. Fix it by
only assigning to vblk->disk->queue on success.

Signed-off-by: Omar Sandoval 
---
 drivers/block/virtio_blk.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 5545a679abd8..8587361e5356 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -628,11 +628,12 @@ static int virtblk_probe(struct virtio_device *vdev)
if (err)
goto out_put_disk;
 
-   q = vblk->disk->queue = blk_mq_init_queue(&vblk->tag_set);
+   q = blk_mq_init_queue(&vblk->tag_set);
if (IS_ERR(q)) {
err = -ENOMEM;
goto out_free_tags;
}
+   vblk->disk->queue = q;
 
q->queuedata = vblk;
 
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] nvme: improve cmb sysfs reporting

2017-01-09 Thread Stephen Bates
> Minor nit below
>
>
>> +
>> +for (i = NVME_CMB_CAP_SQS; i <= NVME_CMB_CAP_WDS; i++)
>>
> I'd prefer seeing (i = 0; i < ARRAY_SIZE(..); i++) because it provides
> automatic bounds checking against future code.
>

Thanks Jon, I will take a look at doing this in a V1.

Stephen
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] nvme: Improvements in sysfs entry for NVMe CMBs

2017-01-09 Thread Stephen Bates
>
> I have added 1/2, since that one is a no-brainer. For 2/2, not so sure.
> Generally we try to avoid having sysfs file that aren't single value
> output. That isn't a super hard rule, but it is preferable.
>
> --
> Jens Axboe
>

Thanks Jens and sorry for the delay (extended vacation). Thanks for
pulling 1/2. I will go take a look at 2/2.

Stephen
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 05/17] blk-throttle: add upgrade logic for LIMIT_LOW state

2017-01-09 Thread Tejun Heo
Hello, Shaohua.

On Thu, Dec 15, 2016 at 12:32:56PM -0800, Shaohua Li wrote:
> For a cgroup hierarchy, there are two cases. Children has lower low
> limit than parent. Parent's low limit is meaningless. If children's
> bps/iops cross low limit, we can upgrade queue state. The other case is
> children has higher low limit than parent. Children's low limit is
> meaningless. As long as parent's bps/iops cross low limit, we can
> upgrade queue state.

The above isn't completely accurate as the parent should consider the
sum of what's currently being used in the children.

> +static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
> +{
> + struct throtl_service_queue *sq = &tg->service_queue;
> + bool read_limit, write_limit;
> +
> + /*
> +  * if cgroup reaches low/max limit (max >= low), it's ok to next
> +  * limit
> +  */
> + read_limit = tg->bps[READ][LIMIT_LOW] != U64_MAX ||
> +  tg->iops[READ][LIMIT_LOW] != UINT_MAX;
> + write_limit = tg->bps[WRITE][LIMIT_LOW] != U64_MAX ||
> +   tg->iops[WRITE][LIMIT_LOW] != UINT_MAX;
> + if (read_limit && sq->nr_queued[READ] &&
> + (!write_limit || sq->nr_queued[WRITE]))
> + return true;
> + if (write_limit && sq->nr_queued[WRITE] &&
> + (!read_limit || sq->nr_queued[READ]))
> + return true;

I think it'd be great to explain the above.  It was a bit difficult
for me to follow.  It's also interesting because we're tying state
transitions for both read and write together.  blk-throtl has been
handling reads and writes independently, now the mode switching from
low to max is shared across reads and writes.  I suppose it could be
fine but would it be complex to separate them out?  It's weird to make
this one state shared across reads and writes while not for others or
was this sharing intentional?

> + return false;
> +}
> +
> +static bool throtl_hierarchy_can_upgrade(struct throtl_grp *tg)
> +{
> + while (true) {
> + if (throtl_tg_can_upgrade(tg))
> + return true;
> + tg = sq_to_tg(tg->service_queue.parent_sq);
> + if (!tg || (cgroup_subsys_on_dfl(io_cgrp_subsys) &&
> + !tg_to_blkg(tg)->parent))
> + return false;

Isn't the low limit v2 only?  Do we need the on_dfl test this deep?

> + }
> + return false;
> +}
> +
> +static bool throtl_can_upgrade(struct throtl_data *td,
> + struct throtl_grp *this_tg)
> +{
> + struct cgroup_subsys_state *pos_css;
> + struct blkcg_gq *blkg;
> +
> + if (td->limit_index != LIMIT_LOW)
> + return false;
> +
> + rcu_read_lock();
> + blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
> + struct throtl_grp *tg = blkg_to_tg(blkg);
> +
> + if (tg == this_tg)
> + continue;
> + if (!list_empty(&tg_to_blkg(tg)->blkcg->css.children))
> + continue;
> + if (!throtl_hierarchy_can_upgrade(tg)) {
> + rcu_read_unlock();
> + return false;
> + }
> + }
> + rcu_read_unlock();
> + return true;
> +}

So, if all with low limit are over their limits (have commands queued
in the delay queue), the state can be upgraded, right?  Yeah, that
seems correct to me.  The patch description didn't seem to match it
tho.  Can you please update the description accordingly?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 04/17] blk-throttle: configure bps/iops limit for cgroup in low limit

2017-01-09 Thread Tejun Heo
Hello,

On Thu, Dec 15, 2016 at 12:32:55PM -0800, Shaohua Li wrote:
> each queue will have a state machine. Initially queue is in LIMIT_LOW
> state, which means all cgroups will be throttled according to their low
> limit. After all cgroups with low limit cross the limit, the queue state
> gets upgraded to LIMIT_MAX state.

As with the previous patch, I don't think it's correct to default to
MAX for LIMIT_LOW and then enforce it.  Without specific
configuration, a cgroup should always be above below (so, no latency
requirement) and below max (no hard limiting).

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[LSF/MM TOPIC] [LSF/MM ATTEND] md raid general discussion

2017-01-09 Thread Coly Li
Hi Folks,

I'd like to propose a general md raid discussion, it is quite necessary
for most of active md raid developers sit together to discuss current
challenge of Linux software raid and development trends.

In the last years, we have many development activities in md raid, e.g.
raid5 cache, raid1 clustering, partial parity log, fast fail
upstreaming, and some effort for raid1 & raid0 performance improvement.

I see there are some kind of functionality overlap between r5cache
(raid5 cache) and PPL (partial parity log), currently I have no idea
where we will go for these two development activities.
Also I receive reports from users that raid1 performance is desired when
it is built on NVMe SSDs as a cache (maybe bcache or dm-cache). I am
working on some raid1 performance improvement (e.g. new raid1 I/O
barrier and lockless raid1 I/O submit), and have some more ideas to discuss.

Therefore, if md raid developers may have a chance to sit together,
discuss how to efficiently collaborate in next year, it will be much
more productive then communicating on mailing list.

Finally let me introduce myself for people don't know me. My name is
Coly Li, I used to work on OCFS2, Ext4 for SUSE Linux, now I work with
Neil Brown and Hannes Reinecke to maintain block layer code for SUSE
Linux, mostly focus on drivers/md/*

Thanks.

Coly Li

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 03/17] blk-throttle: add .low interface

2017-01-09 Thread Tejun Heo
Happy new year, Shaohua.

Sorry about the long delay.

On Thu, Dec 15, 2016 at 12:32:54PM -0800, Shaohua Li wrote:
> Add low limit for cgroup and corresponding cgroup interface.

It'd be nice to explain why we're adding separate _conf fields.

> +static void blk_throtl_update_valid_limit(struct throtl_data *td)

I think blk_throtl_update_limit_valid() would be more consistent.

> +{
> + struct cgroup_subsys_state *pos_css;
> + struct blkcg_gq *blkg;
> + bool low_valid = false;
> +
> + rcu_read_lock();
> + blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
> + struct throtl_grp *tg = blkg_to_tg(blkg);
> +
> + if (tg->bps[READ][LIMIT_LOW] != U64_MAX ||
> + tg->bps[WRITE][LIMIT_LOW] != U64_MAX ||
> + tg->iops[READ][LIMIT_LOW] != UINT_MAX ||
> + tg->iops[WRITE][LIMIT_LOW] != UINT_MAX)
> + low_valid = true;

It's weird that it's defaulting to MAX.  Shouldn't it default to 0?
When we enable these limits on a cgroup, we want it to not affect the
operation without further configuration.  For max limit, MAX does that
as being over the limit is what changes the behavior.  For low, it's
the other way around.  We enforce latency target if cgroups are under
the low limit, and thus 0 should be the noop default value, which is
the same in memcg.

> + }
> + rcu_read_unlock();
> +
> + if (low_valid)
> + td->limit_valid[LIMIT_LOW] = true;
> + else
> + td->limit_valid[LIMIT_LOW] = false;

Maybe
td->limit_valid[LIMIT_LOW] = low_valid;

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio_blk: avoid DMA to stack for the sense buffer

2017-01-09 Thread Michael S. Tsirkin
On Wed, Jan 04, 2017 at 08:25:05AM +0300, Christoph Hellwig wrote:
> Most users of BLOCK_PC requests allocate the sense buffer on the stack,
> so to avoid DMA to the stack copy them to a field in the heap allocated
> virtblk_req structure.  Without that any attempt at SCSI passthrough I/O,
> including the SG_IO ioctl from userspace will crash the kernel.  Note that
> this includes running tools like hdparm even when the host does not have
> SCSI passthrough enabled.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/block/virtio_blk.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 5545a67..3c3b8f6 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -56,6 +56,7 @@ struct virtblk_req {
>   struct virtio_blk_outhdr out_hdr;
>   struct virtio_scsi_inhdr in_hdr;
>   u8 status;
> + u8 sense[SCSI_SENSE_BUFFERSIZE];
>   struct scatterlist sg[];
>  };
>  
> @@ -102,7 +103,8 @@ static int __virtblk_add_req(struct virtqueue *vq,
>   }
>  
>   if (type == cpu_to_virtio32(vq->vdev, VIRTIO_BLK_T_SCSI_CMD)) {
> - sg_init_one(&sense, vbr->req->sense, SCSI_SENSE_BUFFERSIZE);
> + memcpy(vbr->sense, vbr->req->sense, SCSI_SENSE_BUFFERSIZE);
> + sg_init_one(&sense, vbr->sense, SCSI_SENSE_BUFFERSIZE);

I would prefer sizeof(vbr->sense) here to avoid duplication.
Otherwise:

Acked-by: Michael S. Tsirkin 


>   sgs[num_out + num_in++] = &sense;
>   sg_init_one(&inhdr, &vbr->in_hdr, sizeof(vbr->in_hdr));
>   sgs[num_out + num_in++] = &inhdr;
> -- 
> 2.1.4
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] do_direct_IO: Use inode->i_blkbits to compute block count to be cleaned

2017-01-09 Thread Jeff Moyer
Chandan Rajendra  writes:

> The code currently uses sdio->blkbits to compute the number of blocks to
> be cleaned. However sdio->blkbits is derived from the logical block size
> of the underlying block device (Refer to the definition of
> do_blockdev_direct_IO()). Due to this, generic/299 test would rarely
> fail when executed on an ext4 filesystem with 64k as the block size and
> when using a virtio based disk (having 512 byte as the logical block
> size) inside a kvm guest.
>
> This commit fixes the bug by using inode->i_blkbits to compute the
> number of blocks to be cleaned.
>
> Signed-off-by: Chandan Rajendra 
> ---
>  fs/direct-io.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index aeae8c0..b20adf9 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -905,6 +905,7 @@ static inline void dio_zero_block(struct dio *dio, struct 
> dio_submit *sdio,
>  static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
>   struct buffer_head *map_bh)
>  {
> + const unsigned i_blkbits = dio->inode->i_blkbits;

NAK.  Please see commit ab73857e354 ("direct-io: don't read
inode->i_blkbits multiple times").

I think you want:

const unsigned i_blkbits = sdio->blkbits + sdio->blkfactor;

as is used elsewhere in the code.

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio_blk: avoid DMA to stack for the sense buffer

2017-01-09 Thread Jens Axboe
On 01/09/2017 06:35 AM, Christoph Hellwig wrote:
> Is someone going to pick the patch up and send it to Linus?  I keep
> running into all kinds of boot failures whenever I forget to cherry
> pick it into my development trees..

I'll add it.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] do_direct_IO: Use inode->i_blkbits to compute block count to be cleaned

2017-01-09 Thread Jens Axboe
On 01/08/2017 07:47 AM, Chandan Rajendra wrote:
> The code currently uses sdio->blkbits to compute the number of blocks to
> be cleaned. However sdio->blkbits is derived from the logical block size
> of the underlying block device (Refer to the definition of
> do_blockdev_direct_IO()). Due to this, generic/299 test would rarely
> fail when executed on an ext4 filesystem with 64k as the block size and
> when using a virtio based disk (having 512 byte as the logical block
> size) inside a kvm guest.
> 
> This commit fixes the bug by using inode->i_blkbits to compute the
> number of blocks to be cleaned.

Thanks, added for 4.10.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] blk: increase logical_block_size to unsigned int

2017-01-09 Thread Jerome Marchand


- Original Message -
> From: "Sergey Senozhatsky" 
> To: "Minchan Kim" 
> Cc: "Jens Axboe" , "Hyeoncheol Lee" , 
> linux-block@vger.kernel.org,
> linux-ker...@vger.kernel.org, "Andrew Morton" , 
> "Sergey Senozhatsky"
> , "Robert Jennings" , 
> "Jerome Marchand" 
> Sent: Monday, January 9, 2017 3:33:44 PM
> Subject: Re: [RFC] blk: increase logical_block_size to unsigned int
> 
> On (01/09/17 14:04), Minchan Kim wrote:
> > Mostly, zram is used as swap system on embedded world so it want to do IO
> > as PAGE_SIZE aligned/size IO unit. For that, one of the problem was
> > blk_queue_logical_block_size(zram->disk->queue, PAGE_SIZE) made overflow
> > in *64K page system* so [1] changed it to constant 4096.
> > Since that, partial IO can happen so zram should handle it which makes zram
> > complicated[2].
> > 
> 
> I thought that zram partial IO support is there because some file
> systems cannot cope with large logical_block_size. like FAT, for
> example. am I wrong?

Yes indeed. When we discussed the patch adding the partial I/O, increasing the
size of logical_block was considered. The reason we didn't go the easy path was
that not all block users could handle 64k blocks. FAT is one of them.

Jerome

> 
>   -ss
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os

2017-01-09 Thread Theodore Ts'o
So in the model where the Flash-side is tracking logical to physical
zone mapping, and host is merely expecting the ZBC interface, one way
it could work is as follows.

1)  The flash signals that a particular zone should be reset soon.

2)  If the host does not honor the request, eventually the flash will
have to do a forced copy of the zone to a new erase block.  (This
is a fail-safe and shouldn't happen under normal circumstances.)

(By the way, this model can be used for any number of things.  For
example, for cloud workloads where tail latency is really
important, it would be really cool if T10/T13 adopted a way that
the host could be notified about the potential need for ATI
remediation in a particular disk region, so the host could
schedule it when it would be least likely to impact high priority,
low latency workloads.  If the host fails to give permission to
the firmware to do the ATI remediation before the "gotta go"
deadline is exceed, the disk would could the ATI remediation at
that point to assure data integrity as a fail safe.)

3) The host, since it has better knowledge of which blocks belong to
which inode, and which inodes are more likely to have identical
object lifetimes (example, all of the .o files in a directory are
likely to be deleted at the same time when the user runs "make
clean"; there was a Usenix or FAST paper over a decade ago the
pointed out that doing hueristics based on file names were likely
to be helpful), can do a better job of distributing the blocks to
different partially filled sequential write preferred / sequential
write required zones.

The idea here is that you might have multiple zones that are
partially filled based on expected object lifetime predictions.
Or the host could move blocks based on the knowledge that a
particular already has blocks that will share the same fate (e.g.,
belong to the same inode) --- this is knowledge that the FTL can
not know, so with a sufficiently smart host file system, it ought
to be able to do a better job than the FTL.

4) Since we assumed that the Flash is tracking logical to physial zone
   mappings, and the host is responsible for everything else, if the
   host decides to move blocks to different SMR zones, the host file
   system will be responsible for updating its existing (inode,
   logical block) to physical block (SMR zone plus offset) mapping
   tables.

The main advantage of this model is to the extent that there are
cloud/enterprise customers who are already implementing Host Aware SMR
storage solutions, they might be able to reutilize code already
written for SMR HDD's, and reuse it for this model/interface.  Yes,
some tweaks would probably be needed since the design tradeoffs for
disks and flash are very different.  But the point is that the Host
Managed and Host Aware SMR models is one that is well understood by
everyone.

 

There is another model you might consider, and it's one which
Christoph Hillwig suggested at a LSF/MM at least 2-3 years ago, and
this is a model where the flash or the SMR disk could use a division
of labor similar to Object Based Disks (except hopefully with a less
awful interface).  The idea here is that you give up on LBA numbers,
and instead you move the entire responsibility of mapping (inode,
logical block) to (physical location) to the storage device.  The file
system would then be responsibile for managing metadata (mod times,
user/group ownership, permission bits/ACL's, etc) and namespace issues
(e.g., directory pathnames to inode lookups).

So this solves the problem you seem to be concerned about in terms of
keeping mapping information at two layers, and it solves it
completely, since the file system no longer has to do a mapping
between inode+logical offset to LBA number, which it would in the
models you've outlined to date.  It also solves the problem of giving
the storage device more information about which blocks belong to which
inode/object, and it would also make it easier for the OS to pass
object lifetime and shared fate hints to the storage device.  This
should hopefully allow the FTL or STL to do a better job, since it now
has access to low-level hardware inforation (e.g., BER / Soft ECC
failures) as well as higher-level object information to do a better
job making storage layout and garbage collection activities.

 

A fair criticism of any of the models discussed to date (one based on
ZBC, the object-based storage model, or OCSSD) don't have any mature
implementations, either in the open source or the clost source world.
But since it's true for *all* of them, we should be using other
criteria for deciding which model is the best one to choose for the
long term.

The advantage of the ZBC model is that people have had several years
to consider and understand the model, so in terms of mind shar

Re: [RFC] blk: increase logical_block_size to unsigned int

2017-01-09 Thread Sergey Senozhatsky
On (01/09/17 14:04), Minchan Kim wrote:
> Mostly, zram is used as swap system on embedded world so it want to do IO
> as PAGE_SIZE aligned/size IO unit. For that, one of the problem was
> blk_queue_logical_block_size(zram->disk->queue, PAGE_SIZE) made overflow
> in *64K page system* so [1] changed it to constant 4096.
> Since that, partial IO can happen so zram should handle it which makes zram
> complicated[2].
> 

I thought that zram partial IO support is there because some file
systems cannot cope with large logical_block_size. like FAT, for
example. am I wrong?

-ss
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] do_direct_IO: Use inode->i_blkbits to compute block count to be cleaned

2017-01-09 Thread Christoph Hellwig
On Sun, Jan 08, 2017 at 08:17:10PM +0530, Chandan Rajendra wrote:
> The code currently uses sdio->blkbits to compute the number of blocks to
> be cleaned. However sdio->blkbits is derived from the logical block size
> of the underlying block device (Refer to the definition of
> do_blockdev_direct_IO()). Due to this, generic/299 test would rarely
> fail when executed on an ext4 filesystem with 64k as the block size and
> when using a virtio based disk (having 512 byte as the logical block
> size) inside a kvm guest.
> 
> This commit fixes the bug by using inode->i_blkbits to compute the
> number of blocks to be cleaned.

Looks fine,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio_blk: avoid DMA to stack for the sense buffer

2017-01-09 Thread Christoph Hellwig
Is someone going to pick the patch up and send it to Linus?  I keep
running into all kinds of boot failures whenever I forget to cherry
pick it into my development trees..

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


LSF/MM 2017: Call for Proposals closes January 15th

2017-01-09 Thread Jeff Layton
We initially sent this pretty early this year, so this is a resend in   
case anyone missed the first posting. The call for topics and attendance
requests is open until January 15th, 2017.

The original message follows:

--8<

The annual Linux Storage, Filesystem and Memory Management (LSF/MM)
Summit for 2017 will be held on March 20th and 21st at the Hyatt
Cambridge, Cambridge, MA. LSF/MM is an invitation-only technical
workshop to map out improvements to the Linux storage, filesystem and
memory management subsystems that will make their way into the mainline
kernel within the coming years.


http://events.linuxfoundation.org/events/linux-storage-filesystem-and-mm-summit

Like last year, LSF/MM will be colocated with the Linux Foundation Vault
conference which takes place on March 22nd and 23rd in the same Venue.
For those that do not know, Vault is designed to be an event where open
source storage and filesystem practitioners meet storage implementors
and, as such, it would be of benefit for LSF/MM attendees to attend.

Unlike past years, Vault admission is not free for LSF/MM attendees this
year unless they're giving a talk. There is a discount for LSF/MM
attendees, however we would also like to encourage folks to submit talk
proposals to speak at the Vault conference.

http://events.linuxfoundation.org/events/vault

On behalf of the committee I am issuing a call for agenda proposals that
are suitable for cross-track discussion as well as technical subjects
for the breakout sessions.

If advance notice is required for visa applications then please point
that out in your proposal or request to attend, and submit the topic
as soon as possible.

1) Proposals for agenda topics should be sent before January 15th, 2016
to:

lsf...@lists.linux-foundation.org

and cc the Linux list or lists that are relevant for the topic in
question:

ATA:   linux-...@vger.kernel.org
Block: linux-block@vger.kernel.org
FS:linux-fsde...@vger.kernel.org
MM:linux...@kvack.org
SCSI:  linux-s...@vger.kernel.org
NVMe:  linux-n...@lists.infradead.org

Please tag your proposal with [LSF/MM TOPIC] to make it easier to track.
In addition, please make sure to start a new thread for each topic
rather than following up to an existing one.  Agenda topics and
attendees will be selected by the program committee, but the final
agenda will be formed by consensus of the attendees on the day.

2) Requests to attend the summit for those that are not proposing a
topic should be sent to:

lsf...@lists.linux-foundation.org

Please summarise what expertise you will bring to the meeting, and what
you would like to discuss. Please also tag your email with [LSF/MM
ATTEND] and send it as a new thread so there is less chance of it
getting lost.

We will try to cap attendance at around 25-30 per track to facilitate
discussions although the final numbers will depend on the room sizes at
the venue.

Brief presentations are allowed to guide discussion, but are strongly
discouraged. There will be no recording or audio bridge. However, we
expect that written minutes will be published as we did in previous
years:

2016: https://lwn.net/Articles/lsfmm2016/

2015: https://lwn.net/Articles/lsfmm2015/

2014: http://lwn.net/Articles/LSFMM2014/

2013: http://lwn.net/Articles/548089/

3) If you have feedback on last year's meeting that we can use to
improve this year's, please also send that to:

lsf...@lists.linux-foundation.org

Thank you on behalf of the program committee:

Storage:
James Bottomley
Martin K. Petersen (track chair)
Sagi Grimberg

Filesystems:
Anna Schumaker
Chris Mason
Eric Sandeen
Jan Kara
Jeff Layton (summit chair)
Josef Bacik (track chair)
Trond Myklebust

MM:
Johannes Weiner
Rik van Riel (track chair)
-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html