Re: [xfstests PATCH v4 5/5] btrfs: make a btrfs version of writeback error reporting test

2017-06-14 Thread Eryu Guan
On Wed, Jun 14, 2017 at 07:55:17AM -0400, Jeff Layton wrote:
> On Tue, 2017-06-13 at 16:40 +0800, Eryu Guan wrote:
> > On Mon, Jun 12, 2017 at 08:42:13AM -0400, Jeff Layton wrote:
> > > Make a new btrfs/999 test that works the way Chris Mason suggested:
> > > 
> > > Build a filesystem with 2 devices that stripes the data across
> > > both devices, but mirrors metadata across both. Then, make one
> > > of the devices fail and see how fsync is handled.
> > > 
> > > Signed-off-by: Jeff Layton 
> > > ---
> > >  tests/btrfs/999   | 93 
> > > +++
> > 
> > Missing btrfs/999.out file
> > 
> > >  tests/btrfs/group |  1 +
> > >  2 files changed, 94 insertions(+)
> > >  create mode 100755 tests/btrfs/999
> > > 
> > > diff --git a/tests/btrfs/999 b/tests/btrfs/999
> > > new file mode 100755
> > > index ..84031cc0d913
> > > --- /dev/null
> > > +++ b/tests/btrfs/999
> > > @@ -0,0 +1,93 @@
> > > +#! /bin/bash
> > > +# FS QA Test No. 999
> > > +#
> > > +# Open a file several times, write to it, fsync on all fds and make sure 
> > > that
> > > +# they all return 0. Change the device to start throwing errors. Write 
> > > again
> > > +# on all fds and fsync on all fds. Ensure that we get errors on all of 
> > > them.
> > > +# Then fsync on all one last time and verify that all return 0.
> > > +#
> > > +#---
> > > +# Copyright (c) 2017, Jeff Layton 
> > > +#
> > > +# This program is free software; you can redistribute it and/or
> > > +# modify it under the terms of the GNU General Public License as
> > > +# published by the Free Software Foundation.
> > > +#
> > > +# This program is distributed in the hope that it would be useful,
> > > +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > > +# GNU General Public License for more details.
> > > +#
> > > +# You should have received a copy of the GNU General Public License
> > > +# along with this program; if not, write the Free Software Foundation,
> > > +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> > > +#---
> > > +
> > > +seq=`basename $0`
> > > +seqres=$RESULT_DIR/$seq
> > > +echo "QA output created by $seq"
> > > +
> > > +here=`pwd`
> > > +tmp=/tmp/$$
> > > +status=1# failure is the default!
> > > +trap "_cleanup; exit \$status" 0 1 2 3 15
> > > +
> > > +_cleanup()
> > > +{
> > > +cd /
> > > +rm -rf $tmp.* $testdir
> > > +_dmerror_cleanup
> > > +}
> > > +
> > > +# get standard environment, filters and checks
> > > +. ./common/rc
> > > +. ./common/filter
> > > +. ./common/dmerror
> > > +
> > > +# real QA test starts here
> > > +_supported_os Linux
> > > +_require_dm_target error
> > > +_require_test_program fsync-err
> > > +_require_test_program dmerror
> > > +
> > > +# bring up dmerror device
> > > +_scratch_unmount
> > > +_dmerror_init
> > > +
> > > +# Replace first device with error-test device
> > > +old_SCRATCH_DEV=$SCRATCH_DEV
> > > +SCRATCH_DEV_POOL=`echo $SCRATCH_DEV_POOL | perl -pe 
> > > "s#$SCRATCH_DEV#$DMERROR_DEV#"`
> > > +SCRATCH_DEV=$DMERROR_DEV
> > > +
> > > +_require_scratch
> > > +_require_scratch_dev_pool
> > 
> > Need "_require_scratch_dev_pool_equal_size" too, since test creates
> > raid1 profile for metadata.

Sorry, it's not needed here. I got confused with device replace
operation, only "replace" needs this require rule. Thanks for
confirming!

Eryu


Re: [PATCHSET v2] Add support for write life time hints

2017-06-14 Thread Jens Axboe
On 06/14/2017 09:57 PM, Jens Axboe wrote:
> On 06/14/2017 09:53 PM, Andreas Dilger wrote:
>> On Jun 14, 2017, at 9:26 PM, Jens Axboe  wrote:
>>> On Wed, Jun 14, 2017 at 5:39 PM, Andreas Dilger  wrote:
 On Jun 14, 2017, at 10:04 AM, Martin K. Petersen 
  wrote:
> Christoph,
>
>> I think what Martin wants (or at least what I'd want him to want) is
>> to define a few REQ_* bits that mirror the RWF bits, use that to
>> transfer the information down the stack, and then only translate it
>> to stream ids in the driver.
>
> Yup. If we have enough space in the existing flags that's perfect (I
> lost count after your op/flag shuffle).

 Just to clarify, does this mean that Jens' "lifetime" hints are going to
 be independent of separate "stream ID" values in the future (needing a
 similar, but independent, set of fields for stream IDs, or will they
 both be implemented using a common transport in the kernel (i.e. both
 share a single set of fields in the inode/bio/etc?
>>>
>>> There's no reason they can't coexist. Now that bio->bi_stream is gone
>>> and we just have the flags, if we want to expose separate "real" stream
>>> IDs, then we could later re-introduce that. It'd be pretty trivial to
>>> just use the most pertinent information on the driver side.
>>>
>>> From my perspective, all I really care about is the 4 hints. It's a
>>> simple enough interface that applications can understand and use it, and
>>> we don't need any management of actual stream IDs. I think that has the
>>> highest chance of success. Modifying an application to use it is
>>> trivial, even something like RocksDB (if you havehad to make changes
>>> to RocksDB, you'll get this).
>>
>> If this is really to be only flags, rather than a hybrid of flags and IDs
>> (as I had thought), then it probably makes sense to limit the inode usage
>> to a few bits in i_flags using S_LIFETIME_* values rather than a separate
>> 16-bit i_stream field, which can be used for the actual stream IDs later.
> 
> Christoph alluded to that as well. And yes, if we are contemplating
> something else later on, then that does make more sense. I'll make that
> change.

Easy enough to do, see attached incremental patch. Only issue is that we
then have to lock the inode, and use the atomic flag setting. I'm assuming
that's safe to do from that path. We only need to grab the lock if the
hint changes, which is basically never should.


diff --git a/fs/inode.c b/fs/inode.c
index bd8bf44f3f31..db5914783a71 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -149,7 +149,6 @@ int inode_init_always(struct super_block *sb, struct inode 
*inode)
inode->i_blocks = 0;
inode->i_bytes = 0;
inode->i_generation = 0;
-   inode->i_write_hint = 0;
inode->i_pipe = NULL;
inode->i_bdev = NULL;
inode->i_cdev = NULL;
diff --git a/fs/read_write.c b/fs/read_write.c
index 9cb2314efca3..70f8764ae117 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -675,6 +675,7 @@ EXPORT_SYMBOL(iov_shorten);
 static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
loff_t *ppos, int type, int flags)
 {
+   struct inode *inode = file_inode(filp);
struct kiocb kiocb;
ssize_t ret;
 
@@ -688,12 +689,19 @@ static ssize_t do_iter_readv_writev(struct file *filp, 
struct iov_iter *iter,
kiocb.ki_flags |= IOCB_DSYNC;
if (flags & RWF_SYNC)
kiocb.ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
-   if (flags & RWF_WRITE_LIFE_MASK) {
-   struct inode *inode = file_inode(filp);
+   if ((flags & RWF_WRITE_LIFE_MASK) ||
+   (inode->i_flags & S_WRITE_LIFE_MASK)) {
+   unsigned int hint, iflags;
 
-   inode->i_write_hint = (flags & RWF_WRITE_LIFE_MASK) >>
-   RWF_WRITE_LIFE_SHIFT;
-   kiocb.ki_flags |= inode->i_write_hint << IOCB_WRITE_LIFE_SHIFT;
+   hint = (flags & RWF_WRITE_LIFE_MASK) >> RWF_WRITE_LIFE_SHIFT;
+   iflags = hint << S_WRITE_LIFE_SHIFT;
+
+   if ((inode->i_flags & S_WRITE_LIFE_MASK) != iflags) {
+   inode_lock(inode);
+   inode_set_flags(inode, iflags, iflags);
+   inode_unlock(inode);
+   }
+   kiocb.ki_flags |= hint << IOCB_WRITE_LIFE_SHIFT;
}
kiocb.ki_pos = *ppos;
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 63798b67fcfe..929f1fc088c6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -270,10 +270,10 @@ struct writeback_control;
 #define IOCB_WRITE (1 << 6)
 
 /*
- * Steal 4-bits for stream information, this allows 16 valid streams
+ * Steal 3 bits for stream information, this allows 8 valid streams
  */
 #define IOCB_WRITE_LIFE_SHIFT  7
-#define IOCB_WRITE_LIFE_MASK   (BIT(7) | BIT(8) 

Re: [PATCH 04/11] fs: add support for allowing applications to pass in write life time hints

2017-06-14 Thread Jens Axboe
On 06/14/2017 10:15 PM, Darrick J. Wong wrote:
>> diff --git a/fs/read_write.c b/fs/read_write.c
>> index 47c1d4484df9..9cb2314efca3 100644
>> --- a/fs/read_write.c
>> +++ b/fs/read_write.c
>> @@ -678,7 +678,7 @@ static ssize_t do_iter_readv_writev(struct file *filp, 
>> struct iov_iter *iter,
>>  struct kiocb kiocb;
>>  ssize_t ret;
>>  
>> -if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC))
>> +if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_WRITE_LIFE_MASK))
>>  return -EOPNOTSUPP;
>>  
>>  init_sync_kiocb(, filp);
>> @@ -688,6 +688,13 @@ static ssize_t do_iter_readv_writev(struct file *filp, 
>> struct iov_iter *iter,
>>  kiocb.ki_flags |= IOCB_DSYNC;
>>  if (flags & RWF_SYNC)
>>  kiocb.ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
>> +if (flags & RWF_WRITE_LIFE_MASK) {
>> +struct inode *inode = file_inode(filp);
>> +
>> +inode->i_write_hint = (flags & RWF_WRITE_LIFE_MASK) >>
>> +RWF_WRITE_LIFE_SHIFT;
> 
> Hmm, so once set, hints stick around until someone else sets a different
> one.  I suppose it's unlikely that you'd have two programs writing to
> the same inode with different write hints, right?

You'd hope so... There's really no good way to support that with
buffered writes. For the NVMe use case, you'd be no worse off than you
were without hints, however.

But I do think one change should be made above - we only reset the hint
if someone passes a new hint in. But we probably also want to do so for
the case where no hint is passed in, but one is currently set.

> Also, how does userspace query the write hint value once set?

It doesn't. Ideally this hint would be "for this write only", but that's
not really possible with deferred write back.

>> +/*
>> + * Data life time write flags, steal 4 bits for that
>> + */
>> +#define RWF_WRITE_LIFE_SHIFT4
>> +#define RWF_WRITE_LIFE_MASK 0x00f0 /* 4 bits of stream ID */
>> +#define RWF_WRITE_LIFE_SHORT(1 << RWF_WRITE_LIFE_SHIFT)
>> +#define RWF_WRITE_LIFE_MEDIUM   (2 << RWF_WRITE_LIFE_SHIFT)
>> +#define RWF_WRITE_LIFE_LONG (3 << RWF_WRITE_LIFE_SHIFT)
>> +#define RWF_WRITE_LIFE_EXTREME  (4 << RWF_WRITE_LIFE_SHIFT)
> 
> Should O_TMPFILE files ought to be created with i_write_hint =
> RWF_WRITE_LIFE_SHORT by default?

The answer here is "it depends". If we're already using hints on that
device, then yes, O_TMPFILE is a clear candidate for
RWF_WRITE_LIFE_SHORT. However, if we are not, then we should not set it
as it may have implications on how the device manages writes. For that
case we'll potentially only be driving a single stream, short writes,
and that may not be enough to saturate device bandwidth.

I would rather leave that for future experimentation. There are similar
things we can do with short lived writes, like apply them to the log
writes in the file system. But all of that should be built on top of
what we end up agreeing on, not included from the get-go. I'd rather get
the basic usage and model going first before we further complicate
matters.

-- 
Jens Axboe



Re: [PATCH 04/11] fs: add support for allowing applications to pass in write life time hints

2017-06-14 Thread Darrick J. Wong
On Wed, Jun 14, 2017 at 09:45:05PM -0600, Jens Axboe wrote:
> Add four flags for the pwritev2(2) system call, allowing an application
> to give the kernel a hint about what on-media life times can be
> expected from a given write.
> 
> The intent is for these values to be relative to each other, no
> absolute meaning should be attached to these flag names.
> 
> Define IOCB flags to carry this over this information from the pwritev2
> RWF_WRITE_LIFE_* flags.
> 
> Reviewed-by: Andreas Dilger 
> Signed-off-by: Jens Axboe 
> ---
>  fs/read_write.c |  9 -
>  include/linux/fs.h  | 12 
>  include/uapi/linux/fs.h | 10 ++
>  3 files changed, 30 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 47c1d4484df9..9cb2314efca3 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -678,7 +678,7 @@ static ssize_t do_iter_readv_writev(struct file *filp, 
> struct iov_iter *iter,
>   struct kiocb kiocb;
>   ssize_t ret;
>  
> - if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC))
> + if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_WRITE_LIFE_MASK))
>   return -EOPNOTSUPP;
>  
>   init_sync_kiocb(, filp);
> @@ -688,6 +688,13 @@ static ssize_t do_iter_readv_writev(struct file *filp, 
> struct iov_iter *iter,
>   kiocb.ki_flags |= IOCB_DSYNC;
>   if (flags & RWF_SYNC)
>   kiocb.ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
> + if (flags & RWF_WRITE_LIFE_MASK) {
> + struct inode *inode = file_inode(filp);
> +
> + inode->i_write_hint = (flags & RWF_WRITE_LIFE_MASK) >>
> + RWF_WRITE_LIFE_SHIFT;

Hmm, so once set, hints stick around until someone else sets a different
one.  I suppose it's unlikely that you'd have two programs writing to
the same inode with different write hints, right?

Just wondering if anyone will be surprised that they thought they were
writing to an _EXTREME hint file but someone else switched it to _SHORT
on them.

Also, how does userspace query the write hint value once set?

> + kiocb.ki_flags |= inode->i_write_hint << IOCB_WRITE_LIFE_SHIFT;
> + }
>   kiocb.ki_pos = *ppos;
>  
>   if (type == READ)
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index f4f9df8ed059..63798b67fcfe 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -269,6 +269,12 @@ struct writeback_control;
>  #define IOCB_SYNC(1 << 5)
>  #define IOCB_WRITE   (1 << 6)
>  
> +/*
> + * Steal 4-bits for stream information, this allows 16 valid streams
> + */
> +#define IOCB_WRITE_LIFE_SHIFT7
> +#define IOCB_WRITE_LIFE_MASK (BIT(7) | BIT(8) | BIT(9) | BIT(10))
> +
>  struct kiocb {
>   struct file *ki_filp;
>   loff_t  ki_pos;
> @@ -292,6 +298,12 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, 
> struct file *filp)
>   };
>  }
>  
> +static inline int iocb_write_hint(const struct kiocb *iocb)
> +{
> + return (iocb->ki_flags & IOCB_WRITE_LIFE_MASK) >>
> + IOCB_WRITE_LIFE_SHIFT;
> +}
> +
>  /*
>   * "descriptor" for what we're up to with a read.
>   * This allows us to use the same read code yet
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 24e61a54feaa..58b7ee06b380 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -361,4 +361,14 @@ struct fscrypt_key {
>  #define RWF_DSYNC0x0002 /* per-IO O_DSYNC */
>  #define RWF_SYNC 0x0004 /* per-IO O_SYNC */
>  
> +/*
> + * Data life time write flags, steal 4 bits for that
> + */
> +#define RWF_WRITE_LIFE_SHIFT 4
> +#define RWF_WRITE_LIFE_MASK  0x00f0 /* 4 bits of stream ID */
> +#define RWF_WRITE_LIFE_SHORT (1 << RWF_WRITE_LIFE_SHIFT)
> +#define RWF_WRITE_LIFE_MEDIUM(2 << RWF_WRITE_LIFE_SHIFT)
> +#define RWF_WRITE_LIFE_LONG  (3 << RWF_WRITE_LIFE_SHIFT)
> +#define RWF_WRITE_LIFE_EXTREME   (4 << RWF_WRITE_LIFE_SHIFT)

Should O_TMPFILE files ought to be created with i_write_hint =
RWF_WRITE_LIFE_SHORT by default?

--D

> +
>  #endif /* _UAPI_LINUX_FS_H */
> -- 
> 2.7.4
> 


Re: [PATCHSET v2] Add support for write life time hints

2017-06-14 Thread Andreas Dilger
On Jun 14, 2017, at 9:26 PM, Jens Axboe  wrote:
> On Wed, Jun 14, 2017 at 5:39 PM, Andreas Dilger  wrote:
>> On Jun 14, 2017, at 10:04 AM, Martin K. Petersen 
>>  wrote:
>>> Christoph,
>>> 
 I think what Martin wants (or at least what I'd want him to want) is
 to define a few REQ_* bits that mirror the RWF bits, use that to
 transfer the information down the stack, and then only translate it
 to stream ids in the driver.
>>> 
>>> Yup. If we have enough space in the existing flags that's perfect (I
>>> lost count after your op/flag shuffle).
>> 
>> Just to clarify, does this mean that Jens' "lifetime" hints are going to
>> be independent of separate "stream ID" values in the future (needing a
>> similar, but independent, set of fields for stream IDs, or will they
>> both be implemented using a common transport in the kernel (i.e. both
>> share a single set of fields in the inode/bio/etc?
> 
> There's no reason they can't coexist. Now that bio->bi_stream is gone
> and we just have the flags, if we want to expose separate "real" stream
> IDs, then we could later re-introduce that. It'd be pretty trivial to
> just use the most pertinent information on the driver side.
> 
> From my perspective, all I really care about is the 4 hints. It's a
> simple enough interface that applications can understand and use it, and
> we don't need any management of actual stream IDs. I think that has the
> highest chance of success. Modifying an application to use it is
> trivial, even something like RocksDB (if you havehad to make changes
> to RocksDB, you'll get this).

If this is really to be only flags, rather than a hybrid of flags and IDs
(as I had thought), then it probably makes sense to limit the inode usage
to a few bits in i_flags using S_LIFETIME_* values rather than a separate
16-bit i_stream field, which can be used for the actual stream IDs later.

>> I can imagine applications using either the lifetime hints (mapped to
>> stream IDs internally), or explicitly managing stream IDs, but it seems
>> unlikely that both would be in use at the same time, unless the "lifetime"
>> hints are intended to also indirectly map to HSM-like tiered cache/storage
>> hints internally to the storage stack (e.g. whether writes should be put
>> into local cachefs vs remote network storage, or bcache NVRAM vs. HDD)?
> 
> Agree, both would be viable solutions, and they can coexist. Explicitly
> managing stream IDs is a lot more advanced, and the 4 hints will get
> most people 95% of the way there. For explicit streams, we'd need
> something similar to my previous versions of this patchset, where a full
> API is provided.
> 
>> Jens' patches were moving in the direction of allowing up to 16 stream IDs,
>> with the default being IDs 1-4 are intended to reflect different data
>> lifetimes, but allowing up to 15 unique IDs to be specified "per device"*
>> if userspace wants to manage this more explicitly.
> 
> I don't think we should mix the two up. As mentioned above, one will
> require a full API and kernel management of the streams. I had a lot of
> push back on that approach previously, which is why I've now gone for
> the simpler version. If we want to expose stream IDs explicitly to user
> space, that's a lot more involved.
> 
> --
> Jens Axboe


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP


Re: [PATCHSET v2] Add support for write life time hints

2017-06-14 Thread Jens Axboe
On 06/14/2017 09:53 PM, Andreas Dilger wrote:
> On Jun 14, 2017, at 9:26 PM, Jens Axboe  wrote:
>> On Wed, Jun 14, 2017 at 5:39 PM, Andreas Dilger  wrote:
>>> On Jun 14, 2017, at 10:04 AM, Martin K. Petersen 
>>>  wrote:
 Christoph,

> I think what Martin wants (or at least what I'd want him to want) is
> to define a few REQ_* bits that mirror the RWF bits, use that to
> transfer the information down the stack, and then only translate it
> to stream ids in the driver.

 Yup. If we have enough space in the existing flags that's perfect (I
 lost count after your op/flag shuffle).
>>>
>>> Just to clarify, does this mean that Jens' "lifetime" hints are going to
>>> be independent of separate "stream ID" values in the future (needing a
>>> similar, but independent, set of fields for stream IDs, or will they
>>> both be implemented using a common transport in the kernel (i.e. both
>>> share a single set of fields in the inode/bio/etc?
>>
>> There's no reason they can't coexist. Now that bio->bi_stream is gone
>> and we just have the flags, if we want to expose separate "real" stream
>> IDs, then we could later re-introduce that. It'd be pretty trivial to
>> just use the most pertinent information on the driver side.
>>
>> From my perspective, all I really care about is the 4 hints. It's a
>> simple enough interface that applications can understand and use it, and
>> we don't need any management of actual stream IDs. I think that has the
>> highest chance of success. Modifying an application to use it is
>> trivial, even something like RocksDB (if you havehad to make changes
>> to RocksDB, you'll get this).
> 
> If this is really to be only flags, rather than a hybrid of flags and IDs
> (as I had thought), then it probably makes sense to limit the inode usage
> to a few bits in i_flags using S_LIFETIME_* values rather than a separate
> 16-bit i_stream field, which can be used for the actual stream IDs later.

Christoph alluded to that as well. And yes, if we are contemplating
something else later on, then that does make more sense. I'll make that
change.

-- 
Jens Axboe



[PATCH 09/11] xfs: add support for passing in write hints for buffered writes

2017-06-14 Thread Jens Axboe
Reviewed-by: Andreas Dilger 
Signed-off-by: Jens Axboe 
---
 fs/xfs/xfs_aops.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 09af0f7cd55e..fe11fe47d235 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -505,6 +505,7 @@ xfs_submit_ioend(
return status;
}
 
+   ioend->io_bio->bi_opf |= 
bio_op_write_hint(inode_write_hint(ioend->io_inode));
submit_bio(ioend->io_bio);
return 0;
 }
@@ -564,6 +565,7 @@ xfs_chain_bio(
bio_chain(ioend->io_bio, new);
bio_get(ioend->io_bio); /* for xfs_destroy_ioend */
ioend->io_bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
+   ioend->io_bio->bi_opf |= 
bio_op_write_hint(inode_write_hint(ioend->io_inode));
submit_bio(ioend->io_bio);
ioend->io_bio = new;
 }
-- 
2.7.4



[PATCH 08/11] ext4: add support for passing in write hints for buffered writes

2017-06-14 Thread Jens Axboe
Reviewed-by: Andreas Dilger 
Signed-off-by: Jens Axboe 
---
 fs/ext4/page-io.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 1a82138ba739..764bf0ddecd4 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -349,6 +349,7 @@ void ext4_io_submit(struct ext4_io_submit *io)
if (bio) {
int io_op_flags = io->io_wbc->sync_mode == WB_SYNC_ALL ?
  REQ_SYNC : 0;
+   io_op_flags |= 
bio_op_write_hint(inode_write_hint(io->io_end->inode));
bio_set_op_attrs(io->io_bio, REQ_OP_WRITE, io_op_flags);
submit_bio(io->io_bio);
}
@@ -396,6 +397,7 @@ static int io_submit_add_bh(struct ext4_io_submit *io,
ret = io_submit_init_bio(io, bh);
if (ret)
return ret;
+   io->io_bio->bi_opf |= 
bio_op_write_hint(inode_write_hint(inode));
}
ret = bio_add_page(io->io_bio, page, bh->b_size, bh_offset(bh));
if (ret != bh->b_size)
-- 
2.7.4



[PATCH 03/11] fs: add support for an inode to carry stream related data

2017-06-14 Thread Jens Axboe
No functional changes in this patch, just in preparation for
allowing applications to pass in hints about data life times
for writes.

Pack the i_write_hint field into a 2-byte hole, so we don't grow
the size of the inode.

Reviewed-by: Andreas Dilger 
Signed-off-by: Jens Axboe 
---
 fs/inode.c | 1 +
 include/linux/fs.h | 9 +
 2 files changed, 10 insertions(+)

diff --git a/fs/inode.c b/fs/inode.c
index db5914783a71..bd8bf44f3f31 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -149,6 +149,7 @@ int inode_init_always(struct super_block *sb, struct inode 
*inode)
inode->i_blocks = 0;
inode->i_bytes = 0;
inode->i_generation = 0;
+   inode->i_write_hint = 0;
inode->i_pipe = NULL;
inode->i_bdev = NULL;
inode->i_cdev = NULL;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 803e5a9b2654..f4f9df8ed059 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -591,6 +591,7 @@ struct inode {
struct timespec i_ctime;
spinlock_t  i_lock; /* i_blocks, i_bytes, maybe i_size */
unsigned short  i_bytes;
+   unsigned short  i_write_hint;
unsigned inti_blkbits;
blkcnt_ti_blocks;
 
@@ -655,6 +656,14 @@ struct inode {
void*i_private; /* fs or device private pointer */
 };
 
+static inline unsigned int inode_write_hint(struct inode *inode)
+{
+   if (inode)
+   return inode->i_write_hint;
+
+   return 0;
+}
+
 static inline unsigned int i_blocksize(const struct inode *node)
 {
return (1 << node->i_blkbits);
-- 
2.7.4



[PATCH 10/11] btrfs: add support for passing in write hints for buffered writes

2017-06-14 Thread Jens Axboe
Reviewed-by: Andreas Dilger 
Signed-off-by: Chris Mason 
Signed-off-by: Jens Axboe 
---
 fs/btrfs/extent_io.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d3619e010005..2bc2dfca87c2 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2826,6 +2826,7 @@ static int submit_extent_page(int op, int op_flags, 
struct extent_io_tree *tree,
bio_add_page(bio, page, page_size, offset);
bio->bi_end_io = end_io_func;
bio->bi_private = tree;
+   op_flags |= bio_op_write_hint(inode_write_hint(page->mapping->host));
bio_set_op_attrs(bio, op, op_flags);
if (wbc) {
wbc_init_bio(wbc, bio);
-- 
2.7.4



[PATCH 04/11] fs: add support for allowing applications to pass in write life time hints

2017-06-14 Thread Jens Axboe
Add four flags for the pwritev2(2) system call, allowing an application
to give the kernel a hint about what on-media life times can be
expected from a given write.

The intent is for these values to be relative to each other, no
absolute meaning should be attached to these flag names.

Define IOCB flags to carry this over this information from the pwritev2
RWF_WRITE_LIFE_* flags.

Reviewed-by: Andreas Dilger 
Signed-off-by: Jens Axboe 
---
 fs/read_write.c |  9 -
 include/linux/fs.h  | 12 
 include/uapi/linux/fs.h | 10 ++
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 47c1d4484df9..9cb2314efca3 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -678,7 +678,7 @@ static ssize_t do_iter_readv_writev(struct file *filp, 
struct iov_iter *iter,
struct kiocb kiocb;
ssize_t ret;
 
-   if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC))
+   if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_WRITE_LIFE_MASK))
return -EOPNOTSUPP;
 
init_sync_kiocb(, filp);
@@ -688,6 +688,13 @@ static ssize_t do_iter_readv_writev(struct file *filp, 
struct iov_iter *iter,
kiocb.ki_flags |= IOCB_DSYNC;
if (flags & RWF_SYNC)
kiocb.ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
+   if (flags & RWF_WRITE_LIFE_MASK) {
+   struct inode *inode = file_inode(filp);
+
+   inode->i_write_hint = (flags & RWF_WRITE_LIFE_MASK) >>
+   RWF_WRITE_LIFE_SHIFT;
+   kiocb.ki_flags |= inode->i_write_hint << IOCB_WRITE_LIFE_SHIFT;
+   }
kiocb.ki_pos = *ppos;
 
if (type == READ)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f4f9df8ed059..63798b67fcfe 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -269,6 +269,12 @@ struct writeback_control;
 #define IOCB_SYNC  (1 << 5)
 #define IOCB_WRITE (1 << 6)
 
+/*
+ * Steal 4-bits for stream information, this allows 16 valid streams
+ */
+#define IOCB_WRITE_LIFE_SHIFT  7
+#define IOCB_WRITE_LIFE_MASK   (BIT(7) | BIT(8) | BIT(9) | BIT(10))
+
 struct kiocb {
struct file *ki_filp;
loff_t  ki_pos;
@@ -292,6 +298,12 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, 
struct file *filp)
};
 }
 
+static inline int iocb_write_hint(const struct kiocb *iocb)
+{
+   return (iocb->ki_flags & IOCB_WRITE_LIFE_MASK) >>
+   IOCB_WRITE_LIFE_SHIFT;
+}
+
 /*
  * "descriptor" for what we're up to with a read.
  * This allows us to use the same read code yet
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 24e61a54feaa..58b7ee06b380 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -361,4 +361,14 @@ struct fscrypt_key {
 #define RWF_DSYNC  0x0002 /* per-IO O_DSYNC */
 #define RWF_SYNC   0x0004 /* per-IO O_SYNC */
 
+/*
+ * Data life time write flags, steal 4 bits for that
+ */
+#define RWF_WRITE_LIFE_SHIFT   4
+#define RWF_WRITE_LIFE_MASK0x00f0 /* 4 bits of stream ID */
+#define RWF_WRITE_LIFE_SHORT   (1 << RWF_WRITE_LIFE_SHIFT)
+#define RWF_WRITE_LIFE_MEDIUM  (2 << RWF_WRITE_LIFE_SHIFT)
+#define RWF_WRITE_LIFE_LONG(3 << RWF_WRITE_LIFE_SHIFT)
+#define RWF_WRITE_LIFE_EXTREME (4 << RWF_WRITE_LIFE_SHIFT)
+
 #endif /* _UAPI_LINUX_FS_H */
-- 
2.7.4



[PATCHSET v4] Add support for write life time hints

2017-06-14 Thread Jens Axboe
A new iteration of this patchset, previously known as write streams.
As before, this patchset aims at enabling applications split up
writes into separate streams, based on the perceived life time
of the data written. This is useful for a variety of reasons:

- With NVMe 1.3 compliant devices, the device can expose multiple
  streams. Separating data written into streams based on life time
  can drastically reduce the write amplification. This helps device
  endurance, and increases performance. Testing just performed
  internally at Facebook with these patches showed up to a 25%
  reduction in NAND writes in a RocksDB setup.

- Software caching solutions can make more intelligent decisions
  on how and where to place data.

Contrary to previous patches, we're not exposing numeric stream values anymore.
I've previously advocated for just doing a set of hints that makes sense
instead. See the coverage from the LSFMM summit this year:

https://lwn.net/Articles/717755/

This patchset attempts to do that. We define 4 flags for the pwritev2
system call:

RWF_WRITE_LIFE_SHORTData written with this flag is expected to have
a high overwrite rate, or life time.

RWF_WRITE_LIFE_MEDIUM   Longer life time than SHORT

RWF_WRITE_LIFE_LONG Longer life time than MEDIUM

RWF_WRITE_LIFE_EXTREME  Longer life time than LONG

The idea is that these are relative values, so an application can
use them as they see fit. The underlying device can then place
data appropriately, or be free to ignore the hint. It's just a hint.

A branch based on current master can be pulled
from here:

git://git.kernel.dk/linux-block write-stream.4

Changes since v3:

- Change any naming of stream ID to write hint.
- Various little API changes, suggested by Christoph
- Cleanup the NVMe bits, dump the debug info.
- Change NVMe to lazily allocate the streams.
- Various NVMe error handling improvements and command checking.

Changes since v2:

- Get rid of bio->bi_stream and replace with four request/bio flags.
  These map directly to the RWF_WRITE_* flags that the user passes in.
- Cleanup the NVMe stream setting.
- Drivers now responsible for updating the queue stream write counter,
  as they determine what stream to map a given flag to.

Changes since v1:

- Guard queue stream stats to ensure we don't mess up memory, if
  bio_stream() ever were to return a larger value than we support.
- NVMe: ensure we set the stream modulo the name space defined count.
- Cleanup the RWF_ and IOCB_ flags. Set aside 4 bits, and just store
  the stream value in there. This makes the passing of stream ID from
  RWF_ space to IOCB_ (and IOCB_ to bio) more efficient, and cleans it
  up in general.
- Kill the block internal definitions of the stream type, we don't need
  them anymore. See above.

-- 
Jens Axboe



[PATCH 06/11] fs: add O_DIRECT support for sending down bio stream information

2017-06-14 Thread Jens Axboe
Reviewed-by: Andreas Dilger 
Signed-off-by: Jens Axboe 
---
 fs/block_dev.c | 2 ++
 fs/direct-io.c | 2 ++
 fs/iomap.c | 1 +
 3 files changed, 5 insertions(+)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 51959936..de4301168710 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -239,6 +239,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct 
iov_iter *iter,
should_dirty = true;
} else {
bio.bi_opf = dio_bio_write_op(iocb);
+   bio.bi_opf |= bio_op_write_hint(iocb_write_hint(iocb));
task_io_account_write(ret);
}
 
@@ -374,6 +375,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
*iter, int nr_pages)
bio_set_pages_dirty(bio);
} else {
bio->bi_opf = dio_bio_write_op(iocb);
+   bio->bi_opf |= bio_op_write_hint(iocb_write_hint(iocb));
task_io_account_write(bio->bi_iter.bi_size);
}
 
diff --git a/fs/direct-io.c b/fs/direct-io.c
index a04ebea77de8..98874478ec8a 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -386,6 +386,8 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
else
bio->bi_end_io = dio_bio_end_io;
 
+   bio->bi_opf |= bio_op_write_hint(iocb_write_hint(dio->iocb));
+
sdio->bio = bio;
sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
 }
diff --git a/fs/iomap.c b/fs/iomap.c
index 4b10892967a5..7e18e760e421 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -804,6 +804,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t 
length,
 
if (dio->flags & IOMAP_DIO_WRITE) {
bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC | 
REQ_IDLE);
+   bio->bi_opf |= 
bio_op_write_hint(inode_write_hint(inode));
task_io_account_write(bio->bi_iter.bi_size);
} else {
bio_set_op_attrs(bio, REQ_OP_READ, 0);
-- 
2.7.4



[PATCH 07/11] fs: add support for buffered writeback to pass down write hints

2017-06-14 Thread Jens Axboe
Reviewed-by: Andreas Dilger 
Signed-off-by: Jens Axboe 
---
 fs/buffer.c | 14 +-
 fs/mpage.c  |  1 +
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 161be58c5cb0..3faf73a71d4b 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -49,7 +49,7 @@
 
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
 static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
-struct writeback_control *wbc);
+unsigned int stream, struct writeback_control *wbc);
 
 #define BH_ENTRY(list) list_entry((list), struct buffer_head, b_assoc_buffers)
 
@@ -1829,7 +1829,8 @@ int __block_write_full_page(struct inode *inode, struct 
page *page,
do {
struct buffer_head *next = bh->b_this_page;
if (buffer_async_write(bh)) {
-   submit_bh_wbc(REQ_OP_WRITE, write_flags, bh, wbc);
+   submit_bh_wbc(REQ_OP_WRITE, write_flags, bh,
+   inode_write_hint(inode), wbc);
nr_underway++;
}
bh = next;
@@ -1883,7 +1884,8 @@ int __block_write_full_page(struct inode *inode, struct 
page *page,
struct buffer_head *next = bh->b_this_page;
if (buffer_async_write(bh)) {
clear_buffer_dirty(bh);
-   submit_bh_wbc(REQ_OP_WRITE, write_flags, bh, wbc);
+   submit_bh_wbc(REQ_OP_WRITE, write_flags, bh,
+   inode_write_hint(inode), wbc);
nr_underway++;
}
bh = next;
@@ -3091,7 +3093,7 @@ void guard_bio_eod(int op, struct bio *bio)
 }
 
 static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
-struct writeback_control *wbc)
+unsigned int write_hint, struct writeback_control *wbc)
 {
struct bio *bio;
 
@@ -3134,6 +3136,8 @@ static int submit_bh_wbc(int op, int op_flags, struct 
buffer_head *bh,
op_flags |= REQ_META;
if (buffer_prio(bh))
op_flags |= REQ_PRIO;
+
+   op_flags |= bio_op_write_hint(write_hint);
bio_set_op_attrs(bio, op, op_flags);
 
submit_bio(bio);
@@ -3142,7 +3146,7 @@ static int submit_bh_wbc(int op, int op_flags, struct 
buffer_head *bh,
 
 int submit_bh(int op, int op_flags, struct buffer_head *bh)
 {
-   return submit_bh_wbc(op, op_flags, bh, NULL);
+   return submit_bh_wbc(op, op_flags, bh, 0, NULL);
 }
 EXPORT_SYMBOL(submit_bh);
 
diff --git a/fs/mpage.c b/fs/mpage.c
index baff8f820c29..df0635c8a512 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -614,6 +614,7 @@ static int __mpage_writepage(struct page *page, struct 
writeback_control *wbc,
goto confused;
 
wbc_init_bio(wbc, bio);
+   bio->bi_opf |= bio_op_write_hint(inode_write_hint(inode));
}
 
/*
-- 
2.7.4



[PATCH 05/11] block: add helpers for setting/checking write hint validity

2017-06-14 Thread Jens Axboe
We map the RWF_WRITE_* life time flags to the internal flags.
Drivers can then, in turn, map those flags to a suitable stream
type.

Signed-off-by: Jens Axboe 
---
 block/bio.c   | 16 
 include/linux/bio.h   |  1 +
 include/linux/blk_types.h |  5 +
 3 files changed, 22 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index 888e7801c638..8804439945b2 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -2082,6 +2082,22 @@ void bio_clone_blkcg_association(struct bio *dst, struct 
bio *src)
 
 #endif /* CONFIG_BLK_CGROUP */
 
+static const unsigned int rwf_write_to_opf_flag[] = {
+   0, REQ_WRITE_SHORT, REQ_WRITE_MEDIUM, REQ_WRITE_LONG, REQ_WRITE_EXTREME
+};
+
+/*
+ * 'stream_flags' is one of RWF_WRITE_LIFE_* values
+ */
+unsigned int bio_op_write_hint(unsigned int rwf_flags)
+{
+   if (WARN_ON_ONCE(rwf_flags >= ARRAY_SIZE(rwf_write_to_opf_flag)))
+   return 0;
+
+   return rwf_write_to_opf_flag[rwf_flags];
+}
+EXPORT_SYMBOL_GPL(bio_op_write_hint);
+
 static void __init biovec_init_slabs(void)
 {
int i;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index d1b04b0e99cf..debd60d641dd 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -443,6 +443,7 @@ extern struct bio *bio_copy_kern(struct request_queue *, 
void *, unsigned int,
 gfp_t, int);
 extern void bio_set_pages_dirty(struct bio *bio);
 extern void bio_check_pages_dirty(struct bio *bio);
+extern unsigned int bio_op_write_hint(unsigned int rwf_flags);
 
 void generic_start_io_acct(int rw, unsigned long sectors,
   struct hd_struct *part);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 57d1eb530799..23646eb433e7 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -323,4 +323,9 @@ struct blk_rq_stat {
u64 batch;
 };
 
+static inline bool op_write_hint_valid(unsigned int opf)
+{
+   return (opf & REQ_WRITE_LIFE_MASK) != 0;
+}
+
 #endif /* __LINUX_BLK_TYPES_H */
-- 
2.7.4



[PATCH 01/11] block: add support for carrying stream information in a bio

2017-06-14 Thread Jens Axboe
No functional changes in this patch, we just add four flags
that will be used to denote a stream type, and ensure that we
don't merge across different stream types.

Signed-off-by: Jens Axboe 
---
 block/blk-merge.c | 16 
 include/linux/blk_types.h | 11 +++
 2 files changed, 27 insertions(+)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 3990ae406341..7d299df3b12b 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -693,6 +693,14 @@ static struct request *attempt_merge(struct request_queue 
*q,
return NULL;
 
/*
+* Don't allow merge of different streams, or for a stream with
+* non-stream IO.
+*/
+   if ((req->cmd_flags & REQ_WRITE_LIFE_MASK) !=
+   (next->cmd_flags & REQ_WRITE_LIFE_MASK))
+   return NULL;
+
+   /*
 * If we are allowed to merge, then append bio list
 * from next to rq and release next. merge_requests_fn
 * will have updated segment counts, update sector
@@ -811,6 +819,14 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
!blk_write_same_mergeable(rq->bio, bio))
return false;
 
+   /*
+* Don't allow merge of different streams, or for a stream with
+* non-stream IO.
+*/
+   if ((rq->cmd_flags & REQ_WRITE_LIFE_MASK) !=
+   (bio->bi_opf & REQ_WRITE_LIFE_MASK))
+   return false;
+
return true;
 }
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 61339bc44400..57d1eb530799 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -201,6 +201,10 @@ enum req_flag_bits {
__REQ_PREFLUSH, /* request for cache flush */
__REQ_RAHEAD,   /* read ahead, can fail anytime */
__REQ_BACKGROUND,   /* background IO */
+   __REQ_WRITE_SHORT,  /* short life time write */
+   __REQ_WRITE_MEDIUM, /* medium life time write */
+   __REQ_WRITE_LONG,   /* long life time write */
+   __REQ_WRITE_EXTREME,/* extremely long life time write */
 
/* command specific flags for REQ_OP_WRITE_ZEROES: */
__REQ_NOUNMAP,  /* do not free blocks when zeroing */
@@ -221,6 +225,13 @@ enum req_flag_bits {
 #define REQ_PREFLUSH   (1ULL << __REQ_PREFLUSH)
 #define REQ_RAHEAD (1ULL << __REQ_RAHEAD)
 #define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND)
+#define REQ_WRITE_SHORT(1ULL << __REQ_WRITE_SHORT)
+#define REQ_WRITE_MEDIUM   (1ULL << __REQ_WRITE_MEDIUM)
+#define REQ_WRITE_LONG (1ULL << __REQ_WRITE_LONG)
+#define REQ_WRITE_EXTREME  (1ULL << __REQ_WRITE_EXTREME)
+
+#define REQ_WRITE_LIFE_MASK(REQ_WRITE_SHORT | REQ_WRITE_MEDIUM | \
+   REQ_WRITE_LONG | REQ_WRITE_EXTREME)
 
 #define REQ_NOUNMAP(1ULL << __REQ_NOUNMAP)
 
-- 
2.7.4



[PATCH 02/11] blk-mq: expose stream write stats through debugfs

2017-06-14 Thread Jens Axboe
Useful to verify that things are working the way they should.
Reading the file will return number of kb written to each
stream. Writing the file will reset the statistics. No care
is taken to ensure that we don't race on updates.

Drivers will write to q->stream_writes[] if they handle a stream.

Reviewed-by: Andreas Dilger 
Signed-off-by: Jens Axboe 
---
 block/blk-mq-debugfs.c | 24 
 include/linux/blkdev.h |  3 +++
 2 files changed, 27 insertions(+)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 803aed4d7221..0a37c848961d 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -133,6 +133,29 @@ static void print_stat(struct seq_file *m, struct 
blk_rq_stat *stat)
}
 }
 
+static int queue_streams_show(void *data, struct seq_file *m)
+{
+   struct request_queue *q = data;
+   int i;
+
+   for (i = 0; i < BLK_MAX_STREAM; i++)
+   seq_printf(m, "stream%d: %llu\n", i, q->stream_writes[i]);
+
+   return 0;
+}
+
+static ssize_t queue_streams_store(void *data, const char __user *buf,
+  size_t count, loff_t *ppos)
+{
+   struct request_queue *q = data;
+   int i;
+
+   for (i = 0; i < BLK_MAX_STREAM; i++)
+   q->stream_writes[i] = 0;
+
+   return count;
+}
+
 static int queue_poll_stat_show(void *data, struct seq_file *m)
 {
struct request_queue *q = data;
@@ -656,6 +679,7 @@ const struct file_operations blk_mq_debugfs_fops = {
 static const struct blk_mq_debugfs_attr blk_mq_debugfs_queue_attrs[] = {
{"poll_stat", 0400, queue_poll_stat_show},
{"state", 0600, queue_state_show, queue_state_write},
+   {"streams", 0600, queue_streams_show, queue_streams_store},
{},
 };
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ab92c4ea138b..88719c6f3edf 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -586,6 +586,9 @@ struct request_queue {
 
size_t  cmd_size;
void*rq_alloc_data;
+
+#define BLK_MAX_STREAM 5
+   u64 stream_writes[BLK_MAX_STREAM];
 };
 
 #define QUEUE_FLAG_QUEUED  1   /* uses generic tag queueing */
-- 
2.7.4



Re: [PATCHSET v2] Add support for write life time hints

2017-06-14 Thread Jens Axboe
On Wed, Jun 14, 2017 at 5:39 PM, Andreas Dilger  wrote:
> On Jun 14, 2017, at 10:04 AM, Martin K. Petersen  
> wrote:
>> Christoph,
>>
>>> I think what Martin wants (or at least what I'd want him to want) is
>>> to define a few REQ_* bits that mirror the RWF bits, use that to
>>> transfer the information down the stack, and then only translate it
>>> to stream ids in the driver.
>>
>> Yup. If we have enough space in the existing flags that's perfect (I
>> lost count after your op/flag shuffle).
>
> Just to clarify, does this mean that Jens' "lifetime" hints are going to
> be independent of separate "stream ID" values in the future (needing a
> similar, but independent, set of fields for stream IDs, or will they
> both be implemented using a common transport in the kernel (i.e. both
> share a single set of fields in the inode/bio/etc?

There's no reason they can't coexist. Now that bio->bi_stream is gone
and we just have the flags, if we want to expose separate "real" stream
IDs, then we could later re-introduce that. It'd be pretty trivial to
just use the most pertinent information on the driver side.

>From my perspective, all I really care about is the 4 hints. It's a
simple enough interface that applications can understand and use it, and
we don't need any management of actual stream IDs. I think that has the
highest chance of success. Modifying an application to use it is
trivial, even something like RocksDB (if you havehad to make changes
to RocksDB, you'll get this).

> I can imagine applications using either the lifetime hints (mapped to
> stream IDs internally), or explicitly managing stream IDs, but it seems
> unlikely that both would be in use at the same time, unless the "lifetime"
> hints are intended to also indirectly map to HSM-like tiered cache/storage
> hints internally to the storage stack (e.g. whether writes should be put
> into local cachefs vs remote network storage, or bcache NVRAM vs. HDD)?

Agree, both would be viable solutions, and they can coexist. Explicitly
managing stream IDs is a lot more advanced, and the 4 hints will get
most people 95% of the way there. For explicit streams, we'd need
something similar to my previous versions of this patchset, where a full
API is provided.

> Jens' patches were moving in the direction of allowing up to 16 stream IDs,
> with the default being IDs 1-4 are intended to reflect different data
> lifetimes, but allowing up to 15 unique IDs to be specified "per device"*
> if userspace wants to manage this more explicitly.

I don't think we should mix the two up. As mentioned above, one will
require a full API and kernel management of the streams. I had a lot of
push back on that approach previously, which is why I've now gone for
the simpler version. If we want to expose stream IDs explicitly to user
space, that's a lot more involved.

-- 
Jens Axboe



Re: [PATCH 10/11] btrfs: add support for passing in stream information for buffered writes

2017-06-14 Thread Chris Mason

On Wed, Jun 14, 2017 at 01:05:33PM -0600, Jens Axboe wrote:

Reviewed-by: Andreas Dilger 
Signed-off-by: Jens Axboe 


Thanks Jens!

Signed-off-by: Chris Mason 


---
fs/btrfs/extent_io.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d3619e010005..b245085e8f10 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2827,6 +2827,7 @@ static int submit_extent_page(int op, int op_flags, 
struct extent_io_tree *tree,
bio->bi_end_io = end_io_func;
bio->bi_private = tree;
bio_set_op_attrs(bio, op, op_flags);
+   bio_set_streamid(bio, inode_streamid(page->mapping->host));
if (wbc) {
wbc_init_bio(wbc, bio);
wbc_account_io(wbc, page, page_size);
--
2.7.4



[RFC PATCH 3/4] ext4: Set the bio REQ_NOENCRYPT flag

2017-06-14 Thread Michael Halcrow
When lower layers such as dm-crypt observe the REQ_NOENCRYPT flag, it
helps the I/O stack avoid redundant encryption, improving performance
and power utilization.

Note that lower layers must be consistent in their observation of this
flag in order to avoid the possibility of data corruption.

Signed-off-by: Michael Halcrow 
---
 fs/crypto/bio.c|  2 +-
 fs/ext4/ext4.h |  3 +++
 fs/ext4/inode.c| 13 -
 fs/ext4/page-io.c  |  5 +
 fs/ext4/readpage.c |  3 ++-
 5 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/fs/crypto/bio.c b/fs/crypto/bio.c
index a409a84f1bca..9093a715d2be 100644
--- a/fs/crypto/bio.c
+++ b/fs/crypto/bio.c
@@ -118,7 +118,7 @@ int fscrypt_zeroout_range(const struct inode *inode, 
pgoff_t lblk,
bio->bi_bdev = inode->i_sb->s_bdev;
bio->bi_iter.bi_sector =
pblk << (inode->i_sb->s_blocksize_bits - 9);
-   bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
+   bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_NOENCRYPT);
ret = bio_add_page(bio, ciphertext_page,
inode->i_sb->s_blocksize, 0);
if (ret != inode->i_sb->s_blocksize) {
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 8e8046104f4d..48c2bc9f8688 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -206,7 +206,10 @@ typedef struct ext4_io_end {
ssize_t size;   /* size of the extent */
 } ext4_io_end_t;
 
+#define EXT4_IO_ENCRYPTED  1
+
 struct ext4_io_submit {
+   unsigned intio_flags;
struct writeback_control *io_wbc;
struct bio  *io_bio;
ext4_io_end_t   *io_end;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 1bd0bfa547f6..25a9b7265692 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1154,10 +1154,11 @@ static int ext4_block_write_begin(struct page *page, 
loff_t pos, unsigned len,
if (!buffer_uptodate(bh) && !buffer_delay(bh) &&
!buffer_unwritten(bh) &&
(block_start < from || block_end > to)) {
-   ll_rw_block(REQ_OP_READ, 0, 1, );
-   *wait_bh++ = bh;
decrypt = ext4_encrypted_inode(inode) &&
S_ISREG(inode->i_mode);
+   ll_rw_block(REQ_OP_READ, (decrypt ? REQ_NOENCRYPT : 0),
+   1, );
+   *wait_bh++ = bh;
}
}
/*
@@ -3863,6 +3864,7 @@ static int __ext4_block_zero_page_range(handle_t *handle,
struct inode *inode = mapping->host;
struct buffer_head *bh;
struct page *page;
+   bool decrypt;
int err = 0;
 
page = find_or_create_page(mapping, from >> PAGE_SHIFT,
@@ -3905,13 +3907,14 @@ static int __ext4_block_zero_page_range(handle_t 
*handle,
 
if (!buffer_uptodate(bh)) {
err = -EIO;
-   ll_rw_block(REQ_OP_READ, 0, 1, );
+   decrypt = S_ISREG(inode->i_mode) &&
+   ext4_encrypted_inode(inode);
+   ll_rw_block(REQ_OP_READ, (decrypt ? REQ_NOENCRYPT : 0), 1, );
wait_on_buffer(bh);
/* Uhhuh. Read error. Complain and punt. */
if (!buffer_uptodate(bh))
goto unlock;
-   if (S_ISREG(inode->i_mode) &&
-   ext4_encrypted_inode(inode)) {
+   if (decrypt) {
/* We expect the key to be set. */
BUG_ON(!fscrypt_has_encryption_key(inode));
BUG_ON(blocksize != PAGE_SIZE);
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 1a82138ba739..e25bf6cb216a 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -349,6 +349,8 @@ void ext4_io_submit(struct ext4_io_submit *io)
if (bio) {
int io_op_flags = io->io_wbc->sync_mode == WB_SYNC_ALL ?
  REQ_SYNC : 0;
+   if (io->io_flags & EXT4_IO_ENCRYPTED)
+   io_op_flags |= REQ_NOENCRYPT;
bio_set_op_attrs(io->io_bio, REQ_OP_WRITE, io_op_flags);
submit_bio(io->io_bio);
}
@@ -358,6 +360,7 @@ void ext4_io_submit(struct ext4_io_submit *io)
 void ext4_io_submit_init(struct ext4_io_submit *io,
 struct writeback_control *wbc)
 {
+   io->io_flags = 0;
io->io_wbc = wbc;
io->io_bio = NULL;
io->io_end = NULL;
@@ -499,6 +502,8 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
do {
if (!buffer_async_write(bh))
continue;
+   if (data_page)
+   io->io_flags |= EXT4_IO_ENCRYPTED;
ret = io_submit_add_bh(io, inode,
   data_page ? data_page : page, bh);
if (ret) 

[RFC PATCH 1/4] block: Add bio req flag to disable encryption in block

2017-06-14 Thread Michael Halcrow
When both the file system and a lower layer such as dm-crypt encrypt
the same file contents, it impacts performance and power utilization.
Depending on how the operating environment manages the encryption
keys, there is often no significant security benefit to redundantly
encrypting.

File systems that encrypt some of their blocks can set the
REQ_NOENCRYPT flag as a directive to lower layers to not encrypt.

Lower layers may optionally observe the flag, but once thay do, they
must continue to observe it on subsequent I/O on the device.
Otherwise they will decrypt content that they didn't previously
encrypt, resulting in data corruption.

Signed-off-by: Michael Halcrow 
---
 include/linux/blk_types.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 61339bc44400..89da8f5f7be1 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -205,6 +205,7 @@ enum req_flag_bits {
/* command specific flags for REQ_OP_WRITE_ZEROES: */
__REQ_NOUNMAP,  /* do not free blocks when zeroing */
 
+   __REQ_NOENCRYPT,/* ok to not encrypt */
__REQ_NR_BITS,  /* stops here */
 };
 
@@ -223,6 +224,7 @@ enum req_flag_bits {
 #define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND)
 
 #define REQ_NOUNMAP(1ULL << __REQ_NOUNMAP)
+#define REQ_NOENCRYPT  (1ULL << __REQ_NOENCRYPT)
 
 #define REQ_FAILFAST_MASK \
(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
-- 
2.13.1.508.gb3defc5cc-goog



[RFC PATCH 2/4] dm-crypt: Skip encryption of file system-encrypted blocks

2017-06-14 Thread Michael Halcrow
File systems can encrypt some of their data blocks with their own
encryption keys, and for those blocks another round of encryption at
the dm-crypt layer may be redundant, depending on the keys being used.

This patch enables dm-crypt to observe the REQ_NOENCRYPT flag as an
indicator that a bio request should bypass the dm-crypt encryption
queue.

By default dm-crypt will ignore this request flag from the file
system.  The user must set the allow_encrypt_override option to enable
this functionality.  Once the dm-crypt has been used with the
allow_encrypt_override option for any given block device, it must
continue to be used with the option to avoid the possibility of data
corruption.

Signed-off-by: Michael Halcrow 
---
 drivers/md/dm-crypt.c | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index ebf9e72d479b..14ca8a6de3d9 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -125,7 +125,8 @@ struct iv_tcw_private {
  * and encrypts / decrypts at the same time.
  */
 enum flags { DM_CRYPT_SUSPENDED, DM_CRYPT_KEY_VALID,
-DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD };
+DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD,
+DM_CRYPT_ENCRYPT_OVERRIDE };
 
 enum cipher_flags {
CRYPT_MODE_INTEGRITY_AEAD,  /* Use authenticated mode for cihper */
@@ -2572,6 +2573,8 @@ static int crypt_ctr_optional(struct dm_target *ti, 
unsigned int argc, char **ar
cc->sector_shift = __ffs(cc->sector_size) - 
SECTOR_SHIFT;
} else if (!strcasecmp(opt_string, "iv_large_sectors"))
set_bit(CRYPT_IV_LARGE_SECTORS, >cipher_flags);
+   else if (!strcasecmp(opt_string, "allow_encrypt_override"))
+   set_bit(DM_CRYPT_ENCRYPT_OVERRIDE, >flags);
else {
ti->error = "Invalid feature arguments";
return -EINVAL;
@@ -2770,12 +2773,15 @@ static int crypt_map(struct dm_target *ti, struct bio 
*bio)
struct crypt_config *cc = ti->private;
 
/*
-* If bio is REQ_PREFLUSH or REQ_OP_DISCARD, just bypass crypt queues.
+* If bio is REQ_PREFLUSH, REQ_NOENCRYPT, or REQ_OP_DISCARD,
+* just bypass crypt queues.
 * - for REQ_PREFLUSH device-mapper core ensures that no IO is in-flight
 * - for REQ_OP_DISCARD caller must use flush if IO ordering matters
 */
-   if (unlikely(bio->bi_opf & REQ_PREFLUSH ||
-   bio_op(bio) == REQ_OP_DISCARD)) {
+   if (unlikely(bio->bi_opf & REQ_PREFLUSH) ||
+   (unlikely(bio->bi_opf & REQ_NOENCRYPT) &&
+test_bit(DM_CRYPT_ENCRYPT_OVERRIDE, >flags)) ||
+   bio_op(bio) == REQ_OP_DISCARD) {
bio->bi_bdev = cc->dev->bdev;
if (bio_sectors(bio))
bio->bi_iter.bi_sector = cc->start +
@@ -2862,6 +2868,7 @@ static void crypt_status(struct dm_target *ti, 
status_type_t type,
num_feature_args += test_bit(DM_CRYPT_NO_OFFLOAD, >flags);
num_feature_args += cc->sector_size != (1 << SECTOR_SHIFT);
num_feature_args += test_bit(CRYPT_IV_LARGE_SECTORS, 
>cipher_flags);
+   num_feature_args += test_bit(DM_CRYPT_ENCRYPT_OVERRIDE, 
>flags);
if (cc->on_disk_tag_size)
num_feature_args++;
if (num_feature_args) {
@@ -2878,6 +2885,8 @@ static void crypt_status(struct dm_target *ti, 
status_type_t type,
DMEMIT(" sector_size:%d", cc->sector_size);
if (test_bit(CRYPT_IV_LARGE_SECTORS, >cipher_flags))
DMEMIT(" iv_large_sectors");
+   if (test_bit(DM_CRYPT_ENCRYPT_OVERRIDE, >flags))
+   DMEMIT(" allow_encrypt_override");
}
 
break;
-- 
2.13.1.508.gb3defc5cc-goog



[RFC PATCH 4/4] f2fs: Set the bio REQ_NOENCRYPT flag

2017-06-14 Thread Michael Halcrow
When lower layers such as dm-crypt observe the REQ_NOENCRYPT flag, it
helps the I/O stack avoid redundant encryption, improving performance
and power utilization.

Note that lower layers must be consistent in their observation of this
flag in order to avoid the possibility of data corruption.

Signed-off-by: Michael Halcrow 
---
 fs/f2fs/data.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 7c0f6bdf817d..2a000c0ec7e1 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -359,6 +359,7 @@ int f2fs_submit_page_bio(struct f2fs_io_info *fio)
bio_put(bio);
return -EFAULT;
}
+   fio->op_flags |= fio->encrypted_page ? REQ_NOENCRYPT : 0;
bio_set_op_attrs(bio, fio->op, fio->op_flags);
 
__submit_bio(fio->sbi, bio, fio->type);
@@ -384,6 +385,7 @@ int f2fs_submit_page_mbio(struct f2fs_io_info *fio)
verify_block_addr(sbi, fio->new_blkaddr);
 
bio_page = fio->encrypted_page ? fio->encrypted_page : fio->page;
+   fio->op_flags |= fio->encrypted_page ? REQ_NOENCRYPT : 0;
 
/* set submitted = 1 as a return value */
fio->submitted = 1;
@@ -1242,7 +1244,10 @@ static int f2fs_mpage_readpages(struct address_space 
*mapping,
bio = NULL;
goto set_error_page;
}
-   bio_set_op_attrs(bio, REQ_OP_READ, 0);
+   bio_set_op_attrs(bio, REQ_OP_READ,
+(f2fs_encrypted_inode(inode) ?
+ REQ_NOENCRYPT :
+ 0));
}
 
if (bio_add_page(bio, page, blocksize, 0) < blocksize)
@@ -1914,7 +1919,8 @@ static int f2fs_write_begin(struct file *file, struct 
address_space *mapping,
err = PTR_ERR(bio);
goto fail;
}
-   bio->bi_opf = REQ_OP_READ;
+   bio->bi_opf = REQ_OP_READ |
+   (f2fs_encrypted_inode(inode) ? REQ_NOENCRYPT : 0);
if (bio_add_page(bio, page, PAGE_SIZE, 0) < PAGE_SIZE) {
bio_put(bio);
err = -EFAULT;
-- 
2.13.1.508.gb3defc5cc-goog



Re: [PATCHSET v2] Add support for write life time hints

2017-06-14 Thread Andreas Dilger
On Jun 14, 2017, at 10:04 AM, Martin K. Petersen  
wrote:
> Christoph,
> 
>> I think what Martin wants (or at least what I'd want him to want) is
>> to define a few REQ_* bits that mirror the RWF bits, use that to
>> transfer the information down the stack, and then only translate it
>> to stream ids in the driver.
> 
> Yup. If we have enough space in the existing flags that's perfect (I
> lost count after your op/flag shuffle).

Just to clarify, does this mean that Jens' "lifetime" hints are going to
be independent of separate "stream ID" values in the future (needing a
similar, but independent, set of fields for stream IDs, or will they
both be implemented using a common transport in the kernel (i.e. both
share a single set of fields in the inode/bio/etc?

I can imagine applications using either the lifetime hints (mapped to
stream IDs internally), or explicitly managing stream IDs, but it seems
unlikely that both would be in use at the same time, unless the "lifetime"
hints are intended to also indirectly map to HSM-like tiered cache/storage
hints internally to the storage stack (e.g. whether writes should be put
into local cachefs vs remote network storage, or bcache NVRAM vs. HDD)?

Jens' patches were moving in the direction of allowing up to 16 stream IDs,
with the default being IDs 1-4 are intended to reflect different data
lifetimes, but allowing up to 15 unique IDs to be specified "per device"*
if userspace wants to manage this more explicitly.

Martin, how many stream IDs do you think are needed?  If we need too many,
then we wouldn't be able to pack them into the existing pwritev2() flags
and would need another new API to specify the stream IDs.

[*] more work is needed to handle this on multiple filesystems, MDRAID/LVM, etc.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP


Re: [PATCH 01/11] block: add support for carrying stream information in a bio

2017-06-14 Thread Jens Axboe
On 06/14/2017 02:37 PM, Christoph Hellwig wrote:
> Btw, I think these could also easily map to DSM field in the NVMe
> write command, except that these unfortunately mix in read information
> as well.

But that's the problem, they are read/write mixed flags. I'd much
rather keep them separate. If some application finds it useful
to specify read access patterns, we should have separate flags for
those imho.

-- 
Jens Axboe



Re: [PATCH 11/11] nvme: add support for streams and directives

2017-06-14 Thread Jens Axboe
On 06/14/2017 02:32 PM, Christoph Hellwig wrote:
>> +static unsigned int nvme_get_write_stream(struct nvme_ns *ns,
>> +  struct request *req)
>> +{
>> +unsigned int streamid = 0;
>> +
>> +if (req_op(req) != REQ_OP_WRITE || !blk_stream_valid(req->cmd_flags) ||
>> +!ns->nr_streams)
>> +return 0;
> 
> Might make more sense to do this check in the caller?

OK, will fix up.

>> +if (req->cmd_flags & REQ_WRITE_SHORT)
>> +streamid = 1;
>> +else if (req->cmd_flags & REQ_WRITE_MEDIUM)
>> +streamid = 2;
>> +else if (req->cmd_flags & REQ_WRITE_LONG)
>> +streamid = 3;
>> +else if (req->cmd_flags & REQ_WRITE_EXTREME)
>> +streamid = 4;
>> +
>> +if (streamid < BLK_MAX_STREAM)
> 
> Can happen per the index above.

True, that's a leftover from the previous version.

>> +req->q->stream_writes[streamid] += blk_rq_bytes(req) >> 9;
>> +
>> +return (streamid % (ns->nr_streams + 1));
> 
> Should we do smarted collapsing?  e.g. short + medium and long + extreme
> for two?  What for three?  Does one extra stream make sense in this
> scheme?

Collapsing is probably saner than round-robin. I'd tend to collapse on
the longer life time end of things, logically that would make the most
sense.

>> +dev_info(ctrl->device, "streams: msl=%u, nssa=%u, nsso=%u, sws=%u "
>> +"sgs=%u, nsa=%u, nso=%u\n", s.msl, s.nssa,
>> +s.nsso, s.sws, s.sgs, s.nsa, s.nso);
> 
> Way to chatty.

Sure, that's mentioned in the changelog, that stuff will go.

>> +if (ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES)
>> +dev_info(ctrl->dev, "supports directives\n");
> 
> Same.  Use nvme-cli for that sort of info.

Ditto

>>  ctrl->npss = id->npss;
>>  prev_apsta = ctrl->apsta;
>>  if (ctrl->quirks & NVME_QUIRK_NO_APST) {
>> @@ -2060,6 +2201,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, 
>> unsigned nsid)
>>  goto out_free_id;
>>  }
>>  
>> +if (ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES)
>> +nvme_config_streams(ns);
> 
> This sets aside four streams on any device that supports them, and
> will probably kill performance on them unless you have a workload
> that actually uses those streams.  I think they need to be allocated
> lazily.

That's a good point, I have been thinking about how best to handle this.
I don't want an API for this, but doing it lazy would be fine. When we
see a write with a life time attached, kick off background setup of
streams. Until that's done, don't use any streams. Once setup, we'll
mark it as we currently do now.

How's that?

-- 
Jens Axboe



Re: [PATCH 01/11] block: add support for carrying stream information in a bio

2017-06-14 Thread Christoph Hellwig
Btw, I think these could also easily map to DSM field in the NVMe
write command, except that these unfortunately mix in read information
as well.

> + __REQ_WRITE_SHORT,  /* short life time write */

-> Frequent writes and infrequent reads to the LBA range indicated.

or

-> Frequent writes and frequent reads to the LBA range indicated.

> + __REQ_WRITE_MEDIUM, /* medium life time write */

-> Typical number of reads and writes expected for this LBA range.

> + __REQ_WRITE_LONG,   /* long life time write */

-> Infrequent writes and infrequent reads to the LBA range indicated.

or

-> Infrequent writes and frequent reads to the LBA range indicated.

> + __REQ_WRITE_EXTREME,/* extremely long life time write */

-> One time write. E.g. command is due to virus scan, backup, file
copy, or archive.


Re: [PATCH 04/11] fs: add support for allowing applications to pass in write life time hints

2017-06-14 Thread Jens Axboe
On 06/14/2017 02:26 PM, Christoph Hellwig wrote:
> On Wed, Jun 14, 2017 at 01:05:27PM -0600, Jens Axboe wrote:
>> Add four flags for the pwritev2(2) system call, allowing an application
>> to give the kernel a hint about what on-media life times can be
>> expected from a given write.
>>
>> The intent is for these values to be relative to each other, no
>> absolute meaning should be attached to these flag names.
>>
>> Define IOCB flags to carry this information over, and finally
>> transform them into the block defined stream values.
> 
> Note that we also have the RWF_NOWAIT flag pending, with which this
> would clash.  Maybe move it to the end of the space.

That's a trivial thing to resolve. Whomever goes last can resolve
that. I can put them at the end if you feel strongly about it, I
personally don't really care.

-- 
Jens Axboe



Re: [PATCH 03/11] fs: add support for an inode to carry stream related data

2017-06-14 Thread Jens Axboe
On 06/14/2017 02:25 PM, Christoph Hellwig wrote:
> On Wed, Jun 14, 2017 at 01:05:26PM -0600, Jens Axboe wrote:
>> No functional changes in this patch, just in preparation for
>> allowing applications to pass in hints about data life times
>> for writes.
>>
>> Pack the i_stream field into a 2-byte hole, so we don't grow
>> the size of the inode.
> 
> Can't we find space for 4 flags somewhere else?

Where? I can stuff them in i_flags? But apart from that, stealing
space in a hole seems the sanest.

-- 
Jens Axboe



Re: [PATCH 11/11] nvme: add support for streams and directives

2017-06-14 Thread Christoph Hellwig
> +static unsigned int nvme_get_write_stream(struct nvme_ns *ns,
> +   struct request *req)
> +{
> + unsigned int streamid = 0;
> +
> + if (req_op(req) != REQ_OP_WRITE || !blk_stream_valid(req->cmd_flags) ||
> + !ns->nr_streams)
> + return 0;

Might make more sense to do this check in the caller?

> +
> + if (req->cmd_flags & REQ_WRITE_SHORT)
> + streamid = 1;
> + else if (req->cmd_flags & REQ_WRITE_MEDIUM)
> + streamid = 2;
> + else if (req->cmd_flags & REQ_WRITE_LONG)
> + streamid = 3;
> + else if (req->cmd_flags & REQ_WRITE_EXTREME)
> + streamid = 4;
> +
> + if (streamid < BLK_MAX_STREAM)

Can happen per the index above.

> + req->q->stream_writes[streamid] += blk_rq_bytes(req) >> 9;
> +
> + return (streamid % (ns->nr_streams + 1));

Should we do smarted collapsing?  e.g. short + medium and long + extreme
for two?  What for three?  Does one extra stream make sense in this
scheme?

> + dev_info(ctrl->device, "streams: msl=%u, nssa=%u, nsso=%u, sws=%u "
> + "sgs=%u, nsa=%u, nso=%u\n", s.msl, s.nssa,
> + s.nsso, s.sws, s.sgs, s.nsa, s.nso);

Way to chatty.

> + if (ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES)
> + dev_info(ctrl->dev, "supports directives\n");

Same.  Use nvme-cli for that sort of info.

>   ctrl->npss = id->npss;
>   prev_apsta = ctrl->apsta;
>   if (ctrl->quirks & NVME_QUIRK_NO_APST) {
> @@ -2060,6 +2201,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, 
> unsigned nsid)
>   goto out_free_id;
>   }
>  
> + if (ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES)
> + nvme_config_streams(ns);

This sets aside four streams on any device that supports them, and
will probably kill performance on them unless you have a workload
that actually uses those streams.  I think they need to be allocated
lazily.


Re: [PATCH 05/11] block: add helpers for setting/checking stream validity

2017-06-14 Thread Christoph Hellwig
> +static const unsigned int rwf_write_to_opf_flag[] = {
> + 0, REQ_WRITE_SHORT, REQ_WRITE_MEDIUM, REQ_WRITE_LONG, REQ_WRITE_EXTREME
> +};



> +
> +/*
> + * 'stream_flags' is one of RWF_WRITE_LIFE_* values
> + */
> +void bio_set_streamid(struct bio *bio, unsigned int rwf_flags)
> +{
> + if (WARN_ON_ONCE(rwf_flags >= ARRAY_SIZE(rwf_write_to_opf_flag)))
> + return;
> +
> + bio->bi_opf |= rwf_write_to_opf_flag[rwf_flags];
> +}
> +EXPORT_SYMBOL_GPL(bio_set_streamid);

I'd move the bio->bi_opf assignment outÅŸ for a call like:

bio->bi_opf |= bio_op_write_bucket(flags);

and preferably move the flags masking / shifting into the helper as
well.

>  };
>  
> +static inline bool blk_stream_valid(unsigned int opf)
> +{
> + return (opf & REQ_WRITE_LIFE_MASK) != 0;
> +}

Replace the stream name here with lifetime or similar as well?


Re: [PATCH 04/11] fs: add support for allowing applications to pass in write life time hints

2017-06-14 Thread Christoph Hellwig
On Wed, Jun 14, 2017 at 01:05:27PM -0600, Jens Axboe wrote:
> Add four flags for the pwritev2(2) system call, allowing an application
> to give the kernel a hint about what on-media life times can be
> expected from a given write.
> 
> The intent is for these values to be relative to each other, no
> absolute meaning should be attached to these flag names.
> 
> Define IOCB flags to carry this information over, and finally
> transform them into the block defined stream values.

Note that we also have the RWF_NOWAIT flag pending, with which this
would clash.  Maybe move it to the end of the space.


Re: [PATCH 03/11] fs: add support for an inode to carry stream related data

2017-06-14 Thread Christoph Hellwig
On Wed, Jun 14, 2017 at 01:05:26PM -0600, Jens Axboe wrote:
> No functional changes in this patch, just in preparation for
> allowing applications to pass in hints about data life times
> for writes.
> 
> Pack the i_stream field into a 2-byte hole, so we don't grow
> the size of the inode.

Can't we find space for 4 flags somewhere else?


Re: [PATCH v2 1/6] block: Avoid that blk_exit_rl() triggers a use-after-free

2017-06-14 Thread Bart Van Assche
On 06/14/17 12:28, Jens Axboe wrote:
> I added this, but the above is really a horrible changelog. It doesn't
> say how the problem is fixed. I added some verbiage to that effect.

Hello Jens,

Thanks for having fixed up the changelog and for already having picked
up this patch. I was going to repost the patch and write a proper
changelog but apparently you were faster than I :-)

BTW, since last night I can finally receive and send e-mails with a
@wdc.com suffix (thirteen months after the acquisition closed).

Bart.



Re: [PATCH v2 1/6] block: Avoid that blk_exit_rl() triggers a use-after-free

2017-06-14 Thread Jens Axboe
On 06/14/2017 09:19 AM, Bart Van Assche wrote:
> Subject: [PATCH] block: Fix a blk_exit_rl() regression
> 
> Avoid that the following complaint is reported:
> 
>  BUG: sleeping function called from invalid context at kernel/workqueue.c:2790
>  in_atomic(): 1, irqs_disabled(): 0, pid: 41, name: rcuop/3
>  1 lock held by rcuop/3/41:
>   #0:  (rcu_callback){..}, at: [] 
> rcu_nocb_kthread+0x282/0x500
>  Call Trace:
>   dump_stack+0x86/0xcf
>   ___might_sleep+0x174/0x260
>   __might_sleep+0x4a/0x80
>   flush_work+0x7e/0x2e0
>   __cancel_work_timer+0x143/0x1c0
>   cancel_work_sync+0x10/0x20
>   blk_throtl_exit+0x25/0x60
>   blkcg_exit_queue+0x35/0x40
>   blk_release_queue+0x42/0x130
>   kobject_put+0xa9/0x190

I added this, but the above is really a horrible changelog. It doesn't
say how the problem is fixed. I added some verbiage to that effect.

-- 
Jens Axboe



[PATCH 11/11] nvme: add support for streams and directives

2017-06-14 Thread Jens Axboe
This adds support for Directives in NVMe, particular for the Streams
directive. Support for Directives is a new feature in NVMe 1.3. It
allows a user to pass in information about where to store the data,
so that it the device can do so most effiently. If an application is
managing and writing data with different life times, mixing differently
retentioned data onto the same locations on flash can cause write
amplification to grow. This, in turn, will reduce performance and
life time of the device.

We default to allocating 4 streams per name space, but it is
configurable with the 'streams_per_ns' module option. If a write stream
is set in a write, flag is as such before sending it to the device.

Some debug stuff in this patch, dumping streams ID params when
we load nvme.

Signed-off-by: Jens Axboe 
---
 drivers/nvme/host/core.c | 146 +++
 drivers/nvme/host/nvme.h |   1 +
 include/linux/nvme.h |  48 
 3 files changed, 195 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 903d5813023a..a389d62a528b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -65,6 +65,10 @@ static bool force_apst;
 module_param(force_apst, bool, 0644);
 MODULE_PARM_DESC(force_apst, "allow APST for newly enumerated devices even if 
quirked off");
 
+static char streams_per_ns = 4;
+module_param(streams_per_ns, byte, 0644);
+MODULE_PARM_DESC(streams_per_ns, "if available, allocate this many streams per 
NS");
+
 static LIST_HEAD(nvme_ctrl_list);
 static DEFINE_SPINLOCK(dev_list_lock);
 
@@ -331,9 +335,34 @@ static inline int nvme_setup_discard(struct nvme_ns *ns, 
struct request *req,
return BLK_MQ_RQ_QUEUE_OK;
 }
 
+static unsigned int nvme_get_write_stream(struct nvme_ns *ns,
+ struct request *req)
+{
+   unsigned int streamid = 0;
+
+   if (req_op(req) != REQ_OP_WRITE || !blk_stream_valid(req->cmd_flags) ||
+   !ns->nr_streams)
+   return 0;
+
+   if (req->cmd_flags & REQ_WRITE_SHORT)
+   streamid = 1;
+   else if (req->cmd_flags & REQ_WRITE_MEDIUM)
+   streamid = 2;
+   else if (req->cmd_flags & REQ_WRITE_LONG)
+   streamid = 3;
+   else if (req->cmd_flags & REQ_WRITE_EXTREME)
+   streamid = 4;
+
+   if (streamid < BLK_MAX_STREAM)
+   req->q->stream_writes[streamid] += blk_rq_bytes(req) >> 9;
+
+   return (streamid % (ns->nr_streams + 1));
+}
+
 static inline void nvme_setup_rw(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmnd)
 {
+   unsigned int stream;
u16 control = 0;
u32 dsmgmt = 0;
 
@@ -351,6 +380,12 @@ static inline void nvme_setup_rw(struct nvme_ns *ns, 
struct request *req,
cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
 
+   stream = nvme_get_write_stream(ns, req);
+   if (stream) {
+   control |= NVME_RW_DTYPE_STREAMS;
+   dsmgmt |= (stream << 16);
+   }
+
if (ns->ms) {
switch (ns->pi_type) {
case NVME_NS_DPS_PI_TYPE3:
@@ -1073,6 +1108,109 @@ static int nvme_revalidate_disk(struct gendisk *disk)
return 0;
 }
 
+static int nvme_enable_streams(struct nvme_ns *ns)
+{
+   struct nvme_command c;
+
+   memset(, 0, sizeof(c));
+
+   c.directive.opcode = nvme_admin_directive_send;
+   c.directive.nsid = cpu_to_le32(ns->ns_id);
+   c.directive.doper = NVME_DIR_SND_ID_OP_ENABLE;
+   c.directive.dtype = NVME_DIR_IDENTIFY;
+   c.directive.tdtype = NVME_DIR_STREAMS;
+   c.directive.endir = NVME_DIR_ENDIR;
+
+   return nvme_submit_sync_cmd(ns->ctrl->admin_q, , NULL, 0);
+}
+
+static int nvme_streams_params(struct nvme_ns *ns)
+{
+   struct nvme_ctrl *ctrl = ns->ctrl;
+   struct streams_directive_params s;
+   struct nvme_command c;
+   int ret;
+
+   memset(, 0, sizeof(c));
+   memset(, 0, sizeof(s));
+
+   c.directive.opcode = nvme_admin_directive_recv;
+   c.directive.nsid = cpu_to_le32(ns->ns_id);
+   c.directive.numd = sizeof(s);
+   c.directive.doper = NVME_DIR_RCV_ST_OP_PARAM;
+   c.directive.dtype = NVME_DIR_STREAMS;
+
+   ret = nvme_submit_sync_cmd(ctrl->admin_q, , , sizeof(s));
+   if (ret)
+   return ret;
+
+   s.msl = le16_to_cpu(s.msl);
+   s.nssa = le16_to_cpu(s.nssa);
+   s.nsso = le16_to_cpu(s.nsso);
+   s.sws = le32_to_cpu(s.sws);
+   s.sgs = le16_to_cpu(s.sgs);
+   s.nsa = le16_to_cpu(s.nsa);
+   s.nso = le16_to_cpu(s.nso);
+
+   dev_info(ctrl->device, "streams: msl=%u, nssa=%u, nsso=%u, sws=%u "
+   "sgs=%u, nsa=%u, nso=%u\n", s.msl, s.nssa,
+   s.nsso, s.sws, s.sgs, s.nsa, s.nso);
+   return 0;

[PATCH 05/11] block: add helpers for setting/checking stream validity

2017-06-14 Thread Jens Axboe
We map the RWF_WRITE_* life time flags to the internal flags.
Drivers can then, in turn, map those flags to a suitable stream
type.

Signed-off-by: Jens Axboe 
---
 block/bio.c   | 16 
 include/linux/bio.h   |  1 +
 include/linux/blk_types.h |  5 +
 3 files changed, 22 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index 888e7801c638..25ea7c365aac 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -2082,6 +2082,22 @@ void bio_clone_blkcg_association(struct bio *dst, struct 
bio *src)
 
 #endif /* CONFIG_BLK_CGROUP */
 
+static const unsigned int rwf_write_to_opf_flag[] = {
+   0, REQ_WRITE_SHORT, REQ_WRITE_MEDIUM, REQ_WRITE_LONG, REQ_WRITE_EXTREME
+};
+
+/*
+ * 'stream_flags' is one of RWF_WRITE_LIFE_* values
+ */
+void bio_set_streamid(struct bio *bio, unsigned int rwf_flags)
+{
+   if (WARN_ON_ONCE(rwf_flags >= ARRAY_SIZE(rwf_write_to_opf_flag)))
+   return;
+
+   bio->bi_opf |= rwf_write_to_opf_flag[rwf_flags];
+}
+EXPORT_SYMBOL_GPL(bio_set_streamid);
+
 static void __init biovec_init_slabs(void)
 {
int i;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index d1b04b0e99cf..a1b3145020ad 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -443,6 +443,7 @@ extern struct bio *bio_copy_kern(struct request_queue *, 
void *, unsigned int,
 gfp_t, int);
 extern void bio_set_pages_dirty(struct bio *bio);
 extern void bio_check_pages_dirty(struct bio *bio);
+extern void bio_set_streamid(struct bio *bio, unsigned int rwf_flags);
 
 void generic_start_io_acct(int rw, unsigned long sectors,
   struct hd_struct *part);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 57d1eb530799..06c8c35f0288 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -323,4 +323,9 @@ struct blk_rq_stat {
u64 batch;
 };
 
+static inline bool blk_stream_valid(unsigned int opf)
+{
+   return (opf & REQ_WRITE_LIFE_MASK) != 0;
+}
+
 #endif /* __LINUX_BLK_TYPES_H */
-- 
2.7.4



[PATCH 08/11] ext4: add support for passing in stream information for buffered writes

2017-06-14 Thread Jens Axboe
Reviewed-by: Andreas Dilger 
Signed-off-by: Jens Axboe 
---
 fs/ext4/page-io.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 1a82138ba739..033b5bfa4e0b 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -350,6 +350,7 @@ void ext4_io_submit(struct ext4_io_submit *io)
int io_op_flags = io->io_wbc->sync_mode == WB_SYNC_ALL ?
  REQ_SYNC : 0;
bio_set_op_attrs(io->io_bio, REQ_OP_WRITE, io_op_flags);
+   bio_set_streamid(io->io_bio, inode_streamid(io->io_end->inode));
submit_bio(io->io_bio);
}
io->io_bio = NULL;
@@ -396,6 +397,7 @@ static int io_submit_add_bh(struct ext4_io_submit *io,
ret = io_submit_init_bio(io, bh);
if (ret)
return ret;
+   bio_set_streamid(io->io_bio, inode_streamid(inode));
}
ret = bio_add_page(io->io_bio, page, bh->b_size, bh_offset(bh));
if (ret != bh->b_size)
-- 
2.7.4



[PATCH 07/11] fs: add support for buffered writeback to pass down stream information

2017-06-14 Thread Jens Axboe
Reviewed-by: Andreas Dilger 
Signed-off-by: Jens Axboe 
---
 fs/buffer.c | 14 +-
 fs/mpage.c  |  1 +
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 161be58c5cb0..8324c24751ca 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -49,7 +49,7 @@
 
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
 static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
-struct writeback_control *wbc);
+unsigned int stream, struct writeback_control *wbc);
 
 #define BH_ENTRY(list) list_entry((list), struct buffer_head, b_assoc_buffers)
 
@@ -1829,7 +1829,8 @@ int __block_write_full_page(struct inode *inode, struct 
page *page,
do {
struct buffer_head *next = bh->b_this_page;
if (buffer_async_write(bh)) {
-   submit_bh_wbc(REQ_OP_WRITE, write_flags, bh, wbc);
+   submit_bh_wbc(REQ_OP_WRITE, write_flags, bh,
+   inode_streamid(inode), wbc);
nr_underway++;
}
bh = next;
@@ -1883,7 +1884,8 @@ int __block_write_full_page(struct inode *inode, struct 
page *page,
struct buffer_head *next = bh->b_this_page;
if (buffer_async_write(bh)) {
clear_buffer_dirty(bh);
-   submit_bh_wbc(REQ_OP_WRITE, write_flags, bh, wbc);
+   submit_bh_wbc(REQ_OP_WRITE, write_flags, bh,
+   inode_streamid(inode), wbc);
nr_underway++;
}
bh = next;
@@ -3091,7 +3093,7 @@ void guard_bio_eod(int op, struct bio *bio)
 }
 
 static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
-struct writeback_control *wbc)
+unsigned int stream, struct writeback_control *wbc)
 {
struct bio *bio;
 
@@ -3130,6 +3132,8 @@ static int submit_bh_wbc(int op, int op_flags, struct 
buffer_head *bh,
/* Take care of bh's that straddle the end of the device */
guard_bio_eod(op, bio);
 
+   bio_set_streamid(bio, stream);
+
if (buffer_meta(bh))
op_flags |= REQ_META;
if (buffer_prio(bh))
@@ -3142,7 +3146,7 @@ static int submit_bh_wbc(int op, int op_flags, struct 
buffer_head *bh,
 
 int submit_bh(int op, int op_flags, struct buffer_head *bh)
 {
-   return submit_bh_wbc(op, op_flags, bh, NULL);
+   return submit_bh_wbc(op, op_flags, bh, 0, NULL);
 }
 EXPORT_SYMBOL(submit_bh);
 
diff --git a/fs/mpage.c b/fs/mpage.c
index baff8f820c29..a9d40c0c053e 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -614,6 +614,7 @@ static int __mpage_writepage(struct page *page, struct 
writeback_control *wbc,
goto confused;
 
wbc_init_bio(wbc, bio);
+   bio_set_streamid(bio, inode_streamid(inode));
}
 
/*
-- 
2.7.4



[PATCH 06/11] fs: add O_DIRECT support for sending down bio stream information

2017-06-14 Thread Jens Axboe
Reviewed-by: Andreas Dilger 
Signed-off-by: Jens Axboe 
---
 fs/block_dev.c | 2 ++
 fs/direct-io.c | 2 ++
 fs/iomap.c | 1 +
 3 files changed, 5 insertions(+)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 51959936..31ba4a8f0a28 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -239,6 +239,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct 
iov_iter *iter,
should_dirty = true;
} else {
bio.bi_opf = dio_bio_write_op(iocb);
+   bio_set_streamid(, iocb_streamid(iocb));
task_io_account_write(ret);
}
 
@@ -374,6 +375,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
*iter, int nr_pages)
bio_set_pages_dirty(bio);
} else {
bio->bi_opf = dio_bio_write_op(iocb);
+   bio_set_streamid(bio, iocb_streamid(iocb));
task_io_account_write(bio->bi_iter.bi_size);
}
 
diff --git a/fs/direct-io.c b/fs/direct-io.c
index a04ebea77de8..a770e82bb9f8 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -386,6 +386,8 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
else
bio->bi_end_io = dio_bio_end_io;
 
+   bio_set_streamid(bio, iocb_streamid(dio->iocb));
+
sdio->bio = bio;
sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
 }
diff --git a/fs/iomap.c b/fs/iomap.c
index 4b10892967a5..fa7d29632fbc 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -804,6 +804,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t 
length,
 
if (dio->flags & IOMAP_DIO_WRITE) {
bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC | 
REQ_IDLE);
+   bio_set_streamid(bio, inode_streamid(inode));
task_io_account_write(bio->bi_iter.bi_size);
} else {
bio_set_op_attrs(bio, REQ_OP_READ, 0);
-- 
2.7.4



[PATCH 10/11] btrfs: add support for passing in stream information for buffered writes

2017-06-14 Thread Jens Axboe
Reviewed-by: Andreas Dilger 
Signed-off-by: Jens Axboe 
---
 fs/btrfs/extent_io.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d3619e010005..b245085e8f10 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2827,6 +2827,7 @@ static int submit_extent_page(int op, int op_flags, 
struct extent_io_tree *tree,
bio->bi_end_io = end_io_func;
bio->bi_private = tree;
bio_set_op_attrs(bio, op, op_flags);
+   bio_set_streamid(bio, inode_streamid(page->mapping->host));
if (wbc) {
wbc_init_bio(wbc, bio);
wbc_account_io(wbc, page, page_size);
-- 
2.7.4



[PATCH 09/11] xfs: add support for passing in stream information for buffered writes

2017-06-14 Thread Jens Axboe
Reviewed-by: Andreas Dilger 
Signed-off-by: Jens Axboe 
---
 fs/xfs/xfs_aops.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 09af0f7cd55e..9770be0140ad 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -505,6 +505,7 @@ xfs_submit_ioend(
return status;
}
 
+   bio_set_streamid(ioend->io_bio, inode_streamid(ioend->io_inode));
submit_bio(ioend->io_bio);
return 0;
 }
@@ -564,6 +565,7 @@ xfs_chain_bio(
bio_chain(ioend->io_bio, new);
bio_get(ioend->io_bio); /* for xfs_destroy_ioend */
ioend->io_bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
+   bio_set_streamid(ioend->io_bio, inode_streamid(ioend->io_inode));
submit_bio(ioend->io_bio);
ioend->io_bio = new;
 }
-- 
2.7.4



[PATCHSET v3] Add support for write life time hints

2017-06-14 Thread Jens Axboe
A new iteration of this patchset, previously known as write streams.
As before, this patchset aims at enabling applications split up
writes into separate streams, based on the perceived life time
of the data written. This is useful for a variety of reasons:

- With NVMe 1.3 compliant devices, the device can expose multiple
  streams. Separating data written into streams based on life time
  can drastically reduce the write amplification. This helps device
  endurance, and increases performance. Testing just performed
  internally at Facebook with these patches showed up to a 25%
  reduction in NAND writes in a RocksDB setup.

- Software caching solutions can make more intelligent decisions
  on how and where to place data.

Contrary to previous patches, we're not exposing numeric stream values anymore.
I've previously advocated for just doing a set of hints that makes sense
instead. See the coverage from the LSFMM summit this year:

https://lwn.net/Articles/717755/

This patchset attempts to do that. We define 4 flags for the pwritev2
system call:

RWF_WRITE_LIFE_SHORTData written with this flag is expected to have
a high overwrite rate, or life time.

RWF_WRITE_LIFE_MEDIUM   Longer life time than SHORT

RWF_WRITE_LIFE_LONG Longer life time than MEDIUM

RWF_WRITE_LIFE_EXTREME  Longer life time than LONG

The idea is that these are relative values, so an application can
use them as they see fit. The underlying device can then place
data appropriately, or be free to ignore the hint. It's just a hint.

A branch based on current master can be pulled
from here:

git://git.kernel.dk/linux-block write-stream.3

Changes since v2:

- Get rid of bio->bi_stream and replace with four request/bio flags.
  These map directly to the RWF_WRITE_* flags that the user passes in.
- Cleanup the NVMe stream setting.
- Drivers now responsible for updating the queue stream write counter,
  as they determine what stream to map a given flag to.

Changes since v1:

- Guard queue stream stats to ensure we don't mess up memory, if
  bio_stream() ever were to return a larger value than we support.
- NVMe: ensure we set the stream modulo the name space defined count.
- Cleanup the RWF_ and IOCB_ flags. Set aside 4 bits, and just store
  the stream value in there. This makes the passing of stream ID from
  RWF_ space to IOCB_ (and IOCB_ to bio) more efficient, and cleans it
  up in general.
- Kill the block internal definitions of the stream type, we don't need
  them anymore. See above.

-- 
Jens Axboe




[PATCH 02/11] blk-mq: expose stream write stats through debugfs

2017-06-14 Thread Jens Axboe
Useful to verify that things are working the way they should.
Reading the file will return number of kb written to each
stream. Writing the file will reset the statistics. No care
is taken to ensure that we don't race on updates.

Drivers will write to q->stream_writes[] if they handle a stream.

Reviewed-by: Andreas Dilger 
Signed-off-by: Jens Axboe 
---
 block/blk-mq-debugfs.c | 24 
 include/linux/blkdev.h |  3 +++
 2 files changed, 27 insertions(+)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 803aed4d7221..0a37c848961d 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -133,6 +133,29 @@ static void print_stat(struct seq_file *m, struct 
blk_rq_stat *stat)
}
 }
 
+static int queue_streams_show(void *data, struct seq_file *m)
+{
+   struct request_queue *q = data;
+   int i;
+
+   for (i = 0; i < BLK_MAX_STREAM; i++)
+   seq_printf(m, "stream%d: %llu\n", i, q->stream_writes[i]);
+
+   return 0;
+}
+
+static ssize_t queue_streams_store(void *data, const char __user *buf,
+  size_t count, loff_t *ppos)
+{
+   struct request_queue *q = data;
+   int i;
+
+   for (i = 0; i < BLK_MAX_STREAM; i++)
+   q->stream_writes[i] = 0;
+
+   return count;
+}
+
 static int queue_poll_stat_show(void *data, struct seq_file *m)
 {
struct request_queue *q = data;
@@ -656,6 +679,7 @@ const struct file_operations blk_mq_debugfs_fops = {
 static const struct blk_mq_debugfs_attr blk_mq_debugfs_queue_attrs[] = {
{"poll_stat", 0400, queue_poll_stat_show},
{"state", 0600, queue_state_show, queue_state_write},
+   {"streams", 0600, queue_streams_show, queue_streams_store},
{},
 };
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ab92c4ea138b..88719c6f3edf 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -586,6 +586,9 @@ struct request_queue {
 
size_t  cmd_size;
void*rq_alloc_data;
+
+#define BLK_MAX_STREAM 5
+   u64 stream_writes[BLK_MAX_STREAM];
 };
 
 #define QUEUE_FLAG_QUEUED  1   /* uses generic tag queueing */
-- 
2.7.4



[PATCH 03/11] fs: add support for an inode to carry stream related data

2017-06-14 Thread Jens Axboe
No functional changes in this patch, just in preparation for
allowing applications to pass in hints about data life times
for writes.

Pack the i_stream field into a 2-byte hole, so we don't grow
the size of the inode.

Reviewed-by: Andreas Dilger 
Signed-off-by: Jens Axboe 
---
 fs/inode.c | 1 +
 include/linux/fs.h | 9 +
 2 files changed, 10 insertions(+)

diff --git a/fs/inode.c b/fs/inode.c
index db5914783a71..5717cb378fda 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -149,6 +149,7 @@ int inode_init_always(struct super_block *sb, struct inode 
*inode)
inode->i_blocks = 0;
inode->i_bytes = 0;
inode->i_generation = 0;
+   inode->i_stream = 0;
inode->i_pipe = NULL;
inode->i_bdev = NULL;
inode->i_cdev = NULL;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 803e5a9b2654..771e172d23d7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -591,6 +591,7 @@ struct inode {
struct timespec i_ctime;
spinlock_t  i_lock; /* i_blocks, i_bytes, maybe i_size */
unsigned short  i_bytes;
+   unsigned short  i_stream;
unsigned inti_blkbits;
blkcnt_ti_blocks;
 
@@ -655,6 +656,14 @@ struct inode {
void*i_private; /* fs or device private pointer */
 };
 
+static inline unsigned int inode_streamid(struct inode *inode)
+{
+   if (inode)
+   return inode->i_stream;
+
+   return 0;
+}
+
 static inline unsigned int i_blocksize(const struct inode *node)
 {
return (1 << node->i_blkbits);
-- 
2.7.4



[PATCH 01/11] block: add support for carrying stream information in a bio

2017-06-14 Thread Jens Axboe
No functional changes in this patch, we just add four flags
that will be used to denote a stream type, and ensure that we
don't merge across different stream types.

Signed-off-by: Jens Axboe 
---
 block/blk-merge.c | 16 
 include/linux/blk_types.h | 11 +++
 2 files changed, 27 insertions(+)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 3990ae406341..7d299df3b12b 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -693,6 +693,14 @@ static struct request *attempt_merge(struct request_queue 
*q,
return NULL;
 
/*
+* Don't allow merge of different streams, or for a stream with
+* non-stream IO.
+*/
+   if ((req->cmd_flags & REQ_WRITE_LIFE_MASK) !=
+   (next->cmd_flags & REQ_WRITE_LIFE_MASK))
+   return NULL;
+
+   /*
 * If we are allowed to merge, then append bio list
 * from next to rq and release next. merge_requests_fn
 * will have updated segment counts, update sector
@@ -811,6 +819,14 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
!blk_write_same_mergeable(rq->bio, bio))
return false;
 
+   /*
+* Don't allow merge of different streams, or for a stream with
+* non-stream IO.
+*/
+   if ((rq->cmd_flags & REQ_WRITE_LIFE_MASK) !=
+   (bio->bi_opf & REQ_WRITE_LIFE_MASK))
+   return false;
+
return true;
 }
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 61339bc44400..57d1eb530799 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -201,6 +201,10 @@ enum req_flag_bits {
__REQ_PREFLUSH, /* request for cache flush */
__REQ_RAHEAD,   /* read ahead, can fail anytime */
__REQ_BACKGROUND,   /* background IO */
+   __REQ_WRITE_SHORT,  /* short life time write */
+   __REQ_WRITE_MEDIUM, /* medium life time write */
+   __REQ_WRITE_LONG,   /* long life time write */
+   __REQ_WRITE_EXTREME,/* extremely long life time write */
 
/* command specific flags for REQ_OP_WRITE_ZEROES: */
__REQ_NOUNMAP,  /* do not free blocks when zeroing */
@@ -221,6 +225,13 @@ enum req_flag_bits {
 #define REQ_PREFLUSH   (1ULL << __REQ_PREFLUSH)
 #define REQ_RAHEAD (1ULL << __REQ_RAHEAD)
 #define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND)
+#define REQ_WRITE_SHORT(1ULL << __REQ_WRITE_SHORT)
+#define REQ_WRITE_MEDIUM   (1ULL << __REQ_WRITE_MEDIUM)
+#define REQ_WRITE_LONG (1ULL << __REQ_WRITE_LONG)
+#define REQ_WRITE_EXTREME  (1ULL << __REQ_WRITE_EXTREME)
+
+#define REQ_WRITE_LIFE_MASK(REQ_WRITE_SHORT | REQ_WRITE_MEDIUM | \
+   REQ_WRITE_LONG | REQ_WRITE_EXTREME)
 
 #define REQ_NOUNMAP(1ULL << __REQ_NOUNMAP)
 
-- 
2.7.4



[PATCH 04/11] fs: add support for allowing applications to pass in write life time hints

2017-06-14 Thread Jens Axboe
Add four flags for the pwritev2(2) system call, allowing an application
to give the kernel a hint about what on-media life times can be
expected from a given write.

The intent is for these values to be relative to each other, no
absolute meaning should be attached to these flag names.

Define IOCB flags to carry this information over, and finally
transform them into the block defined stream values.

Reviewed-by: Andreas Dilger 
Signed-off-by: Jens Axboe 
---
 fs/read_write.c |  9 -
 include/linux/fs.h  | 12 
 include/uapi/linux/fs.h | 10 ++
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 47c1d4484df9..a9b0c3125e0f 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -678,7 +678,7 @@ static ssize_t do_iter_readv_writev(struct file *filp, 
struct iov_iter *iter,
struct kiocb kiocb;
ssize_t ret;
 
-   if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC))
+   if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_WRITE_LIFE_MASK))
return -EOPNOTSUPP;
 
init_sync_kiocb(, filp);
@@ -688,6 +688,13 @@ static ssize_t do_iter_readv_writev(struct file *filp, 
struct iov_iter *iter,
kiocb.ki_flags |= IOCB_DSYNC;
if (flags & RWF_SYNC)
kiocb.ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
+   if (flags & RWF_WRITE_LIFE_MASK) {
+   struct inode *inode = file_inode(filp);
+
+   inode->i_stream = (flags & RWF_WRITE_LIFE_MASK) >>
+   RWF_WRITE_LIFE_SHIFT;
+   kiocb.ki_flags |= inode->i_stream << IOCB_WRITE_LIFE_SHIFT;
+   }
kiocb.ki_pos = *ppos;
 
if (type == READ)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 771e172d23d7..751a1046e87b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -269,6 +269,12 @@ struct writeback_control;
 #define IOCB_SYNC  (1 << 5)
 #define IOCB_WRITE (1 << 6)
 
+/*
+ * Steal 4-bits for stream information, this allows 16 valid streams
+ */
+#define IOCB_WRITE_LIFE_SHIFT  7
+#define IOCB_WRITE_LIFE_MASK   (BIT(7) | BIT(8) | BIT(9) | BIT(10))
+
 struct kiocb {
struct file *ki_filp;
loff_t  ki_pos;
@@ -292,6 +298,12 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, 
struct file *filp)
};
 }
 
+static inline int iocb_streamid(const struct kiocb *iocb)
+{
+   return (iocb->ki_flags & IOCB_WRITE_LIFE_MASK) >>
+   IOCB_WRITE_LIFE_SHIFT;
+}
+
 /*
  * "descriptor" for what we're up to with a read.
  * This allows us to use the same read code yet
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 24e61a54feaa..58b7ee06b380 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -361,4 +361,14 @@ struct fscrypt_key {
 #define RWF_DSYNC  0x0002 /* per-IO O_DSYNC */
 #define RWF_SYNC   0x0004 /* per-IO O_SYNC */
 
+/*
+ * Data life time write flags, steal 4 bits for that
+ */
+#define RWF_WRITE_LIFE_SHIFT   4
+#define RWF_WRITE_LIFE_MASK0x00f0 /* 4 bits of stream ID */
+#define RWF_WRITE_LIFE_SHORT   (1 << RWF_WRITE_LIFE_SHIFT)
+#define RWF_WRITE_LIFE_MEDIUM  (2 << RWF_WRITE_LIFE_SHIFT)
+#define RWF_WRITE_LIFE_LONG(3 << RWF_WRITE_LIFE_SHIFT)
+#define RWF_WRITE_LIFE_EXTREME (4 << RWF_WRITE_LIFE_SHIFT)
+
 #endif /* _UAPI_LINUX_FS_H */
-- 
2.7.4



Re: [PATCH v2 1/6] block: Avoid that blk_exit_rl() triggers a use-after-free

2017-06-14 Thread Ross Zwisler
On Wed, Jun 14, 2017 at 9:19 AM, Bart Van Assche
 wrote:
> On 06/13/17 10:54, Ross Zwisler wrote:
>> This commit is causing the following kernel BUG for me when I shut
>> down my systems:
>>
>>   BUG: sleeping function called from invalid context at 
>> kernel/workqueue.c:2790
>>   in_atomic(): 1, irqs_disabled(): 0, pid: 41, name: rcuop/3
>
> Thanks Ross for the testing and for the report. Can you check whether
> the patch below is sufficient to fix this?
>
>
> Subject: [PATCH] block: Fix a blk_exit_rl() regression
>
> Avoid that the following complaint is reported:
>
>  BUG: sleeping function called from invalid context at kernel/workqueue.c:2790
>  in_atomic(): 1, irqs_disabled(): 0, pid: 41, name: rcuop/3
>  1 lock held by rcuop/3/41:
>   #0:  (rcu_callback){..}, at: [] 
> rcu_nocb_kthread+0x282/0x500
>  Call Trace:
>   dump_stack+0x86/0xcf
>   ___might_sleep+0x174/0x260
>   __might_sleep+0x4a/0x80
>   flush_work+0x7e/0x2e0
>   __cancel_work_timer+0x143/0x1c0
>   cancel_work_sync+0x10/0x20
>   blk_throtl_exit+0x25/0x60
>   blkcg_exit_queue+0x35/0x40
>   blk_release_queue+0x42/0x130
>   kobject_put+0xa9/0x190
>
> Reported-by: Ross Zwisler 
> Fixes: commit b425e5049258 ("block: Avoid that blk_exit_rl() triggers a 
> use-after-free")
> Signed-off-by: Bart Van Assche 
> Cc: Ross Zwisler 

Yep, this solves the issue for me, thanks!

Tested-by: Ross Zwisler 


Re: [PATCHSET v2] Add support for write life time hints

2017-06-14 Thread Jens Axboe
On 06/14/2017 10:04 AM, Martin K. Petersen wrote:
> 
> Christoph,
> 
>> I think what Martin wants (or at least what I'd want him to want) is
>> to define a few REQ_* bits that mirror the RWF bits, use that to
>> transfer the information down the stack, and then only translate it
>> to stream ids in the driver.
> 
> Yup. If we have enough space in the existing flags that's perfect (I
> lost count after your op/flag shuffle).

OK, diff on top of the current stuff, so you can see how that changes
things. If this looks good to folks, I'll update the series to achieve
the same final result.

diff --git a/block/bio.c b/block/bio.c
index 77f4be1f..25ea7c3 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -595,7 +595,6 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
bio->bi_opf = bio_src->bi_opf;
bio->bi_iter = bio_src->bi_iter;
bio->bi_io_vec = bio_src->bi_io_vec;
-   bio->bi_stream = bio_src->bi_stream;
 
bio_clone_blkcg_association(bio, bio_src);
 }
@@ -679,7 +678,6 @@ struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t 
gfp_mask,
bio->bi_opf = bio_src->bi_opf;
bio->bi_iter.bi_sector  = bio_src->bi_iter.bi_sector;
bio->bi_iter.bi_size= bio_src->bi_iter.bi_size;
-   bio->bi_stream  = bio_src->bi_stream;
 
switch (bio_op(bio)) {
case REQ_OP_DISCARD:
@@ -2084,6 +2082,22 @@ void bio_clone_blkcg_association(struct bio *dst, struct 
bio *src)
 
 #endif /* CONFIG_BLK_CGROUP */
 
+static const unsigned int rwf_write_to_opf_flag[] = {
+   0, REQ_WRITE_SHORT, REQ_WRITE_MEDIUM, REQ_WRITE_LONG, REQ_WRITE_EXTREME
+};
+
+/*
+ * 'stream_flags' is one of RWF_WRITE_LIFE_* values
+ */
+void bio_set_streamid(struct bio *bio, unsigned int rwf_flags)
+{
+   if (WARN_ON_ONCE(rwf_flags >= ARRAY_SIZE(rwf_write_to_opf_flag)))
+   return;
+
+   bio->bi_opf |= rwf_write_to_opf_flag[rwf_flags];
+}
+EXPORT_SYMBOL_GPL(bio_set_streamid);
+
 static void __init biovec_init_slabs(void)
 {
int i;
diff --git a/block/blk-core.c b/block/blk-core.c
index 3f4a206..a7421b7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2057,12 +2057,6 @@ blk_qc_t generic_make_request(struct bio *bio)
do {
struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
-   if (bio_op(bio) == REQ_OP_WRITE &&
-   bio_stream(bio) < BLK_MAX_STREAM) {
-   q->stream_writes[bio_stream(bio)] +=
-   bio->bi_iter.bi_size >> 9;
-   }
-
if (likely(blk_queue_enter(q, false) == 0)) {
struct bio_list lower, same;
 
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 28998ac..7d299df 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -696,7 +696,8 @@ static struct request *attempt_merge(struct request_queue 
*q,
 * Don't allow merge of different streams, or for a stream with
 * non-stream IO.
 */
-   if (req->bio->bi_stream != next->bio->bi_stream)
+   if ((req->cmd_flags & REQ_WRITE_LIFE_MASK) !=
+   (next->cmd_flags & REQ_WRITE_LIFE_MASK))
return NULL;
 
/*
@@ -822,7 +823,8 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
 * Don't allow merge of different streams, or for a stream with
 * non-stream IO.
 */
-   if (rq->bio->bi_stream != bio->bi_stream)
+   if ((rq->cmd_flags & REQ_WRITE_LIFE_MASK) !=
+   (bio->bi_opf & REQ_WRITE_LIFE_MASK))
return false;
 
return true;
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index d7cbd05..8988133 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -335,6 +335,20 @@ static inline int nvme_setup_discard(struct nvme_ns *ns, 
struct request *req,
return BLK_MQ_RQ_QUEUE_OK;
 }
 
+static inline unsigned int req_to_streamid(struct request *req)
+{
+   if (req->cmd_flags & REQ_WRITE_SHORT)
+   return 1;
+   else if (req->cmd_flags & REQ_WRITE_MEDIUM)
+   return 2;
+   else if (req->cmd_flags & REQ_WRITE_LONG)
+   return 3;
+   else if (req->cmd_flags & REQ_WRITE_EXTREME)
+   return 4;
+
+   return 0;
+}
+
 static inline void nvme_setup_rw(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmnd)
 {
@@ -355,13 +369,15 @@ static inline void nvme_setup_rw(struct nvme_ns *ns, 
struct request *req,
cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
 
-   if (req_op(req) == REQ_OP_WRITE) {
-   if (bio_stream_valid(req->bio) && ns->nr_streams) {
-   unsigned stream = bio_stream(req->bio) & 0x;
+   if (req_op(req) == REQ_OP_WRITE && blk_stream_valid(req->cmd_flags) &&
+   ns->nr_streams) {
+

Re: [PATCHSET v2] Add support for write life time hints

2017-06-14 Thread Jens Axboe
On 06/14/2017 10:04 AM, Martin K. Petersen wrote:
> 
> Christoph,
> 
>> I think what Martin wants (or at least what I'd want him to want) is
>> to define a few REQ_* bits that mirror the RWF bits, use that to
>> transfer the information down the stack, and then only translate it
>> to stream ids in the driver.
> 
> Yup. If we have enough space in the existing flags that's perfect (I
> lost count after your op/flag shuffle).

There are enough flags, I just implemented it. I'll run some testing
and rebase everything, then send out a v3...

-- 
Jens Axboe



[PATCH V2 00/12]blktrace: output cgroup info

2017-06-14 Thread Shaohua Li
From: Shaohua Li 

Hi,

Currently blktrace isn't cgroup aware. blktrace prints out task name of current
context, but the task of current context isn't always in the cgroup where the
BIO comes from. We can't use task name to find out IO cgroup. For example,
Writeback BIOs always comes from flusher thread but the BIOs are for different
blk cgroups. Request could be requeued and dispatched from completely different
tasks. MD/DM are another examples. This brings challenges if we want to use
blktrace for performance tunning with cgroup enabled.

This patchset try to fix the gap. We print out cgroup fhandle info in blktrace.
Userspace can use open_by_handle_at() syscall to find the cgroup by fhandle. Or
userspace can use name_to_handle_at() syscall to find fhandle for a cgroup and
use a BPF program to filter out blktrace for a specific cgroup.

The first 6 patches adds export operation handlers for kernfs, so userspace can
use open_by_handle_at/name_to_handle_at to a kernfs file. Later patches make
blktrace output cgroup info.

Note, we export 64-bit inode number and 32-bit generation number for fhandle.
Currently kernfs only supports 32-bit inode number actually because idr only
supports 32-bit allocation. We had plan to support 64-bit inode number soon, as
Tejun has concerns the 32-bit inode/generation could wrap easily. This patchset
hasn't converted inode number to 64-bit yet.

Thanks,
Shaohua

V1 -> V2:
- Fix a bug in cgroup association
- Fix build errors reported by 0day
- Address some issues pointed out by Tejun

Shaohua Li (12):
  kernfs: implement i_generation
  kernfs: use idr instead of ida to manage inode number
  kernfs: add an API to get kernfs node from inode number
  kernfs: don't set dentry->d_fsdata
  kernfs: introduce kernfs_node_id
  kernfs: add exportfs operations
  cgroup: export fhandle info for a cgroup
  blktrace: export cgroup info in trace
  block: always attach cgroup info into bio
  block: call __bio_free in bio_endio
  blktrace: add an option to allow displying cgroup path
  block: use standard blktrace API to output cgroup info for debug notes

 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c |   2 +-
 block/bfq-iosched.h  |  13 +-
 block/bio-integrity.c|   1 +
 block/bio.c  |   2 +
 block/blk-throttle.c |  13 +-
 block/cfq-iosched.c  |  15 +-
 fs/kernfs/dir.c  | 101 +---
 fs/kernfs/file.c |  10 +-
 fs/kernfs/inode.c|   9 +-
 fs/kernfs/kernfs-internal.h  |   9 ++
 fs/kernfs/mount.c| 144 +++--
 fs/kernfs/symlink.c  |   6 +-
 fs/sysfs/mount.c |   2 +-
 include/linux/blk-cgroup.h   |  17 +-
 include/linux/blktrace_api.h |  13 +-
 include/linux/cgroup.h   |  16 +-
 include/linux/exportfs.h |  11 ++
 include/linux/kernfs.h   |  36 -
 include/uapi/linux/blktrace_api.h|   3 +
 kernel/cgroup/cgroup.c   |  15 +-
 kernel/trace/blktrace.c  | 259 ++-
 21 files changed, 523 insertions(+), 174 deletions(-)

-- 
2.9.3



[PATCH V2 04/12] kernfs: don't set dentry->d_fsdata

2017-06-14 Thread Shaohua Li
From: Shaohua Li 

When working on adding exportfs operations in kernfs, I found it's hard
to initialize dentry->d_fsdata in the exportfs operations. Looks there
is no way to do it without race condition. Look at the kernfs code
closely, there is no point to set dentry->d_fsdata. inode->i_private
already points to kernfs_node, and we can get inode from a dentry. So
this patch just delete the d_fsdata usage.

Signed-off-by: Shaohua Li 
---
 fs/kernfs/dir.c | 25 +
 fs/kernfs/file.c|  6 +++---
 fs/kernfs/inode.c   |  6 +++---
 fs/kernfs/kernfs-internal.h |  7 +++
 fs/kernfs/mount.c   |  8 ++--
 fs/kernfs/symlink.c |  6 +++---
 6 files changed, 27 insertions(+), 31 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 646b56b..e54fdd9 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -566,7 +566,7 @@ static int kernfs_dop_revalidate(struct dentry *dentry, 
unsigned int flags)
if (d_really_is_negative(dentry))
goto out_bad_unlocked;
 
-   kn = dentry->d_fsdata;
+   kn = kernfs_dentry_node(dentry);
mutex_lock(_mutex);
 
/* The kernfs node has been deactivated */
@@ -574,7 +574,7 @@ static int kernfs_dop_revalidate(struct dentry *dentry, 
unsigned int flags)
goto out_bad;
 
/* The kernfs node has been moved? */
-   if (dentry->d_parent->d_fsdata != kn->parent)
+   if (kernfs_dentry_node(dentry->d_parent) != kn->parent)
goto out_bad;
 
/* The kernfs node has been renamed */
@@ -594,14 +594,8 @@ static int kernfs_dop_revalidate(struct dentry *dentry, 
unsigned int flags)
return 0;
 }
 
-static void kernfs_dop_release(struct dentry *dentry)
-{
-   kernfs_put(dentry->d_fsdata);
-}
-
 const struct dentry_operations kernfs_dops = {
.d_revalidate   = kernfs_dop_revalidate,
-   .d_release  = kernfs_dop_release,
 };
 
 /**
@@ -617,8 +611,9 @@ const struct dentry_operations kernfs_dops = {
  */
 struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry)
 {
-   if (dentry->d_sb->s_op == _sops)
-   return dentry->d_fsdata;
+   if (dentry->d_sb->s_op == _sops &&
+   !d_really_is_negative(dentry))
+   return kernfs_dentry_node(dentry);
return NULL;
 }
 
@@ -1046,7 +1041,7 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
unsigned int flags)
 {
struct dentry *ret;
-   struct kernfs_node *parent = dentry->d_parent->d_fsdata;
+   struct kernfs_node *parent = dir->i_private;
struct kernfs_node *kn;
struct inode *inode;
const void *ns = NULL;
@@ -1063,8 +1058,6 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
ret = NULL;
goto out_unlock;
}
-   kernfs_get(kn);
-   dentry->d_fsdata = kn;
 
/* attach dentry and inode */
inode = kernfs_get_inode(dir->i_sb, kn);
@@ -1101,7 +1094,7 @@ static int kernfs_iop_mkdir(struct inode *dir, struct 
dentry *dentry,
 
 static int kernfs_iop_rmdir(struct inode *dir, struct dentry *dentry)
 {
-   struct kernfs_node *kn  = dentry->d_fsdata;
+   struct kernfs_node *kn  = kernfs_dentry_node(dentry);
struct kernfs_syscall_ops *scops = kernfs_root(kn)->syscall_ops;
int ret;
 
@@ -1121,7 +1114,7 @@ static int kernfs_iop_rename(struct inode *old_dir, 
struct dentry *old_dentry,
 struct inode *new_dir, struct dentry *new_dentry,
 unsigned int flags)
 {
-   struct kernfs_node *kn  = old_dentry->d_fsdata;
+   struct kernfs_node *kn = kernfs_dentry_node(old_dentry);
struct kernfs_node *new_parent = new_dir->i_private;
struct kernfs_syscall_ops *scops = kernfs_root(kn)->syscall_ops;
int ret;
@@ -1634,7 +1627,7 @@ static struct kernfs_node *kernfs_dir_next_pos(const void 
*ns,
 static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
 {
struct dentry *dentry = file->f_path.dentry;
-   struct kernfs_node *parent = dentry->d_fsdata;
+   struct kernfs_node *parent = kernfs_dentry_node(dentry);
struct kernfs_node *pos = file->private_data;
const void *ns = NULL;
 
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index ac2dfe0..7f90d4d 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -616,7 +616,7 @@ static void kernfs_put_open_node(struct kernfs_node *kn,
 
 static int kernfs_fop_open(struct inode *inode, struct file *file)
 {
-   struct kernfs_node *kn = file->f_path.dentry->d_fsdata;
+   struct kernfs_node *kn = inode->i_private;
struct kernfs_root *root = kernfs_root(kn);
const struct kernfs_ops *ops;
struct kernfs_open_file *of;
@@ -768,7 +768,7 @@ static void kernfs_release_file(struct kernfs_node *kn,
 
 static int kernfs_fop_release(struct 

[PATCH V2 07/12] cgroup: export fhandle info for a cgroup

2017-06-14 Thread Shaohua Li
From: Shaohua Li 

Add an API to export cgroup fhandle info. We don't export a full 'struct
file_handle', there are unrequired info. Sepcifically, cgroup is always
a directory, so we don't need a 'FILEID_KERNFS_WITH_PARENT' type
fhandle, we only need export the inode number and generation number.
Since the first part of 'kernfs_fid' and 'kernfs_node_id' have the same
layout, we can just export a 'kernfs_node_id'.

Signed-off-by: Shaohua Li 
---
 include/linux/cgroup.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 30c6877..5ebe89f 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -609,6 +609,10 @@ static inline void cgroup_kthread_ready(void)
current->no_cgroup_migration = 0;
 }
 
+static inline struct kernfs_node_id *cgroup_get_node_id(struct cgroup *cgrp)
+{
+   return >kn->id;
+}
 #else /* !CONFIG_CGROUPS */
 
 struct cgroup_subsys_state;
@@ -631,6 +635,10 @@ static inline int cgroup_init_early(void) { return 0; }
 static inline int cgroup_init(void) { return 0; }
 static inline void cgroup_init_kthreadd(void) {}
 static inline void cgroup_kthread_ready(void) {}
+static inline struct kernfs_node_id *cgroup_get_node_id(struct cgroup *cgrp)
+{
+   return NULL;
+}
 
 static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
   struct cgroup *ancestor)
-- 
2.9.3



[PATCH V2 03/12] kernfs: add an API to get kernfs node from inode number

2017-06-14 Thread Shaohua Li
From: Shaohua Li 

Add an API to get kernfs node from inode number. We will need this to
implement exportfs operations.

To make the API lock free, kernfs node is freed in RCU context. And we
depend on kernfs_node count/ino number to filter stale kernfs nodes.

Signed-off-by: Shaohua Li 
---
 fs/kernfs/dir.c | 53 +
 fs/kernfs/kernfs-internal.h |  2 ++
 fs/kernfs/mount.c   |  4 +++-
 3 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 8e8545a..646b56b 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -508,6 +508,10 @@ void kernfs_put(struct kernfs_node *kn)
struct kernfs_node *parent;
struct kernfs_root *root;
 
+   /*
+* kernfs_node is freed with ->count 0, kernfs_find_and_get_node_by_ino
+* depends on this to filter reused stale node
+*/
if (!kn || !atomic_dec_and_test(>count))
return;
root = kernfs_root(kn);
@@ -643,6 +647,7 @@ static struct kernfs_node *__kernfs_new_node(struct 
kernfs_root *root,
kn->ino = ret;
kn->generation = atomic_inc_return(>next_generation);
 
+   /* set ino first. Above atomic_inc_return has a barrier */
atomic_set(>count, 1);
atomic_set(>active, KN_DEACTIVATED_BIAS);
RB_CLEAR_NODE(>rb);
@@ -674,6 +679,54 @@ struct kernfs_node *kernfs_new_node(struct kernfs_node 
*parent,
return kn;
 }
 
+/*
+ * kernfs_find_and_get_node_by_ino - get kernfs_node from inode number
+ * @root: the kernfs root
+ * @ino: inode number
+ *
+ * RETURNS:
+ * NULL on failure. Return a kernfs node with reference counter incremented
+ */
+struct kernfs_node *kernfs_find_and_get_node_by_ino(struct kernfs_root *root,
+   unsigned int ino)
+{
+   struct kernfs_node *kn;
+
+   rcu_read_lock();
+   kn = idr_find(>ino_idr, ino);
+   if (!kn)
+   goto out;
+
+   /*
+* Since kernfs_node is freed in RCU, it's possible an old node for ino
+* is freed, but reused before RCU grace period. But a freed node (see
+* kernfs_put) or an incompletedly initialized node (see
+* __kernfs_new_node) should have 'count' 0. We can use this fact to
+* filter out such node.
+*/
+   if (!atomic_inc_not_zero(>count)) {
+   kn = NULL;
+   goto out;
+   }
+
+   /*
+* The node could be a new node or a reused node. If it's a new node,
+* we are ok. If it's reused because of RCU, the __kernfs_new_node
+* always sets its 'ino' before 'count'. So if 'count' is uptodate,
+* 'ino' should be uptodate, hence we can use 'ino' to filter stale
+* node.
+*/
+   if (kn->ino != ino)
+   goto out;
+   rcu_read_unlock();
+
+   return kn;
+out:
+   rcu_read_unlock();
+   kernfs_put(kn);
+   return NULL;
+}
+
 /**
  * kernfs_add_one - add kernfs_node to parent without warning
  * @kn: kernfs_node to be added
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 2d5144a..e9c226f 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -98,6 +98,8 @@ int kernfs_add_one(struct kernfs_node *kn);
 struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
const char *name, umode_t mode,
unsigned flags);
+struct kernfs_node *kernfs_find_and_get_node_by_ino(struct kernfs_root *root,
+   unsigned int ino);
 
 /*
  * file.c
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index d5b149a..343dfeb 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -332,5 +332,7 @@ void __init kernfs_init(void)
 {
kernfs_node_cache = kmem_cache_create("kernfs_node_cache",
  sizeof(struct kernfs_node),
- 0, SLAB_PANIC, NULL);
+ 0,
+ SLAB_PANIC | SLAB_TYPESAFE_BY_RCU,
+ NULL);
 }
-- 
2.9.3



[PATCH V2 09/12] block: always attach cgroup info into bio

2017-06-14 Thread Shaohua Li
From: Shaohua Li 

blkcg_bio_issue_check() already gets blkcg for a BIO.
bio_associate_blkcg() uses a percpu refcounter, so it's a very cheap
operation. There is no point we don't attach the cgroup info into bio at
blkcg_bio_issue_check. This also makes blktrace outputs correct cgroup
info.

Signed-off-by: Shaohua Li 
---
 block/blk-throttle.c   | 7 +--
 include/linux/blk-cgroup.h | 3 +++
 2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a7285bf..a6ebd2b 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -2104,14 +2104,9 @@ static inline void throtl_update_latency_buckets(struct 
throtl_data *td)
 static void blk_throtl_assoc_bio(struct throtl_grp *tg, struct bio *bio)
 {
 #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
-   int ret;
-
-   ret = bio_associate_current(bio);
-   if (ret == 0 || ret == -EBUSY)
+   if (bio->bi_css)
bio->bi_cg_private = tg;
blk_stat_set_issue(>bi_issue_stat, bio_sectors(bio));
-#else
-   bio_associate_current(bio);
 #endif
 }
 
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 01b62e7..fe10b85 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -691,6 +691,9 @@ static inline bool blkcg_bio_issue_check(struct 
request_queue *q,
rcu_read_lock();
blkcg = bio_blkcg(bio);
 
+   /* associate blkcg if bio hasn't attached one */
+   bio_associate_blkcg(bio, >css);
+
blkg = blkg_lookup(blkcg, q);
if (unlikely(!blkg)) {
spin_lock_irq(q->queue_lock);
-- 
2.9.3



[PATCH V2 02/12] kernfs: use idr instead of ida to manage inode number

2017-06-14 Thread Shaohua Li
From: Shaohua Li 

kernfs uses ida to manage inode number. The problem is we can't get
kernfs_node from inode number with ida. Switching to use idr, next patch
will add an API to get kernfs_node from inode number.

Signed-off-by: Shaohua Li 
---
 fs/kernfs/dir.c| 17 -
 include/linux/kernfs.h |  2 +-
 2 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 09d093e..8e8545a 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -21,6 +21,7 @@
 DEFINE_MUTEX(kernfs_mutex);
 static DEFINE_SPINLOCK(kernfs_rename_lock);/* kn->parent and ->name */
 static char kernfs_pr_cont_buf[PATH_MAX];  /* protected by rename_lock */
+static DEFINE_SPINLOCK(kernfs_idr_lock);   /* root->ino_idr */
 
 #define rb_to_kn(X) rb_entry((X), struct kernfs_node, rb)
 
@@ -533,7 +534,9 @@ void kernfs_put(struct kernfs_node *kn)
simple_xattrs_free(>iattr->xattrs);
}
kfree(kn->iattr);
-   ida_simple_remove(>ino_ida, kn->ino);
+   spin_lock(_idr_lock);
+   idr_remove(>ino_idr, kn->ino);
+   spin_unlock(_idr_lock);
kmem_cache_free(kernfs_node_cache, kn);
 
kn = parent;
@@ -542,7 +545,7 @@ void kernfs_put(struct kernfs_node *kn)
goto repeat;
} else {
/* just released the root kn, free @root too */
-   ida_destroy(>ino_ida);
+   idr_destroy(>ino_idr);
kfree(root);
}
 }
@@ -630,7 +633,11 @@ static struct kernfs_node *__kernfs_new_node(struct 
kernfs_root *root,
if (!kn)
goto err_out1;
 
-   ret = ida_simple_get(>ino_ida, 1, 0, GFP_KERNEL);
+   idr_preload(GFP_KERNEL);
+   spin_lock(_idr_lock);
+   ret = idr_alloc(>ino_idr, kn, 1, 0, GFP_ATOMIC);
+   spin_unlock(_idr_lock);
+   idr_preload_end();
if (ret < 0)
goto err_out2;
kn->ino = ret;
@@ -876,14 +883,14 @@ struct kernfs_root *kernfs_create_root(struct 
kernfs_syscall_ops *scops,
if (!root)
return ERR_PTR(-ENOMEM);
 
-   ida_init(>ino_ida);
+   idr_init(>ino_idr);
INIT_LIST_HEAD(>supers);
atomic_set(>next_generation, 0);
 
kn = __kernfs_new_node(root, "", S_IFDIR | S_IRUGO | S_IXUGO,
   KERNFS_DIR);
if (!kn) {
-   ida_destroy(>ino_ida);
+   idr_destroy(>ino_idr);
kfree(root);
return ERR_PTR(-ENOMEM);
}
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index c5f0fa7..61668d1 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -164,7 +164,7 @@ struct kernfs_root {
unsigned intflags;  /* KERNFS_ROOT_* flags */
 
/* private fields, do not use outside kernfs proper */
-   struct ida  ino_ida;
+   struct idr  ino_idr;
struct kernfs_syscall_ops *syscall_ops;
 
/* list of kernfs_super_info of this root, protected by kernfs_mutex */
-- 
2.9.3



[PATCH V2 12/12] block: use standard blktrace API to output cgroup info for debug notes

2017-06-14 Thread Shaohua Li
From: Shaohua Li 

Currently cfq/bfq/blk-throttle output cgroup info in trace in their own
way. Now we have standard blktrace API for this, so convert them to use
it.

Note, this changes the behavior a little bit. cgroup info isn't output
by default, we only do this with 'blk_cgroup' option enabled. cgroup
info isn't output as a string by default too, we only do this with
'blk_cgname' option enabled. Also cgroup info is output in different
position of the note string. I think these behavior changes aren't a big
issue (actually we make trace data shorter which is good), since the
blktrace note is solely for debugging.

Signed-off-by: Shaohua Li 
---
 block/bfq-iosched.h  | 13 -
 block/blk-throttle.c |  6 ++
 block/cfq-iosched.c  | 15 ++-
 include/linux/blk-cgroup.h   | 14 --
 include/linux/blktrace_api.h | 13 +
 kernel/trace/blktrace.c  | 12 ++--
 6 files changed, 35 insertions(+), 38 deletions(-)

diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 5c3bf98..ea6a6eb 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -916,13 +916,16 @@ void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct 
bfq_queue *bfqq);
 struct bfq_group *bfqq_group(struct bfq_queue *bfqq);
 
 #define bfq_log_bfqq(bfqd, bfqq, fmt, args...) do {\
-   blk_add_trace_msg((bfqd)->queue, "bfq%d%c %s " fmt, (bfqq)->pid,\
-   bfq_bfqq_sync((bfqq)) ? 'S' : 'A',  \
-   bfqq_group(bfqq)->blkg_path, ##args);   \
+   blk_add_cgroup_trace_msg((bfqd)->queue, \
+   bfqg_to_blkg(bfqq_group(bfqq))->blkcg,  \
+   "bfq%d%c " fmt, (bfqq)->pid,\
+   bfq_bfqq_sync((bfqq)) ? 'S' : 'A', ##args); \
 } while (0)
 
-#define bfq_log_bfqg(bfqd, bfqg, fmt, args...) \
-   blk_add_trace_msg((bfqd)->queue, "%s " fmt, (bfqg)->blkg_path, ##args)
+#define bfq_log_bfqg(bfqd, bfqg, fmt, args...) do {\
+   blk_add_cgroup_trace_msg((bfqd)->queue, \
+   bfqg_to_blkg(bfqg)->blkcg, fmt, ##args);\
+} while (0)
 
 #else /* CONFIG_BFQ_GROUP_IOSCHED */
 
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a6ebd2b..6a4c4c4 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -373,10 +373,8 @@ static unsigned int tg_iops_limit(struct throtl_grp *tg, 
int rw)
if (likely(!blk_trace_note_message_enabled(__td->queue)))   \
break;  \
if ((__tg)) {   \
-   char __pbuf[128];   \
-   \
-   blkg_path(tg_to_blkg(__tg), __pbuf, sizeof(__pbuf));\
-   blk_add_trace_msg(__td->queue, "throtl %s " fmt, __pbuf, 
##args); \
+   blk_add_cgroup_trace_msg(__td->queue,   \
+   tg_to_blkg(__tg)->blkcg, "throtl " fmt, ##args);\
} else {\
blk_add_trace_msg(__td->queue, "throtl " fmt, ##args);  \
}   \
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index b7e9c7f..6df342f 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -656,20 +656,17 @@ static inline void cfqg_put(struct cfq_group *cfqg)
 }
 
 #define cfq_log_cfqq(cfqd, cfqq, fmt, args...) do {\
-   char __pbuf[128];   \
-   \
-   blkg_path(cfqg_to_blkg((cfqq)->cfqg), __pbuf, sizeof(__pbuf));  \
-   blk_add_trace_msg((cfqd)->queue, "cfq%d%c%c %s " fmt, (cfqq)->pid, \
+   blk_add_cgroup_trace_msg((cfqd)->queue, \
+   cfqg_to_blkg((cfqq)->cfqg)->blkcg,  \
+   "cfq%d%c%c " fmt, (cfqq)->pid,  \
cfq_cfqq_sync((cfqq)) ? 'S' : 'A',  \
cfqq_type((cfqq)) == SYNC_NOIDLE_WORKLOAD ? 'N' : ' ',\
- __pbuf, ##args);  \
+ ##args);  \
 } while (0)
 
 #define cfq_log_cfqg(cfqd, cfqg, fmt, args...) do {\
-   char __pbuf[128];   \
-   \
-   blkg_path(cfqg_to_blkg(cfqg), __pbuf, sizeof(__pbuf));  \
-   blk_add_trace_msg((cfqd)->queue, "%s " fmt, __pbuf, ##args);\
+   blk_add_cgroup_trace_msg((cfqd)->queue, 

[PATCH V2 05/12] kernfs: introduce kernfs_node_id

2017-06-14 Thread Shaohua Li
From: Shaohua Li 

inode number and generation can identify a kernfs node. We are going to
export the identification by exportfs operations, so put ino and
generation into a separate structure. It's convenient when later patches
use the identification.

Please note, I extend inode number to 64 bits. idr actually doesn't
support 64-bit yet, so currently inode number is always 32-bit. Tejun
has concerns 32-bit inode and generation could wrap. We will need to fix
this later. The data structure will be exported out to userspace, so the
goal of the 64-bit data structure is to avoid change the data structure
if we make inode number to 64-bit later.

Signed-off-by: Shaohua Li 
---
 fs/kernfs/dir.c| 10 +-
 fs/kernfs/file.c   |  4 ++--
 fs/kernfs/inode.c  |  4 ++--
 include/linux/cgroup.h |  2 +-
 include/linux/kernfs.h |  9 +++--
 5 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index e54fdd9..7483bea 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -539,7 +539,7 @@ void kernfs_put(struct kernfs_node *kn)
}
kfree(kn->iattr);
spin_lock(_idr_lock);
-   idr_remove(>ino_idr, kn->ino);
+   idr_remove(>ino_idr, kn->id.ino);
spin_unlock(_idr_lock);
kmem_cache_free(kernfs_node_cache, kn);
 
@@ -639,8 +639,8 @@ static struct kernfs_node *__kernfs_new_node(struct 
kernfs_root *root,
idr_preload_end();
if (ret < 0)
goto err_out2;
-   kn->ino = ret;
-   kn->generation = atomic_inc_return(>next_generation);
+   kn->id.ino = ret;
+   kn->id.generation = atomic_inc_return(>next_generation);
 
/* set ino first. Above atomic_inc_return has a barrier */
atomic_set(>count, 1);
@@ -711,7 +711,7 @@ struct kernfs_node *kernfs_find_and_get_node_by_ino(struct 
kernfs_root *root,
 * 'ino' should be uptodate, hence we can use 'ino' to filter stale
 * node.
 */
-   if (kn->ino != ino)
+   if (kn->id.ino != ino)
goto out;
rcu_read_unlock();
 
@@ -1644,7 +1644,7 @@ static int kernfs_fop_readdir(struct file *file, struct 
dir_context *ctx)
const char *name = pos->name;
unsigned int type = dt_type(pos);
int len = strlen(name);
-   ino_t ino = pos->ino;
+   ino_t ino = pos->id.ino;
 
ctx->pos = pos->hash;
file->private_data = pos;
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 7f90d4d..7441925 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -895,7 +895,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
 * have the matching @file available.  Look up the inodes
 * and generate the events manually.
 */
-   inode = ilookup(info->sb, kn->ino);
+   inode = ilookup(info->sb, kn->id.ino);
if (!inode)
continue;
 
@@ -903,7 +903,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
if (parent) {
struct inode *p_inode;
 
-   p_inode = ilookup(info->sb, parent->ino);
+   p_inode = ilookup(info->sb, parent->id.ino);
if (p_inode) {
fsnotify(p_inode, FS_MODIFY | FS_EVENT_ON_CHILD,
 inode, FSNOTIFY_EVENT_INODE, kn->name, 
0);
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 4c8b510..a343039 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -220,7 +220,7 @@ static void kernfs_init_inode(struct kernfs_node *kn, 
struct inode *inode)
inode->i_private = kn;
inode->i_mapping->a_ops = _aops;
inode->i_op = _iops;
-   inode->i_generation = kn->generation;
+   inode->i_generation = kn->id.generation;
 
set_default_inode_attr(inode, kn->mode);
kernfs_refresh_inode(kn, inode);
@@ -266,7 +266,7 @@ struct inode *kernfs_get_inode(struct super_block *sb, 
struct kernfs_node *kn)
 {
struct inode *inode;
 
-   inode = iget_locked(sb, kn->ino);
+   inode = iget_locked(sb, kn->id.ino);
if (inode && (inode->i_state & I_NEW))
kernfs_init_inode(kn, inode);
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 710a005..30c6877 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -543,7 +543,7 @@ static inline bool cgroup_is_populated(struct cgroup *cgrp)
 /* returns ino associated with a cgroup */
 static inline ino_t cgroup_ino(struct cgroup *cgrp)
 {
-   return cgrp->kn->ino;
+   return cgrp->kn->id.ino;
 }
 
 /* cft/css accessors for cftype->write() operation */
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 61668d1..6be2d57 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -95,6 +95,12 @@ struct 

[PATCH V2 01/12] kernfs: implement i_generation

2017-06-14 Thread Shaohua Li
From: Shaohua Li 

Set i_generation for kernfs inode. This is required to implement exportfs
operations.

Note, the generation is 32-bit, so it's possible the generation wraps up
and we find stale files. The possiblity is low, since fhandle matches
both inode number and generation. In most fs, the generation is 32-bit.
fhandle only export 32-bit generation for most fs. So unless we have
solid reason, we'd live with the possible conflict.

Signed-off-by: Shaohua Li 
---
 fs/kernfs/dir.c| 2 ++
 fs/kernfs/inode.c  | 1 +
 include/linux/kernfs.h | 2 ++
 3 files changed, 5 insertions(+)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index db5900aaa..09d093e 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -634,6 +634,7 @@ static struct kernfs_node *__kernfs_new_node(struct 
kernfs_root *root,
if (ret < 0)
goto err_out2;
kn->ino = ret;
+   kn->generation = atomic_inc_return(>next_generation);
 
atomic_set(>count, 1);
atomic_set(>active, KN_DEACTIVATED_BIAS);
@@ -877,6 +878,7 @@ struct kernfs_root *kernfs_create_root(struct 
kernfs_syscall_ops *scops,
 
ida_init(>ino_ida);
INIT_LIST_HEAD(>supers);
+   atomic_set(>next_generation, 0);
 
kn = __kernfs_new_node(root, "", S_IFDIR | S_IRUGO | S_IXUGO,
   KERNFS_DIR);
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index fb4b4a7..79cdae4 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -220,6 +220,7 @@ static void kernfs_init_inode(struct kernfs_node *kn, 
struct inode *inode)
inode->i_private = kn;
inode->i_mapping->a_ops = _aops;
inode->i_op = _iops;
+   inode->i_generation = kn->generation;
 
set_default_inode_attr(inode, kn->mode);
kernfs_refresh_inode(kn, inode);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index a9b11b8..c5f0fa7 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -135,6 +135,7 @@ struct kernfs_node {
umode_t mode;
unsigned intino;
struct kernfs_iattrs*iattr;
+   u32 generation;
 };
 
 /*
@@ -170,6 +171,7 @@ struct kernfs_root {
struct list_headsupers;
 
wait_queue_head_t   deactivate_waitq;
+   atomic_tnext_generation;
 };
 
 struct kernfs_open_file {
-- 
2.9.3



[PATCH V2 11/12] blktrace: add an option to allow displying cgroup path

2017-06-14 Thread Shaohua Li
From: Shaohua Li 

By default we output cgroup id in blktrace. This adds an option to
display cgroup path. Since get cgroup path is a relativly heavy
operation, we don't enable it by default.

with the option enabled, blktrace will output something like this:
dd-1353  [007] d..2   293.015252:   8,0   /test/level  D   R 24 + 8 [dd]

Signed-off-by: Shaohua Li 
---
 fs/kernfs/mount.c   | 19 +++
 include/linux/cgroup.h  |  6 ++
 include/linux/kernfs.h  |  2 ++
 kernel/cgroup/cgroup.c  | 12 
 kernel/trace/blktrace.c | 14 +-
 5 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 5af73a8..3bbca9c 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -91,6 +91,25 @@ static int kernfs_encode_fh(struct inode *inode, __u32 *fh, 
int *max_len,
}
 }
 
+/*
+ * Similar like kernfs_fh_get_inode, this one gets kernfs node from inode
+ * number and generation
+ */
+struct kernfs_node *kernfs_get_node_by_id(struct kernfs_root *root,
+   const struct kernfs_node_id *id)
+{
+   struct kernfs_node *kn;
+
+   kn = kernfs_find_and_get_node_by_ino(root, id->ino);
+   if (!kn)
+   return NULL;
+   if (kn->id.generation != id->generation) {
+   kernfs_put(kn);
+   return NULL;
+   }
+   return kn;
+}
+
 static struct inode *kernfs_fh_get_inode(struct super_block *sb,
u64 ino, u32 generation)
 {
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 5ebe89f..163e5c4 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -613,6 +613,9 @@ static inline struct kernfs_node_id 
*cgroup_get_node_id(struct cgroup *cgrp)
 {
return >kn->id;
 }
+
+void cgroup_path_from_node_id(const struct kernfs_node_id *id,
+   char *buf, size_t buflen);
 #else /* !CONFIG_CGROUPS */
 
 struct cgroup_subsys_state;
@@ -645,6 +648,9 @@ static inline bool task_under_cgroup_hierarchy(struct 
task_struct *task,
 {
return true;
 }
+
+static inline void cgroup_path_from_node_id(const struct kernfs_node_id *id,
+   char *buf, size_t buflen) {}
 #endif /* !CONFIG_CGROUPS */
 
 /*
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 3b38bdf..e8c685f 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -358,6 +358,8 @@ struct super_block *kernfs_pin_sb(struct kernfs_root *root, 
const void *ns);
 
 void kernfs_init(void);
 
+struct kernfs_node *kernfs_get_node_by_id(struct kernfs_root *root,
+   const struct kernfs_node_id *id);
 #else  /* CONFIG_KERNFS */
 
 static inline enum kernfs_node_type kernfs_type(struct kernfs_node *kn)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 639e27d..ae5ff1c 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -4606,6 +4606,18 @@ static int __init cgroup_wq_init(void)
 }
 core_initcall(cgroup_wq_init);
 
+void cgroup_path_from_node_id(const struct kernfs_node_id *id,
+   char *buf, size_t buflen)
+{
+   struct kernfs_node *kn;
+
+   kn = kernfs_get_node_by_id(cgrp_dfl_root.kf_root, id);
+   if (!kn)
+   return;
+   kernfs_path(kn, buf, buflen);
+   kernfs_put(kn);
+}
+
 /*
  * proc_cgroup_show()
  *  - Print task's cgroup paths into seq_file, one line for each hierarchy
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 02cfcf2..ad5abb9 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -48,12 +48,14 @@ static __cacheline_aligned_in_smp 
DEFINE_SPINLOCK(running_trace_lock);
 /* Select an alternative, minimalistic output than the original one */
 #define TRACE_BLK_OPT_CLASSIC  0x1
 #define TRACE_BLK_OPT_CGROUP   0x2
+#define TRACE_BLK_OPT_CGNAME   0x4
 
 static struct tracer_opt blk_tracer_opts[] = {
/* Default disable the minimalistic output */
{ TRACER_OPT(blk_classic, TRACE_BLK_OPT_CLASSIC) },
 #ifdef CONFIG_BLK_CGROUP
{ TRACER_OPT(blk_cgroup, TRACE_BLK_OPT_CGROUP) },
+   { TRACER_OPT(blk_cgname, TRACE_BLK_OPT_CGNAME) },
 #endif
{ }
 };
@@ -1213,7 +1215,17 @@ static void blk_log_action(struct trace_iterator *iter, 
const char *act,
if (has_cg) {
const struct kernfs_node_id *id = cgid_start(iter->ent);
 
-   trace_seq_printf(>seq, "%3d,%-3d %llx,%-x %2s %3s ",
+   if (blk_tracer_flags.val & TRACE_BLK_OPT_CGNAME) {
+   char blkcg_name_buf[NAME_MAX + 1] = "<...>";
+
+   cgroup_path_from_node_id(id, blkcg_name_buf,
+   sizeof(blkcg_name_buf));
+   trace_seq_printf(>seq, "%3d,%-3d %s %2s %3s ",
+MAJOR(t->device), MINOR(t->device),
+blkcg_name_buf, act, rwbs);
+   } else
+   trace_seq_printf(>seq,
+  

[PATCH V2 06/12] kernfs: add exportfs operations

2017-06-14 Thread Shaohua Li
From: Shaohua Li 

Now we have the facilities to implement exportfs operations. The idea is
cgroup can export the fhandle info to userspace, then userspace uses
fhandle to find the cgroup name. Another example is userspace can get
fhandle for a cgroup and BPF uses the fhandle to filter info for the
cgroup.

Signed-off-by: Shaohua Li 
---
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c |   2 +-
 fs/kernfs/mount.c| 113 ++-
 fs/sysfs/mount.c |   2 +-
 include/linux/exportfs.h |  11 +++
 include/linux/kernfs.h   |  23 +--
 kernel/cgroup/cgroup.c   |   3 +-
 6 files changed, 144 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c 
b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index f5af0cc..fee2126 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -854,7 +854,7 @@ static struct dentry *rdt_mount(struct file_system_type 
*fs_type,
}
 
dentry = kernfs_mount(fs_type, flags, rdt_root,
- RDTGROUP_SUPER_MAGIC, NULL);
+ RDTGROUP_SUPER_MAGIC, NULL, false);
if (IS_ERR(dentry))
goto out_cdp;
 
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 462a40c..5af73a8 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "kernfs-internal.h"
 
@@ -64,6 +65,107 @@ const struct super_operations kernfs_sops = {
.show_path  = kernfs_sop_show_path,
 };
 
+static int kernfs_encode_fh(struct inode *inode, __u32 *fh, int *max_len,
+struct inode *parent)
+{
+   struct kernfs_fid *fid = (struct kernfs_fid *)fh;
+
+   if (parent && (*max_len) < KERNFS_FID_WITH_PARENT_LEN) {
+   *max_len = KERNFS_FID_WITH_PARENT_LEN;
+   return FILEID_INVALID;
+   } else if ((*max_len) < KERNFS_FID_WITHOUT_PARENT_LEN) {
+   *max_len = KERNFS_FID_WITHOUT_PARENT_LEN;
+   return FILEID_INVALID;
+   }
+
+   fid->ino = inode->i_ino;
+   fid->gen = inode->i_generation;
+   if (parent) {
+   fid->parent_ino = parent->i_ino;
+   fid->parent_gen = parent->i_generation;
+   *max_len = KERNFS_FID_WITH_PARENT_LEN;
+   return FILEID_KERNFS_WITH_PARENT;
+   } else {
+   *max_len = KERNFS_FID_WITHOUT_PARENT_LEN;
+   return FILEID_KERNFS_WITHOUT_PARENT;
+   }
+}
+
+static struct inode *kernfs_fh_get_inode(struct super_block *sb,
+   u64 ino, u32 generation)
+{
+   struct kernfs_super_info *info = kernfs_info(sb);
+   struct inode *inode;
+   struct kernfs_node *kn;
+
+   if (ino == 0)
+   return ERR_PTR(-ESTALE);
+
+   kn = kernfs_find_and_get_node_by_ino(info->root, ino);
+   if (!kn)
+   return ERR_PTR(-ESTALE);
+   inode = kernfs_get_inode(sb, kn);
+   kernfs_put(kn);
+   if (IS_ERR(inode))
+   return ERR_CAST(inode);
+
+   if (inode->i_generation != generation) {
+   /* we didn't find the right inode.. */
+   iput(inode);
+   return ERR_PTR(-ESTALE);
+   }
+   return inode;
+}
+
+static struct dentry *kernfs_fh_to_dentry(struct super_block *sb, struct fid 
*fid,
+   int fh_len, int fh_type)
+{
+   struct kernfs_fid *kfid = (struct kernfs_fid *)fid;
+   struct inode *inode = NULL;
+
+   if (fh_len < KERNFS_FID_WITHOUT_PARENT_LEN)
+   return NULL;
+
+   switch (fh_type) {
+   case FILEID_KERNFS_WITHOUT_PARENT:
+   case FILEID_KERNFS_WITH_PARENT:
+   inode = kernfs_fh_get_inode(sb, kfid->ino, kfid->gen);
+   break;
+   }
+
+   return d_obtain_alias(inode);
+}
+
+static struct dentry *kernfs_fh_to_parent(struct super_block *sb, struct fid 
*fid,
+   int fh_len, int fh_type)
+{
+   struct kernfs_fid *kfid = (struct kernfs_fid *)fid;
+   struct inode *inode = NULL;
+
+   if (fh_len < KERNFS_FID_WITH_PARENT_LEN)
+   return NULL;
+
+   if (fh_type == FILEID_KERNFS_WITH_PARENT)
+   inode = kernfs_fh_get_inode(sb, kfid->parent_ino,
+   kfid->parent_gen);
+
+   return d_obtain_alias(inode);
+}
+
+static struct dentry *kernfs_get_parent_dentry(struct dentry *child)
+{
+   struct kernfs_node *kn = kernfs_dentry_node(child);
+
+   return d_obtain_alias(kernfs_get_inode(child->d_sb, kn->parent));
+}
+
+static const struct export_operations kernfs_export_ops = {
+   .encode_fh  = kernfs_encode_fh,
+   .fh_to_dentry   = kernfs_fh_to_dentry,
+   .fh_to_parent   = kernfs_fh_to_parent,
+   .get_parent = kernfs_get_parent_dentry,
+};
+
 /**
  * kernfs_root_from_sb - 

[PATCH V2 08/12] blktrace: export cgroup info in trace

2017-06-14 Thread Shaohua Li
From: Shaohua Li 

Currently blktrace isn't cgroup aware. blktrace prints out task name of
current context, but the task of current context isn't always in the
cgroup where the BIO comes from. We can't use task name to find out IO
cgroup. For example, Writeback BIOs always comes from flusher thread but
the BIOs are for different blk cgroups. Request could be requeued and
dispatched from completely different tasks. MD/DM are another examples.

This patch tries to fix the gap. We print out cgroup fhandle info in
blktrace. Userspace can use open_by_handle_at() syscall to find the
cgroup by fhandle. Or userspace can use name_to_handle_at() syscall to
find fhandle for a cgroup and use a BPF program to filter out blktrace
for a specific cgroup.

We add a new 'blk_cgroup' trace option for blk tracer. It's default off.
Application which doesn't know the new option isn't affected.  When it's
on, we output fhandle info right after blk_io_trace with an extra bit
set in event action. So from application point of view, blktrace with
the option will output new actions.

I didn't change blk trace event yet, since I'm not sure if changing the
trace event output is an ABI issue. If not, I'll do it later.

Signed-off-by: Shaohua Li 
---
 include/uapi/linux/blktrace_api.h |   3 +
 kernel/trace/blktrace.c   | 231 ++
 2 files changed, 161 insertions(+), 73 deletions(-)

diff --git a/include/uapi/linux/blktrace_api.h 
b/include/uapi/linux/blktrace_api.h
index c590ca6..9cdaede 100644
--- a/include/uapi/linux/blktrace_api.h
+++ b/include/uapi/linux/blktrace_api.h
@@ -52,6 +52,7 @@ enum blktrace_act {
__BLK_TA_REMAP, /* bio was remapped */
__BLK_TA_ABORT, /* request aborted */
__BLK_TA_DRV_DATA,  /* driver-specific binary data */
+   __BLK_TA_CGROUP = 1 << 8,   /* from a cgroup*/
 };
 
 /*
@@ -61,6 +62,7 @@ enum blktrace_notify {
__BLK_TN_PROCESS = 0,   /* establish pid/name mapping */
__BLK_TN_TIMESTAMP, /* include system clock */
__BLK_TN_MESSAGE,   /* Character string message */
+   __BLK_TN_CGROUP = __BLK_TA_CGROUP, /* from a cgroup */
 };
 
 
@@ -107,6 +109,7 @@ struct blk_io_trace {
__u32 cpu;  /* on what cpu did it happen */
__u16 error;/* completion error */
__u16 pdu_len;  /* length of data after this trace */
+   /* cgroup id will be stored here if exists */
 };
 
 /*
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 193c5f5..02cfcf2 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "../../block/blk.h"
 
@@ -46,10 +47,14 @@ static __cacheline_aligned_in_smp 
DEFINE_SPINLOCK(running_trace_lock);
 
 /* Select an alternative, minimalistic output than the original one */
 #define TRACE_BLK_OPT_CLASSIC  0x1
+#define TRACE_BLK_OPT_CGROUP   0x2
 
 static struct tracer_opt blk_tracer_opts[] = {
/* Default disable the minimalistic output */
{ TRACER_OPT(blk_classic, TRACE_BLK_OPT_CLASSIC) },
+#ifdef CONFIG_BLK_CGROUP
+   { TRACER_OPT(blk_cgroup, TRACE_BLK_OPT_CGROUP) },
+#endif
{ }
 };
 
@@ -68,7 +73,8 @@ static void blk_unregister_tracepoints(void);
  * Send out a notify message.
  */
 static void trace_note(struct blk_trace *bt, pid_t pid, int action,
-  const void *data, size_t len)
+  const void *data, size_t len,
+  struct kernfs_node_id *cgid)
 {
struct blk_io_trace *t;
struct ring_buffer_event *event = NULL;
@@ -76,12 +82,13 @@ static void trace_note(struct blk_trace *bt, pid_t pid, int 
action,
int pc = 0;
int cpu = smp_processor_id();
bool blk_tracer = blk_tracer_enabled;
+   ssize_t cgid_len = cgid ? sizeof(*cgid) : 0;
 
if (blk_tracer) {
buffer = blk_tr->trace_buffer.buffer;
pc = preempt_count();
event = trace_buffer_lock_reserve(buffer, TRACE_BLK,
- sizeof(*t) + len,
+ sizeof(*t) + len + cgid_len,
  0, pc);
if (!event)
return;
@@ -92,17 +99,19 @@ static void trace_note(struct blk_trace *bt, pid_t pid, int 
action,
if (!bt->rchan)
return;
 
-   t = relay_reserve(bt->rchan, sizeof(*t) + len);
+   t = relay_reserve(bt->rchan, sizeof(*t) + len + cgid_len);
if (t) {
t->magic = BLK_IO_TRACE_MAGIC | BLK_IO_TRACE_VERSION;
t->time = ktime_to_ns(ktime_get());
 record_it:
t->device = bt->dev;
-   t->action = action;
+   t->action = action | (cgid ? __BLK_TN_CGROUP : 0);
   

[PATCH V2 10/12] block: call __bio_free in bio_endio

2017-06-14 Thread Shaohua Li
From: Shaohua Li 

bio_free isn't a good place to free cgroup/integrity info. There are a
lot of cases bio is allocated in special way (for example, in stack) and
never gets called by bio_put hence bio_free, we are leaking memory. This
patch moves the free to bio endio, which should be called anyway. The
__bio_free call in bio_free is kept, in case the bio never gets called
bio endio.

This assumes ->bi_end_io() doesn't access cgroup/integrity info, which
seems true in my audit. Otherwise, we probably must add a flag to
distinguish if bio will be called by bio_put.

Signed-off-by: Shaohua Li 
---
 block/bio-integrity.c | 1 +
 block/bio.c   | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index b5009a8..869ac7a 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -120,6 +120,7 @@ void bio_integrity_free(struct bio *bio)
}
 
bio->bi_integrity = NULL;
+   bio->bi_opf &= ~REQ_INTEGRITY;
 }
 EXPORT_SYMBOL(bio_integrity_free);
 
diff --git a/block/bio.c b/block/bio.c
index 888e780..9bfd8d4 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1823,6 +1823,8 @@ void bio_endio(struct bio *bio)
}
 
blk_throtl_bio_endio(bio);
+   /* release cgroup/integrity info */
+   __bio_free(bio);
if (bio->bi_end_io)
bio->bi_end_io(bio);
 }
-- 
2.9.3



Re: [PATCHSET v2] Add support for write life time hints

2017-06-14 Thread Martin K. Petersen

Christoph,

> I think what Martin wants (or at least what I'd want him to want) is
> to define a few REQ_* bits that mirror the RWF bits, use that to
> transfer the information down the stack, and then only translate it
> to stream ids in the driver.

Yup. If we have enough space in the existing flags that's perfect (I
lost count after your op/flag shuffle).

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCHSET v2] Add support for write life time hints

2017-06-14 Thread Jens Axboe
On 06/14/2017 10:00 AM, Martin K. Petersen wrote:
> 
> Jens,
> 
>> So how about we just call it "write_hint"? It sounds mostly like a
>> naming issue to me, as you would then map that to some specific stream
>> in your driver. You're free to do that right now. They are all flags,
>> it's just packed as a value to not waste too much space.
> 
> Sure, that's fine with me. But let's call them bi_hints or something. I
> have a couple that I would like to add that are I/O direction agnostic.

Assuming you saw Christoph's reply, if we just make then bio/req flags,
you can translate them into whatever you want. Work for you?

-- 
Jens Axboe



Re: [PATCHSET v2] Add support for write life time hints

2017-06-14 Thread Jens Axboe
On 06/14/2017 10:01 AM, Christoph Hellwig wrote:
> On Wed, Jun 14, 2017 at 09:53:05AM -0600, Jens Axboe wrote:
>> So how about we just call it "write_hint"? It sounds mostly like a
>> naming issue to me, as you would then map that to some specific stream
>> in your driver. You're free to do that right now. They are all flags,
>> it's just packed as a value to not waste too much space.
> 
> I think what Martin wants (or at least what I'd want him to want) is
> to define a few REQ_* bits that mirror the RWF bits, use that to
> transfer the information down the stack, and then only translate it
> to stream ids in the driver.

Sure, that's fine with me. Let me make that change.

-- 
Jens Axboe



Re: [PATCHSET v2] Add support for write life time hints

2017-06-14 Thread Christoph Hellwig
On Wed, Jun 14, 2017 at 09:53:05AM -0600, Jens Axboe wrote:
> So how about we just call it "write_hint"? It sounds mostly like a
> naming issue to me, as you would then map that to some specific stream
> in your driver. You're free to do that right now. They are all flags,
> it's just packed as a value to not waste too much space.

I think what Martin wants (or at least what I'd want him to want) is
to define a few REQ_* bits that mirror the RWF bits, use that to
transfer the information down the stack, and then only translate it
to stream ids in the driver.


Re: [PATCHSET v2] Add support for write life time hints

2017-06-14 Thread Martin K. Petersen

Jens,

> So how about we just call it "write_hint"? It sounds mostly like a
> naming issue to me, as you would then map that to some specific stream
> in your driver. You're free to do that right now. They are all flags,
> it's just packed as a value to not waste too much space.

Sure, that's fine with me. But let's call them bi_hints or something. I
have a couple that I would like to add that are I/O direction agnostic.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCHSET v2] Add support for write life time hints

2017-06-14 Thread Jens Axboe
On 06/14/2017 09:45 AM, Martin K. Petersen wrote:
> 
> Jens,
> 
>> A new iteration of this patchset, previously known as write streams.
>> As before, this patchset aims at enabling applications split up
>> writes into separate streams, based on the perceived life time
>> of the data written. This is useful for a variety of reasons:
>>
>> - With NVMe 1.3 compliant devices, the device can expose multiple
>>   streams. Separating data written into streams based on life time
>>   can drastically reduce the write amplification. This helps device
>>   endurance, and increases performance. Testing just performed
>>   internally at Facebook with these patches showed up to a 25%
>>   reduction in NAND writes in a RocksDB setup.
>>
>> - Software caching solutions can make more intelligent decisions
>>   on how and where to place data.
>>
>> Contrary to previous patches, we're not exposing numeric stream values
>> anymore.  I've previously advocated for just doing a set of hints that
>> makes sense instead. See the coverage from the LSFMM summit this year:
> 
> I am all for having these write hints. But one request I would like to
> make is that we just make them flags and abolish all notions of the term
> "streams" from block for this particular use case (since it is more
> hinty than streamy).
> 
> There are devices coming where a proper stream ID is prerequisite to
> separate data streams for other reasons than bucketing based on data
> lifetime. So my preference would be that we make the lifetime hints be
> flags in block. That does not preclude using Streams Directives to
> implement them in the NVMe NAND flash case. But it does not cause
> conflicts with the use cases that need "proper" stream IDs for QoS or
> colocation avoidance purposes in SCSI.

So how about we just call it "write_hint"? It sounds mostly like a
naming issue to me, as you would then map that to some specific stream
in your driver. You're free to do that right now. They are all flags,
it's just packed as a value to not waste too much space.

-- 
Jens Axboe



Re: [PATCHSET v2] Add support for write life time hints

2017-06-14 Thread Martin K. Petersen

Jens,

> A new iteration of this patchset, previously known as write streams.
> As before, this patchset aims at enabling applications split up
> writes into separate streams, based on the perceived life time
> of the data written. This is useful for a variety of reasons:
>
> - With NVMe 1.3 compliant devices, the device can expose multiple
>   streams. Separating data written into streams based on life time
>   can drastically reduce the write amplification. This helps device
>   endurance, and increases performance. Testing just performed
>   internally at Facebook with these patches showed up to a 25%
>   reduction in NAND writes in a RocksDB setup.
>
> - Software caching solutions can make more intelligent decisions
>   on how and where to place data.
>
> Contrary to previous patches, we're not exposing numeric stream values
> anymore.  I've previously advocated for just doing a set of hints that
> makes sense instead. See the coverage from the LSFMM summit this year:

I am all for having these write hints. But one request I would like to
make is that we just make them flags and abolish all notions of the term
"streams" from block for this particular use case (since it is more
hinty than streamy).

There are devices coming where a proper stream ID is prerequisite to
separate data streams for other reasons than bucketing based on data
lifetime. So my preference would be that we make the lifetime hints be
flags in block. That does not preclude using Streams Directives to
implement them in the NVMe NAND flash case. But it does not cause
conflicts with the use cases that need "proper" stream IDs for QoS or
colocation avoidance purposes in SCSI.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH v2 1/6] block: Avoid that blk_exit_rl() triggers a use-after-free

2017-06-14 Thread Bart Van Assche
On 06/13/17 10:54, Ross Zwisler wrote:
> This commit is causing the following kernel BUG for me when I shut
> down my systems:
> 
>   BUG: sleeping function called from invalid context at 
> kernel/workqueue.c:2790
>   in_atomic(): 1, irqs_disabled(): 0, pid: 41, name: rcuop/3

Thanks Ross for the testing and for the report. Can you check whether
the patch below is sufficient to fix this?


Subject: [PATCH] block: Fix a blk_exit_rl() regression

Avoid that the following complaint is reported:

 BUG: sleeping function called from invalid context at kernel/workqueue.c:2790
 in_atomic(): 1, irqs_disabled(): 0, pid: 41, name: rcuop/3
 1 lock held by rcuop/3/41:
  #0:  (rcu_callback){..}, at: [] 
rcu_nocb_kthread+0x282/0x500
 Call Trace:
  dump_stack+0x86/0xcf
  ___might_sleep+0x174/0x260
  __might_sleep+0x4a/0x80
  flush_work+0x7e/0x2e0
  __cancel_work_timer+0x143/0x1c0
  cancel_work_sync+0x10/0x20
  blk_throtl_exit+0x25/0x60
  blkcg_exit_queue+0x35/0x40
  blk_release_queue+0x42/0x130
  kobject_put+0xa9/0x190

Reported-by: Ross Zwisler 
Fixes: commit b425e5049258 ("block: Avoid that blk_exit_rl() triggers a 
use-after-free")
Signed-off-by: Bart Van Assche 
Cc: Ross Zwisler 
---
 block/blk-sysfs.c  | 34 ++
 include/linux/blkdev.h |  2 ++
 2 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 283da7fbe034..27aceab1cc31 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -777,24 +777,25 @@ static void blk_free_queue_rcu(struct rcu_head *rcu_head)
 }
 
 /**
- * blk_release_queue: - release a  request_queue when it is no longer 
needed
- * @kobj:the kobj belonging to the request queue to be released
+ * __blk_release_queue - release a request queue when it is no longer needed
+ * @work: pointer to the release_work member of the request queue to be 
released
  *
  * Description:
- * blk_release_queue is the pair to blk_init_queue() or
- * blk_queue_make_request().  It should be called when a request queue is
- * being released; typically when a block device is being de-registered.
- * Currently, its primary task it to free all the  request
- * structures that were allocated to the queue and the queue itself.
+ * blk_release_queue is the counterpart of blk_init_queue(). It should be
+ * called when a request queue is being released; typically when a block
+ * device is being de-registered. Its primary task it to free the queue
+ * itself.
  *
- * Note:
+ * Notes:
  * The low level driver must have finished any outstanding requests first
  * via blk_cleanup_queue().
- **/
-static void blk_release_queue(struct kobject *kobj)
+ *
+ * Although blk_release_queue() may be called with preemption disabled,
+ * __blk_release_queue() may sleep.
+ */
+static void __blk_release_queue(struct work_struct *work)
 {
-   struct request_queue *q =
-   container_of(kobj, struct request_queue, kobj);
+   struct request_queue *q = container_of(work, typeof(*q), release_work);
 
if (test_bit(QUEUE_FLAG_POLL_STATS, >queue_flags))
blk_stat_remove_callback(q, q->poll_cb);
@@ -834,6 +835,15 @@ static void blk_release_queue(struct kobject *kobj)
call_rcu(>rcu_head, blk_free_queue_rcu);
 }
 
+static void blk_release_queue(struct kobject *kobj)
+{
+   struct request_queue *q =
+   container_of(kobj, struct request_queue, kobj);
+
+   INIT_WORK(>release_work, __blk_release_queue);
+   schedule_work(>release_work);
+}
+
 static const struct sysfs_ops queue_sysfs_ops = {
.show   = queue_attr_show,
.store  = queue_attr_store,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index cc75407881e4..07a26414fdfc 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -590,6 +590,8 @@ struct request_queue {
 
size_t  cmd_size;
void*rq_alloc_data;
+
+   struct work_struct  release_work;
 };
 
 #define QUEUE_FLAG_QUEUED  1   /* uses generic tag queueing */
-- 
2.12.2



Re: [xfstests PATCH v4 5/5] btrfs: make a btrfs version of writeback error reporting test

2017-06-14 Thread Jeff Layton
On Tue, 2017-06-13 at 16:40 +0800, Eryu Guan wrote:
> On Mon, Jun 12, 2017 at 08:42:13AM -0400, Jeff Layton wrote:
> > Make a new btrfs/999 test that works the way Chris Mason suggested:
> > 
> > Build a filesystem with 2 devices that stripes the data across
> > both devices, but mirrors metadata across both. Then, make one
> > of the devices fail and see how fsync is handled.
> > 
> > Signed-off-by: Jeff Layton 
> > ---
> >  tests/btrfs/999   | 93 
> > +++
> 
> Missing btrfs/999.out file
> 
> >  tests/btrfs/group |  1 +
> >  2 files changed, 94 insertions(+)
> >  create mode 100755 tests/btrfs/999
> > 
> > diff --git a/tests/btrfs/999 b/tests/btrfs/999
> > new file mode 100755
> > index ..84031cc0d913
> > --- /dev/null
> > +++ b/tests/btrfs/999
> > @@ -0,0 +1,93 @@
> > +#! /bin/bash
> > +# FS QA Test No. 999
> > +#
> > +# Open a file several times, write to it, fsync on all fds and make sure 
> > that
> > +# they all return 0. Change the device to start throwing errors. Write 
> > again
> > +# on all fds and fsync on all fds. Ensure that we get errors on all of 
> > them.
> > +# Then fsync on all one last time and verify that all return 0.
> > +#
> > +#---
> > +# Copyright (c) 2017, Jeff Layton 
> > +#
> > +# This program is free software; you can redistribute it and/or
> > +# modify it under the terms of the GNU General Public License as
> > +# published by the Free Software Foundation.
> > +#
> > +# This program is distributed in the hope that it would be useful,
> > +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > +# GNU General Public License for more details.
> > +#
> > +# You should have received a copy of the GNU General Public License
> > +# along with this program; if not, write the Free Software Foundation,
> > +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> > +#---
> > +
> > +seq=`basename $0`
> > +seqres=$RESULT_DIR/$seq
> > +echo "QA output created by $seq"
> > +
> > +here=`pwd`
> > +tmp=/tmp/$$
> > +status=1# failure is the default!
> > +trap "_cleanup; exit \$status" 0 1 2 3 15
> > +
> > +_cleanup()
> > +{
> > +cd /
> > +rm -rf $tmp.* $testdir
> > +_dmerror_cleanup
> > +}
> > +
> > +# get standard environment, filters and checks
> > +. ./common/rc
> > +. ./common/filter
> > +. ./common/dmerror
> > +
> > +# real QA test starts here
> > +_supported_os Linux
> > +_require_dm_target error
> > +_require_test_program fsync-err
> > +_require_test_program dmerror
> > +
> > +# bring up dmerror device
> > +_scratch_unmount
> > +_dmerror_init
> > +
> > +# Replace first device with error-test device
> > +old_SCRATCH_DEV=$SCRATCH_DEV
> > +SCRATCH_DEV_POOL=`echo $SCRATCH_DEV_POOL | perl -pe 
> > "s#$SCRATCH_DEV#$DMERROR_DEV#"`
> > +SCRATCH_DEV=$DMERROR_DEV
> > +
> > +_require_scratch
> > +_require_scratch_dev_pool
> 
> Need "_require_scratch_dev_pool_equal_size" too, since test creates
> raid1 profile for metadata.
> 
> Thanks,
> Eryu
> 

Is this really needed?

I've been running this test on btrfs with devices that are not of equal
size, and it seems to work just fine. The test doesn't write a lot of
data (just a few megs at most), so I don't think we'll run out of space
unless you have some really small devices in there.

> > +
> > +rm -f $seqres.full
> > +
> > +echo "Format and mount"
> > +
> > +_scratch_pool_mkfs "-d raid0 -m raid1" > $seqres.full 2>&1
> > +_scratch_mount
> > +
> > +# How much do we need to write? We need to hit all of the stripes. btrfs 
> > uses
> > +# a fixed 64k stripesize, so write enough to hit each one
> > +number_of_devices=`echo $SCRATCH_DEV_POOL | wc -w`
> > +write_kb=$(($number_of_devices * 64))
> > +_require_fs_space $SCRATCH_MNT $write_kb
> > +
> > +testfile=$SCRATCH_MNT/fsync-err-test
> > +
> > +SCRATCH_DEV=$old_SCRATCH_DEV
> > +$here/src/fsync-err -b $(($write_kb * 1024)) -d $here/src/dmerror $testfile
> > +
> > +# success, all done
> > +_dmerror_load_working_table
> > +
> > +# fs may be corrupt after this -- attempt to repair it
> > +_repair_scratch_fs >> $seqres.full
> > +
> > +# remove dmerror device
> > +_dmerror_cleanup
> > +
> > +status=0
> > +exit
> > diff --git a/tests/btrfs/group b/tests/btrfs/group
> > index 6f19619e877c..8dbdfbfe29fd 100644
> > --- a/tests/btrfs/group
> > +++ b/tests/btrfs/group
> > @@ -145,3 +145,4 @@
> >  141 auto quick
> >  142 auto quick
> >  143 auto quick
> > +999 auto quick
> > -- 
> > 2.13.0
> > 

-- 
Jeff Layton 


Re: [PATCH v6 12/20] fs: add a new fstype flag to indicate how writeback errors are tracked

2017-06-14 Thread Christoph Hellwig
On Tue, Jun 13, 2017 at 06:24:32AM -0400, Jeff Layton wrote:
> That's definitely what I want for the endgame here. My plan was to add
> this flag for now, and then eventually reverse it (or drop it) once all
> or most filesystems are converted.
> 
> We can do it that way from the get-go if you like. It'll mean tossing in
>  a patch add this flag to all filesystems that have an fsync operation
> and that use the pagecache, and then gradually remove it from them as we
> convert them.
> 
> Which method do you prefer?

Please do it from the get-go.  Or in fact figure out if we can get
away without it entirely.  Moving the error reporting into ->fsync
should help greatly with that, so what's missing after that?