Re: [PATCH 03/11] fs: add frozen sb state helpers

2018-04-21 Thread Luis R. Rodriguez
On Thu, Nov 30, 2017 at 06:13:10PM +0100, Jan Kara wrote:
> On Wed 29-11-17 15:23:48, Luis R. Rodriguez wrote:
> > The question of whether or not a superblock is frozen needs to be
> > augmented in the future to account for differences between a user
> > initiated freeze and a kernel initiated freeze done automatically
> > on behalf of the kernel.
> > 
> > Provide helpers so that these can be used instead so that we don't
> > have to expand checks later in these same call sites as we expand
> > the definition of a frozen superblock.
> > 
> > Signed-off-by: Luis R. Rodriguez 
> 
> So helpers are fine but...
> 
> > +/**
> > + * sb_is_frozen_by_user - is superblock frozen by a user call
> > + * @sb: the super to check
> > + *
> > + * Returns true if the super freeze was initiated by userspace, for 
> > instance,
> > + * an ioctl call.
> > + */
> > +static inline bool sb_is_frozen_by_user(struct super_block *sb)
> > +{
> > +   return sb->s_writers.frozen == SB_FREEZE_COMPLETE;
> > +}
> 
> ... I dislike the _by_user() suffix as there may be different places that
> call freeze_super() (e.g. device mapper does this during some operations).
> Clearly we need to distinguish "by system suspend" and "the other" cases.
> So please make this clear in the naming.
> 
> In fact, what might be a cleaner solution is to introduce a 'freeze_count'
> for superblock freezing (we already do have this for block devices). Then
> you don't need to differentiate these two cases - but you'd still need to
> properly handle cleanup if freezing of all superblocks fails in the middle.
> So I'm not 100% this works out nicely in the end. But it's certainly worth
> a consideration.

Seems reasonable.

  Luis


Re: [PATCH 03/11] fs: add frozen sb state helpers

2018-04-21 Thread Luis R. Rodriguez
On Sun, Apr 22, 2018 at 01:53:23AM +0200, Jan Kara wrote:
> On Fri 20-04-18 11:49:32, Luis R. Rodriguez wrote:
> > On Tue, Apr 17, 2018 at 05:59:36PM -0700, Luis R. Rodriguez wrote:
> > > On Thu, Dec 21, 2017 at 12:03:29PM +0100, Jan Kara wrote:
> > > > Hello,
> > > > 
> > > > I think I owe you a reply here... Sorry that it took so long.
> > > 
> > > Took me just as long :)
> > > 
> > > > On Fri 01-12-17 22:13:27, Luis R. Rodriguez wrote:
> > > > > 
> > > > > I'll note that its still not perfectly clear if really the semantics 
> > > > > behind
> > > > > freeze_bdev() match what I described above fully. That still needs to 
> > > > > be
> > > > > vetted for. For instance, does thaw_bdev() keep a superblock frozen 
> > > > > if we
> > > > > an ioctl initiated freeze had occurred before? If so then great. 
> > > > > Otherwise
> > > > > I think we'll need to distinguish the ioctl interface. Worst possible 
> > > > > case
> > > > > is that bdev semantics and in-kernel semantics differ somehow, then 
> > > > > that
> > > > > will really create a holy fucking mess.
> > > > 
> > > > I believe nobody really thought about mixing those two interfaces to fs
> > > > freezing and so the behavior is basically defined by the implementation.
> > > > That is:
> > > > 
> > > > freeze_bdev() on sb frozen by ioctl_fsfreeze() -> EBUSY
> > > > freeze_bdev() on sb frozen by freeze_bdev() -> success
> > > > ioctl_fsfreeze() on sb frozen by freeze_bdev() -> EBUSY
> > > > ioctl_fsfreeze() on sb frozen by ioctl_fsfreeze() -> EBUSY
> > > > 
> > > > thaw_bdev() on sb frozen by ioctl_fsfreeze() -> EINVAL
> > > 
> > > Phew, so this is what we want for the in-kernel freezing so we're good
> > > and *can* combine these then.
> > 
> > I double checked, and I don't see where you get EINVAL for this case.
> > We *do* keep the sb frozen though, which is good, and the worst fear
> > I had was that we did not. However we return 0 if there was already
> > a prior freeze_bdev() or ioctl_fsfreeze() other than the context that
> > started the prior freeze (--bdev->bd_fsfreeze_count > 0).
> > 
> > The -EINVAL is only returned currently if there were no freezers.
> > 
> > int thaw_bdev(struct block_device *bdev, struct super_block *sb)
> > {
> > int error = -EINVAL;
> > 
> > mutex_lock(>bd_fsfreeze_mutex);
> > if (!bdev->bd_fsfreeze_count)
> > goto out;
> 
> But this is precisely where we'd bail if we freeze sb by ioctl_fsfreeze()
> but try to thaw by thaw_bdev(). ioctl_fsfreeze() does not touch
> bd_fsfreeze_count...

Ah, yes, I see that now, thanks!

  Luis


Re: [PATCH 03/11] fs: add frozen sb state helpers

2018-04-21 Thread Jan Kara
On Fri 20-04-18 11:49:32, Luis R. Rodriguez wrote:
> On Tue, Apr 17, 2018 at 05:59:36PM -0700, Luis R. Rodriguez wrote:
> > On Thu, Dec 21, 2017 at 12:03:29PM +0100, Jan Kara wrote:
> > > Hello,
> > > 
> > > I think I owe you a reply here... Sorry that it took so long.
> > 
> > Took me just as long :)
> > 
> > > On Fri 01-12-17 22:13:27, Luis R. Rodriguez wrote:
> > > > 
> > > > I'll note that its still not perfectly clear if really the semantics 
> > > > behind
> > > > freeze_bdev() match what I described above fully. That still needs to be
> > > > vetted for. For instance, does thaw_bdev() keep a superblock frozen if 
> > > > we
> > > > an ioctl initiated freeze had occurred before? If so then great. 
> > > > Otherwise
> > > > I think we'll need to distinguish the ioctl interface. Worst possible 
> > > > case
> > > > is that bdev semantics and in-kernel semantics differ somehow, then that
> > > > will really create a holy fucking mess.
> > > 
> > > I believe nobody really thought about mixing those two interfaces to fs
> > > freezing and so the behavior is basically defined by the implementation.
> > > That is:
> > > 
> > > freeze_bdev() on sb frozen by ioctl_fsfreeze() -> EBUSY
> > > freeze_bdev() on sb frozen by freeze_bdev() -> success
> > > ioctl_fsfreeze() on sb frozen by freeze_bdev() -> EBUSY
> > > ioctl_fsfreeze() on sb frozen by ioctl_fsfreeze() -> EBUSY
> > > 
> > > thaw_bdev() on sb frozen by ioctl_fsfreeze() -> EINVAL
> > 
> > Phew, so this is what we want for the in-kernel freezing so we're good
> > and *can* combine these then.
> 
> I double checked, and I don't see where you get EINVAL for this case.
> We *do* keep the sb frozen though, which is good, and the worst fear
> I had was that we did not. However we return 0 if there was already
> a prior freeze_bdev() or ioctl_fsfreeze() other than the context that
> started the prior freeze (--bdev->bd_fsfreeze_count > 0).
> 
> The -EINVAL is only returned currently if there were no freezers.
> 
> int thaw_bdev(struct block_device *bdev, struct super_block *sb)
> {
>   int error = -EINVAL;
> 
>   mutex_lock(>bd_fsfreeze_mutex);
>   if (!bdev->bd_fsfreeze_count)
>   goto out;

But this is precisely where we'd bail if we freeze sb by ioctl_fsfreeze()
but try to thaw by thaw_bdev(). ioctl_fsfreeze() does not touch
bd_fsfreeze_count...

Honza

-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 2/4] iomap: iomap_dio_rw() handles all sync writes

2018-04-21 Thread Jan Kara
On Wed 18-04-18 14:08:26, Dave Chinner wrote:
> From: Dave Chinner 
> 
> Currently iomap_dio_rw() only handles (data)sync write completions
> for AIO. This means we can't optimised non-AIO IO to minimise device
> flushes as we can't tell the caller whether a flush is required or
> not.
> 
> To solve this problem and enable further optimisations, make
> iomap_dio_rw responsible for data sync behaviour for all IO, not
> just AIO.
> 
> In doing so, the sync operation is now accounted as part of the DIO
> IO by inode_dio_end(), hence post-IO data stability updates will no
> long race against operations that serialise via inode_dio_wait()
> such as truncate or hole punch.
> 
> Signed-Off-By: Dave Chinner 
> Reviewed-by: Christoph Hellwig 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/iomap.c| 22 +++---
>  fs/xfs/xfs_file.c |  5 -
>  2 files changed, 15 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index afd163586aa0..1f59c2d9ade6 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -685,6 +685,7 @@ EXPORT_SYMBOL_GPL(iomap_seek_data);
>   * Private flags for iomap_dio, must not overlap with the public ones in
>   * iomap.h:
>   */
> +#define IOMAP_DIO_NEED_SYNC  (1 << 29)
>  #define IOMAP_DIO_WRITE  (1 << 30)
>  #define IOMAP_DIO_DIRTY  (1 << 31)
>  
> @@ -759,6 +760,13 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio)
>   dio_warn_stale_pagecache(iocb->ki_filp);
>   }
>  
> + /*
> +  * If this is a DSYNC write, make sure we push it to stable storage now
> +  * that we've written data.
> +  */
> + if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC))
> + ret = generic_write_sync(iocb, ret);
> +
>   inode_dio_end(file_inode(iocb->ki_filp));
>   kfree(dio);
>  
> @@ -768,14 +776,8 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio)
>  static void iomap_dio_complete_work(struct work_struct *work)
>  {
>   struct iomap_dio *dio = container_of(work, struct iomap_dio, aio.work);
> - struct kiocb *iocb = dio->iocb;
> - bool is_write = (dio->flags & IOMAP_DIO_WRITE);
> - ssize_t ret;
>  
> - ret = iomap_dio_complete(dio);
> - if (is_write && ret > 0)
> - ret = generic_write_sync(iocb, ret);
> - iocb->ki_complete(iocb, ret, 0);
> + dio->iocb->ki_complete(dio->iocb, iomap_dio_complete(dio), 0);
>  }
>  
>  /*
> @@ -961,6 +963,10 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t 
> length,
>   return copied;
>  }
>  
> +/*
> + * iomap_dio_rw() always completes O_[D]SYNC writes regardless of whether 
> the IO
> + * is being issued as AIO or not.
> + */
>  ssize_t
>  iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>   const struct iomap_ops *ops, iomap_dio_end_io_t end_io)
> @@ -1006,6 +1012,8 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>   dio->flags |= IOMAP_DIO_DIRTY;
>   } else {
>   dio->flags |= IOMAP_DIO_WRITE;
> + if (iocb->ki_flags & IOCB_DSYNC)
> + dio->flags |= IOMAP_DIO_NEED_SYNC;
>   flags |= IOMAP_WRITE;
>   }
>  
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 6f15027661b6..0c4b8313d544 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -570,11 +570,6 @@ xfs_file_dio_aio_write(
>* complete fully or fail.
>*/
>   ASSERT(ret < 0 || ret == count);
> -
> - if (ret > 0) {
> - /* Handle various SYNC-type writes */
> - ret = generic_write_sync(iocb, ret);
> - }
>   return ret;
>  }
>  
> -- 
> 2.16.1
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 4/4] iomap: Use FUA for pure data O_DSYNC DIO writes

2018-04-21 Thread Jan Kara
On Wed 18-04-18 14:08:28, Dave Chinner wrote:
> @@ -1012,8 +1035,16 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>   dio->flags |= IOMAP_DIO_DIRTY;
>   } else {
>   dio->flags |= IOMAP_DIO_WRITE;
> - if (iocb->ki_flags & IOCB_DSYNC)
> + if (iocb->ki_flags & IOCB_DSYNC) {
>   dio->flags |= IOMAP_DIO_NEED_SYNC;
> + /*
> +  * We optimistically try using FUA for this IO.  Any
> +  * non-FUA write that occurs will clear this flag, hence
> +  * we know before completion whether a cache flush is
> +  * necessary.
> +  */
> + dio->flags |= IOMAP_DIO_WRITE_FUA;
> + }

So I don't think this is quite correct. IOCB_DSYNC gets set also for O_SYNC
writes (in that case we also set IOCB_SYNC). And for those we cannot use
the FUA optimization AFAICT (definitely IOMAP_F_DIRTY isn't a safe
indicator of a need of full fsync for O_SYNC). Other than that the patch
looks good to me.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: question about request merge

2018-04-21 Thread Jens Axboe
On 4/21/18 8:07 AM, Zhengyuan Liu wrote:
> 2018-04-20 22:34 GMT+08:00 Jens Axboe :
>> On 4/19/18 9:51 PM, Zhengyuan Liu wrote:
>>> Hi, Shaohua
>>>
>>> I found it indeed doesn't do front merge when two threads flush plug list  
>>> concurrently.   To
>>> reappear , I prepared two IO threads , named a0.io and a1.io .
>>> Thread a1.io  uses libaio to write 5 requests :
>>> sectors: 16 + 8, 40 + 8, 64 + 8, 88 + 8, 112 + 8
>>> Thread a0.io  uses libaio to write other 5 requests :
>>> sectors: 8+ 8, 32 + 8, 56 + 8, 80 + 8, 104 + 8
>>
>> I'm cutting some of the below.
>>
>> Thanks for the detailed email. It's mostly on purpose that we don't
>> spend cycles and memory on maintaining a separate front merge hash,
>> since it's generally not something that happens very often. If you have
>> a thread pool doing IO and split sequential IO such that you would
>> benefit a lot from front merging, then I would generally claim that
>> you're not writing your app in the most optimal manner.
>>
> 
> Thanks for explanation, I only consider the problem through the code's 
> perspective and ignore the reality situation of app. 

That's quite by design and not accidental.

>> So I'm curious, what's the big interest in front merging?
> 
> If it's not something that happens so much often, I think it's not worth to 
> support front merging too.
> 
> By the way, I got another question that why not  blktrace tracing the back
> merging of requests while flushing plugged requests to queue,  if it does
> we may get a more clear view about IO merging.

Not sure I follow, exactly where is a back merge trace missing?

-- 
Jens Axboe



Re: [PATCH v6] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-21 Thread Jens Axboe
On 4/21/18 9:05 AM, Bart Van Assche wrote:
> On Sat, 2018-04-21 at 22:55 +0800, jianchao.wang wrote:
>> So we have a __deadline as below:
>>  
>> deadline gen state   
>>> __||_|
>>
>>\___ __/ 
>>V
>>   granularity 
>>
>> and even we usually have a 30~60s timeout,
>> but we should support a 1s granularity at least.
>> How to decide the granularity ?
> 
> Your diagram shows that the uppermost bits are used for the deadline. What I
> proposed is to use the lower 32 bits for the deadline such that we again have
> a resolution of 1/HZ.

It honestly doesn't matter how we do it, granularity could be exactly
the same. But yeah, storing it at the lower bits is perfectly fine,
either way works. Only difference is in the encoding, functionally
they can be identical.

-- 
Jens Axboe



Re: [PATCH v6] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-21 Thread Bart Van Assche
On Sat, 2018-04-21 at 22:55 +0800, jianchao.wang wrote:
> So we have a __deadline as below:
>  
> deadline gen state   
> > __||_|
> 
>\___ __/ 
>V
>   granularity 
> 
> and even we usually have a 30~60s timeout,
> but we should support a 1s granularity at least.
> How to decide the granularity ?

Your diagram shows that the uppermost bits are used for the deadline. What I
proposed is to use the lower 32 bits for the deadline such that we again have
a resolution of 1/HZ.

Bart.




Re: [PATCH v6] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-21 Thread jianchao.wang


On 04/21/2018 10:10 PM, Jens Axboe wrote:
> On 4/21/18 7:34 AM, jianchao.wang wrote:
>> Hi Bart
>>
>> Thanks for your kindly response.
>>
>> On 04/20/2018 10:11 PM, Bart Van Assche wrote:
>>> On Fri, 2018-04-20 at 14:55 +0800, jianchao.wang wrote:
 Hi Bart

 On 04/20/2018 12:43 AM, Bart Van Assche wrote:
> Use the deadline instead of the request generation to detect whether
>   or not a request timer fired after reinitialization of a request

 Maybe use deadline to do this is not suitable.

 Let's think of the following scenario.

 T1/T2 times in nano seconds
 J1/J2 times in jiffies

 rq is started at T1,J1
 blk_mq_timeout_work
   -> blk_mq_check_expired
 -> save rq->__deadline (J1 + MQ_RQ_IN_FLIGHT)

  rq is completed and freed 
  rq is allocated and started 
 again on T2
  rq->__deadline = J2 + 
 MQ_RQ_IN_FLIGHT
   -> synchronize srcu/rcu
   -> blk_mq_terminate_expired
 -> rq->__deadline (J2 + MQ_RQ_IN_FLIGHT)

 1. deadline = rq->__deadline & ~RQ_STATE_MASK (0x3)
if J2-J1 < 4 jiffies, we will get the same deadline value.

 2. even if we do some bit shift when save and get deadline
if T2 - T1 < time of 1 jiffies, we sill get the same deadline.
>>>
>>> Hello Jianchao,
>>>
>>> How about using the upper 16 or 32 bits of rq->__deadline to store a 
>>> generation
>>> number? I don't know any block driver for which the product of (deadline in
>>> seconds) and HZ exceeds (1 << 32).
>>>
>> yes, we don't need so long timeout value.
>> However, req->__deadline is an absolute time, not a relative one, 
>> its type is unsigned long variable which is same with jiffies.
>> If we reserve some bits (not just 1 or 2 bits) of req->__deadline for 
>> generation number,
>> how to handle it when mod_timer ?
> 
> We don't need to support timeouts to the full granularity of jiffies.
> Most normal commands are in the 30-60s range, and some lower level
> commands may take minutes. We're well within the range of chopping
> off some bits and not having to worry about that at all.
> 
Yes.
So we have a __deadline as below:
 
deadline gen state   
|__||_|
   \___ __/ 
   V
  granularity 

and even we usually have a 30~60s timeout,
but we should support a 1s granularity at least.
How to decide the granularity ?

Thanks
Jianchao


Re: [PATCH v6] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-21 Thread Jens Axboe
On 4/21/18 7:34 AM, jianchao.wang wrote:
> Hi Bart
> 
> Thanks for your kindly response.
> 
> On 04/20/2018 10:11 PM, Bart Van Assche wrote:
>> On Fri, 2018-04-20 at 14:55 +0800, jianchao.wang wrote:
>>> Hi Bart
>>>
>>> On 04/20/2018 12:43 AM, Bart Van Assche wrote:
 Use the deadline instead of the request generation to detect whether
   or not a request timer fired after reinitialization of a request
>>>
>>> Maybe use deadline to do this is not suitable.
>>>
>>> Let's think of the following scenario.
>>>
>>> T1/T2 times in nano seconds
>>> J1/J2 times in jiffies
>>>
>>> rq is started at T1,J1
>>> blk_mq_timeout_work
>>>   -> blk_mq_check_expired
>>> -> save rq->__deadline (J1 + MQ_RQ_IN_FLIGHT)
>>>
>>>  rq is completed and freed 
>>>  rq is allocated and started 
>>> again on T2
>>>  rq->__deadline = J2 + 
>>> MQ_RQ_IN_FLIGHT
>>>   -> synchronize srcu/rcu
>>>   -> blk_mq_terminate_expired
>>> -> rq->__deadline (J2 + MQ_RQ_IN_FLIGHT)
>>>
>>> 1. deadline = rq->__deadline & ~RQ_STATE_MASK (0x3)
>>>if J2-J1 < 4 jiffies, we will get the same deadline value.
>>>
>>> 2. even if we do some bit shift when save and get deadline
>>>if T2 - T1 < time of 1 jiffies, we sill get the same deadline.
>>
>> Hello Jianchao,
>>
>> How about using the upper 16 or 32 bits of rq->__deadline to store a 
>> generation
>> number? I don't know any block driver for which the product of (deadline in
>> seconds) and HZ exceeds (1 << 32).
>>
> yes, we don't need so long timeout value.
> However, req->__deadline is an absolute time, not a relative one, 
> its type is unsigned long variable which is same with jiffies.
> If we reserve some bits (not just 1 or 2 bits) of req->__deadline for 
> generation number,
> how to handle it when mod_timer ?

We don't need to support timeouts to the full granularity of jiffies.
Most normal commands are in the 30-60s range, and some lower level
commands may take minutes. We're well within the range of chopping
off some bits and not having to worry about that at all.

-- 
Jens Axboe



Re: question about request merge

2018-04-21 Thread Zhengyuan Liu
2018-04-20 22:34 GMT+08:00 Jens Axboe :
> On 4/19/18 9:51 PM, Zhengyuan Liu wrote:
>> Hi, Shaohua
>>
>> I found it indeed doesn't do front merge when two threads flush plug list  
>> concurrently.   To
>> reappear , I prepared two IO threads , named a0.io and a1.io .
>> Thread a1.io  uses libaio to write 5 requests :
>> sectors: 16 + 8, 40 + 8, 64 + 8, 88 + 8, 112 + 8
>> Thread a0.io  uses libaio to write other 5 requests :
>> sectors: 8+ 8, 32 + 8, 56 + 8, 80 + 8, 104 + 8
>
> I'm cutting some of the below.
>
> Thanks for the detailed email. It's mostly on purpose that we don't
> spend cycles and memory on maintaining a separate front merge hash,
> since it's generally not something that happens very often. If you have
> a thread pool doing IO and split sequential IO such that you would
> benefit a lot from front merging, then I would generally claim that
> you're not writing your app in the most optimal manner.
>

Thanks for explanation, I only consider the problem through the code's 
perspective and ignore the reality situation of app. 

> So I'm curious, what's the big interest in front merging?

If it's not something that happens so much often, I think it's not worth to 
support front merging too.

By the way, I got another question that why not  blktrace tracing the back
merging of requests while flushing plugged requests to queue,  if it does
we may get a more clear view about IO merging.

>
> --
> Jens Axboe
>

Re: [PATCH v6] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-21 Thread jianchao.wang
Hi Bart

Thanks for your kindly response.

On 04/20/2018 10:11 PM, Bart Van Assche wrote:
> On Fri, 2018-04-20 at 14:55 +0800, jianchao.wang wrote:
>> Hi Bart
>>
>> On 04/20/2018 12:43 AM, Bart Van Assche wrote:
>>> Use the deadline instead of the request generation to detect whether
>>>   or not a request timer fired after reinitialization of a request
>>
>> Maybe use deadline to do this is not suitable.
>>
>> Let's think of the following scenario.
>>
>> T1/T2 times in nano seconds
>> J1/J2 times in jiffies
>>
>> rq is started at T1,J1
>> blk_mq_timeout_work
>>   -> blk_mq_check_expired
>> -> save rq->__deadline (J1 + MQ_RQ_IN_FLIGHT)
>>
>>  rq is completed and freed 
>>  rq is allocated and started 
>> again on T2
>>  rq->__deadline = J2 + 
>> MQ_RQ_IN_FLIGHT
>>   -> synchronize srcu/rcu
>>   -> blk_mq_terminate_expired
>> -> rq->__deadline (J2 + MQ_RQ_IN_FLIGHT)
>>
>> 1. deadline = rq->__deadline & ~RQ_STATE_MASK (0x3)
>>if J2-J1 < 4 jiffies, we will get the same deadline value.
>>
>> 2. even if we do some bit shift when save and get deadline
>>if T2 - T1 < time of 1 jiffies, we sill get the same deadline.
> 
> Hello Jianchao,
> 
> How about using the upper 16 or 32 bits of rq->__deadline to store a 
> generation
> number? I don't know any block driver for which the product of (deadline in
> seconds) and HZ exceeds (1 << 32).
> 
yes, we don't need so long timeout value.
However, req->__deadline is an absolute time, not a relative one, 
its type is unsigned long variable which is same with jiffies.
If we reserve some bits (not just 1 or 2 bits) of req->__deadline for 
generation number,
how to handle it when mod_timer ?

Thanks
Jianchao
 


Re: usercopy whitelist woe in scsi_sense_cache

2018-04-21 Thread Paolo Valente


> Il giorno 20 apr 2018, alle ore 22:23, Kees Cook  ha 
> scritto:
> 
> On Thu, Apr 19, 2018 at 2:32 AM, Paolo Valente  
> wrote:
>> I'm missing something here.  When the request gets completed in the
>> first place, the hook bfq_finish_requeue_request gets called, and that
>> hook clears both ->elv.priv elements (as the request has a non-null
>> elv.icq).  So, when bfq gets the same request again, those elements
>> must be NULL.  What am I getting wrong?
>> 
>> I have some more concern on this point, but I'll stick to this for the
>> moment, to not create more confusion.
> 
> I don't know the "how", I only found the "what". :)

Got it, although I think you did much more than that ;)

Anyway, my reply was exactly to a (Jens') detailed description of the
how.  And my concern is that there seems to be an inconsistency in
that description.  In addition, Jens is proposing a patch basing on
that description.  But, if this inconsistency is not solved, that
patch may eliminate the symptom at hand, but it may not fix the real
cause, or may even contribute to bury it deeper.

> If you want, grab
> the reproducer VM linked to earlier in this thread; it'll hit the
> problem within about 30 seconds of running the reproducer.
> 

Yep.  Actually, I've been investigating this kind of failure, in
different incarnations, for months now.  In this respect, other
examples are the srp-test failures reported by Bart, e.g., here [1].
According to my analysis, the cause of the problem is somewhere in
blk-mq, outside bfq.  Unfortunately, I didn't make it to find where it
exactly is, mainly because of my limited expertise on blk-mq
internals.  So I have asked for any kind of help and suggestions to
Jens, Mike and any other knowledgeable guy.  Probably those help
requests got somehow lost on those threads, but your results, Kees,
and the analysis that followed from Jens seems now to be carrying us
to the solution of the not-so-recent issue.  Time will tell.


Thanks,
Paolo

[1] https://www.spinics.net/lists/linux-block/msg22760.html


> -Kees
> 
> -- 
> Kees Cook
> Pixel Security