Re: [PATCH] block, bfq: keep peak_rate estimation within range 1..2^32-1

2018-03-19 Thread Paolo Valente


> Il giorno 05 mar 2018, alle ore 04:48, Konstantin Khlebnikov 
>  ha scritto:
> 
> Rate should never overflow or become zero because it is used as divider.
> This patch accumulates it with saturation.
> 
> Signed-off-by: Konstantin Khlebnikov 
> ---
> block/bfq-iosched.c |8 +---
> 1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index aeca22d91101..a236c8d541b5 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -2546,7 +2546,8 @@ static void bfq_reset_rate_computation(struct bfq_data 
> *bfqd,
> 
> static void bfq_update_rate_reset(struct bfq_data *bfqd, struct request *rq)
> {
> - u32 rate, weight, divisor;
> + u32 weight, divisor;
> + u64 rate;
> 
>   /*
>* For the convergence property to hold (see comments on
> @@ -2634,9 +2635,10 @@ static void bfq_update_rate_reset(struct bfq_data 
> *bfqd, struct request *rq)
>*/
>   bfqd->peak_rate *= divisor-1;
>   bfqd->peak_rate /= divisor;
> - rate /= divisor; /* smoothing constant alpha = 1/divisor */
> + do_div(rate, divisor);  /* smoothing constant alpha = 1/divisor */
> 
> - bfqd->peak_rate += rate;
> + /* rate should never overlow or become zero */

It is bfqd->peak_rate that is used as a divider, and bfqd->peak_rate doesn't 
risk to be zero even if the variable 'rate' is zero here.

So I guess the reason why you consider the possibility that bfqd->peak_rate 
becomes zero is because of an overflow when summing 'rate'. But, according to 
my calculations, this should be impossible with devices with sensible speeds.

These are the reasons why I decided I could make it with a 32-bit variable, 
without any additional clamping. Did I make any mistake in my evaluation?

Anyway, even if I made some mistake about the maximum possible value of the 
device rate, and the latter may be too high for bfqd->peak_rate to contain it, 
then I guess the right solution would not be to clamp the actual rate to 
U32_MAX, but to move bfqd->peak_rate to 64 bits. Or am I missing something else?

Thanks,
Paolo


> + bfqd->peak_rate = clamp_t(u64, rate + bfqd->peak_rate, 1, U32_MAX);
>   update_thr_responsiveness_params(bfqd);
> 
> reset_computation:
> 



Re: [PATCH] block, bfq: keep peak_rate estimation within range 1..2^32-1

2018-03-19 Thread Paolo Valente


> Il giorno 05 mar 2018, alle ore 04:48, Konstantin Khlebnikov 
>  ha scritto:
> 
> Rate should never overflow or become zero because it is used as divider.
> This patch accumulates it with saturation.
> 
> Signed-off-by: Konstantin Khlebnikov 
> ---
> block/bfq-iosched.c |8 +---
> 1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index aeca22d91101..a236c8d541b5 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -2546,7 +2546,8 @@ static void bfq_reset_rate_computation(struct bfq_data 
> *bfqd,
> 
> static void bfq_update_rate_reset(struct bfq_data *bfqd, struct request *rq)
> {
> - u32 rate, weight, divisor;
> + u32 weight, divisor;
> + u64 rate;
> 
>   /*
>* For the convergence property to hold (see comments on
> @@ -2634,9 +2635,10 @@ static void bfq_update_rate_reset(struct bfq_data 
> *bfqd, struct request *rq)
>*/
>   bfqd->peak_rate *= divisor-1;
>   bfqd->peak_rate /= divisor;
> - rate /= divisor; /* smoothing constant alpha = 1/divisor */
> + do_div(rate, divisor);  /* smoothing constant alpha = 1/divisor */
> 
> - bfqd->peak_rate += rate;
> + /* rate should never overlow or become zero */

It is bfqd->peak_rate that is used as a divider, and bfqd->peak_rate doesn't 
risk to be zero even if the variable 'rate' is zero here.

So I guess the reason why you consider the possibility that bfqd->peak_rate 
becomes zero is because of an overflow when summing 'rate'. But, according to 
my calculations, this should be impossible with devices with sensible speeds.

These are the reasons why I decided I could make it with a 32-bit variable, 
without any additional clamping. Did I make any mistake in my evaluation?

Anyway, even if I made some mistake about the maximum possible value of the 
device rate, and the latter may be too high for bfqd->peak_rate to contain it, 
then I guess the right solution would not be to clamp the actual rate to 
U32_MAX, but to move bfqd->peak_rate to 64 bits. Or am I missing something else?

Thanks,
Paolo


> + bfqd->peak_rate = clamp_t(u64, rate + bfqd->peak_rate, 1, U32_MAX);
>   update_thr_responsiveness_params(bfqd);
> 
> reset_computation:
> 



Re: [PATCH BUGFIX V3] block, bfq: add requeue-request hook

2018-02-24 Thread Paolo Valente


> Il giorno 24 feb 2018, alle ore 13:05, Ming Lei <ming@redhat.com> ha 
> scritto:
> 
> On Sat, Feb 24, 2018 at 08:54:31AM +0100, Paolo Valente wrote:
>> 
>> 
>>> Il giorno 23 feb 2018, alle ore 17:17, Ming Lei <ming@redhat.com> ha 
>>> scritto:
>>> 
>>> Hi Paolo,
>>> 
>>> On Fri, Feb 23, 2018 at 04:41:36PM +0100, Paolo Valente wrote:
>>>> 
>>>> 
>>>>> Il giorno 23 feb 2018, alle ore 16:07, Ming Lei <ming@redhat.com> ha 
>>>>> scritto:
>>>>> 
>>>>> Hi Paolo,
>>>>> 
>>>>> On Wed, Feb 07, 2018 at 10:19:20PM +0100, Paolo Valente wrote:
>>>>>> Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via
>>>>>> RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
>>>>>> be re-inserted into the active I/O scheduler for that device. As a
>>>>> 
>>>>> No, this behaviour isn't related with commit a6a252e64914, and
>>>>> it has been there since blk_mq_requeue_request() is introduced.
>>>>> 
>>>> 
>>>> Hi Ming,
>>>> actually, we wrote the above statement after simply following the call
>>>> chain that led to the failure.  In this respect, the change in commit
>>>> a6a252e64914:
>>>> 
>>>> static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx,
>>>> +  bool has_sched,
>>>>  struct request *rq)
>>>> {
>>>> -   if (rq->tag == -1) {
>>>> +   /* dispatch flush rq directly */
>>>> +   if (rq->rq_flags & RQF_FLUSH_SEQ) {
>>>> +   spin_lock(>lock);
>>>> +   list_add(>queuelist, >dispatch);
>>>> +   spin_unlock(>lock);
>>>> +   return true;
>>>> +   }
>>>> +
>>>> +   if (has_sched) {
>>>>   rq->rq_flags |= RQF_SORTED;
>>>> -   return false;
>>>> +   WARN_ON(rq->tag != -1);
>>>>   }
>>>> 
>>>> -   /*
>>>> -* If we already have a real request tag, send directly to
>>>> -* the dispatch list.
>>>> -*/
>>>> -   spin_lock(>lock);
>>>> -   list_add(>queuelist, >dispatch);
>>>> -   spin_unlock(>lock);
>>>> -   return true;
>>>> +   return false;
>>>> }
>>>> 
>>>> makes blk_mq_sched_bypass_insert return false for all non-flush
>>>> requests.  From that, the anomaly described in our commit follows, for
>>>> bfq any stateful scheduler that waits for the completion of requests
>>>> that passed through it.  I'm elaborating again a little bit on this in
>>>> my replies to your next points below.
>>> 
>>> Before a6a252e64914, follows blk_mq_sched_bypass_insert()
>>> 
>>>  if (rq->tag == -1) {
>>>  rq->rq_flags |= RQF_SORTED;
>>>  return false;
>>>}
>>> 
>>>  /*
>>>   * If we already have a real request tag, send directly to
>>>   * the dispatch list.
>>>   */
>>>  spin_lock(>lock);
>>>  list_add(>queuelist, >dispatch);
>>>  spin_unlock(>lock);
>>>  return true;
>>> 
>>> This function still returns false for all non-flush request, so nothing
>>> changes wrt. this kind of handling.
>>> 
>> 
>> Yep Ming.  I don't have the expertise to tell you why, but the failure
>> in the USB case was caused by an rq that is not flush, but for which
>> rq->tag != -1.  So, the previous version of blk_mq_sched_bypass_insert
>> returned true, and there was not failure, while after commit
>> a6a252e64914 the function returns true and the failure occurs if bfq
>> does not exploit the requeue hook.
>> 
>> You have actually shown it yourself, several months ago, through the
>> simple code instrumentation you made and used to show that bfq was
>> stuck.  And I guess it can still be reproduced very easily, unless
>> something else has changed in blk-mq.
>> 
>> BTW, if you can shed a light on this fact, that would be a great
>> occasion to learn for me.
> 
> The difference should be 

Re: [PATCH BUGFIX V3] block, bfq: add requeue-request hook

2018-02-24 Thread Paolo Valente


> Il giorno 24 feb 2018, alle ore 13:05, Ming Lei  ha 
> scritto:
> 
> On Sat, Feb 24, 2018 at 08:54:31AM +0100, Paolo Valente wrote:
>> 
>> 
>>> Il giorno 23 feb 2018, alle ore 17:17, Ming Lei  ha 
>>> scritto:
>>> 
>>> Hi Paolo,
>>> 
>>> On Fri, Feb 23, 2018 at 04:41:36PM +0100, Paolo Valente wrote:
>>>> 
>>>> 
>>>>> Il giorno 23 feb 2018, alle ore 16:07, Ming Lei  ha 
>>>>> scritto:
>>>>> 
>>>>> Hi Paolo,
>>>>> 
>>>>> On Wed, Feb 07, 2018 at 10:19:20PM +0100, Paolo Valente wrote:
>>>>>> Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via
>>>>>> RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
>>>>>> be re-inserted into the active I/O scheduler for that device. As a
>>>>> 
>>>>> No, this behaviour isn't related with commit a6a252e64914, and
>>>>> it has been there since blk_mq_requeue_request() is introduced.
>>>>> 
>>>> 
>>>> Hi Ming,
>>>> actually, we wrote the above statement after simply following the call
>>>> chain that led to the failure.  In this respect, the change in commit
>>>> a6a252e64914:
>>>> 
>>>> static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx,
>>>> +  bool has_sched,
>>>>  struct request *rq)
>>>> {
>>>> -   if (rq->tag == -1) {
>>>> +   /* dispatch flush rq directly */
>>>> +   if (rq->rq_flags & RQF_FLUSH_SEQ) {
>>>> +   spin_lock(>lock);
>>>> +   list_add(>queuelist, >dispatch);
>>>> +   spin_unlock(>lock);
>>>> +   return true;
>>>> +   }
>>>> +
>>>> +   if (has_sched) {
>>>>   rq->rq_flags |= RQF_SORTED;
>>>> -   return false;
>>>> +   WARN_ON(rq->tag != -1);
>>>>   }
>>>> 
>>>> -   /*
>>>> -* If we already have a real request tag, send directly to
>>>> -* the dispatch list.
>>>> -*/
>>>> -   spin_lock(>lock);
>>>> -   list_add(>queuelist, >dispatch);
>>>> -   spin_unlock(>lock);
>>>> -   return true;
>>>> +   return false;
>>>> }
>>>> 
>>>> makes blk_mq_sched_bypass_insert return false for all non-flush
>>>> requests.  From that, the anomaly described in our commit follows, for
>>>> bfq any stateful scheduler that waits for the completion of requests
>>>> that passed through it.  I'm elaborating again a little bit on this in
>>>> my replies to your next points below.
>>> 
>>> Before a6a252e64914, follows blk_mq_sched_bypass_insert()
>>> 
>>>  if (rq->tag == -1) {
>>>  rq->rq_flags |= RQF_SORTED;
>>>  return false;
>>>}
>>> 
>>>  /*
>>>   * If we already have a real request tag, send directly to
>>>   * the dispatch list.
>>>   */
>>>  spin_lock(>lock);
>>>  list_add(>queuelist, >dispatch);
>>>  spin_unlock(>lock);
>>>  return true;
>>> 
>>> This function still returns false for all non-flush request, so nothing
>>> changes wrt. this kind of handling.
>>> 
>> 
>> Yep Ming.  I don't have the expertise to tell you why, but the failure
>> in the USB case was caused by an rq that is not flush, but for which
>> rq->tag != -1.  So, the previous version of blk_mq_sched_bypass_insert
>> returned true, and there was not failure, while after commit
>> a6a252e64914 the function returns true and the failure occurs if bfq
>> does not exploit the requeue hook.
>> 
>> You have actually shown it yourself, several months ago, through the
>> simple code instrumentation you made and used to show that bfq was
>> stuck.  And I guess it can still be reproduced very easily, unless
>> something else has changed in blk-mq.
>> 
>> BTW, if you can shed a light on this fact, that would be a great
>> occasion to learn for me.
> 
> The difference should be made by commit 923218f6166a84 (blk-mq: don't
> allocate driver tag upfront for

Re: [PATCH BUGFIX V3] block, bfq: add requeue-request hook

2018-02-23 Thread Paolo Valente


> Il giorno 23 feb 2018, alle ore 17:17, Ming Lei <ming@redhat.com> ha 
> scritto:
> 
> Hi Paolo,
> 
> On Fri, Feb 23, 2018 at 04:41:36PM +0100, Paolo Valente wrote:
>> 
>> 
>>> Il giorno 23 feb 2018, alle ore 16:07, Ming Lei <ming@redhat.com> ha 
>>> scritto:
>>> 
>>> Hi Paolo,
>>> 
>>> On Wed, Feb 07, 2018 at 10:19:20PM +0100, Paolo Valente wrote:
>>>> Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via
>>>> RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
>>>> be re-inserted into the active I/O scheduler for that device. As a
>>> 
>>> No, this behaviour isn't related with commit a6a252e64914, and
>>> it has been there since blk_mq_requeue_request() is introduced.
>>> 
>> 
>> Hi Ming,
>> actually, we wrote the above statement after simply following the call
>> chain that led to the failure.  In this respect, the change in commit
>> a6a252e64914:
>> 
>> static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx,
>> +  bool has_sched,
>>   struct request *rq)
>> {
>> -   if (rq->tag == -1) {
>> +   /* dispatch flush rq directly */
>> +   if (rq->rq_flags & RQF_FLUSH_SEQ) {
>> +   spin_lock(>lock);
>> +   list_add(>queuelist, >dispatch);
>> +   spin_unlock(>lock);
>> +   return true;
>> +   }
>> +
>> +   if (has_sched) {
>>rq->rq_flags |= RQF_SORTED;
>> -   return false;
>> +   WARN_ON(rq->tag != -1);
>>}
>> 
>> -   /*
>> -* If we already have a real request tag, send directly to
>> -* the dispatch list.
>> -*/
>> -   spin_lock(>lock);
>> -   list_add(>queuelist, >dispatch);
>> -   spin_unlock(>lock);
>> -   return true;
>> +   return false;
>> }
>> 
>> makes blk_mq_sched_bypass_insert return false for all non-flush
>> requests.  From that, the anomaly described in our commit follows, for
>> bfq any stateful scheduler that waits for the completion of requests
>> that passed through it.  I'm elaborating again a little bit on this in
>> my replies to your next points below.
> 
> Before a6a252e64914, follows blk_mq_sched_bypass_insert()
> 
>   if (rq->tag == -1) {
>   rq->rq_flags |= RQF_SORTED;
>   return false;
>  }
> 
>   /*
>* If we already have a real request tag, send directly to
>* the dispatch list.
>*/
>   spin_lock(>lock);
>   list_add(>queuelist, >dispatch);
>   spin_unlock(>lock);
>   return true;
> 
> This function still returns false for all non-flush request, so nothing
> changes wrt. this kind of handling.
> 

Yep Ming.  I don't have the expertise to tell you why, but the failure
in the USB case was caused by an rq that is not flush, but for which
rq->tag != -1.  So, the previous version of blk_mq_sched_bypass_insert
returned true, and there was not failure, while after commit
a6a252e64914 the function returns true and the failure occurs if bfq
does not exploit the requeue hook.

You have actually shown it yourself, several months ago, through the
simple code instrumentation you made and used to show that bfq was
stuck.  And I guess it can still be reproduced very easily, unless
something else has changed in blk-mq.

BTW, if you can shed a light on this fact, that would be a great
occasion to learn for me.

>> 
>> I don't mean that this change is an error, it simply sends a stateful
>> scheduler in an inconsistent state, unless the scheduler properly
>> handles the requeue that precedes the re-insertion into the
>> scheduler.
>> 
>> If this clarifies the situation, but there is still some misleading
>> statement in the commit, just let me better understand, and I'll be
>> glad to rectify it, if possible somehow.
>> 
>> 
>>> And you can see blk_mq_requeue_request() is called by lots of drivers,
>>> especially it is often used in error handler, see SCSI's example.
>>> 
>>>> consequence, I/O schedulers may get the same request inserted again,
>>>> even several times, without a finish_request invoked on that request
>>>> before each re-insertion.
>>>> 
>> 
>> ...
>> 
>>>> @@ -5426,7 +5482,8 @@ static struct elevator_type iosc

Re: [PATCH BUGFIX V3] block, bfq: add requeue-request hook

2018-02-23 Thread Paolo Valente


> Il giorno 23 feb 2018, alle ore 17:17, Ming Lei  ha 
> scritto:
> 
> Hi Paolo,
> 
> On Fri, Feb 23, 2018 at 04:41:36PM +0100, Paolo Valente wrote:
>> 
>> 
>>> Il giorno 23 feb 2018, alle ore 16:07, Ming Lei  ha 
>>> scritto:
>>> 
>>> Hi Paolo,
>>> 
>>> On Wed, Feb 07, 2018 at 10:19:20PM +0100, Paolo Valente wrote:
>>>> Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via
>>>> RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
>>>> be re-inserted into the active I/O scheduler for that device. As a
>>> 
>>> No, this behaviour isn't related with commit a6a252e64914, and
>>> it has been there since blk_mq_requeue_request() is introduced.
>>> 
>> 
>> Hi Ming,
>> actually, we wrote the above statement after simply following the call
>> chain that led to the failure.  In this respect, the change in commit
>> a6a252e64914:
>> 
>> static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx,
>> +  bool has_sched,
>>   struct request *rq)
>> {
>> -   if (rq->tag == -1) {
>> +   /* dispatch flush rq directly */
>> +   if (rq->rq_flags & RQF_FLUSH_SEQ) {
>> +   spin_lock(>lock);
>> +   list_add(>queuelist, >dispatch);
>> +   spin_unlock(>lock);
>> +   return true;
>> +   }
>> +
>> +   if (has_sched) {
>>rq->rq_flags |= RQF_SORTED;
>> -   return false;
>> +   WARN_ON(rq->tag != -1);
>>}
>> 
>> -   /*
>> -* If we already have a real request tag, send directly to
>> -* the dispatch list.
>> -*/
>> -   spin_lock(>lock);
>> -   list_add(>queuelist, >dispatch);
>> -   spin_unlock(>lock);
>> -   return true;
>> +   return false;
>> }
>> 
>> makes blk_mq_sched_bypass_insert return false for all non-flush
>> requests.  From that, the anomaly described in our commit follows, for
>> bfq any stateful scheduler that waits for the completion of requests
>> that passed through it.  I'm elaborating again a little bit on this in
>> my replies to your next points below.
> 
> Before a6a252e64914, follows blk_mq_sched_bypass_insert()
> 
>   if (rq->tag == -1) {
>   rq->rq_flags |= RQF_SORTED;
>   return false;
>  }
> 
>   /*
>* If we already have a real request tag, send directly to
>* the dispatch list.
>*/
>   spin_lock(>lock);
>   list_add(>queuelist, >dispatch);
>   spin_unlock(>lock);
>   return true;
> 
> This function still returns false for all non-flush request, so nothing
> changes wrt. this kind of handling.
> 

Yep Ming.  I don't have the expertise to tell you why, but the failure
in the USB case was caused by an rq that is not flush, but for which
rq->tag != -1.  So, the previous version of blk_mq_sched_bypass_insert
returned true, and there was not failure, while after commit
a6a252e64914 the function returns true and the failure occurs if bfq
does not exploit the requeue hook.

You have actually shown it yourself, several months ago, through the
simple code instrumentation you made and used to show that bfq was
stuck.  And I guess it can still be reproduced very easily, unless
something else has changed in blk-mq.

BTW, if you can shed a light on this fact, that would be a great
occasion to learn for me.

>> 
>> I don't mean that this change is an error, it simply sends a stateful
>> scheduler in an inconsistent state, unless the scheduler properly
>> handles the requeue that precedes the re-insertion into the
>> scheduler.
>> 
>> If this clarifies the situation, but there is still some misleading
>> statement in the commit, just let me better understand, and I'll be
>> glad to rectify it, if possible somehow.
>> 
>> 
>>> And you can see blk_mq_requeue_request() is called by lots of drivers,
>>> especially it is often used in error handler, see SCSI's example.
>>> 
>>>> consequence, I/O schedulers may get the same request inserted again,
>>>> even several times, without a finish_request invoked on that request
>>>> before each re-insertion.
>>>> 
>> 
>> ...
>> 
>>>> @@ -5426,7 +5482,8 @@ static struct elevator_type iosched_bfq_mq = {
>>>>.ops.mq = {
>&

Re: [PATCH BUGFIX V3] block, bfq: add requeue-request hook

2018-02-23 Thread Paolo Valente


> Il giorno 23 feb 2018, alle ore 16:07, Ming Lei <ming@redhat.com> ha 
> scritto:
> 
> Hi Paolo,
> 
> On Wed, Feb 07, 2018 at 10:19:20PM +0100, Paolo Valente wrote:
>> Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via
>> RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
>> be re-inserted into the active I/O scheduler for that device. As a
> 
> No, this behaviour isn't related with commit a6a252e64914, and
> it has been there since blk_mq_requeue_request() is introduced.
> 

Hi Ming,
actually, we wrote the above statement after simply following the call
chain that led to the failure.  In this respect, the change in commit
a6a252e64914:

 static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx,
+  bool has_sched,
   struct request *rq)
 {
-   if (rq->tag == -1) {
+   /* dispatch flush rq directly */
+   if (rq->rq_flags & RQF_FLUSH_SEQ) {
+   spin_lock(>lock);
+   list_add(>queuelist, >dispatch);
+   spin_unlock(>lock);
+   return true;
+   }
+
+   if (has_sched) {
rq->rq_flags |= RQF_SORTED;
-   return false;
+   WARN_ON(rq->tag != -1);
}
 
-   /*
-* If we already have a real request tag, send directly to
-* the dispatch list.
-*/
-   spin_lock(>lock);
-   list_add(>queuelist, >dispatch);
-   spin_unlock(>lock);
-   return true;
+   return false;
 }

makes blk_mq_sched_bypass_insert return false for all non-flush
requests.  From that, the anomaly described in our commit follows, for
bfq any stateful scheduler that waits for the completion of requests
that passed through it.  I'm elaborating again a little bit on this in
my replies to your next points below.

I don't mean that this change is an error, it simply sends a stateful
scheduler in an inconsistent state, unless the scheduler properly
handles the requeue that precedes the re-insertion into the
scheduler.

If this clarifies the situation, but there is still some misleading
statement in the commit, just let me better understand, and I'll be
glad to rectify it, if possible somehow.


> And you can see blk_mq_requeue_request() is called by lots of drivers,
> especially it is often used in error handler, see SCSI's example.
> 
>> consequence, I/O schedulers may get the same request inserted again,
>> even several times, without a finish_request invoked on that request
>> before each re-insertion.
>> 

...

>> @@ -5426,7 +5482,8 @@ static struct elevator_type iosched_bfq_mq = {
>>  .ops.mq = {
>>  .limit_depth= bfq_limit_depth,
>>  .prepare_request= bfq_prepare_request,
>> -.finish_request = bfq_finish_request,
>> +.requeue_request= bfq_finish_requeue_request,
>> +.finish_request = bfq_finish_requeue_request,
>>  .exit_icq   = bfq_exit_icq,
>>  .insert_requests= bfq_insert_requests,
>>  .dispatch_request   = bfq_dispatch_request,
> 
> This way may not be correct since blk_mq_sched_requeue_request() can be
> called for one request which won't enter io scheduler.
> 

Exactly, there are two cases: requeues that lead to subsequent
re-insertions, and requeues that don't.  The function
bfq_finish_requeue_request handles both, and both need to be handled,
to inform bfq that it has not to wait for the completion of rq any
longer.

One special case is when bfq_finish_requeue_request gets invoked even
if rq has nothing to do with any scheduler.  In that case,
bfq_finish_requeue_request exists immediately.


> __blk_mq_requeue_request() is called for two cases:
> 
>   - one is that the requeued request is added to hctx->dispatch, such
>   as blk_mq_dispatch_rq_list()

yes

>   - another case is that the request is requeued to io scheduler, such as
>   blk_mq_requeue_request().
> 

yes

> For the 1st case, blk_mq_sched_requeue_request() shouldn't be called
> since it is nothing to do with scheduler,

No, if that rq has been inserted and then extracted from the scheduler
through a dispatch_request, then it has.  The scheduler is stateful,
and keeps state for rq, because it must do so, until a completion or a
requeue arrive.  In particular, bfq may decide that no other
bfq_queues must be served before rq has been completed, because this
is necessary to preserve its target service guarantees.  If bfq is not
informed, either through its completion or its requeue hook, then it
will wait forever.

> seems we only need to do that
> for 2nd case.
> 

> So

Re: [PATCH BUGFIX V3] block, bfq: add requeue-request hook

2018-02-23 Thread Paolo Valente


> Il giorno 23 feb 2018, alle ore 16:07, Ming Lei  ha 
> scritto:
> 
> Hi Paolo,
> 
> On Wed, Feb 07, 2018 at 10:19:20PM +0100, Paolo Valente wrote:
>> Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via
>> RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
>> be re-inserted into the active I/O scheduler for that device. As a
> 
> No, this behaviour isn't related with commit a6a252e64914, and
> it has been there since blk_mq_requeue_request() is introduced.
> 

Hi Ming,
actually, we wrote the above statement after simply following the call
chain that led to the failure.  In this respect, the change in commit
a6a252e64914:

 static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx,
+  bool has_sched,
   struct request *rq)
 {
-   if (rq->tag == -1) {
+   /* dispatch flush rq directly */
+   if (rq->rq_flags & RQF_FLUSH_SEQ) {
+   spin_lock(>lock);
+   list_add(>queuelist, >dispatch);
+   spin_unlock(>lock);
+   return true;
+   }
+
+   if (has_sched) {
rq->rq_flags |= RQF_SORTED;
-   return false;
+   WARN_ON(rq->tag != -1);
}
 
-   /*
-* If we already have a real request tag, send directly to
-* the dispatch list.
-*/
-   spin_lock(>lock);
-   list_add(>queuelist, >dispatch);
-   spin_unlock(>lock);
-   return true;
+   return false;
 }

makes blk_mq_sched_bypass_insert return false for all non-flush
requests.  From that, the anomaly described in our commit follows, for
bfq any stateful scheduler that waits for the completion of requests
that passed through it.  I'm elaborating again a little bit on this in
my replies to your next points below.

I don't mean that this change is an error, it simply sends a stateful
scheduler in an inconsistent state, unless the scheduler properly
handles the requeue that precedes the re-insertion into the
scheduler.

If this clarifies the situation, but there is still some misleading
statement in the commit, just let me better understand, and I'll be
glad to rectify it, if possible somehow.


> And you can see blk_mq_requeue_request() is called by lots of drivers,
> especially it is often used in error handler, see SCSI's example.
> 
>> consequence, I/O schedulers may get the same request inserted again,
>> even several times, without a finish_request invoked on that request
>> before each re-insertion.
>> 

...

>> @@ -5426,7 +5482,8 @@ static struct elevator_type iosched_bfq_mq = {
>>  .ops.mq = {
>>  .limit_depth= bfq_limit_depth,
>>  .prepare_request= bfq_prepare_request,
>> -.finish_request = bfq_finish_request,
>> +.requeue_request= bfq_finish_requeue_request,
>> +.finish_request = bfq_finish_requeue_request,
>>  .exit_icq   = bfq_exit_icq,
>>  .insert_requests= bfq_insert_requests,
>>  .dispatch_request   = bfq_dispatch_request,
> 
> This way may not be correct since blk_mq_sched_requeue_request() can be
> called for one request which won't enter io scheduler.
> 

Exactly, there are two cases: requeues that lead to subsequent
re-insertions, and requeues that don't.  The function
bfq_finish_requeue_request handles both, and both need to be handled,
to inform bfq that it has not to wait for the completion of rq any
longer.

One special case is when bfq_finish_requeue_request gets invoked even
if rq has nothing to do with any scheduler.  In that case,
bfq_finish_requeue_request exists immediately.


> __blk_mq_requeue_request() is called for two cases:
> 
>   - one is that the requeued request is added to hctx->dispatch, such
>   as blk_mq_dispatch_rq_list()

yes

>   - another case is that the request is requeued to io scheduler, such as
>   blk_mq_requeue_request().
> 

yes

> For the 1st case, blk_mq_sched_requeue_request() shouldn't be called
> since it is nothing to do with scheduler,

No, if that rq has been inserted and then extracted from the scheduler
through a dispatch_request, then it has.  The scheduler is stateful,
and keeps state for rq, because it must do so, until a completion or a
requeue arrive.  In particular, bfq may decide that no other
bfq_queues must be served before rq has been completed, because this
is necessary to preserve its target service guarantees.  If bfq is not
informed, either through its completion or its requeue hook, then it
will wait forever.

> seems we only need to do that
> for 2nd case.
> 

> So looks

Re: [PATCH BUGFIX V3] block, bfq: add requeue-request hook

2018-02-11 Thread Paolo Valente


> Il giorno 10 feb 2018, alle ore 09:29, Oleksandr Natalenko 
>  ha scritto:
> 
> Hi.
> 
> On pátek 9. února 2018 18:29:39 CET Mike Galbraith wrote:
>> On Fri, 2018-02-09 at 14:21 +0100, Oleksandr Natalenko wrote:
>>> In addition to this I think it should be worth considering CC'ing Greg
>>> to pull this fix into 4.15 stable tree.
>> 
>> This isn't one he can cherry-pick, some munging required, in which case
>> he usually wants a properly tested backport.
>> 
>>  -Mike
> 
> Maybe, this could be a good opportunity to push all the pending BFQ patches 
> into the stable 4.15 branch? Because IIUC currently BFQ in 4.15 is just 
> unusable.
> 
> Paolo?
> 

Of course ok for me, and thanks Oleksandr for proposing this.  These
commits should apply cleanly on 4.15, and FWIW have been tested, by me
and BFQ users, on 4.15 too in these months.

Thanks,
Paolo

> ---
> 
> block, bfq: add requeue-request hook
> bfq-iosched: don't call bfqg_and_blkg_put for !CONFIG_BFQ_GROUP_IOSCHED
> block, bfq: release oom-queue ref to root group on exit
> block, bfq: put async queues for root bfq groups too
> block, bfq: limit sectors served with interactive weight raising
> block, bfq: limit tags for writes and async I/O
> block, bfq: increase threshold to deem I/O as random
> block, bfq: remove superfluous check in queue-merging setup
> block, bfq: let a queue be merged only shortly after starting I/O
> block, bfq: check low_latency flag in bfq_bfqq_save_state()
> block, bfq: add missing rq_pos_tree update on rq removal
> block, bfq: fix occurrences of request finish method's old name
> block, bfq: consider also past I/O in soft real-time detection
> block, bfq: remove batches of confusing ifdefs
> 
> 



Re: [PATCH BUGFIX V3] block, bfq: add requeue-request hook

2018-02-11 Thread Paolo Valente


> Il giorno 10 feb 2018, alle ore 09:29, Oleksandr Natalenko 
>  ha scritto:
> 
> Hi.
> 
> On pátek 9. února 2018 18:29:39 CET Mike Galbraith wrote:
>> On Fri, 2018-02-09 at 14:21 +0100, Oleksandr Natalenko wrote:
>>> In addition to this I think it should be worth considering CC'ing Greg
>>> to pull this fix into 4.15 stable tree.
>> 
>> This isn't one he can cherry-pick, some munging required, in which case
>> he usually wants a properly tested backport.
>> 
>>  -Mike
> 
> Maybe, this could be a good opportunity to push all the pending BFQ patches 
> into the stable 4.15 branch? Because IIUC currently BFQ in 4.15 is just 
> unusable.
> 
> Paolo?
> 

Of course ok for me, and thanks Oleksandr for proposing this.  These
commits should apply cleanly on 4.15, and FWIW have been tested, by me
and BFQ users, on 4.15 too in these months.

Thanks,
Paolo

> ---
> 
> block, bfq: add requeue-request hook
> bfq-iosched: don't call bfqg_and_blkg_put for !CONFIG_BFQ_GROUP_IOSCHED
> block, bfq: release oom-queue ref to root group on exit
> block, bfq: put async queues for root bfq groups too
> block, bfq: limit sectors served with interactive weight raising
> block, bfq: limit tags for writes and async I/O
> block, bfq: increase threshold to deem I/O as random
> block, bfq: remove superfluous check in queue-merging setup
> block, bfq: let a queue be merged only shortly after starting I/O
> block, bfq: check low_latency flag in bfq_bfqq_save_state()
> block, bfq: add missing rq_pos_tree update on rq removal
> block, bfq: fix occurrences of request finish method's old name
> block, bfq: consider also past I/O in soft real-time detection
> block, bfq: remove batches of confusing ifdefs
> 
> 



Re: [PATCH BUGFIX V3] block, bfq: add requeue-request hook

2018-02-07 Thread Paolo Valente


> Il giorno 07 feb 2018, alle ore 23:18, Jens Axboe <ax...@kernel.dk> ha 
> scritto:
> 
> On 2/7/18 2:19 PM, Paolo Valente wrote:
>> Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via
>> RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
>> be re-inserted into the active I/O scheduler for that device. As a
>> consequence, I/O schedulers may get the same request inserted again,
>> even several times, without a finish_request invoked on that request
>> before each re-insertion.
>> 
>> This fact is the cause of the failure reported in [1]. For an I/O
>> scheduler, every re-insertion of the same re-prepared request is
>> equivalent to the insertion of a new request. For schedulers like
>> mq-deadline or kyber, this fact causes no harm. In contrast, it
>> confuses a stateful scheduler like BFQ, which keeps state for an I/O
>> request, until the finish_request hook is invoked on the request. In
>> particular, BFQ may get stuck, waiting forever for the number of
>> request dispatches, of the same request, to be balanced by an equal
>> number of request completions (while there will be one completion for
>> that request). In this state, BFQ may refuse to serve I/O requests
>> from other bfq_queues. The hang reported in [1] then follows.
>> 
>> However, the above re-prepared requests undergo a requeue, thus the
>> requeue_request hook of the active elevator is invoked for these
>> requests, if set. This commit then addresses the above issue by
>> properly implementing the hook requeue_request in BFQ.
> 
> Thanks, applied.
> 

I Jens,
I forgot to add
Tested-by: Oleksandr Natalenko <oleksa...@natalenko.name>
in the patch.

Is it still possible to add it?

Thanks,
Paolo

> -- 
> Jens Axboe



Re: [PATCH BUGFIX V3] block, bfq: add requeue-request hook

2018-02-07 Thread Paolo Valente


> Il giorno 07 feb 2018, alle ore 23:18, Jens Axboe  ha 
> scritto:
> 
> On 2/7/18 2:19 PM, Paolo Valente wrote:
>> Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via
>> RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
>> be re-inserted into the active I/O scheduler for that device. As a
>> consequence, I/O schedulers may get the same request inserted again,
>> even several times, without a finish_request invoked on that request
>> before each re-insertion.
>> 
>> This fact is the cause of the failure reported in [1]. For an I/O
>> scheduler, every re-insertion of the same re-prepared request is
>> equivalent to the insertion of a new request. For schedulers like
>> mq-deadline or kyber, this fact causes no harm. In contrast, it
>> confuses a stateful scheduler like BFQ, which keeps state for an I/O
>> request, until the finish_request hook is invoked on the request. In
>> particular, BFQ may get stuck, waiting forever for the number of
>> request dispatches, of the same request, to be balanced by an equal
>> number of request completions (while there will be one completion for
>> that request). In this state, BFQ may refuse to serve I/O requests
>> from other bfq_queues. The hang reported in [1] then follows.
>> 
>> However, the above re-prepared requests undergo a requeue, thus the
>> requeue_request hook of the active elevator is invoked for these
>> requests, if set. This commit then addresses the above issue by
>> properly implementing the hook requeue_request in BFQ.
> 
> Thanks, applied.
> 

I Jens,
I forgot to add
Tested-by: Oleksandr Natalenko 
in the patch.

Is it still possible to add it?

Thanks,
Paolo

> -- 
> Jens Axboe



[PATCH BUGFIX V3] block, bfq: add requeue-request hook

2018-02-07 Thread Paolo Valente
Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via
RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
be re-inserted into the active I/O scheduler for that device. As a
consequence, I/O schedulers may get the same request inserted again,
even several times, without a finish_request invoked on that request
before each re-insertion.

This fact is the cause of the failure reported in [1]. For an I/O
scheduler, every re-insertion of the same re-prepared request is
equivalent to the insertion of a new request. For schedulers like
mq-deadline or kyber, this fact causes no harm. In contrast, it
confuses a stateful scheduler like BFQ, which keeps state for an I/O
request, until the finish_request hook is invoked on the request. In
particular, BFQ may get stuck, waiting forever for the number of
request dispatches, of the same request, to be balanced by an equal
number of request completions (while there will be one completion for
that request). In this state, BFQ may refuse to serve I/O requests
from other bfq_queues. The hang reported in [1] then follows.

However, the above re-prepared requests undergo a requeue, thus the
requeue_request hook of the active elevator is invoked for these
requests, if set. This commit then addresses the above issue by
properly implementing the hook requeue_request in BFQ.

[1] https://marc.info/?l=linux-block=15127608676

Reported-by: Ivan Kozik <i...@ludios.org>
Reported-by: Alban Browaeys <alban.browa...@gmail.com>
Tested-by: Mike Galbraith <efa...@gmx.de>
Signed-off-by: Paolo Valente <paolo.vale...@linaro.org>
Signed-off-by: Serena Ziviani <ziviani.ser...@gmail.com>
---
V2: contains fix to bug reported in [2]
V3: implements the improvement suggested in [3]

[2] https://lkml.org/lkml/2018/2/5/599
[3] https://lkml.org/lkml/2018/2/7/532

 block/bfq-iosched.c | 107 
 1 file changed, 82 insertions(+), 25 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 47e6ec7427c4..aeca22d91101 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -3823,24 +3823,26 @@ static struct request *__bfq_dispatch_request(struct 
blk_mq_hw_ctx *hctx)
}

/*
-* We exploit the bfq_finish_request hook to decrement
-* rq_in_driver, but bfq_finish_request will not be
-* invoked on this request. So, to avoid unbalance,
-* just start this request, without incrementing
-* rq_in_driver. As a negative consequence,
-* rq_in_driver is deceptively lower than it should be
-* while this request is in service. This may cause
-* bfq_schedule_dispatch to be invoked uselessly.
+* We exploit the bfq_finish_requeue_request hook to
+* decrement rq_in_driver, but
+* bfq_finish_requeue_request will not be invoked on
+* this request. So, to avoid unbalance, just start
+* this request, without incrementing rq_in_driver. As
+* a negative consequence, rq_in_driver is deceptively
+* lower than it should be while this request is in
+* service. This may cause bfq_schedule_dispatch to be
+* invoked uselessly.
 *
 * As for implementing an exact solution, the
-* bfq_finish_request hook, if defined, is probably
-* invoked also on this request. So, by exploiting
-* this hook, we could 1) increment rq_in_driver here,
-* and 2) decrement it in bfq_finish_request. Such a
-* solution would let the value of the counter be
-* always accurate, but it would entail using an extra
-* interface function. This cost seems higher than the
-* benefit, being the frequency of non-elevator-private
+* bfq_finish_requeue_request hook, if defined, is
+* probably invoked also on this request. So, by
+* exploiting this hook, we could 1) increment
+* rq_in_driver here, and 2) decrement it in
+* bfq_finish_requeue_request. Such a solution would
+* let the value of the counter be always accurate,
+* but it would entail using an extra interface
+* function. This cost seems higher than the benefit,
+* being the frequency of non-elevator-private
 * requests very low.
 */
goto start_rq;
@@ -4515,6 +4517,8 @@ static inline void bfq_update_insert_stats(struct 
request_queue *q,
   unsigned int cmd_flags) {}
 #endif

+static void bfq_prepare_request(struct request *rq, struct bio *bio);
+
 static void bfq_insert_request(struct

[PATCH BUGFIX V3] block, bfq: add requeue-request hook

2018-02-07 Thread Paolo Valente
Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via
RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
be re-inserted into the active I/O scheduler for that device. As a
consequence, I/O schedulers may get the same request inserted again,
even several times, without a finish_request invoked on that request
before each re-insertion.

This fact is the cause of the failure reported in [1]. For an I/O
scheduler, every re-insertion of the same re-prepared request is
equivalent to the insertion of a new request. For schedulers like
mq-deadline or kyber, this fact causes no harm. In contrast, it
confuses a stateful scheduler like BFQ, which keeps state for an I/O
request, until the finish_request hook is invoked on the request. In
particular, BFQ may get stuck, waiting forever for the number of
request dispatches, of the same request, to be balanced by an equal
number of request completions (while there will be one completion for
that request). In this state, BFQ may refuse to serve I/O requests
from other bfq_queues. The hang reported in [1] then follows.

However, the above re-prepared requests undergo a requeue, thus the
requeue_request hook of the active elevator is invoked for these
requests, if set. This commit then addresses the above issue by
properly implementing the hook requeue_request in BFQ.

[1] https://marc.info/?l=linux-block=15127608676

Reported-by: Ivan Kozik 
Reported-by: Alban Browaeys 
Tested-by: Mike Galbraith 
Signed-off-by: Paolo Valente 
Signed-off-by: Serena Ziviani 
---
V2: contains fix to bug reported in [2]
V3: implements the improvement suggested in [3]

[2] https://lkml.org/lkml/2018/2/5/599
[3] https://lkml.org/lkml/2018/2/7/532

 block/bfq-iosched.c | 107 
 1 file changed, 82 insertions(+), 25 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 47e6ec7427c4..aeca22d91101 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -3823,24 +3823,26 @@ static struct request *__bfq_dispatch_request(struct 
blk_mq_hw_ctx *hctx)
}

/*
-* We exploit the bfq_finish_request hook to decrement
-* rq_in_driver, but bfq_finish_request will not be
-* invoked on this request. So, to avoid unbalance,
-* just start this request, without incrementing
-* rq_in_driver. As a negative consequence,
-* rq_in_driver is deceptively lower than it should be
-* while this request is in service. This may cause
-* bfq_schedule_dispatch to be invoked uselessly.
+* We exploit the bfq_finish_requeue_request hook to
+* decrement rq_in_driver, but
+* bfq_finish_requeue_request will not be invoked on
+* this request. So, to avoid unbalance, just start
+* this request, without incrementing rq_in_driver. As
+* a negative consequence, rq_in_driver is deceptively
+* lower than it should be while this request is in
+* service. This may cause bfq_schedule_dispatch to be
+* invoked uselessly.
 *
 * As for implementing an exact solution, the
-* bfq_finish_request hook, if defined, is probably
-* invoked also on this request. So, by exploiting
-* this hook, we could 1) increment rq_in_driver here,
-* and 2) decrement it in bfq_finish_request. Such a
-* solution would let the value of the counter be
-* always accurate, but it would entail using an extra
-* interface function. This cost seems higher than the
-* benefit, being the frequency of non-elevator-private
+* bfq_finish_requeue_request hook, if defined, is
+* probably invoked also on this request. So, by
+* exploiting this hook, we could 1) increment
+* rq_in_driver here, and 2) decrement it in
+* bfq_finish_requeue_request. Such a solution would
+* let the value of the counter be always accurate,
+* but it would entail using an extra interface
+* function. This cost seems higher than the benefit,
+* being the frequency of non-elevator-private
 * requests very low.
 */
goto start_rq;
@@ -4515,6 +4517,8 @@ static inline void bfq_update_insert_stats(struct 
request_queue *q,
   unsigned int cmd_flags) {}
 #endif

+static void bfq_prepare_request(struct request *rq, struct bio *bio);
+
 static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
   bool at_head)
 {
@@ -4541,6 +4545,18 @@ static void bfq_insert_requ

Re: [PATCH BUGFIX V2 1/1] block, bfq: add requeue-request hook

2018-02-07 Thread Paolo Valente


> Il giorno 07 feb 2018, alle ore 19:06, Jens Axboe <ax...@kernel.dk> ha 
> scritto:
> 
> On 2/7/18 11:00 AM, Paolo Valente wrote:
>> Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via
>> RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
>> be re-inserted into the active I/O scheduler for that device. As a
>> consequence, I/O schedulers may get the same request inserted again,
>> even several times, without a finish_request invoked on that request
>> before each re-insertion.
>> 
>> This fact is the cause of the failure reported in [1]. For an I/O
>> scheduler, every re-insertion of the same re-prepared request is
>> equivalent to the insertion of a new request. For schedulers like
>> mq-deadline or kyber, this fact causes no harm. In contrast, it
>> confuses a stateful scheduler like BFQ, which keeps state for an I/O
>> request, until the finish_request hook is invoked on the request. In
>> particular, BFQ may get stuck, waiting forever for the number of
>> request dispatches, of the same request, to be balanced by an equal
>> number of request completions (while there will be one completion for
>> that request). In this state, BFQ may refuse to serve I/O requests
>> from other bfq_queues. The hang reported in [1] then follows.
>> 
>> However, the above re-prepared requests undergo a requeue, thus the
>> requeue_request hook of the active elevator is invoked for these
>> requests, if set. This commit then addresses the above issue by
>> properly implementing the hook requeue_request in BFQ.
>> 
>> [1] https://marc.info/?l=linux-block=15127608676
>> 
>> Reported-by: Ivan Kozik <i...@ludios.org>
>> Reported-by: Alban Browaeys <alban.browa...@gmail.com>
>> Tested-by: Mike Galbraith <efa...@gmx.de>
>> Signed-off-by: Paolo Valente <paolo.vale...@linaro.org>
>> Signed-off-by: Serena Ziviani <ziviani.ser...@gmail.com>
>> ---
>> block/bfq-iosched.c | 109 
>> 
>> 1 file changed, 84 insertions(+), 25 deletions(-)
>> 
>> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
>> index 47e6ec7427c4..21e6b9e45638 100644
>> --- a/block/bfq-iosched.c
>> +++ b/block/bfq-iosched.c
>> @@ -3823,24 +3823,26 @@ static struct request *__bfq_dispatch_request(struct 
>> blk_mq_hw_ctx *hctx)
>>  }
>> 
>>  /*
>> - * We exploit the bfq_finish_request hook to decrement
>> - * rq_in_driver, but bfq_finish_request will not be
>> - * invoked on this request. So, to avoid unbalance,
>> - * just start this request, without incrementing
>> - * rq_in_driver. As a negative consequence,
>> - * rq_in_driver is deceptively lower than it should be
>> - * while this request is in service. This may cause
>> - * bfq_schedule_dispatch to be invoked uselessly.
>> + * We exploit the bfq_finish_requeue_request hook to
>> + * decrement rq_in_driver, but
>> + * bfq_finish_requeue_request will not be invoked on
>> + * this request. So, to avoid unbalance, just start
>> + * this request, without incrementing rq_in_driver. As
>> + * a negative consequence, rq_in_driver is deceptively
>> + * lower than it should be while this request is in
>> + * service. This may cause bfq_schedule_dispatch to be
>> + * invoked uselessly.
>>   *
>>   * As for implementing an exact solution, the
>> - * bfq_finish_request hook, if defined, is probably
>> - * invoked also on this request. So, by exploiting
>> - * this hook, we could 1) increment rq_in_driver here,
>> - * and 2) decrement it in bfq_finish_request. Such a
>> - * solution would let the value of the counter be
>> - * always accurate, but it would entail using an extra
>> - * interface function. This cost seems higher than the
>> - * benefit, being the frequency of non-elevator-private
>> + * bfq_finish_requeue_request hook, if defined, is
>> + * probably invoked also on this request. So, by
>> + * exploiting this hook, we could 1) increment
>> + * rq_in_driver here, and 2) decrement it in
>> + * bfq_finish_requeue_request. Such a solution would
>> + * let the value of the counter be 

Re: [PATCH BUGFIX V2 1/1] block, bfq: add requeue-request hook

2018-02-07 Thread Paolo Valente


> Il giorno 07 feb 2018, alle ore 19:06, Jens Axboe  ha 
> scritto:
> 
> On 2/7/18 11:00 AM, Paolo Valente wrote:
>> Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via
>> RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
>> be re-inserted into the active I/O scheduler for that device. As a
>> consequence, I/O schedulers may get the same request inserted again,
>> even several times, without a finish_request invoked on that request
>> before each re-insertion.
>> 
>> This fact is the cause of the failure reported in [1]. For an I/O
>> scheduler, every re-insertion of the same re-prepared request is
>> equivalent to the insertion of a new request. For schedulers like
>> mq-deadline or kyber, this fact causes no harm. In contrast, it
>> confuses a stateful scheduler like BFQ, which keeps state for an I/O
>> request, until the finish_request hook is invoked on the request. In
>> particular, BFQ may get stuck, waiting forever for the number of
>> request dispatches, of the same request, to be balanced by an equal
>> number of request completions (while there will be one completion for
>> that request). In this state, BFQ may refuse to serve I/O requests
>> from other bfq_queues. The hang reported in [1] then follows.
>> 
>> However, the above re-prepared requests undergo a requeue, thus the
>> requeue_request hook of the active elevator is invoked for these
>> requests, if set. This commit then addresses the above issue by
>> properly implementing the hook requeue_request in BFQ.
>> 
>> [1] https://marc.info/?l=linux-block=15127608676
>> 
>> Reported-by: Ivan Kozik 
>> Reported-by: Alban Browaeys 
>> Tested-by: Mike Galbraith 
>> Signed-off-by: Paolo Valente 
>> Signed-off-by: Serena Ziviani 
>> ---
>> block/bfq-iosched.c | 109 
>> 
>> 1 file changed, 84 insertions(+), 25 deletions(-)
>> 
>> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
>> index 47e6ec7427c4..21e6b9e45638 100644
>> --- a/block/bfq-iosched.c
>> +++ b/block/bfq-iosched.c
>> @@ -3823,24 +3823,26 @@ static struct request *__bfq_dispatch_request(struct 
>> blk_mq_hw_ctx *hctx)
>>  }
>> 
>>  /*
>> - * We exploit the bfq_finish_request hook to decrement
>> - * rq_in_driver, but bfq_finish_request will not be
>> - * invoked on this request. So, to avoid unbalance,
>> - * just start this request, without incrementing
>> - * rq_in_driver. As a negative consequence,
>> - * rq_in_driver is deceptively lower than it should be
>> - * while this request is in service. This may cause
>> - * bfq_schedule_dispatch to be invoked uselessly.
>> + * We exploit the bfq_finish_requeue_request hook to
>> + * decrement rq_in_driver, but
>> + * bfq_finish_requeue_request will not be invoked on
>> + * this request. So, to avoid unbalance, just start
>> + * this request, without incrementing rq_in_driver. As
>> + * a negative consequence, rq_in_driver is deceptively
>> + * lower than it should be while this request is in
>> + * service. This may cause bfq_schedule_dispatch to be
>> + * invoked uselessly.
>>   *
>>   * As for implementing an exact solution, the
>> - * bfq_finish_request hook, if defined, is probably
>> - * invoked also on this request. So, by exploiting
>> - * this hook, we could 1) increment rq_in_driver here,
>> - * and 2) decrement it in bfq_finish_request. Such a
>> - * solution would let the value of the counter be
>> - * always accurate, but it would entail using an extra
>> - * interface function. This cost seems higher than the
>> - * benefit, being the frequency of non-elevator-private
>> + * bfq_finish_requeue_request hook, if defined, is
>> + * probably invoked also on this request. So, by
>> + * exploiting this hook, we could 1) increment
>> + * rq_in_driver here, and 2) decrement it in
>> + * bfq_finish_requeue_request. Such a solution would
>> + * let the value of the counter be always accurate,
>> + * but it would entail using an extra interface
>> + * function. This cost seems higher than the ben

[PATCH BUGFIX V2 1/1] block, bfq: add requeue-request hook

2018-02-07 Thread Paolo Valente
Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via
RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
be re-inserted into the active I/O scheduler for that device. As a
consequence, I/O schedulers may get the same request inserted again,
even several times, without a finish_request invoked on that request
before each re-insertion.

This fact is the cause of the failure reported in [1]. For an I/O
scheduler, every re-insertion of the same re-prepared request is
equivalent to the insertion of a new request. For schedulers like
mq-deadline or kyber, this fact causes no harm. In contrast, it
confuses a stateful scheduler like BFQ, which keeps state for an I/O
request, until the finish_request hook is invoked on the request. In
particular, BFQ may get stuck, waiting forever for the number of
request dispatches, of the same request, to be balanced by an equal
number of request completions (while there will be one completion for
that request). In this state, BFQ may refuse to serve I/O requests
from other bfq_queues. The hang reported in [1] then follows.

However, the above re-prepared requests undergo a requeue, thus the
requeue_request hook of the active elevator is invoked for these
requests, if set. This commit then addresses the above issue by
properly implementing the hook requeue_request in BFQ.

[1] https://marc.info/?l=linux-block=15127608676

Reported-by: Ivan Kozik <i...@ludios.org>
Reported-by: Alban Browaeys <alban.browa...@gmail.com>
Tested-by: Mike Galbraith <efa...@gmx.de>
Signed-off-by: Paolo Valente <paolo.vale...@linaro.org>
Signed-off-by: Serena Ziviani <ziviani.ser...@gmail.com>
---
 block/bfq-iosched.c | 109 
 1 file changed, 84 insertions(+), 25 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 47e6ec7427c4..21e6b9e45638 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -3823,24 +3823,26 @@ static struct request *__bfq_dispatch_request(struct 
blk_mq_hw_ctx *hctx)
}
 
/*
-* We exploit the bfq_finish_request hook to decrement
-* rq_in_driver, but bfq_finish_request will not be
-* invoked on this request. So, to avoid unbalance,
-* just start this request, without incrementing
-* rq_in_driver. As a negative consequence,
-* rq_in_driver is deceptively lower than it should be
-* while this request is in service. This may cause
-* bfq_schedule_dispatch to be invoked uselessly.
+* We exploit the bfq_finish_requeue_request hook to
+* decrement rq_in_driver, but
+* bfq_finish_requeue_request will not be invoked on
+* this request. So, to avoid unbalance, just start
+* this request, without incrementing rq_in_driver. As
+* a negative consequence, rq_in_driver is deceptively
+* lower than it should be while this request is in
+* service. This may cause bfq_schedule_dispatch to be
+* invoked uselessly.
 *
 * As for implementing an exact solution, the
-* bfq_finish_request hook, if defined, is probably
-* invoked also on this request. So, by exploiting
-* this hook, we could 1) increment rq_in_driver here,
-* and 2) decrement it in bfq_finish_request. Such a
-* solution would let the value of the counter be
-* always accurate, but it would entail using an extra
-* interface function. This cost seems higher than the
-* benefit, being the frequency of non-elevator-private
+* bfq_finish_requeue_request hook, if defined, is
+* probably invoked also on this request. So, by
+* exploiting this hook, we could 1) increment
+* rq_in_driver here, and 2) decrement it in
+* bfq_finish_requeue_request. Such a solution would
+* let the value of the counter be always accurate,
+* but it would entail using an extra interface
+* function. This cost seems higher than the benefit,
+* being the frequency of non-elevator-private
 * requests very low.
 */
goto start_rq;
@@ -4515,6 +4517,8 @@ static inline void bfq_update_insert_stats(struct 
request_queue *q,
   unsigned int cmd_flags) {}
 #endif
 
+static void bfq_prepare_request(struct request *rq, struct bio *bio);
+
 static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
   bool at_head)
 {
@@ -4541,6 +4545,20 @@ static void bfq_insert_request(struct 

[PATCH BUGFIX V2 1/1] block, bfq: add requeue-request hook

2018-02-07 Thread Paolo Valente
Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via
RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
be re-inserted into the active I/O scheduler for that device. As a
consequence, I/O schedulers may get the same request inserted again,
even several times, without a finish_request invoked on that request
before each re-insertion.

This fact is the cause of the failure reported in [1]. For an I/O
scheduler, every re-insertion of the same re-prepared request is
equivalent to the insertion of a new request. For schedulers like
mq-deadline or kyber, this fact causes no harm. In contrast, it
confuses a stateful scheduler like BFQ, which keeps state for an I/O
request, until the finish_request hook is invoked on the request. In
particular, BFQ may get stuck, waiting forever for the number of
request dispatches, of the same request, to be balanced by an equal
number of request completions (while there will be one completion for
that request). In this state, BFQ may refuse to serve I/O requests
from other bfq_queues. The hang reported in [1] then follows.

However, the above re-prepared requests undergo a requeue, thus the
requeue_request hook of the active elevator is invoked for these
requests, if set. This commit then addresses the above issue by
properly implementing the hook requeue_request in BFQ.

[1] https://marc.info/?l=linux-block=15127608676

Reported-by: Ivan Kozik 
Reported-by: Alban Browaeys 
Tested-by: Mike Galbraith 
Signed-off-by: Paolo Valente 
Signed-off-by: Serena Ziviani 
---
 block/bfq-iosched.c | 109 
 1 file changed, 84 insertions(+), 25 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 47e6ec7427c4..21e6b9e45638 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -3823,24 +3823,26 @@ static struct request *__bfq_dispatch_request(struct 
blk_mq_hw_ctx *hctx)
}
 
/*
-* We exploit the bfq_finish_request hook to decrement
-* rq_in_driver, but bfq_finish_request will not be
-* invoked on this request. So, to avoid unbalance,
-* just start this request, without incrementing
-* rq_in_driver. As a negative consequence,
-* rq_in_driver is deceptively lower than it should be
-* while this request is in service. This may cause
-* bfq_schedule_dispatch to be invoked uselessly.
+* We exploit the bfq_finish_requeue_request hook to
+* decrement rq_in_driver, but
+* bfq_finish_requeue_request will not be invoked on
+* this request. So, to avoid unbalance, just start
+* this request, without incrementing rq_in_driver. As
+* a negative consequence, rq_in_driver is deceptively
+* lower than it should be while this request is in
+* service. This may cause bfq_schedule_dispatch to be
+* invoked uselessly.
 *
 * As for implementing an exact solution, the
-* bfq_finish_request hook, if defined, is probably
-* invoked also on this request. So, by exploiting
-* this hook, we could 1) increment rq_in_driver here,
-* and 2) decrement it in bfq_finish_request. Such a
-* solution would let the value of the counter be
-* always accurate, but it would entail using an extra
-* interface function. This cost seems higher than the
-* benefit, being the frequency of non-elevator-private
+* bfq_finish_requeue_request hook, if defined, is
+* probably invoked also on this request. So, by
+* exploiting this hook, we could 1) increment
+* rq_in_driver here, and 2) decrement it in
+* bfq_finish_requeue_request. Such a solution would
+* let the value of the counter be always accurate,
+* but it would entail using an extra interface
+* function. This cost seems higher than the benefit,
+* being the frequency of non-elevator-private
 * requests very low.
 */
goto start_rq;
@@ -4515,6 +4517,8 @@ static inline void bfq_update_insert_stats(struct 
request_queue *q,
   unsigned int cmd_flags) {}
 #endif
 
+static void bfq_prepare_request(struct request *rq, struct bio *bio);
+
 static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
   bool at_head)
 {
@@ -4541,6 +4545,20 @@ static void bfq_insert_request(struct blk_mq_hw_ctx 
*hctx, struct request *rq,
else
list_add_tail(>queuelist, >dispatch)

[PATCH BUGFIX V2 0/1] block, bfq: handle requeues of I/O requests

2018-02-07 Thread Paolo Valente
Hi,
fixed version, after bug reports and fixes in [1].

Thanks,
Paolo

[1] https://lkml.org/lkml/2018/2/5/599


Paolo Valente (1):
  block, bfq: add requeue-request hook

 block/bfq-iosched.c | 109 
 1 file changed, 84 insertions(+), 25 deletions(-)

--
2.15.1


[PATCH BUGFIX V2 0/1] block, bfq: handle requeues of I/O requests

2018-02-07 Thread Paolo Valente
Hi,
fixed version, after bug reports and fixes in [1].

Thanks,
Paolo

[1] https://lkml.org/lkml/2018/2/5/599


Paolo Valente (1):
  block, bfq: add requeue-request hook

 block/bfq-iosched.c | 109 
 1 file changed, 84 insertions(+), 25 deletions(-)

--
2.15.1


Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-07 Thread Paolo Valente


> Il giorno 07 feb 2018, alle ore 16:50, Oleksandr Natalenko 
> <oleksa...@natalenko.name> ha scritto:
> 
> Hi.
> 
> 07.02.2018 12:27, Paolo Valente wrote:
>> Hi Oleksandr, Holger,
>> before I prepare a V2 candidate patch, could you please test my
>> instrumentation patch too, with the above change made.  For your
>> convenience, I have attached a compressed archive with both the
>> instrumentation patch and a patch making the above change.
>> Crossing my fingers,
>> Paolo
> 
> With all three patches applied I cannot reproduce neither cfdisk hang nor 
> some kind of boom.
> 
> Thank you.
> 

Thank you, Oleksandr. I'm now ready to prepare a new version of this patch.

Paolo

> Oleksandr



Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-07 Thread Paolo Valente


> Il giorno 07 feb 2018, alle ore 16:50, Oleksandr Natalenko 
>  ha scritto:
> 
> Hi.
> 
> 07.02.2018 12:27, Paolo Valente wrote:
>> Hi Oleksandr, Holger,
>> before I prepare a V2 candidate patch, could you please test my
>> instrumentation patch too, with the above change made.  For your
>> convenience, I have attached a compressed archive with both the
>> instrumentation patch and a patch making the above change.
>> Crossing my fingers,
>> Paolo
> 
> With all three patches applied I cannot reproduce neither cfdisk hang nor 
> some kind of boom.
> 
> Thank you.
> 

Thank you, Oleksandr. I'm now ready to prepare a new version of this patch.

Paolo

> Oleksandr



Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-07 Thread Paolo Valente


> Il giorno 07 feb 2018, alle ore 12:03, Mike Galbraith <efa...@gmx.de> ha 
> scritto:
> 
> On Wed, 2018-02-07 at 11:27 +0100, Paolo Valente wrote:
>> 
>> 2. Could you please turn that BUG_ON into:
>> if (!(rq->rq_flags & RQF_ELVPRIV))
>>  return;
>> and see what happens?
> 
> That seems to make it forgets how to make boom.
> 

Great! This deserves to be celebrated with a packet of Ikea chips.

Just to be certain, before submitting a new patch: you changed *only*
the BUG_ON at line 4742, on top of my instrumentation patch.

Thanks,
Paolo


>   -Mike



Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-07 Thread Paolo Valente


> Il giorno 07 feb 2018, alle ore 12:03, Mike Galbraith  ha 
> scritto:
> 
> On Wed, 2018-02-07 at 11:27 +0100, Paolo Valente wrote:
>> 
>> 2. Could you please turn that BUG_ON into:
>> if (!(rq->rq_flags & RQF_ELVPRIV))
>>  return;
>> and see what happens?
> 
> That seems to make it forgets how to make boom.
> 

Great! This deserves to be celebrated with a packet of Ikea chips.

Just to be certain, before submitting a new patch: you changed *only*
the BUG_ON at line 4742, on top of my instrumentation patch.

Thanks,
Paolo


>   -Mike



Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-07 Thread Paolo Valente


> Il giorno 07 feb 2018, alle ore 11:15, Mike Galbraith <efa...@gmx.de> ha 
> scritto:
> 
> On Wed, 2018-02-07 at 10:45 +0100, Paolo Valente wrote:
>> 
>>> Il giorno 07 feb 2018, alle ore 10:23, Mike Galbraith <efa...@gmx.de> ha 
>>> scritto:
>>> 
>>> On Wed, 2018-02-07 at 10:08 +0100, Paolo Valente wrote:
>>>> 
>>>> The first piece of information I need is whether this failure happens
>>>> even without "BFQ hierarchical scheduling support".
>>> 
>>> I presume you mean BFQ_GROUP_IOSCHED, which I do not have enabled.
>>> 
>> 
>> Great (so to speak), this saves us one step.
>> 
>> So, here's my next request for help: please apply the attached patch
>> (compressed to preserve it from my email client) and retry. It adds
>> several anomaly checks. I hope I have not added any false-positive
>> check.
> 
> kernel BUG at block/bfq-iosched.c:4742!
> 
> 4742 BUG_ON(!(rq->rq_flags & RQF_ELVPRIV));

Oh my, this is as crazy as, fortunately, easy to fix.  The problem is
that this is easy to fix in bfq, but increases the doubts I have
expressed in my cover letter: is it ok that, in blk-mq, the functions
of an elevator may get invoked, without control, on requests that do
not belong to that elevator?

Anyway, two requests, Mike, if you haven't had enough already:

1. Could you paste a stack trace for this OOPS, just to understand how we
get there?

2. Could you please turn that BUG_ON into:
if (!(rq->rq_flags & RQF_ELVPRIV))
return;
and see what happens?

Thanks a lot,
Paolo

Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-07 Thread Paolo Valente


> Il giorno 07 feb 2018, alle ore 11:15, Mike Galbraith  ha 
> scritto:
> 
> On Wed, 2018-02-07 at 10:45 +0100, Paolo Valente wrote:
>> 
>>> Il giorno 07 feb 2018, alle ore 10:23, Mike Galbraith  ha 
>>> scritto:
>>> 
>>> On Wed, 2018-02-07 at 10:08 +0100, Paolo Valente wrote:
>>>> 
>>>> The first piece of information I need is whether this failure happens
>>>> even without "BFQ hierarchical scheduling support".
>>> 
>>> I presume you mean BFQ_GROUP_IOSCHED, which I do not have enabled.
>>> 
>> 
>> Great (so to speak), this saves us one step.
>> 
>> So, here's my next request for help: please apply the attached patch
>> (compressed to preserve it from my email client) and retry. It adds
>> several anomaly checks. I hope I have not added any false-positive
>> check.
> 
> kernel BUG at block/bfq-iosched.c:4742!
> 
> 4742 BUG_ON(!(rq->rq_flags & RQF_ELVPRIV));

Oh my, this is as crazy as, fortunately, easy to fix.  The problem is
that this is easy to fix in bfq, but increases the doubts I have
expressed in my cover letter: is it ok that, in blk-mq, the functions
of an elevator may get invoked, without control, on requests that do
not belong to that elevator?

Anyway, two requests, Mike, if you haven't had enough already:

1. Could you paste a stack trace for this OOPS, just to understand how we
get there?

2. Could you please turn that BUG_ON into:
if (!(rq->rq_flags & RQF_ELVPRIV))
return;
and see what happens?

Thanks a lot,
Paolo

Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-07 Thread Paolo Valente


> Il giorno 06 feb 2018, alle ore 19:35, Oleksandr Natalenko 
> <oleksa...@natalenko.name> ha scritto:
> 
> Hi.
> 
> 06.02.2018 15:50, Paolo Valente wrote:
>> Could you please do a
>> gdb /block/bfq-iosched.o # or vmlinux.o if bfq is builtin
>> list *(bfq_finish_requeue_request+0x54)
>> list *(bfq_put_queue+0x10b)
>> for me?
> 
> Fresh crashes and gdb output are given below. A side note: it is harder to 
> trigger things on a slower machine, so clearly some timing-bounded race 
> condition there.
> 
> [  134.276548] BUG: unable to handle kernel NULL pointer dereference at   
> (null)
> [  134.283699] IP: blk_flush_complete_seq+0x20a/0x300
> [  134.288163] PGD 0 P4D 0
> [  134.291284] Oops: 0002 [#1] PREEMPT SMP PTI
> [  134.293842] Modules linked in: bochs_drm ttm nls_iso8859_1 kvm_intel 
> nls_cp437 vfat fat drm_kms_helper kvm drm irqbypass psmouse iTCO_wdt ppdev 
> iTCO_vendor_support input_leds led_class i2c_i801 parport_pc joydev intel_agp 
> parport intel_gtt mousedev lpc_ich rtc_cmos syscopyarea evdev sysfillrect 
> agpgart qemu_fw_cfg mac_hid sysimgblt fb_sys_fops sch_fq_codel ip_tables 
> x_tables xfs dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c 
> crc32c_generic dm_crypt algif_skcipher af_alg dm_mod hid_generic usbhid 
> raid10 hid md_mod sr_mod sd_mod cdrom uhci_hcd ehci_pci serio_raw 
> crct10dif_pclmul crc32_pclmul atkbd crc32c_intel libps2 ghash_clmulni_intel 
> pcbc xhci_pci xhci_hcd ehci_hcd aesni_intel aes_x86_64 crypto_simd 
> glue_helper cryptd ahci libahci libata usbcore usb_common i8042 serio 
> virtio_scsi
> [  134.340606]  scsi_mod virtio_blk virtio_net virtio_pci virtio_ring virtio
> [  134.345803] CPU: 0 PID: 178 Comm: kworker/0:1H Not tainted 4.15.0-pf2 #1
> [  134.350309] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 
> 02/06/2015
> [  134.355106] Workqueue: kblockd blk_mq_run_work_fn
> [  134.359034] RIP: 0010:blk_flush_complete_seq+0x20a/0x300
> [  134.367647] RSP: :88000f803ce8 EFLAGS: 00010082
> [  134.371632] RAX: 88000d9755c0 RBX: 88000d9755a0 RCX: 
> 88000c9b39a8
> [  134.375675] RDX:  RSI: 88000d9755d0 RDI: 
> 88000c9b3900
> [  134.381068] RBP: 88000d21a990 R08: 88000d9755b0 R09: 
> 
> [  134.386302] R10: 8800058ff100 R11: 0002000b R12: 
> 
> [  134.396915] R13: 88000d9755f0 R14: 0046 R15: 
> 88000d9755a0
> [  134.401140] FS:  () GS:88000f80() 
> knlGS:
> [  134.407361] CS:  0010 DS:  ES:  CR0: 80050033
> [  134.412384] CR2:  CR3: 04008006 CR4: 
> 001606f0
> [  134.416913] Call Trace:
> [  134.420251]  
> [  134.427731]  mq_flush_data_end_io+0xb3/0xf0
> [  134.431848]  scsi_end_request+0x90/0x1e0 [scsi_mod]
> [  134.436424]  scsi_io_completion+0x237/0x650 [scsi_mod]
> [  134.440109]  __blk_mq_complete_request+0xc4/0x150
> [  134.444517]  ? scsi_mq_get_budget+0x110/0x110 [scsi_mod]
> [  134.449603]  ata_scsi_qc_complete+0x8d/0x430 [libata]
> [  134.458487]  ata_qc_complete_multiple+0x8d/0xe0 [libata]
> [  134.461726]  ahci_handle_port_interrupt+0xc9/0x5b0 [libahci]
> [  134.466420]  ahci_handle_port_intr+0x54/0xb0 [libahci]
> [  134.470128]  ahci_single_level_irq_intr+0x3b/0x60 [libahci]
> [  134.473327]  __handle_irq_event_percpu+0x44/0x1e0
> [  134.476700]  handle_irq_event_percpu+0x30/0x70
> [  134.480227]  handle_irq_event+0x37/0x60
> [  134.490341]  handle_edge_irq+0x107/0x1c0
> [  134.492876]  handle_irq+0x1f/0x30
> [  134.495497]  do_IRQ+0x4d/0xe0
> [  134.497963]  common_interrupt+0xa2/0xa2
> [  134.500877]  
> [  134.503129] RIP: 0010:_raw_spin_unlock_irqrestore+0x11/0x40
> [  134.506782] RSP: :c9307d30 EFLAGS: 0293 ORIG_RAX: 
> ffdb
> [  134.511845] RAX: 0001 RBX: 88000db04000 RCX: 
> 0008
> [  134.523019] RDX: 0100 RSI: 0293 RDI: 
> 0293
> [  134.527968] RBP: 0293 R08:  R09: 
> 0040
> [  134.532289] R10: 008e66bf R11: 0002000b R12: 
> 
> [  134.536376] R13: 88000d26a000 R14: 88000b99ac48 R15: 
> 88000d26a000
> [  134.541046]  ata_scsi_queuecmd+0xa0/0x210 [libata]
> [  134.544363]  scsi_dispatch_cmd+0xe8/0x260 [scsi_mod]
> [  134.552883]  scsi_queue_rq+0x4cf/0x560 [scsi_mod]
> [  134.556811]  blk_mq_dispatch_rq_list+0x8f/0x4c0
> [  134.559741]  blk_mq_sched_dispatch_requests+0x105/0x190
> [  134.563253]  __blk_mq_run_hw_queue+0x80/0x90
> [  134.565540]  process_one_work+0x1df/0x420
> [  134.568041]  worker_thread+0x2b/0x3d0
> [  1

Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-07 Thread Paolo Valente


> Il giorno 06 feb 2018, alle ore 19:35, Oleksandr Natalenko 
>  ha scritto:
> 
> Hi.
> 
> 06.02.2018 15:50, Paolo Valente wrote:
>> Could you please do a
>> gdb /block/bfq-iosched.o # or vmlinux.o if bfq is builtin
>> list *(bfq_finish_requeue_request+0x54)
>> list *(bfq_put_queue+0x10b)
>> for me?
> 
> Fresh crashes and gdb output are given below. A side note: it is harder to 
> trigger things on a slower machine, so clearly some timing-bounded race 
> condition there.
> 
> [  134.276548] BUG: unable to handle kernel NULL pointer dereference at   
> (null)
> [  134.283699] IP: blk_flush_complete_seq+0x20a/0x300
> [  134.288163] PGD 0 P4D 0
> [  134.291284] Oops: 0002 [#1] PREEMPT SMP PTI
> [  134.293842] Modules linked in: bochs_drm ttm nls_iso8859_1 kvm_intel 
> nls_cp437 vfat fat drm_kms_helper kvm drm irqbypass psmouse iTCO_wdt ppdev 
> iTCO_vendor_support input_leds led_class i2c_i801 parport_pc joydev intel_agp 
> parport intel_gtt mousedev lpc_ich rtc_cmos syscopyarea evdev sysfillrect 
> agpgart qemu_fw_cfg mac_hid sysimgblt fb_sys_fops sch_fq_codel ip_tables 
> x_tables xfs dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c 
> crc32c_generic dm_crypt algif_skcipher af_alg dm_mod hid_generic usbhid 
> raid10 hid md_mod sr_mod sd_mod cdrom uhci_hcd ehci_pci serio_raw 
> crct10dif_pclmul crc32_pclmul atkbd crc32c_intel libps2 ghash_clmulni_intel 
> pcbc xhci_pci xhci_hcd ehci_hcd aesni_intel aes_x86_64 crypto_simd 
> glue_helper cryptd ahci libahci libata usbcore usb_common i8042 serio 
> virtio_scsi
> [  134.340606]  scsi_mod virtio_blk virtio_net virtio_pci virtio_ring virtio
> [  134.345803] CPU: 0 PID: 178 Comm: kworker/0:1H Not tainted 4.15.0-pf2 #1
> [  134.350309] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 
> 02/06/2015
> [  134.355106] Workqueue: kblockd blk_mq_run_work_fn
> [  134.359034] RIP: 0010:blk_flush_complete_seq+0x20a/0x300
> [  134.367647] RSP: :88000f803ce8 EFLAGS: 00010082
> [  134.371632] RAX: 88000d9755c0 RBX: 88000d9755a0 RCX: 
> 88000c9b39a8
> [  134.375675] RDX:  RSI: 88000d9755d0 RDI: 
> 88000c9b3900
> [  134.381068] RBP: 88000d21a990 R08: 88000d9755b0 R09: 
> 
> [  134.386302] R10: 8800058ff100 R11: 0002000b R12: 
> 
> [  134.396915] R13: 88000d9755f0 R14: 0046 R15: 
> 88000d9755a0
> [  134.401140] FS:  () GS:88000f80() 
> knlGS:
> [  134.407361] CS:  0010 DS:  ES:  CR0: 80050033
> [  134.412384] CR2:  CR3: 04008006 CR4: 
> 001606f0
> [  134.416913] Call Trace:
> [  134.420251]  
> [  134.427731]  mq_flush_data_end_io+0xb3/0xf0
> [  134.431848]  scsi_end_request+0x90/0x1e0 [scsi_mod]
> [  134.436424]  scsi_io_completion+0x237/0x650 [scsi_mod]
> [  134.440109]  __blk_mq_complete_request+0xc4/0x150
> [  134.444517]  ? scsi_mq_get_budget+0x110/0x110 [scsi_mod]
> [  134.449603]  ata_scsi_qc_complete+0x8d/0x430 [libata]
> [  134.458487]  ata_qc_complete_multiple+0x8d/0xe0 [libata]
> [  134.461726]  ahci_handle_port_interrupt+0xc9/0x5b0 [libahci]
> [  134.466420]  ahci_handle_port_intr+0x54/0xb0 [libahci]
> [  134.470128]  ahci_single_level_irq_intr+0x3b/0x60 [libahci]
> [  134.473327]  __handle_irq_event_percpu+0x44/0x1e0
> [  134.476700]  handle_irq_event_percpu+0x30/0x70
> [  134.480227]  handle_irq_event+0x37/0x60
> [  134.490341]  handle_edge_irq+0x107/0x1c0
> [  134.492876]  handle_irq+0x1f/0x30
> [  134.495497]  do_IRQ+0x4d/0xe0
> [  134.497963]  common_interrupt+0xa2/0xa2
> [  134.500877]  
> [  134.503129] RIP: 0010:_raw_spin_unlock_irqrestore+0x11/0x40
> [  134.506782] RSP: :c9307d30 EFLAGS: 0293 ORIG_RAX: 
> ffdb
> [  134.511845] RAX: 0001 RBX: 88000db04000 RCX: 
> 0008
> [  134.523019] RDX: 0100 RSI: 0293 RDI: 
> 0293
> [  134.527968] RBP: 0293 R08:  R09: 
> 0040
> [  134.532289] R10: 008e66bf R11: 0002000b R12: 
> 
> [  134.536376] R13: 88000d26a000 R14: 88000b99ac48 R15: 
> 88000d26a000
> [  134.541046]  ata_scsi_queuecmd+0xa0/0x210 [libata]
> [  134.544363]  scsi_dispatch_cmd+0xe8/0x260 [scsi_mod]
> [  134.552883]  scsi_queue_rq+0x4cf/0x560 [scsi_mod]
> [  134.556811]  blk_mq_dispatch_rq_list+0x8f/0x4c0
> [  134.559741]  blk_mq_sched_dispatch_requests+0x105/0x190
> [  134.563253]  __blk_mq_run_hw_queue+0x80/0x90
> [  134.565540]  process_one_work+0x1df/0x420
> [  134.568041]  worker_thread+0x2b/0x3d0
> [  134.571032]  ? process_one_wo

Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-06 Thread Paolo Valente


> Il giorno 06 feb 2018, alle ore 14:40, Holger Hoffstätte 
>  ha scritto:
> 
> 
> The plot thickens!
> 

Yep, the culprit seems clearer, though ...

> Just as I was about to post that I didn't have any problems - because
> I didn't have any - I decided to do a second test, activated bfq on my
> workstation, on a hunch typed "sync" and .. the machine locked up, hard.
> 
> Rebooted, activated bfq, typed sync..sync hangs. Luckily this time
> a second terminal was still alive, so I could capture a trace for
> your enjoyment:
> 
> Feb  6 14:28:17 ragnarok kernel: io scheduler bfq registered
> Feb  6 14:28:20 ragnarok kernel: BUG: unable to handle kernel NULL pointer 
> dereference at 0030
> Feb  6 14:28:20 ragnarok kernel: IP: bfq_put_queue+0x10b/0x130 [bfq]
> Feb  6 14:28:20 ragnarok kernel: PGD 0 P4D 0 
> Feb  6 14:28:20 ragnarok kernel: Oops:  [#1] SMP PTI
> Feb  6 14:28:20 ragnarok kernel: Modules linked in: bfq lz4 lz4_compress 
> lz4_decompress nfs lockd grace sunrpc autofs4 sch_fq_codel it87 hwmon_vid 
> x86_pkg_temp_thermal snd_hda_codec_realtek coretemp radeon crc32_pclmul 
> snd_hda_codec_generic crc32c_intel pcbc snd_hda_codec_hdmi i2c_algo_bit 
> aesni_intel drm_kms_helper aes_x86_64 uvcvideo syscopyarea crypto_simd 
> snd_hda_intel sysfillrect cryptd snd_usb_audio sysimgblt videobuf2_vmalloc 
> glue_helper fb_sys_fops snd_hda_codec snd_hwdep videobuf2_memops ttm 
> videobuf2_v4l2 snd_usbmidi_lib videobuf2_core snd_rawmidi snd_hda_core drm 
> snd_seq_device videodev snd_pcm i2c_i801 usbhid snd_timer i2c_core snd 
> backlight soundcore r8169 parport_pc mii parport
> Feb  6 14:28:20 ragnarok kernel: CPU: 0 PID: 4 Comm: kworker/0:0H Not tainted 
> 4.14.18 #1
> Feb  6 14:28:20 ragnarok kernel: Hardware name: Gigabyte Technology Co., Ltd. 
> P67-DS3-B3/P67-DS3-B3, BIOS F1 05/06/2011
> Feb  6 14:28:20 ragnarok kernel: Workqueue: kblockd blk_mq_requeue_work
> Feb  6 14:28:20 ragnarok kernel: task: 88060395a1c0 task.stack: 
> c9044000
> Feb  6 14:28:20 ragnarok kernel: RIP: 0010:bfq_put_queue+0x10b/0x130 [bfq]
> Feb  6 14:28:20 ragnarok kernel: RSP: 0018:c9047ca0 EFLAGS: 00010286
> Feb  6 14:28:20 ragnarok kernel: RAX: 0008 RBX: 8806023db690 
> RCX: 
> Feb  6 14:28:20 ragnarok kernel: RDX:  RSI: 880601bb39b0 
> RDI: 880601a56400
> Feb  6 14:28:20 ragnarok kernel: RBP: 01bb3980 R08: 0053 
> R09: 8806023db690
> Feb  6 14:28:20 ragnarok kernel: R10: 1dd0f11e R11: 080a011b 
> R12: 880601a56400
> Feb  6 14:28:20 ragnarok kernel: R13: 8806023dbed0 R14: 0053 
> R15: 
> Feb  6 14:28:20 ragnarok kernel: FS:  () 
> GS:88061f40() knlGS:
> Feb  6 14:28:20 ragnarok kernel: CS:  0010 DS:  ES:  CR0: 
> 80050033
> Feb  6 14:28:20 ragnarok kernel: CR2: 0030 CR3: 0200a002 
> CR4: 000606f0
> Feb  6 14:28:20 ragnarok kernel: Call Trace:
> Feb  6 14:28:20 ragnarok kernel:  bfq_finish_requeue_request+0x4b/0x370 [bfq]
> Feb  6 14:28:20 ragnarok kernel:  __blk_mq_requeue_request+0x57/0x130
> Feb  6 14:28:20 ragnarok kernel:  blk_mq_dispatch_rq_list+0x1b3/0x510
> Feb  6 14:28:20 ragnarok kernel:  ? __bfq_bfqd_reset_in_service+0x20/0x70 
> [bfq]
> Feb  6 14:28:20 ragnarok kernel:  ? bfq_bfqq_expire+0x212/0x740 [bfq]
> Feb  6 14:28:20 ragnarok kernel:  blk_mq_sched_dispatch_requests+0xf0/0x170
> Feb  6 14:28:20 ragnarok kernel:  __blk_mq_run_hw_queue+0x4e/0x90
> Feb  6 14:28:20 ragnarok kernel:  __blk_mq_delay_run_hw_queue+0x73/0x80
> Feb  6 14:28:20 ragnarok kernel:  blk_mq_run_hw_queue+0x53/0x150
> Feb  6 14:28:20 ragnarok kernel:  blk_mq_run_hw_queues+0x3a/0x50
> Feb  6 14:28:20 ragnarok kernel:  blk_mq_requeue_work+0x104/0x110
> Feb  6 14:28:20 ragnarok kernel:  process_one_work+0x1d4/0x3d0
> Feb  6 14:28:20 ragnarok kernel:  worker_thread+0x2b/0x3c0
> Feb  6 14:28:20 ragnarok kernel:  ? process_one_work+0x3d0/0x3d0
> Feb  6 14:28:20 ragnarok kernel:  kthread+0x117/0x130
> Feb  6 14:28:20 ragnarok kernel:  ? kthread_create_on_node+0x40/0x40
> Feb  6 14:28:20 ragnarok kernel:  ret_from_fork+0x1f/0x30
> Feb  6 14:28:20 ragnarok kernel: Code: c1 e8 06 83 e0 01 48 83 f8 01 45 19 f6 
> e8 ce 3a 00 00 41 83 e6 ee 48 89 c7 41 83 c6 53 e8 9e 3a 00 00 49 89 d9 45 89 
> f0 44 89 f9 <48> 8b 70 28 48 c7 c2 d8 00 25 a0 55 4c 89 ef e8 11 ba ea e0 8b 
> Feb  6 14:28:20 ragnarok kernel: RIP: bfq_put_queue+0x10b/0x130 [bfq] RSP: 
> c9047ca0
> Feb  6 14:28:20 ragnarok kernel: CR2: 0030
> Feb  6 14:28:20 ragnarok kernel: ---[ end trace 8b782ace30a4e7d8 ]---
> 

Same request: please
gdb /block/bfq-iosched.o
list *(bfq_finish_requeue_request+0x4b)
list *(bfq_put_queue+0x10b)

Thanks,
Paolo


> Yes, this is 4.14.x but with most-of block/4.16, rock-solid otherwise (in 
> daily use).
> Looks like there is something wrong with this patch after 

Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-06 Thread Paolo Valente


> Il giorno 06 feb 2018, alle ore 14:40, Holger Hoffstätte 
>  ha scritto:
> 
> 
> The plot thickens!
> 

Yep, the culprit seems clearer, though ...

> Just as I was about to post that I didn't have any problems - because
> I didn't have any - I decided to do a second test, activated bfq on my
> workstation, on a hunch typed "sync" and .. the machine locked up, hard.
> 
> Rebooted, activated bfq, typed sync..sync hangs. Luckily this time
> a second terminal was still alive, so I could capture a trace for
> your enjoyment:
> 
> Feb  6 14:28:17 ragnarok kernel: io scheduler bfq registered
> Feb  6 14:28:20 ragnarok kernel: BUG: unable to handle kernel NULL pointer 
> dereference at 0030
> Feb  6 14:28:20 ragnarok kernel: IP: bfq_put_queue+0x10b/0x130 [bfq]
> Feb  6 14:28:20 ragnarok kernel: PGD 0 P4D 0 
> Feb  6 14:28:20 ragnarok kernel: Oops:  [#1] SMP PTI
> Feb  6 14:28:20 ragnarok kernel: Modules linked in: bfq lz4 lz4_compress 
> lz4_decompress nfs lockd grace sunrpc autofs4 sch_fq_codel it87 hwmon_vid 
> x86_pkg_temp_thermal snd_hda_codec_realtek coretemp radeon crc32_pclmul 
> snd_hda_codec_generic crc32c_intel pcbc snd_hda_codec_hdmi i2c_algo_bit 
> aesni_intel drm_kms_helper aes_x86_64 uvcvideo syscopyarea crypto_simd 
> snd_hda_intel sysfillrect cryptd snd_usb_audio sysimgblt videobuf2_vmalloc 
> glue_helper fb_sys_fops snd_hda_codec snd_hwdep videobuf2_memops ttm 
> videobuf2_v4l2 snd_usbmidi_lib videobuf2_core snd_rawmidi snd_hda_core drm 
> snd_seq_device videodev snd_pcm i2c_i801 usbhid snd_timer i2c_core snd 
> backlight soundcore r8169 parport_pc mii parport
> Feb  6 14:28:20 ragnarok kernel: CPU: 0 PID: 4 Comm: kworker/0:0H Not tainted 
> 4.14.18 #1
> Feb  6 14:28:20 ragnarok kernel: Hardware name: Gigabyte Technology Co., Ltd. 
> P67-DS3-B3/P67-DS3-B3, BIOS F1 05/06/2011
> Feb  6 14:28:20 ragnarok kernel: Workqueue: kblockd blk_mq_requeue_work
> Feb  6 14:28:20 ragnarok kernel: task: 88060395a1c0 task.stack: 
> c9044000
> Feb  6 14:28:20 ragnarok kernel: RIP: 0010:bfq_put_queue+0x10b/0x130 [bfq]
> Feb  6 14:28:20 ragnarok kernel: RSP: 0018:c9047ca0 EFLAGS: 00010286
> Feb  6 14:28:20 ragnarok kernel: RAX: 0008 RBX: 8806023db690 
> RCX: 
> Feb  6 14:28:20 ragnarok kernel: RDX:  RSI: 880601bb39b0 
> RDI: 880601a56400
> Feb  6 14:28:20 ragnarok kernel: RBP: 01bb3980 R08: 0053 
> R09: 8806023db690
> Feb  6 14:28:20 ragnarok kernel: R10: 1dd0f11e R11: 080a011b 
> R12: 880601a56400
> Feb  6 14:28:20 ragnarok kernel: R13: 8806023dbed0 R14: 0053 
> R15: 
> Feb  6 14:28:20 ragnarok kernel: FS:  () 
> GS:88061f40() knlGS:
> Feb  6 14:28:20 ragnarok kernel: CS:  0010 DS:  ES:  CR0: 
> 80050033
> Feb  6 14:28:20 ragnarok kernel: CR2: 0030 CR3: 0200a002 
> CR4: 000606f0
> Feb  6 14:28:20 ragnarok kernel: Call Trace:
> Feb  6 14:28:20 ragnarok kernel:  bfq_finish_requeue_request+0x4b/0x370 [bfq]
> Feb  6 14:28:20 ragnarok kernel:  __blk_mq_requeue_request+0x57/0x130
> Feb  6 14:28:20 ragnarok kernel:  blk_mq_dispatch_rq_list+0x1b3/0x510
> Feb  6 14:28:20 ragnarok kernel:  ? __bfq_bfqd_reset_in_service+0x20/0x70 
> [bfq]
> Feb  6 14:28:20 ragnarok kernel:  ? bfq_bfqq_expire+0x212/0x740 [bfq]
> Feb  6 14:28:20 ragnarok kernel:  blk_mq_sched_dispatch_requests+0xf0/0x170
> Feb  6 14:28:20 ragnarok kernel:  __blk_mq_run_hw_queue+0x4e/0x90
> Feb  6 14:28:20 ragnarok kernel:  __blk_mq_delay_run_hw_queue+0x73/0x80
> Feb  6 14:28:20 ragnarok kernel:  blk_mq_run_hw_queue+0x53/0x150
> Feb  6 14:28:20 ragnarok kernel:  blk_mq_run_hw_queues+0x3a/0x50
> Feb  6 14:28:20 ragnarok kernel:  blk_mq_requeue_work+0x104/0x110
> Feb  6 14:28:20 ragnarok kernel:  process_one_work+0x1d4/0x3d0
> Feb  6 14:28:20 ragnarok kernel:  worker_thread+0x2b/0x3c0
> Feb  6 14:28:20 ragnarok kernel:  ? process_one_work+0x3d0/0x3d0
> Feb  6 14:28:20 ragnarok kernel:  kthread+0x117/0x130
> Feb  6 14:28:20 ragnarok kernel:  ? kthread_create_on_node+0x40/0x40
> Feb  6 14:28:20 ragnarok kernel:  ret_from_fork+0x1f/0x30
> Feb  6 14:28:20 ragnarok kernel: Code: c1 e8 06 83 e0 01 48 83 f8 01 45 19 f6 
> e8 ce 3a 00 00 41 83 e6 ee 48 89 c7 41 83 c6 53 e8 9e 3a 00 00 49 89 d9 45 89 
> f0 44 89 f9 <48> 8b 70 28 48 c7 c2 d8 00 25 a0 55 4c 89 ef e8 11 ba ea e0 8b 
> Feb  6 14:28:20 ragnarok kernel: RIP: bfq_put_queue+0x10b/0x130 [bfq] RSP: 
> c9047ca0
> Feb  6 14:28:20 ragnarok kernel: CR2: 0030
> Feb  6 14:28:20 ragnarok kernel: ---[ end trace 8b782ace30a4e7d8 ]---
> 

Same request: please
gdb /block/bfq-iosched.o
list *(bfq_finish_requeue_request+0x4b)
list *(bfq_put_queue+0x10b)

Thanks,
Paolo


> Yes, this is 4.14.x but with most-of block/4.16, rock-solid otherwise (in 
> daily use).
> Looks like there is something wrong with this patch after all..
> 
> cheers
> Holger
> 



Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-06 Thread Paolo Valente


> Il giorno 06 feb 2018, alle ore 15:07, Oleksandr Natalenko 
>  ha scritto:
> 
> Hi.
> 
> 06.02.2018 14:46, Mike Galbraith wrote:
>>> Sorry for the noise, but just to make it clear, are we talking about
>>> "deadline" or "mq-deadline" now?
>> mq-deadline.
> 
> Okay, I've spent a little bit more time on stressing the VM with BFQ + this 
> patch enabled, and managed to get it crashed relatively easy just by 
> traversing through all the files in the /usr and reading them. 3 stacktraces 
> from different boots are below:
> 
> ===
> [   33.147909] BUG: unable to handle kernel NULL pointer dereference at   
> (null)
> [   33.156874] IP: blk_flush_complete_seq+0x20a/0x300
> [   33.160400] PGD 0 P4D 0
> [   33.162682] Oops: 0002 [#1] PREEMPT SMP PTI
> [   33.166091] Modules linked in: crct10dif_pclmul crc32_pclmul 
> ghash_clmulni_intel pcbc nls_iso8859_1 nls_cp437 aesni_intel vfat aes_x86_64 
> crypto_simd fat glue_helper cryptd bochs_drm ttm iTCO_wdt drm_kms_helper 
> iTCO_vendor_support ppdev psmouse drm input_leds led_class syscopyarea 
> intel_agp parport_pc joydev parport evdev rtc_cmos mousedev intel_gtt mac_hid 
> i2c_i801 sysfillrect lpc_ich sysimgblt agpgart fb_sys_fops qemu_fw_cfg 
> sch_fq_codel ip_tables x_tables hid_generic usbhid hid xfs libcrc32c 
> crc32c_generic dm_mod raid10 md_mod sr_mod cdrom sd_mod uhci_hcd ahci libahci 
> xhci_pci serio_raw atkbd libata ehci_pci virtio_net libps2 xhci_hcd ehci_hcd 
> crc32c_intel scsi_mod virtio_pci usbcore virtio_ring usb_common virtio i8042 
> serio
> [   33.226238] CPU: 0 PID: 0 Comm: BFS/0 Not tainted 4.15.0-pf2 #1
> [   33.232823] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 
> 02/06/2015
> [   33.245062] RIP: 0010:blk_flush_complete_seq+0x20a/0x300
> [   33.247781] RSP: 0018:88001f803ed0 EFLAGS: 00010086
> [   33.250184] RAX: 8800199b5e00 RBX: 8800199b5de0 RCX: 
> 8800197f72a8
> [   33.253220] RDX:  RSI: 8800199b5e10 RDI: 
> 8800197f7200
> [   33.257429] RBP: 8800194090a0 R08: 8800199b5df0 R09: 
> 
> [   33.260756] R10: 88001df2b600 R11: 88001f803ce8 R12: 
> 
> [   33.268840] R13: 8800199b5e30 R14: 0046 R15: 
> 8800199b5de0
> [   33.272722] FS:  () GS:88001f80() 
> knlGS:
> [   33.276798] CS:  0010 DS:  ES:  CR0: 80050033
> [   33.278981] CR2:  CR3: 04008003 CR4: 
> 003606f0
> [   33.282078] DR0:  DR1:  DR2: 
> 
> [   33.285566] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [   33.289322] Call Trace:
> [   33.291094]  
> [   33.294208]  mq_flush_data_end_io+0xb3/0xf0
> [   33.301226]  scsi_end_request+0x90/0x1e0 [scsi_mod]
> [   33.303660]  scsi_io_completion+0x237/0x650 [scsi_mod]
> [   33.306833]  flush_smp_call_function_queue+0x7c/0xf0
> [   33.311369]  smp_call_function_single_interrupt+0x32/0xf0
> [   33.314245]  call_function_single_interrupt+0xa2/0xb0
> [   33.316755]  
> [   33.318365] RIP: 0010:native_safe_halt+0x2/0x10
> [   33.321988] RSP: 0018:82003ea0 EFLAGS: 0202 ORIG_RAX: 
> ff04
> [   33.324995] RAX:  RBX:  RCX: 
> 
> [   33.334532] RDX:  RSI: 81e43850 RDI: 
> 81e43a9d
> [   33.336643] RBP: 82128080 R08: 000e7441b6ce R09: 
> 88001a4c2c00
> [   33.339282] R10:  R11: 88001bceb7f0 R12: 
> 81ea5c92
> [   33.344597] R13:  R14:  R15: 
> 1e07cd69
> [   33.346526]  default_idle+0x18/0x130
> [   33.347929]  do_idle+0x167/0x1d0
> [   33.349201]  cpu_startup_entry+0x6f/0x80
> [   33.350553]  start_kernel+0x4c2/0x4e2
> [   33.351822]  secondary_startup_64+0xa5/0xb0
> [   33.354470] Code: 39 d0 0f 84 8f 00 00 00 48 8b 97 b0 00 00 00 49 c1 e0 04 
> 45 31 e4 48 8b b7 a8 00 00 00 49 01 d8 48 8d 8f a8 00 00 00 48 89 56 08 <48> 
> 89 32 49 8b 50 18 49 89 48 18 48 89 87 a8 00 00 00 48 89 97
> [   33.364997] RIP: blk_flush_complete_seq+0x20a/0x300 RSP: 88001f803ed0
> [   33.366759] CR2: 
> [   33.367921] ---[ end trace 5179c1a9cbc4eaa2 ]---
> [   33.369413] Kernel panic - not syncing: Fatal exception in interrupt
> [   33.372089] Kernel Offset: disabled
> [   33.374413] ---[ end Kernel panic - not syncing: Fatal exception in 
> interrupt
> ===
> 
> ===
> [   30.937747] BUG: unable to handle kernel NULL pointer dereference at 
> 0028
> [   30.940984] IP: bfq_put_queue+0x10b/0x130
> [   30.941609] PGD 0 P4D 0
> [   30.941976] Oops:  [#1] PREEMPT SMP PTI
> [   30.942560] Modules linked in: crct10dif_pclmul crc32_pclmul 
> ghash_clmulni_intel nls_iso8859_1 bochs_drm pcbc nls_cp437 vfat ttm 
> aesni_intel fat aes_x86_64 crypto_simd glue_helper cryptd drm_kms_helper 
> iTCO_wdt drm ppdev iTCO_vendor_support 

Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-06 Thread Paolo Valente


> Il giorno 06 feb 2018, alle ore 15:07, Oleksandr Natalenko 
>  ha scritto:
> 
> Hi.
> 
> 06.02.2018 14:46, Mike Galbraith wrote:
>>> Sorry for the noise, but just to make it clear, are we talking about
>>> "deadline" or "mq-deadline" now?
>> mq-deadline.
> 
> Okay, I've spent a little bit more time on stressing the VM with BFQ + this 
> patch enabled, and managed to get it crashed relatively easy just by 
> traversing through all the files in the /usr and reading them. 3 stacktraces 
> from different boots are below:
> 
> ===
> [   33.147909] BUG: unable to handle kernel NULL pointer dereference at   
> (null)
> [   33.156874] IP: blk_flush_complete_seq+0x20a/0x300
> [   33.160400] PGD 0 P4D 0
> [   33.162682] Oops: 0002 [#1] PREEMPT SMP PTI
> [   33.166091] Modules linked in: crct10dif_pclmul crc32_pclmul 
> ghash_clmulni_intel pcbc nls_iso8859_1 nls_cp437 aesni_intel vfat aes_x86_64 
> crypto_simd fat glue_helper cryptd bochs_drm ttm iTCO_wdt drm_kms_helper 
> iTCO_vendor_support ppdev psmouse drm input_leds led_class syscopyarea 
> intel_agp parport_pc joydev parport evdev rtc_cmos mousedev intel_gtt mac_hid 
> i2c_i801 sysfillrect lpc_ich sysimgblt agpgart fb_sys_fops qemu_fw_cfg 
> sch_fq_codel ip_tables x_tables hid_generic usbhid hid xfs libcrc32c 
> crc32c_generic dm_mod raid10 md_mod sr_mod cdrom sd_mod uhci_hcd ahci libahci 
> xhci_pci serio_raw atkbd libata ehci_pci virtio_net libps2 xhci_hcd ehci_hcd 
> crc32c_intel scsi_mod virtio_pci usbcore virtio_ring usb_common virtio i8042 
> serio
> [   33.226238] CPU: 0 PID: 0 Comm: BFS/0 Not tainted 4.15.0-pf2 #1
> [   33.232823] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 
> 02/06/2015
> [   33.245062] RIP: 0010:blk_flush_complete_seq+0x20a/0x300
> [   33.247781] RSP: 0018:88001f803ed0 EFLAGS: 00010086
> [   33.250184] RAX: 8800199b5e00 RBX: 8800199b5de0 RCX: 
> 8800197f72a8
> [   33.253220] RDX:  RSI: 8800199b5e10 RDI: 
> 8800197f7200
> [   33.257429] RBP: 8800194090a0 R08: 8800199b5df0 R09: 
> 
> [   33.260756] R10: 88001df2b600 R11: 88001f803ce8 R12: 
> 
> [   33.268840] R13: 8800199b5e30 R14: 0046 R15: 
> 8800199b5de0
> [   33.272722] FS:  () GS:88001f80() 
> knlGS:
> [   33.276798] CS:  0010 DS:  ES:  CR0: 80050033
> [   33.278981] CR2:  CR3: 04008003 CR4: 
> 003606f0
> [   33.282078] DR0:  DR1:  DR2: 
> 
> [   33.285566] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [   33.289322] Call Trace:
> [   33.291094]  
> [   33.294208]  mq_flush_data_end_io+0xb3/0xf0
> [   33.301226]  scsi_end_request+0x90/0x1e0 [scsi_mod]
> [   33.303660]  scsi_io_completion+0x237/0x650 [scsi_mod]
> [   33.306833]  flush_smp_call_function_queue+0x7c/0xf0
> [   33.311369]  smp_call_function_single_interrupt+0x32/0xf0
> [   33.314245]  call_function_single_interrupt+0xa2/0xb0
> [   33.316755]  
> [   33.318365] RIP: 0010:native_safe_halt+0x2/0x10
> [   33.321988] RSP: 0018:82003ea0 EFLAGS: 0202 ORIG_RAX: 
> ff04
> [   33.324995] RAX:  RBX:  RCX: 
> 
> [   33.334532] RDX:  RSI: 81e43850 RDI: 
> 81e43a9d
> [   33.336643] RBP: 82128080 R08: 000e7441b6ce R09: 
> 88001a4c2c00
> [   33.339282] R10:  R11: 88001bceb7f0 R12: 
> 81ea5c92
> [   33.344597] R13:  R14:  R15: 
> 1e07cd69
> [   33.346526]  default_idle+0x18/0x130
> [   33.347929]  do_idle+0x167/0x1d0
> [   33.349201]  cpu_startup_entry+0x6f/0x80
> [   33.350553]  start_kernel+0x4c2/0x4e2
> [   33.351822]  secondary_startup_64+0xa5/0xb0
> [   33.354470] Code: 39 d0 0f 84 8f 00 00 00 48 8b 97 b0 00 00 00 49 c1 e0 04 
> 45 31 e4 48 8b b7 a8 00 00 00 49 01 d8 48 8d 8f a8 00 00 00 48 89 56 08 <48> 
> 89 32 49 8b 50 18 49 89 48 18 48 89 87 a8 00 00 00 48 89 97
> [   33.364997] RIP: blk_flush_complete_seq+0x20a/0x300 RSP: 88001f803ed0
> [   33.366759] CR2: 
> [   33.367921] ---[ end trace 5179c1a9cbc4eaa2 ]---
> [   33.369413] Kernel panic - not syncing: Fatal exception in interrupt
> [   33.372089] Kernel Offset: disabled
> [   33.374413] ---[ end Kernel panic - not syncing: Fatal exception in 
> interrupt
> ===
> 
> ===
> [   30.937747] BUG: unable to handle kernel NULL pointer dereference at 
> 0028
> [   30.940984] IP: bfq_put_queue+0x10b/0x130
> [   30.941609] PGD 0 P4D 0
> [   30.941976] Oops:  [#1] PREEMPT SMP PTI
> [   30.942560] Modules linked in: crct10dif_pclmul crc32_pclmul 
> ghash_clmulni_intel nls_iso8859_1 bochs_drm pcbc nls_cp437 vfat ttm 
> aesni_intel fat aes_x86_64 crypto_simd glue_helper cryptd drm_kms_helper 
> iTCO_wdt drm ppdev iTCO_vendor_support input_leds psmouse led_class 

Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-06 Thread Paolo Valente


> Il giorno 06 feb 2018, alle ore 13:26, Paolo Valente 
> <paolo.vale...@linaro.org> ha scritto:
> 
> 
> 
>> Il giorno 06 feb 2018, alle ore 12:57, Mike Galbraith <efa...@gmx.de> ha 
>> scritto:
>> 
>> On Tue, 2018-02-06 at 10:38 +0100, Paolo Valente wrote:
>>> 
>>> Hi Mike,
>>> as you can imagine, I didn't get any failure in my pre-submission
>>> tests on this patch.  In addition, it is not that easy to link this
>>> patch, which just adds some internal bfq housekeeping in case of a
>>> requeue, with a corruption of external lists for general I/O
>>> management.
>>> 
>>> In this respect, as Oleksandr comments point out, by switching from
>>> cfq to bfq, you switch between much more than two schedulers.  Anyway,
>>> who knows ...
>> 
>> Not me.  Box seems to be fairly sure that it is bfq.
> 
> Yeah, sorry for the too short comment: what I meant is that cfq (and
> deadline) are in legacy blk, while bfq is in blk-mq.  So, to use bfq,
> you must also switch from legacy-blk I/O stack to blk-mq I/O stack.
> 
> 
>> Twice again box
>> went belly up on me in fairly short order with bfq, but seemed fine
>> with deadline.  I'm currently running deadline again, and box again
>> seems solid, thought I won't say _is_ solid until it's been happily
>> trundling along with deadline for a quite a bit longer.
>> 
> 
> As Oleksadr asked too, is it deadline or mq-deadline?
> 
>> I was ssh'd in during the last episode, got this out.  I should be
>> getting crash dumps, but seems kdump is only working intermittently
>> atm.  I did get one earlier, but 3 of 4 times not.  Hohum.
>> 
>> [  484.179292] BUG: unable to handle kernel paging request at 
>> a0817000
>> [  484.179436] IP: __trace_note_message+0x1f/0xd0
>> [  484.179576] PGD 1e0c067 P4D 1e0c067 PUD 1e0d063 PMD 3faff2067 PTE 0
>> [  484.179719] Oops:  [#1] SMP PTI
>> [  484.179861] Dumping ftrace buffer:
>> [  484.180011](ftrace buffer empty)
>> [  484.180138] Modules linked in: fuse(E) ebtable_filter(E) ebtables(E) 
>> af_packet(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) 
>> nf_conntrack_ipv6(E) nf_defrag_ipv6(E) ipt_REJECT(E) xt_tcpudp(E) 
>> iptable_filter(E) ip6table_mangle(E) nf_conntrack_netbios_ns(E) 
>> nf_conntrack_broadcast(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) 
>> ip_tables(E) xt_conntrack(E) nf_conntrack(E) ip6table_filter(E) 
>> ip6_tables(E) x_tables(E) nls_iso8859_1(E) nls_cp437(E) intel_rapl(E) 
>> x86_pkg_temp_thermal(E) intel_powerclamp(E) snd_hda_codec_hdmi(E) 
>> coretemp(E) kvm_intel(E) snd_hda_codec_realtek(E) kvm(E) 
>> snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_codec(E) sr_mod(E) 
>> snd_hwdep(E) cdrom(E) joydev(E) snd_hda_core(E) snd_pcm(E) snd_timer(E) 
>> irqbypass(E) snd(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) 
>> r8169(E)
>> [  484.180740]  iTCO_wdt(E) ghash_clmulni_intel(E) mii(E) 
>> iTCO_vendor_support(E) pcbc(E) aesni_intel(E) soundcore(E) aes_x86_64(E) 
>> shpchp(E) crypto_simd(E) lpc_ich(E) glue_helper(E) i2c_i801(E) mei_me(E) 
>> mfd_core(E) mei(E) cryptd(E) intel_smartconnect(E) pcspkr(E) fan(E) 
>> thermal(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) 
>> hid_logitech_hidpp(E) hid_logitech_dj(E) uas(E) usb_storage(E) 
>> hid_generic(E) usbhid(E) nouveau(E) wmi(E) i2c_algo_bit(E) drm_kms_helper(E) 
>> syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) ahci(E) 
>> xhci_pci(E) ehci_pci(E) libahci(E) ttm(E) ehci_hcd(E) xhci_hcd(E) libata(E) 
>> drm(E) usbcore(E) video(E) button(E) sd_mod(E) vfat(E) fat(E) virtio_blk(E) 
>> virtio_mmio(E) virtio_pci(E) virtio_ring(E) virtio(E) ext4(E) crc16(E) 
>> mbcache(E) jbd2(E) loop(E) sg(E) dm_multipath(E)
>> [  484.181421]  dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) 
>> scsi_mod(E) efivarfs(E) autofs4(E)
>> [  484.181583] CPU: 3 PID: 500 Comm: kworker/3:1H Tainted: GE
>> 4.15.0.ge237f98-master #609
>> [  484.181746] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 
>> 09/23/2013
>> [  484.181910] Workqueue: kblockd blk_mq_requeue_work
>> [  484.182076] RIP: 0010:__trace_note_message+0x1f/0xd0
>> [  484.182250] RSP: 0018:8803f45bfc20 EFLAGS: 00010282
>> [  484.182436] RAX:  RBX: a0817000 RCX: 
>> 8803
>> [  484.182622] RDX: 81bf514d RSI:  RDI: 
>> a0817000
>> [  484.182810] RBP: 8803f45bfc80 R08: 0041 R09: 
>> 8803f69cc5d0
>> [  484.182998] R10: 

Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-06 Thread Paolo Valente


> Il giorno 06 feb 2018, alle ore 13:26, Paolo Valente 
>  ha scritto:
> 
> 
> 
>> Il giorno 06 feb 2018, alle ore 12:57, Mike Galbraith  ha 
>> scritto:
>> 
>> On Tue, 2018-02-06 at 10:38 +0100, Paolo Valente wrote:
>>> 
>>> Hi Mike,
>>> as you can imagine, I didn't get any failure in my pre-submission
>>> tests on this patch.  In addition, it is not that easy to link this
>>> patch, which just adds some internal bfq housekeeping in case of a
>>> requeue, with a corruption of external lists for general I/O
>>> management.
>>> 
>>> In this respect, as Oleksandr comments point out, by switching from
>>> cfq to bfq, you switch between much more than two schedulers.  Anyway,
>>> who knows ...
>> 
>> Not me.  Box seems to be fairly sure that it is bfq.
> 
> Yeah, sorry for the too short comment: what I meant is that cfq (and
> deadline) are in legacy blk, while bfq is in blk-mq.  So, to use bfq,
> you must also switch from legacy-blk I/O stack to blk-mq I/O stack.
> 
> 
>> Twice again box
>> went belly up on me in fairly short order with bfq, but seemed fine
>> with deadline.  I'm currently running deadline again, and box again
>> seems solid, thought I won't say _is_ solid until it's been happily
>> trundling along with deadline for a quite a bit longer.
>> 
> 
> As Oleksadr asked too, is it deadline or mq-deadline?
> 
>> I was ssh'd in during the last episode, got this out.  I should be
>> getting crash dumps, but seems kdump is only working intermittently
>> atm.  I did get one earlier, but 3 of 4 times not.  Hohum.
>> 
>> [  484.179292] BUG: unable to handle kernel paging request at 
>> a0817000
>> [  484.179436] IP: __trace_note_message+0x1f/0xd0
>> [  484.179576] PGD 1e0c067 P4D 1e0c067 PUD 1e0d063 PMD 3faff2067 PTE 0
>> [  484.179719] Oops:  [#1] SMP PTI
>> [  484.179861] Dumping ftrace buffer:
>> [  484.180011](ftrace buffer empty)
>> [  484.180138] Modules linked in: fuse(E) ebtable_filter(E) ebtables(E) 
>> af_packet(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) 
>> nf_conntrack_ipv6(E) nf_defrag_ipv6(E) ipt_REJECT(E) xt_tcpudp(E) 
>> iptable_filter(E) ip6table_mangle(E) nf_conntrack_netbios_ns(E) 
>> nf_conntrack_broadcast(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) 
>> ip_tables(E) xt_conntrack(E) nf_conntrack(E) ip6table_filter(E) 
>> ip6_tables(E) x_tables(E) nls_iso8859_1(E) nls_cp437(E) intel_rapl(E) 
>> x86_pkg_temp_thermal(E) intel_powerclamp(E) snd_hda_codec_hdmi(E) 
>> coretemp(E) kvm_intel(E) snd_hda_codec_realtek(E) kvm(E) 
>> snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_codec(E) sr_mod(E) 
>> snd_hwdep(E) cdrom(E) joydev(E) snd_hda_core(E) snd_pcm(E) snd_timer(E) 
>> irqbypass(E) snd(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) 
>> r8169(E)
>> [  484.180740]  iTCO_wdt(E) ghash_clmulni_intel(E) mii(E) 
>> iTCO_vendor_support(E) pcbc(E) aesni_intel(E) soundcore(E) aes_x86_64(E) 
>> shpchp(E) crypto_simd(E) lpc_ich(E) glue_helper(E) i2c_i801(E) mei_me(E) 
>> mfd_core(E) mei(E) cryptd(E) intel_smartconnect(E) pcspkr(E) fan(E) 
>> thermal(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) 
>> hid_logitech_hidpp(E) hid_logitech_dj(E) uas(E) usb_storage(E) 
>> hid_generic(E) usbhid(E) nouveau(E) wmi(E) i2c_algo_bit(E) drm_kms_helper(E) 
>> syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) ahci(E) 
>> xhci_pci(E) ehci_pci(E) libahci(E) ttm(E) ehci_hcd(E) xhci_hcd(E) libata(E) 
>> drm(E) usbcore(E) video(E) button(E) sd_mod(E) vfat(E) fat(E) virtio_blk(E) 
>> virtio_mmio(E) virtio_pci(E) virtio_ring(E) virtio(E) ext4(E) crc16(E) 
>> mbcache(E) jbd2(E) loop(E) sg(E) dm_multipath(E)
>> [  484.181421]  dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) 
>> scsi_mod(E) efivarfs(E) autofs4(E)
>> [  484.181583] CPU: 3 PID: 500 Comm: kworker/3:1H Tainted: GE
>> 4.15.0.ge237f98-master #609
>> [  484.181746] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 
>> 09/23/2013
>> [  484.181910] Workqueue: kblockd blk_mq_requeue_work
>> [  484.182076] RIP: 0010:__trace_note_message+0x1f/0xd0
>> [  484.182250] RSP: 0018:8803f45bfc20 EFLAGS: 00010282
>> [  484.182436] RAX:  RBX: a0817000 RCX: 
>> 8803
>> [  484.182622] RDX: 81bf514d RSI:  RDI: 
>> a0817000
>> [  484.182810] RBP: 8803f45bfc80 R08: 0041 R09: 
>> 8803f69cc5d0
>> [  484.182998] R10: 8803f80b47d0 R11: 1000 R12: 
>> 8803f45e8

Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-06 Thread Paolo Valente


> Il giorno 06 feb 2018, alle ore 12:57, Mike Galbraith <efa...@gmx.de> ha 
> scritto:
> 
> On Tue, 2018-02-06 at 10:38 +0100, Paolo Valente wrote:
>> 
>> Hi Mike,
>> as you can imagine, I didn't get any failure in my pre-submission
>> tests on this patch.  In addition, it is not that easy to link this
>> patch, which just adds some internal bfq housekeeping in case of a
>> requeue, with a corruption of external lists for general I/O
>> management.
>> 
>> In this respect, as Oleksandr comments point out, by switching from
>> cfq to bfq, you switch between much more than two schedulers.  Anyway,
>> who knows ...
> 
> Not me.  Box seems to be fairly sure that it is bfq.

Yeah, sorry for the too short comment: what I meant is that cfq (and
deadline) are in legacy blk, while bfq is in blk-mq.  So, to use bfq,
you must also switch from legacy-blk I/O stack to blk-mq I/O stack.


>  Twice again box
> went belly up on me in fairly short order with bfq, but seemed fine
> with deadline.  I'm currently running deadline again, and box again
> seems solid, thought I won't say _is_ solid until it's been happily
> trundling along with deadline for a quite a bit longer.
> 

As Oleksadr asked too, is it deadline or mq-deadline?

> I was ssh'd in during the last episode, got this out.  I should be
> getting crash dumps, but seems kdump is only working intermittently
> atm.  I did get one earlier, but 3 of 4 times not.  Hohum.
> 
> [  484.179292] BUG: unable to handle kernel paging request at a0817000
> [  484.179436] IP: __trace_note_message+0x1f/0xd0
> [  484.179576] PGD 1e0c067 P4D 1e0c067 PUD 1e0d063 PMD 3faff2067 PTE 0
> [  484.179719] Oops:  [#1] SMP PTI
> [  484.179861] Dumping ftrace buffer:
> [  484.180011](ftrace buffer empty)
> [  484.180138] Modules linked in: fuse(E) ebtable_filter(E) ebtables(E) 
> af_packet(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) 
> nf_conntrack_ipv6(E) nf_defrag_ipv6(E) ipt_REJECT(E) xt_tcpudp(E) 
> iptable_filter(E) ip6table_mangle(E) nf_conntrack_netbios_ns(E) 
> nf_conntrack_broadcast(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) ip_tables(E) 
> xt_conntrack(E) nf_conntrack(E) ip6table_filter(E) ip6_tables(E) x_tables(E) 
> nls_iso8859_1(E) nls_cp437(E) intel_rapl(E) x86_pkg_temp_thermal(E) 
> intel_powerclamp(E) snd_hda_codec_hdmi(E) coretemp(E) kvm_intel(E) 
> snd_hda_codec_realtek(E) kvm(E) snd_hda_codec_generic(E) snd_hda_intel(E) 
> snd_hda_codec(E) sr_mod(E) snd_hwdep(E) cdrom(E) joydev(E) snd_hda_core(E) 
> snd_pcm(E) snd_timer(E) irqbypass(E) snd(E) crct10dif_pclmul(E) 
> crc32_pclmul(E) crc32c_intel(E) r8169(E)
> [  484.180740]  iTCO_wdt(E) ghash_clmulni_intel(E) mii(E) 
> iTCO_vendor_support(E) pcbc(E) aesni_intel(E) soundcore(E) aes_x86_64(E) 
> shpchp(E) crypto_simd(E) lpc_ich(E) glue_helper(E) i2c_i801(E) mei_me(E) 
> mfd_core(E) mei(E) cryptd(E) intel_smartconnect(E) pcspkr(E) fan(E) 
> thermal(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) 
> hid_logitech_hidpp(E) hid_logitech_dj(E) uas(E) usb_storage(E) hid_generic(E) 
> usbhid(E) nouveau(E) wmi(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) 
> sysfillrect(E) sysimgblt(E) fb_sys_fops(E) ahci(E) xhci_pci(E) ehci_pci(E) 
> libahci(E) ttm(E) ehci_hcd(E) xhci_hcd(E) libata(E) drm(E) usbcore(E) 
> video(E) button(E) sd_mod(E) vfat(E) fat(E) virtio_blk(E) virtio_mmio(E) 
> virtio_pci(E) virtio_ring(E) virtio(E) ext4(E) crc16(E) mbcache(E) jbd2(E) 
> loop(E) sg(E) dm_multipath(E)
> [  484.181421]  dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) 
> scsi_mod(E) efivarfs(E) autofs4(E)
> [  484.181583] CPU: 3 PID: 500 Comm: kworker/3:1H Tainted: GE
> 4.15.0.ge237f98-master #609
> [  484.181746] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 
> 09/23/2013
> [  484.181910] Workqueue: kblockd blk_mq_requeue_work
> [  484.182076] RIP: 0010:__trace_note_message+0x1f/0xd0
> [  484.182250] RSP: 0018:8803f45bfc20 EFLAGS: 00010282
> [  484.182436] RAX:  RBX: a0817000 RCX: 
> 8803
> [  484.182622] RDX: 81bf514d RSI:  RDI: 
> a0817000
> [  484.182810] RBP: 8803f45bfc80 R08: 0041 R09: 
> 8803f69cc5d0
> [  484.182998] R10: 8803f80b47d0 R11: 1000 R12: 
> 8803f45e8000
> [  484.183185] R13: 000d R14:  R15: 
> 8803fba112c0
> [  484.183372] FS:  () GS:88041ecc() 
> knlGS:
> [  484.183561] CS:  0010 DS:  ES:  CR0: 80050033
> [  484.183747] CR2: a0817000 CR3: 01e0a006 CR4: 
> 001606e0
> [  484.183934] Call Trace:
> [  484.184122]  bfq_put_queu

Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-06 Thread Paolo Valente


> Il giorno 06 feb 2018, alle ore 12:57, Mike Galbraith  ha 
> scritto:
> 
> On Tue, 2018-02-06 at 10:38 +0100, Paolo Valente wrote:
>> 
>> Hi Mike,
>> as you can imagine, I didn't get any failure in my pre-submission
>> tests on this patch.  In addition, it is not that easy to link this
>> patch, which just adds some internal bfq housekeeping in case of a
>> requeue, with a corruption of external lists for general I/O
>> management.
>> 
>> In this respect, as Oleksandr comments point out, by switching from
>> cfq to bfq, you switch between much more than two schedulers.  Anyway,
>> who knows ...
> 
> Not me.  Box seems to be fairly sure that it is bfq.

Yeah, sorry for the too short comment: what I meant is that cfq (and
deadline) are in legacy blk, while bfq is in blk-mq.  So, to use bfq,
you must also switch from legacy-blk I/O stack to blk-mq I/O stack.


>  Twice again box
> went belly up on me in fairly short order with bfq, but seemed fine
> with deadline.  I'm currently running deadline again, and box again
> seems solid, thought I won't say _is_ solid until it's been happily
> trundling along with deadline for a quite a bit longer.
> 

As Oleksadr asked too, is it deadline or mq-deadline?

> I was ssh'd in during the last episode, got this out.  I should be
> getting crash dumps, but seems kdump is only working intermittently
> atm.  I did get one earlier, but 3 of 4 times not.  Hohum.
> 
> [  484.179292] BUG: unable to handle kernel paging request at a0817000
> [  484.179436] IP: __trace_note_message+0x1f/0xd0
> [  484.179576] PGD 1e0c067 P4D 1e0c067 PUD 1e0d063 PMD 3faff2067 PTE 0
> [  484.179719] Oops:  [#1] SMP PTI
> [  484.179861] Dumping ftrace buffer:
> [  484.180011](ftrace buffer empty)
> [  484.180138] Modules linked in: fuse(E) ebtable_filter(E) ebtables(E) 
> af_packet(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) 
> nf_conntrack_ipv6(E) nf_defrag_ipv6(E) ipt_REJECT(E) xt_tcpudp(E) 
> iptable_filter(E) ip6table_mangle(E) nf_conntrack_netbios_ns(E) 
> nf_conntrack_broadcast(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) ip_tables(E) 
> xt_conntrack(E) nf_conntrack(E) ip6table_filter(E) ip6_tables(E) x_tables(E) 
> nls_iso8859_1(E) nls_cp437(E) intel_rapl(E) x86_pkg_temp_thermal(E) 
> intel_powerclamp(E) snd_hda_codec_hdmi(E) coretemp(E) kvm_intel(E) 
> snd_hda_codec_realtek(E) kvm(E) snd_hda_codec_generic(E) snd_hda_intel(E) 
> snd_hda_codec(E) sr_mod(E) snd_hwdep(E) cdrom(E) joydev(E) snd_hda_core(E) 
> snd_pcm(E) snd_timer(E) irqbypass(E) snd(E) crct10dif_pclmul(E) 
> crc32_pclmul(E) crc32c_intel(E) r8169(E)
> [  484.180740]  iTCO_wdt(E) ghash_clmulni_intel(E) mii(E) 
> iTCO_vendor_support(E) pcbc(E) aesni_intel(E) soundcore(E) aes_x86_64(E) 
> shpchp(E) crypto_simd(E) lpc_ich(E) glue_helper(E) i2c_i801(E) mei_me(E) 
> mfd_core(E) mei(E) cryptd(E) intel_smartconnect(E) pcspkr(E) fan(E) 
> thermal(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) 
> hid_logitech_hidpp(E) hid_logitech_dj(E) uas(E) usb_storage(E) hid_generic(E) 
> usbhid(E) nouveau(E) wmi(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) 
> sysfillrect(E) sysimgblt(E) fb_sys_fops(E) ahci(E) xhci_pci(E) ehci_pci(E) 
> libahci(E) ttm(E) ehci_hcd(E) xhci_hcd(E) libata(E) drm(E) usbcore(E) 
> video(E) button(E) sd_mod(E) vfat(E) fat(E) virtio_blk(E) virtio_mmio(E) 
> virtio_pci(E) virtio_ring(E) virtio(E) ext4(E) crc16(E) mbcache(E) jbd2(E) 
> loop(E) sg(E) dm_multipath(E)
> [  484.181421]  dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) 
> scsi_mod(E) efivarfs(E) autofs4(E)
> [  484.181583] CPU: 3 PID: 500 Comm: kworker/3:1H Tainted: GE
> 4.15.0.ge237f98-master #609
> [  484.181746] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 
> 09/23/2013
> [  484.181910] Workqueue: kblockd blk_mq_requeue_work
> [  484.182076] RIP: 0010:__trace_note_message+0x1f/0xd0
> [  484.182250] RSP: 0018:8803f45bfc20 EFLAGS: 00010282
> [  484.182436] RAX:  RBX: a0817000 RCX: 
> 8803
> [  484.182622] RDX: 81bf514d RSI:  RDI: 
> a0817000
> [  484.182810] RBP: 8803f45bfc80 R08: 0041 R09: 
> 8803f69cc5d0
> [  484.182998] R10: 8803f80b47d0 R11: 1000 R12: 
> 8803f45e8000
> [  484.183185] R13: 000d R14:  R15: 
> 8803fba112c0
> [  484.183372] FS:  () GS:88041ecc() 
> knlGS:
> [  484.183561] CS:  0010 DS:  ES:  CR0: 80050033
> [  484.183747] CR2: a0817000 CR3: 01e0a006 CR4: 
> 001606e0
> [  484.183934] Call Trace:
> [  484.184122]  bfq_put_queue+0xd3/0xe0
> [  48

Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-06 Thread Paolo Valente


> Il giorno 06 feb 2018, alle ore 08:56, Mike Galbraith  ha 
> scritto:
> 
> On Tue, 2018-02-06 at 08:44 +0100, Oleksandr Natalenko wrote:
>> Hi, Paolo.
>> 
>> I can confirm that this patch fixes cfdisk hang for me. I've also tried 
>> to trigger the issue Mike has encountered, but with no luck (maybe, I 
>> wasn't insistent enough, just was doing dd on usb-storage device in the 
>> VM).
> 
> I was doing kbuilds, and it blew up on me twice.  Switching back to cfq
> seemed to confirm it was indeed the patch causing trouble, but that's
> by no means a certainty.
> 

Hi Mike,
as you can imagine, I didn't get any failure in my pre-submission
tests on this patch.  In addition, it is not that easy to link this
patch, which just adds some internal bfq housekeeping in case of a
requeue, with a corruption of external lists for general I/O
management.

In this respect, as Oleksandr comments point out, by switching from
cfq to bfq, you switch between much more than two schedulers.  Anyway,
who knows ...

Maybe this report will ring a bell with some block-layer expert, or,
hopefully, next feedback will help puzzle this out.

Thanks,
Paolo


>   -Mike



Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-06 Thread Paolo Valente


> Il giorno 06 feb 2018, alle ore 08:56, Mike Galbraith  ha 
> scritto:
> 
> On Tue, 2018-02-06 at 08:44 +0100, Oleksandr Natalenko wrote:
>> Hi, Paolo.
>> 
>> I can confirm that this patch fixes cfdisk hang for me. I've also tried 
>> to trigger the issue Mike has encountered, but with no luck (maybe, I 
>> wasn't insistent enough, just was doing dd on usb-storage device in the 
>> VM).
> 
> I was doing kbuilds, and it blew up on me twice.  Switching back to cfq
> seemed to confirm it was indeed the patch causing trouble, but that's
> by no means a certainty.
> 

Hi Mike,
as you can imagine, I didn't get any failure in my pre-submission
tests on this patch.  In addition, it is not that easy to link this
patch, which just adds some internal bfq housekeeping in case of a
requeue, with a corruption of external lists for general I/O
management.

In this respect, as Oleksandr comments point out, by switching from
cfq to bfq, you switch between much more than two schedulers.  Anyway,
who knows ...

Maybe this report will ring a bell with some block-layer expert, or,
hopefully, next feedback will help puzzle this out.

Thanks,
Paolo


>   -Mike



Re: v4.15 and I/O hang with BFQ

2018-02-05 Thread Paolo Valente


> Il giorno 30 gen 2018, alle ore 16:40, Paolo Valente 
> <paolo.vale...@linaro.org> ha scritto:
> 
> 
> 
>> Il giorno 30 gen 2018, alle ore 15:40, Ming Lei <ming@redhat.com> ha 
>> scritto:
>> 
>> On Tue, Jan 30, 2018 at 03:30:28PM +0100, Oleksandr Natalenko wrote:
>>> Hi.
>>> 
>> ...
>>>  systemd-udevd-271   [000]  4.311033: bfq_insert_requests: insert
>>> rq->0
>>>  systemd-udevd-271   [000] ...1 4.311037: blk_mq_do_dispatch_sched:
>>> not get rq, 1
>>> cfdisk-408   [000] 13.484220: bfq_insert_requests: insert
>>> rq->1
>>>   kworker/0:1H-174   [000] 13.484253: blk_mq_do_dispatch_sched:
>>> not get rq, 1
>>> ===
>>> 
>>> Looks the same, right?
>> 
>> Yeah, same with before.
>> 
> 
> Hi guys,
> sorry for the delay with this fix.  We are proceeding very slowly on
> this, because I'm super busy.  Anyway, now I can at least explain in
> more detail the cause that leads to this hang.  Commit 'a6a252e64914
> ("blk-mq-sched: decide how to handle flush rq via RQF_FLUSH_SEQ")'
> makes all non-flush re-prepared requests be re-inserted into the I/O
> scheduler.  With this change, I/O schedulers may get the same request
> inserted again, even several times, without a finish_request invoked
> on the request before each re-insertion.
> 
> For the I/O scheduler, every such re-prepared request is equivalent
> to the insertion of a new request. For schedulers like mq-deadline
> or kyber this fact causes no problems. In contrast, it confuses a stateful
> scheduler like BFQ, which preserves states for an I/O request until
> finish_request is invoked on it. In particular, BFQ has no way
> to know that the above re-insertions concerns the same, already dispatched
> request. So it may get stuck waiting for the completion of these
> re-inserted requests forever, thus preventing any other queue of
> requests to be served.
> 
> We are trying to address this issue by adding the hook requeue_request
> to bfq interface.
> 
> Unfortunately, with our current implementation of requeue_request in
> place, bfq eventually gets to an incoherent state.  This is apparently
> caused by a requeue of an I/O request, immediately followed by a
> completion of the same request.  This seems rather absurd, and drives
> bfq crazy.  But this is something for which we don't have definite
> results yet.
> 
> We're working on it, sorry again for the delay.
> 

Ok, patch arriving ... Please test it.

Thanks,
Paolo

> Thanks,
> Paolo
> 
>> -- 
>> Ming
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "bfq-iosched" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to bfq-iosched+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



Re: v4.15 and I/O hang with BFQ

2018-02-05 Thread Paolo Valente


> Il giorno 30 gen 2018, alle ore 16:40, Paolo Valente 
>  ha scritto:
> 
> 
> 
>> Il giorno 30 gen 2018, alle ore 15:40, Ming Lei  ha 
>> scritto:
>> 
>> On Tue, Jan 30, 2018 at 03:30:28PM +0100, Oleksandr Natalenko wrote:
>>> Hi.
>>> 
>> ...
>>>  systemd-udevd-271   [000]  4.311033: bfq_insert_requests: insert
>>> rq->0
>>>  systemd-udevd-271   [000] ...1 4.311037: blk_mq_do_dispatch_sched:
>>> not get rq, 1
>>> cfdisk-408   [000] 13.484220: bfq_insert_requests: insert
>>> rq->1
>>>   kworker/0:1H-174   [000] 13.484253: blk_mq_do_dispatch_sched:
>>> not get rq, 1
>>> ===
>>> 
>>> Looks the same, right?
>> 
>> Yeah, same with before.
>> 
> 
> Hi guys,
> sorry for the delay with this fix.  We are proceeding very slowly on
> this, because I'm super busy.  Anyway, now I can at least explain in
> more detail the cause that leads to this hang.  Commit 'a6a252e64914
> ("blk-mq-sched: decide how to handle flush rq via RQF_FLUSH_SEQ")'
> makes all non-flush re-prepared requests be re-inserted into the I/O
> scheduler.  With this change, I/O schedulers may get the same request
> inserted again, even several times, without a finish_request invoked
> on the request before each re-insertion.
> 
> For the I/O scheduler, every such re-prepared request is equivalent
> to the insertion of a new request. For schedulers like mq-deadline
> or kyber this fact causes no problems. In contrast, it confuses a stateful
> scheduler like BFQ, which preserves states for an I/O request until
> finish_request is invoked on it. In particular, BFQ has no way
> to know that the above re-insertions concerns the same, already dispatched
> request. So it may get stuck waiting for the completion of these
> re-inserted requests forever, thus preventing any other queue of
> requests to be served.
> 
> We are trying to address this issue by adding the hook requeue_request
> to bfq interface.
> 
> Unfortunately, with our current implementation of requeue_request in
> place, bfq eventually gets to an incoherent state.  This is apparently
> caused by a requeue of an I/O request, immediately followed by a
> completion of the same request.  This seems rather absurd, and drives
> bfq crazy.  But this is something for which we don't have definite
> results yet.
> 
> We're working on it, sorry again for the delay.
> 

Ok, patch arriving ... Please test it.

Thanks,
Paolo

> Thanks,
> Paolo
> 
>> -- 
>> Ming
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "bfq-iosched" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to bfq-iosched+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



[PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-05 Thread Paolo Valente
Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via
RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
be re-inserted into the active I/O scheduler for that device. As a
consequence, I/O schedulers may get the same request inserted again,
even several times, without a finish_request invoked on that request
before each re-insertion.

This fact is the cause of the failure reported in [1]. For an I/O
scheduler, every re-insertion of the same re-prepared request is
equivalent to the insertion of a new request. For schedulers like
mq-deadline or kyber, this fact causes no harm. In contrast, it
confuses a stateful scheduler like BFQ, which keeps state for an I/O
request, until the finish_request hook is invoked on the request. In
particular, BFQ may get stuck, waiting forever for the number of
request dispatches, of the same request, to be balanced by an equal
number of request completions (while there will be one completion for
that request). In this state, BFQ may refuse to serve I/O requests
from other bfq_queues. The hang reported in [1] then follows.

However, the above re-prepared requests undergo a requeue, thus the
requeue_request hook of the active elevator is invoked for these
requests, if set. This commit then addresses the above issue by
properly implementing the hook requeue_request in BFQ.

[1] https://marc.info/?l=linux-block=15127608676

Reported-by: Ivan Kozik <i...@ludios.org>
Reported-by: Alban Browaeys <alban.browa...@gmail.com>
Signed-off-by: Paolo Valente <paolo.vale...@linaro.org>
Signed-off-by: Serena Ziviani <ziviani.ser...@gmail.com>
---
 block/bfq-iosched.c | 101 +++-
 1 file changed, 76 insertions(+), 25 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 47e6ec7427c4..343452488515 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -3823,24 +3823,26 @@ static struct request *__bfq_dispatch_request(struct 
blk_mq_hw_ctx *hctx)
}
 
/*
-* We exploit the bfq_finish_request hook to decrement
-* rq_in_driver, but bfq_finish_request will not be
-* invoked on this request. So, to avoid unbalance,
-* just start this request, without incrementing
-* rq_in_driver. As a negative consequence,
-* rq_in_driver is deceptively lower than it should be
-* while this request is in service. This may cause
-* bfq_schedule_dispatch to be invoked uselessly.
+* We exploit the bfq_finish_requeue_request hook to
+* decrement rq_in_driver, but
+* bfq_finish_requeue_request will not be invoked on
+* this request. So, to avoid unbalance, just start
+* this request, without incrementing rq_in_driver. As
+* a negative consequence, rq_in_driver is deceptively
+* lower than it should be while this request is in
+* service. This may cause bfq_schedule_dispatch to be
+* invoked uselessly.
 *
 * As for implementing an exact solution, the
-* bfq_finish_request hook, if defined, is probably
-* invoked also on this request. So, by exploiting
-* this hook, we could 1) increment rq_in_driver here,
-* and 2) decrement it in bfq_finish_request. Such a
-* solution would let the value of the counter be
-* always accurate, but it would entail using an extra
-* interface function. This cost seems higher than the
-* benefit, being the frequency of non-elevator-private
+* bfq_finish_requeue_request hook, if defined, is
+* probably invoked also on this request. So, by
+* exploiting this hook, we could 1) increment
+* rq_in_driver here, and 2) decrement it in
+* bfq_finish_requeue_request. Such a solution would
+* let the value of the counter be always accurate,
+* but it would entail using an extra interface
+* function. This cost seems higher than the benefit,
+* being the frequency of non-elevator-private
 * requests very low.
 */
goto start_rq;
@@ -4515,6 +4517,8 @@ static inline void bfq_update_insert_stats(struct 
request_queue *q,
   unsigned int cmd_flags) {}
 #endif
 
+static void bfq_prepare_request(struct request *rq, struct bio *bio);
+
 static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
   bool at_head)
 {
@@ -4541,6 +4545,20 @@ static void bfq_insert_request(struct blk_mq_hw_ctx 
*hctx, struct

[PATCH BUGFIX 0/1] block, bfq: handle requeues of I/O requests

2018-02-05 Thread Paolo Valente
Hi,
just a note: the most difficult part in the implementation of this
patch has been how to handle the fact that the requeue and finish
hooks of the active elevator get invoked even for requests that are
not referred in that elevator any longer. You can find details in the
comments introduced by the patch. This note is just to point out this
unexpected fact (for me), in case it is not intentional.

Thanks,
Paolo

Paolo Valente (1):
  block, bfq: add requeue-request hook

 block/bfq-iosched.c | 101 +++-
 1 file changed, 76 insertions(+), 25 deletions(-)

--
2.15.1


[PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-05 Thread Paolo Valente
Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via
RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
be re-inserted into the active I/O scheduler for that device. As a
consequence, I/O schedulers may get the same request inserted again,
even several times, without a finish_request invoked on that request
before each re-insertion.

This fact is the cause of the failure reported in [1]. For an I/O
scheduler, every re-insertion of the same re-prepared request is
equivalent to the insertion of a new request. For schedulers like
mq-deadline or kyber, this fact causes no harm. In contrast, it
confuses a stateful scheduler like BFQ, which keeps state for an I/O
request, until the finish_request hook is invoked on the request. In
particular, BFQ may get stuck, waiting forever for the number of
request dispatches, of the same request, to be balanced by an equal
number of request completions (while there will be one completion for
that request). In this state, BFQ may refuse to serve I/O requests
from other bfq_queues. The hang reported in [1] then follows.

However, the above re-prepared requests undergo a requeue, thus the
requeue_request hook of the active elevator is invoked for these
requests, if set. This commit then addresses the above issue by
properly implementing the hook requeue_request in BFQ.

[1] https://marc.info/?l=linux-block=15127608676

Reported-by: Ivan Kozik 
Reported-by: Alban Browaeys 
Signed-off-by: Paolo Valente 
Signed-off-by: Serena Ziviani 
---
 block/bfq-iosched.c | 101 +++-
 1 file changed, 76 insertions(+), 25 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 47e6ec7427c4..343452488515 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -3823,24 +3823,26 @@ static struct request *__bfq_dispatch_request(struct 
blk_mq_hw_ctx *hctx)
}
 
/*
-* We exploit the bfq_finish_request hook to decrement
-* rq_in_driver, but bfq_finish_request will not be
-* invoked on this request. So, to avoid unbalance,
-* just start this request, without incrementing
-* rq_in_driver. As a negative consequence,
-* rq_in_driver is deceptively lower than it should be
-* while this request is in service. This may cause
-* bfq_schedule_dispatch to be invoked uselessly.
+* We exploit the bfq_finish_requeue_request hook to
+* decrement rq_in_driver, but
+* bfq_finish_requeue_request will not be invoked on
+* this request. So, to avoid unbalance, just start
+* this request, without incrementing rq_in_driver. As
+* a negative consequence, rq_in_driver is deceptively
+* lower than it should be while this request is in
+* service. This may cause bfq_schedule_dispatch to be
+* invoked uselessly.
 *
 * As for implementing an exact solution, the
-* bfq_finish_request hook, if defined, is probably
-* invoked also on this request. So, by exploiting
-* this hook, we could 1) increment rq_in_driver here,
-* and 2) decrement it in bfq_finish_request. Such a
-* solution would let the value of the counter be
-* always accurate, but it would entail using an extra
-* interface function. This cost seems higher than the
-* benefit, being the frequency of non-elevator-private
+* bfq_finish_requeue_request hook, if defined, is
+* probably invoked also on this request. So, by
+* exploiting this hook, we could 1) increment
+* rq_in_driver here, and 2) decrement it in
+* bfq_finish_requeue_request. Such a solution would
+* let the value of the counter be always accurate,
+* but it would entail using an extra interface
+* function. This cost seems higher than the benefit,
+* being the frequency of non-elevator-private
 * requests very low.
 */
goto start_rq;
@@ -4515,6 +4517,8 @@ static inline void bfq_update_insert_stats(struct 
request_queue *q,
   unsigned int cmd_flags) {}
 #endif
 
+static void bfq_prepare_request(struct request *rq, struct bio *bio);
+
 static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
   bool at_head)
 {
@@ -4541,6 +4545,20 @@ static void bfq_insert_request(struct blk_mq_hw_ctx 
*hctx, struct request *rq,
else
list_add_tail(>queuelist, >dispatch);
} else {
+

[PATCH BUGFIX 0/1] block, bfq: handle requeues of I/O requests

2018-02-05 Thread Paolo Valente
Hi,
just a note: the most difficult part in the implementation of this
patch has been how to handle the fact that the requeue and finish
hooks of the active elevator get invoked even for requests that are
not referred in that elevator any longer. You can find details in the
comments introduced by the patch. This note is just to point out this
unexpected fact (for me), in case it is not intentional.

Thanks,
Paolo

Paolo Valente (1):
  block, bfq: add requeue-request hook

 block/bfq-iosched.c | 101 +++-
 1 file changed, 76 insertions(+), 25 deletions(-)

--
2.15.1


Re: v4.15 and I/O hang with BFQ

2018-01-30 Thread Paolo Valente


> Il giorno 30 gen 2018, alle ore 15:40, Ming Lei  ha 
> scritto:
> 
> On Tue, Jan 30, 2018 at 03:30:28PM +0100, Oleksandr Natalenko wrote:
>> Hi.
>> 
> ...
>>   systemd-udevd-271   [000]  4.311033: bfq_insert_requests: insert
>> rq->0
>>   systemd-udevd-271   [000] ...1 4.311037: blk_mq_do_dispatch_sched:
>> not get rq, 1
>>  cfdisk-408   [000] 13.484220: bfq_insert_requests: insert
>> rq->1
>>kworker/0:1H-174   [000] 13.484253: blk_mq_do_dispatch_sched:
>> not get rq, 1
>> ===
>> 
>> Looks the same, right?
> 
> Yeah, same with before.
> 

Hi guys,
sorry for the delay with this fix.  We are proceeding very slowly on
this, because I'm super busy.  Anyway, now I can at least explain in
more detail the cause that leads to this hang.  Commit 'a6a252e64914
("blk-mq-sched: decide how to handle flush rq via RQF_FLUSH_SEQ")'
makes all non-flush re-prepared requests be re-inserted into the I/O
scheduler.  With this change, I/O schedulers may get the same request
inserted again, even several times, without a finish_request invoked
on the request before each re-insertion.

For the I/O scheduler, every such re-prepared request is equivalent
to the insertion of a new request. For schedulers like mq-deadline
or kyber this fact causes no problems. In contrast, it confuses a stateful
scheduler like BFQ, which preserves states for an I/O request until
finish_request is invoked on it. In particular, BFQ has no way
to know that the above re-insertions concerns the same, already dispatched
request. So it may get stuck waiting for the completion of these
re-inserted requests forever, thus preventing any other queue of
requests to be served.

We are trying to address this issue by adding the hook requeue_request
to bfq interface.

Unfortunately, with our current implementation of requeue_request in
place, bfq eventually gets to an incoherent state.  This is apparently
caused by a requeue of an I/O request, immediately followed by a
completion of the same request.  This seems rather absurd, and drives
bfq crazy.  But this is something for which we don't have definite
results yet.

We're working on it, sorry again for the delay.

Thanks,
Paolo

> -- 
> Ming



Re: v4.15 and I/O hang with BFQ

2018-01-30 Thread Paolo Valente


> Il giorno 30 gen 2018, alle ore 15:40, Ming Lei  ha 
> scritto:
> 
> On Tue, Jan 30, 2018 at 03:30:28PM +0100, Oleksandr Natalenko wrote:
>> Hi.
>> 
> ...
>>   systemd-udevd-271   [000]  4.311033: bfq_insert_requests: insert
>> rq->0
>>   systemd-udevd-271   [000] ...1 4.311037: blk_mq_do_dispatch_sched:
>> not get rq, 1
>>  cfdisk-408   [000] 13.484220: bfq_insert_requests: insert
>> rq->1
>>kworker/0:1H-174   [000] 13.484253: blk_mq_do_dispatch_sched:
>> not get rq, 1
>> ===
>> 
>> Looks the same, right?
> 
> Yeah, same with before.
> 

Hi guys,
sorry for the delay with this fix.  We are proceeding very slowly on
this, because I'm super busy.  Anyway, now I can at least explain in
more detail the cause that leads to this hang.  Commit 'a6a252e64914
("blk-mq-sched: decide how to handle flush rq via RQF_FLUSH_SEQ")'
makes all non-flush re-prepared requests be re-inserted into the I/O
scheduler.  With this change, I/O schedulers may get the same request
inserted again, even several times, without a finish_request invoked
on the request before each re-insertion.

For the I/O scheduler, every such re-prepared request is equivalent
to the insertion of a new request. For schedulers like mq-deadline
or kyber this fact causes no problems. In contrast, it confuses a stateful
scheduler like BFQ, which preserves states for an I/O request until
finish_request is invoked on it. In particular, BFQ has no way
to know that the above re-insertions concerns the same, already dispatched
request. So it may get stuck waiting for the completion of these
re-inserted requests forever, thus preventing any other queue of
requests to be served.

We are trying to address this issue by adding the hook requeue_request
to bfq interface.

Unfortunately, with our current implementation of requeue_request in
place, bfq eventually gets to an incoherent state.  This is apparently
caused by a requeue of an I/O request, immediately followed by a
completion of the same request.  This seems rather absurd, and drives
bfq crazy.  But this is something for which we don't have definite
results yet.

We're working on it, sorry again for the delay.

Thanks,
Paolo

> -- 
> Ming



Re: [PATCH BUGFIX/IMPROVEMENT 0/2] block, bfq: two pending patches

2018-01-18 Thread Paolo Valente


> Il giorno 13 gen 2018, alle ore 12:05, Paolo Valente 
> <paolo.vale...@linaro.org> ha scritto:
> 
> Hi Jens,
> here are again the two pending patches you asked me to resend [1]. One
> of them, fixing read-starvation problems, was accompanied by a cover
> letter. I'm pasting the content of that cover letter below.
> 
> The patch addresses (serious) starvation problems caused by
> request-tag exhaustion, as explained in more detail in the commit
> message. I started from the solution in the function
> kyber_limit_depth, but then I had to define more articulate limits, to
> counter starvation also in cases not covered in kyber_limit_depth.
> If this solution proves to be effective, I'm willing to port it
> somehow to the other schedulers.
> 

Hi Jens,
have had to time to check these patches?  Sorry for pushing, but I
guess 4.16 is getting closer, and these patches are performance
critical; especially the first, which solves a starvation problem.

Thanks,
Paolo

> Thanks,
> Paolo
> 
> [1] https://www.spinics.net/lists/linux-block/msg21586.html
> 
> Paolo Valente (2):
>  block, bfq: limit tags for writes and async I/O
>  block, bfq: limit sectors served with interactive weight raising
> 
> block/bfq-iosched.c | 158 +---
> block/bfq-iosched.h |  17 ++
> block/bfq-wf2q.c|   3 +
> 3 files changed, 169 insertions(+), 9 deletions(-)
> 
> --
> 2.15.1



Re: [PATCH BUGFIX/IMPROVEMENT 0/2] block, bfq: two pending patches

2018-01-18 Thread Paolo Valente


> Il giorno 13 gen 2018, alle ore 12:05, Paolo Valente 
>  ha scritto:
> 
> Hi Jens,
> here are again the two pending patches you asked me to resend [1]. One
> of them, fixing read-starvation problems, was accompanied by a cover
> letter. I'm pasting the content of that cover letter below.
> 
> The patch addresses (serious) starvation problems caused by
> request-tag exhaustion, as explained in more detail in the commit
> message. I started from the solution in the function
> kyber_limit_depth, but then I had to define more articulate limits, to
> counter starvation also in cases not covered in kyber_limit_depth.
> If this solution proves to be effective, I'm willing to port it
> somehow to the other schedulers.
> 

Hi Jens,
have had to time to check these patches?  Sorry for pushing, but I
guess 4.16 is getting closer, and these patches are performance
critical; especially the first, which solves a starvation problem.

Thanks,
Paolo

> Thanks,
> Paolo
> 
> [1] https://www.spinics.net/lists/linux-block/msg21586.html
> 
> Paolo Valente (2):
>  block, bfq: limit tags for writes and async I/O
>  block, bfq: limit sectors served with interactive weight raising
> 
> block/bfq-iosched.c | 158 +---
> block/bfq-iosched.h |  17 ++
> block/bfq-wf2q.c|   3 +
> 3 files changed, 169 insertions(+), 9 deletions(-)
> 
> --
> 2.15.1



[PATCH BUGFIX/IMPROVEMENT 1/2] block, bfq: limit tags for writes and async I/O

2018-01-13 Thread Paolo Valente
Asynchronous I/O can easily starve synchronous I/O (both sync reads
and sync writes), by consuming all request tags. Similarly, storms of
synchronous writes, such as those that sync(2) may trigger, can starve
synchronous reads. In their turn, these two problems may also cause
BFQ to loose control on latency for interactive and soft real-time
applications. For example, on a PLEXTOR PX-256M5S SSD, LibreOffice
Writer takes 0.6 seconds to start if the device is idle, but it takes
more than 45 seconds (!) if there are sequential writes in the
background.

This commit addresses this issue by limiting the maximum percentage of
tags that asynchronous I/O requests and synchronous write requests can
consume. In particular, this commit grants a higher threshold to
synchronous writes, to prevent the latter from being starved by
asynchronous I/O.

According to the above test, LibreOffice Writer now starts in about
1.2 seconds on average, regardless of the background workload, and
apart from some rare outlier. To check this improvement, run, e.g.,
sudo ./comm_startup_lat.sh bfq 5 5 seq 10 "lowriter --terminate_after_init"
for the comm_startup_lat benchmark in the S suite [1].

[1] https://github.com/Algodev-github/S

Tested-by: Oleksandr Natalenko <oleksa...@natalenko.name>
Tested-by: Holger Hoffstätte <hol...@applied-asynchrony.com>
Signed-off-by: Paolo Valente <paolo.vale...@linaro.org>
---
 block/bfq-iosched.c | 77 +
 block/bfq-iosched.h | 12 +
 2 files changed, 89 insertions(+)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 1caeecad7af1..527bd2ccda51 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -417,6 +417,82 @@ static struct request *bfq_choose_req(struct bfq_data 
*bfqd,
}
 }
 
+/*
+ * See the comments on bfq_limit_depth for the purpose of
+ * the depths set in the function.
+ */
+static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt)
+{
+   bfqd->sb_shift = bt->sb.shift;
+
+   /*
+* In-word depths if no bfq_queue is being weight-raised:
+* leaving 25% of tags only for sync reads.
+*
+* In next formulas, right-shift the value
+* (1U<sb_shift), instead of computing directly
+* (1U<<(bfqd->sb_shift - something)), to be robust against
+* any possible value of bfqd->sb_shift, without having to
+* limit 'something'.
+*/
+   /* no more than 50% of tags for async I/O */
+   bfqd->word_depths[0][0] = max((1U<sb_shift)>>1, 1U);
+   /*
+* no more than 75% of tags for sync writes (25% extra tags
+* w.r.t. async I/O, to prevent async I/O from starving sync
+* writes)
+*/
+   bfqd->word_depths[0][1] = max(((1U<sb_shift) * 3)>>2, 1U);
+
+   /*
+* In-word depths in case some bfq_queue is being weight-
+* raised: leaving ~63% of tags for sync reads. This is the
+* highest percentage for which, in our tests, application
+* start-up times didn't suffer from any regression due to tag
+* shortage.
+*/
+   /* no more than ~18% of tags for async I/O */
+   bfqd->word_depths[1][0] = max(((1U<sb_shift) * 3)>>4, 1U);
+   /* no more than ~37% of tags for sync writes (~20% extra tags) */
+   bfqd->word_depths[1][1] = max(((1U<sb_shift) * 6)>>4, 1U);
+}
+
+/*
+ * Async I/O can easily starve sync I/O (both sync reads and sync
+ * writes), by consuming all tags. Similarly, storms of sync writes,
+ * such as those that sync(2) may trigger, can starve sync reads.
+ * Limit depths of async I/O and sync writes so as to counter both
+ * problems.
+ */
+static void bfq_limit_depth(unsigned int op, struct blk_mq_alloc_data *data)
+{
+   struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
+   struct bfq_data *bfqd = data->q->elevator->elevator_data;
+   struct sbitmap_queue *bt;
+
+   if (op_is_sync(op) && !op_is_write(op))
+   return;
+
+   if (data->flags & BLK_MQ_REQ_RESERVED) {
+   if (unlikely(!tags->nr_reserved_tags)) {
+   WARN_ON_ONCE(1);
+   return;
+   }
+   bt = >breserved_tags;
+   } else
+   bt = >bitmap_tags;
+
+   if (unlikely(bfqd->sb_shift != bt->sb.shift))
+   bfq_update_depths(bfqd, bt);
+
+   data->shallow_depth =
+   bfqd->word_depths[!!bfqd->wr_busy_queues][op_is_sync(op)];
+
+   bfq_log(bfqd, "[%s] wr_busy %d sync %d depth %u",
+   __func__, bfqd->wr_busy_queues, op_is_sync(op),
+   data->shallow_depth);
+}
+
 static struct bfq_queue *
 bfq_rq_pos_tree_lookup(struct bfq_data *bfqd, struct rb_root *root,
 sector_t sector

[PATCH BUGFIX/IMPROVEMENT 1/2] block, bfq: limit tags for writes and async I/O

2018-01-13 Thread Paolo Valente
Asynchronous I/O can easily starve synchronous I/O (both sync reads
and sync writes), by consuming all request tags. Similarly, storms of
synchronous writes, such as those that sync(2) may trigger, can starve
synchronous reads. In their turn, these two problems may also cause
BFQ to loose control on latency for interactive and soft real-time
applications. For example, on a PLEXTOR PX-256M5S SSD, LibreOffice
Writer takes 0.6 seconds to start if the device is idle, but it takes
more than 45 seconds (!) if there are sequential writes in the
background.

This commit addresses this issue by limiting the maximum percentage of
tags that asynchronous I/O requests and synchronous write requests can
consume. In particular, this commit grants a higher threshold to
synchronous writes, to prevent the latter from being starved by
asynchronous I/O.

According to the above test, LibreOffice Writer now starts in about
1.2 seconds on average, regardless of the background workload, and
apart from some rare outlier. To check this improvement, run, e.g.,
sudo ./comm_startup_lat.sh bfq 5 5 seq 10 "lowriter --terminate_after_init"
for the comm_startup_lat benchmark in the S suite [1].

[1] https://github.com/Algodev-github/S

Tested-by: Oleksandr Natalenko 
Tested-by: Holger Hoffstätte 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 77 +
 block/bfq-iosched.h | 12 +
 2 files changed, 89 insertions(+)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 1caeecad7af1..527bd2ccda51 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -417,6 +417,82 @@ static struct request *bfq_choose_req(struct bfq_data 
*bfqd,
}
 }
 
+/*
+ * See the comments on bfq_limit_depth for the purpose of
+ * the depths set in the function.
+ */
+static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt)
+{
+   bfqd->sb_shift = bt->sb.shift;
+
+   /*
+* In-word depths if no bfq_queue is being weight-raised:
+* leaving 25% of tags only for sync reads.
+*
+* In next formulas, right-shift the value
+* (1U<sb_shift), instead of computing directly
+* (1U<<(bfqd->sb_shift - something)), to be robust against
+* any possible value of bfqd->sb_shift, without having to
+* limit 'something'.
+*/
+   /* no more than 50% of tags for async I/O */
+   bfqd->word_depths[0][0] = max((1U<sb_shift)>>1, 1U);
+   /*
+* no more than 75% of tags for sync writes (25% extra tags
+* w.r.t. async I/O, to prevent async I/O from starving sync
+* writes)
+*/
+   bfqd->word_depths[0][1] = max(((1U<sb_shift) * 3)>>2, 1U);
+
+   /*
+* In-word depths in case some bfq_queue is being weight-
+* raised: leaving ~63% of tags for sync reads. This is the
+* highest percentage for which, in our tests, application
+* start-up times didn't suffer from any regression due to tag
+* shortage.
+*/
+   /* no more than ~18% of tags for async I/O */
+   bfqd->word_depths[1][0] = max(((1U<sb_shift) * 3)>>4, 1U);
+   /* no more than ~37% of tags for sync writes (~20% extra tags) */
+   bfqd->word_depths[1][1] = max(((1U<sb_shift) * 6)>>4, 1U);
+}
+
+/*
+ * Async I/O can easily starve sync I/O (both sync reads and sync
+ * writes), by consuming all tags. Similarly, storms of sync writes,
+ * such as those that sync(2) may trigger, can starve sync reads.
+ * Limit depths of async I/O and sync writes so as to counter both
+ * problems.
+ */
+static void bfq_limit_depth(unsigned int op, struct blk_mq_alloc_data *data)
+{
+   struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
+   struct bfq_data *bfqd = data->q->elevator->elevator_data;
+   struct sbitmap_queue *bt;
+
+   if (op_is_sync(op) && !op_is_write(op))
+   return;
+
+   if (data->flags & BLK_MQ_REQ_RESERVED) {
+   if (unlikely(!tags->nr_reserved_tags)) {
+   WARN_ON_ONCE(1);
+   return;
+   }
+   bt = >breserved_tags;
+   } else
+   bt = >bitmap_tags;
+
+   if (unlikely(bfqd->sb_shift != bt->sb.shift))
+   bfq_update_depths(bfqd, bt);
+
+   data->shallow_depth =
+   bfqd->word_depths[!!bfqd->wr_busy_queues][op_is_sync(op)];
+
+   bfq_log(bfqd, "[%s] wr_busy %d sync %d depth %u",
+   __func__, bfqd->wr_busy_queues, op_is_sync(op),
+   data->shallow_depth);
+}
+
 static struct bfq_queue *
 bfq_rq_pos_tree_lookup(struct bfq_data *bfqd, struct rb_root *root,
 sector_t sector, struct rb_node **ret_parent,
@@ -5267,6 +5343,7 @@ static struct elv_fs_entry bfq_attrs[] = {
 
 static struct e

[PATCH BUGFIX/IMPROVEMENT 0/2] block, bfq: two pending patches

2018-01-13 Thread Paolo Valente
Hi Jens,
here are again the two pending patches you asked me to resend [1]. One
of them, fixing read-starvation problems, was accompanied by a cover
letter. I'm pasting the content of that cover letter below.

The patch addresses (serious) starvation problems caused by
request-tag exhaustion, as explained in more detail in the commit
message. I started from the solution in the function
kyber_limit_depth, but then I had to define more articulate limits, to
counter starvation also in cases not covered in kyber_limit_depth.
If this solution proves to be effective, I'm willing to port it
somehow to the other schedulers.

Thanks,
Paolo

[1] https://www.spinics.net/lists/linux-block/msg21586.html

Paolo Valente (2):
  block, bfq: limit tags for writes and async I/O
  block, bfq: limit sectors served with interactive weight raising

 block/bfq-iosched.c | 158 +---
 block/bfq-iosched.h |  17 ++
 block/bfq-wf2q.c|   3 +
 3 files changed, 169 insertions(+), 9 deletions(-)

--
2.15.1


[PATCH BUGFIX/IMPROVEMENT 0/2] block, bfq: two pending patches

2018-01-13 Thread Paolo Valente
Hi Jens,
here are again the two pending patches you asked me to resend [1]. One
of them, fixing read-starvation problems, was accompanied by a cover
letter. I'm pasting the content of that cover letter below.

The patch addresses (serious) starvation problems caused by
request-tag exhaustion, as explained in more detail in the commit
message. I started from the solution in the function
kyber_limit_depth, but then I had to define more articulate limits, to
counter starvation also in cases not covered in kyber_limit_depth.
If this solution proves to be effective, I'm willing to port it
somehow to the other schedulers.

Thanks,
Paolo

[1] https://www.spinics.net/lists/linux-block/msg21586.html

Paolo Valente (2):
  block, bfq: limit tags for writes and async I/O
  block, bfq: limit sectors served with interactive weight raising

 block/bfq-iosched.c | 158 +---
 block/bfq-iosched.h |  17 ++
 block/bfq-wf2q.c|   3 +
 3 files changed, 169 insertions(+), 9 deletions(-)

--
2.15.1


[PATCH BUGFIX/IMPROVEMENT 2/2] block, bfq: limit sectors served with interactive weight raising

2018-01-13 Thread Paolo Valente
To maximise responsiveness, BFQ raises the weight, and performs device
idling, for bfq_queues associated with processes deemed as
interactive. In particular, weight raising has a maximum duration,
equal to the time needed to start a large application. If a
weight-raised process goes on doing I/O beyond this maximum duration,
it loses weight-raising.

This mechanism is evidently vulnerable to the following false
positives: I/O-bound applications that will go on doing I/O for much
longer than the duration of weight-raising. These applications have
basically no benefit from being weight-raised at the beginning of
their I/O. On the opposite end, while being weight-raised, these
applications
a) unjustly steal throughput to applications that may truly need
low latency;
b) make BFQ uselessly perform device idling; device idling results
in loss of device throughput with most flash-based storage, and may
increase latencies when used purposelessly.

This commit adds a countermeasure to reduce both the above
problems. To introduce this countermeasure, we provide the following
extra piece of information (full details in the comments added by this
commit). During the start-up of the large application used as a
reference to set the duration of weight-raising, involved processes
transfer at most ~110K sectors each. Accordingly, a process initially
deemed as interactive has no right to be weight-raised any longer,
once transferred 110K sectors or more.

Basing on this consideration, this commit early-ends weight-raising
for a bfq_queue if the latter happens to have received an amount of
service at least equal to 110K sectors (actually, a little bit more,
to keep a safety margin). I/O-bound applications that reach a high
throughput, such as file copy, get to this threshold much before the
allowed weight-raising period finishes. Thus this early ending of
weight-raising reduces the amount of time during which these
applications cause the problems described above.

Tested-by: Oleksandr Natalenko <oleksa...@natalenko.name>
Tested-by: Holger Hoffstätte <hol...@applied-asynchrony.com>
Signed-off-by: Paolo Valente <paolo.vale...@linaro.org>
---
 block/bfq-iosched.c | 81 +++--
 block/bfq-iosched.h |  5 
 block/bfq-wf2q.c|  3 ++
 3 files changed, 80 insertions(+), 9 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 527bd2ccda51..93a97a7fe519 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -209,15 +209,17 @@ static struct kmem_cache *bfq_pool;
  * interactive applications automatically, using the following formula:
  * duration = (R / r) * T, where r is the peak rate of the device, and
  * R and T are two reference parameters.
- * In particular, R is the peak rate of the reference device (see below),
- * and T is a reference time: given the systems that are likely to be
- * installed on the reference device according to its speed class, T is
- * about the maximum time needed, under BFQ and while reading two files in
- * parallel, to load typical large applications on these systems.
- * In practice, the slower/faster the device at hand is, the more/less it
- * takes to load applications with respect to the reference device.
- * Accordingly, the longer/shorter BFQ grants weight raising to interactive
- * applications.
+ * In particular, R is the peak rate of the reference device (see
+ * below), and T is a reference time: given the systems that are
+ * likely to be installed on the reference device according to its
+ * speed class, T is about the maximum time needed, under BFQ and
+ * while reading two files in parallel, to load typical large
+ * applications on these systems (see the comments on
+ * max_service_from_wr below, for more details on how T is obtained).
+ * In practice, the slower/faster the device at hand is, the more/less
+ * it takes to load applications with respect to the reference device.
+ * Accordingly, the longer/shorter BFQ grants weight raising to
+ * interactive applications.
  *
  * BFQ uses four different reference pairs (R, T), depending on:
  * . whether the device is rotational or non-rotational;
@@ -254,6 +256,60 @@ static int T_slow[2];
 static int T_fast[2];
 static int device_speed_thresh[2];
 
+/*
+ * BFQ uses the above-detailed, time-based weight-raising mechanism to
+ * privilege interactive tasks. This mechanism is vulnerable to the
+ * following false positives: I/O-bound applications that will go on
+ * doing I/O for much longer than the duration of weight
+ * raising. These applications have basically no benefit from being
+ * weight-raised at the beginning of their I/O. On the opposite end,
+ * while being weight-raised, these applications
+ * a) unjustly steal throughput to applications that may actually need
+ * low latency;
+ * b) make BFQ uselessly perform device idling; device idling results
+ * in loss of device throughput with most flash-based storage, and may
+ * increase l

[PATCH BUGFIX/IMPROVEMENT 2/2] block, bfq: limit sectors served with interactive weight raising

2018-01-13 Thread Paolo Valente
To maximise responsiveness, BFQ raises the weight, and performs device
idling, for bfq_queues associated with processes deemed as
interactive. In particular, weight raising has a maximum duration,
equal to the time needed to start a large application. If a
weight-raised process goes on doing I/O beyond this maximum duration,
it loses weight-raising.

This mechanism is evidently vulnerable to the following false
positives: I/O-bound applications that will go on doing I/O for much
longer than the duration of weight-raising. These applications have
basically no benefit from being weight-raised at the beginning of
their I/O. On the opposite end, while being weight-raised, these
applications
a) unjustly steal throughput to applications that may truly need
low latency;
b) make BFQ uselessly perform device idling; device idling results
in loss of device throughput with most flash-based storage, and may
increase latencies when used purposelessly.

This commit adds a countermeasure to reduce both the above
problems. To introduce this countermeasure, we provide the following
extra piece of information (full details in the comments added by this
commit). During the start-up of the large application used as a
reference to set the duration of weight-raising, involved processes
transfer at most ~110K sectors each. Accordingly, a process initially
deemed as interactive has no right to be weight-raised any longer,
once transferred 110K sectors or more.

Basing on this consideration, this commit early-ends weight-raising
for a bfq_queue if the latter happens to have received an amount of
service at least equal to 110K sectors (actually, a little bit more,
to keep a safety margin). I/O-bound applications that reach a high
throughput, such as file copy, get to this threshold much before the
allowed weight-raising period finishes. Thus this early ending of
weight-raising reduces the amount of time during which these
applications cause the problems described above.

Tested-by: Oleksandr Natalenko 
Tested-by: Holger Hoffstätte 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 81 +++--
 block/bfq-iosched.h |  5 
 block/bfq-wf2q.c|  3 ++
 3 files changed, 80 insertions(+), 9 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 527bd2ccda51..93a97a7fe519 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -209,15 +209,17 @@ static struct kmem_cache *bfq_pool;
  * interactive applications automatically, using the following formula:
  * duration = (R / r) * T, where r is the peak rate of the device, and
  * R and T are two reference parameters.
- * In particular, R is the peak rate of the reference device (see below),
- * and T is a reference time: given the systems that are likely to be
- * installed on the reference device according to its speed class, T is
- * about the maximum time needed, under BFQ and while reading two files in
- * parallel, to load typical large applications on these systems.
- * In practice, the slower/faster the device at hand is, the more/less it
- * takes to load applications with respect to the reference device.
- * Accordingly, the longer/shorter BFQ grants weight raising to interactive
- * applications.
+ * In particular, R is the peak rate of the reference device (see
+ * below), and T is a reference time: given the systems that are
+ * likely to be installed on the reference device according to its
+ * speed class, T is about the maximum time needed, under BFQ and
+ * while reading two files in parallel, to load typical large
+ * applications on these systems (see the comments on
+ * max_service_from_wr below, for more details on how T is obtained).
+ * In practice, the slower/faster the device at hand is, the more/less
+ * it takes to load applications with respect to the reference device.
+ * Accordingly, the longer/shorter BFQ grants weight raising to
+ * interactive applications.
  *
  * BFQ uses four different reference pairs (R, T), depending on:
  * . whether the device is rotational or non-rotational;
@@ -254,6 +256,60 @@ static int T_slow[2];
 static int T_fast[2];
 static int device_speed_thresh[2];
 
+/*
+ * BFQ uses the above-detailed, time-based weight-raising mechanism to
+ * privilege interactive tasks. This mechanism is vulnerable to the
+ * following false positives: I/O-bound applications that will go on
+ * doing I/O for much longer than the duration of weight
+ * raising. These applications have basically no benefit from being
+ * weight-raised at the beginning of their I/O. On the opposite end,
+ * while being weight-raised, these applications
+ * a) unjustly steal throughput to applications that may actually need
+ * low latency;
+ * b) make BFQ uselessly perform device idling; device idling results
+ * in loss of device throughput with most flash-based storage, and may
+ * increase latencies when used purposelessly.
+ *
+ * BFQ tries to reduce these problems, by adopting

Re: [PATCH IMPROVEMENT] block, bfq: limit sectors served with interactive weight raising

2018-01-12 Thread Paolo Valente


> Il giorno 12 gen 2018, alle ore 11:15, Holger Hoffstätte 
> <hol...@applied-asynchrony.com> ha scritto:
> 
> On 01/12/18 06:58, Paolo Valente wrote:
>> 
>> 
>>> Il giorno 28 dic 2017, alle ore 15:00, Holger Hoffstätte 
>>> <hol...@applied-asynchrony.com> ha scritto:
>>> 
>>> 
>>> On 12/28/17 12:19, Paolo Valente wrote:
>>> (snip half a tech report ;)
>>> 
>>> So either this or the previous patch ("limit tags for writes and async I/O"
>>> can lead to a hard, unrecoverable hang with heavy writes. Since I couldn't
>>> log into the affected system anymore I couldn't get any stack traces, blk-mq
>>> debug output etc. but there was nothing in dmesg/on the console, so it
>>> wasn't a BUG/OOPS.
>>> 
>>> -h
>> 
>> Hi Holger,
>> if, as I guess, this problem hasn't gone away for you, I have two
>> requests:
>> 1) could you share your exact test
>> 2) if nothing happens in my systems with your test, would you be
>> willing to retry with the dev version of bfq?  It should be able to
>> tell us what takes to your hang.  If you are willing to do this test,
>> I'll prepare a branch with everything already configured for you.
> 
> Hi,
> 
> thanks for following up but there's no need for any of that; it turned out
> to be something else since I got the same hang without those patches at
> least once (during a btrfs balance, even though it didn't look like btrfs'
> fault directly; more like block/mm/helpers.
> 
> So on January 7 I posted to linux-block et.al. where I said
> "So this turned out to be something else, sorry for the false alarm."
> but apparently that didn't make it through since it's not in the
> archives either. Sorry.
> 
> Long story short, the good news is that I've been running with both patches
> since then without any issue. :)
> 

Wow, what a relief! :)

So, Jens, being the only issue reported gone, can you please consider
queueing this patch and the other pending one [1]?  They are both
critical for bfq performance.

Thanks,
Paolo

[1] https://www.spinics.net/lists/kernel/msg2684463.html

> cheers
> Holger



Re: [PATCH IMPROVEMENT] block, bfq: limit sectors served with interactive weight raising

2018-01-12 Thread Paolo Valente


> Il giorno 12 gen 2018, alle ore 11:15, Holger Hoffstätte 
>  ha scritto:
> 
> On 01/12/18 06:58, Paolo Valente wrote:
>> 
>> 
>>> Il giorno 28 dic 2017, alle ore 15:00, Holger Hoffstätte 
>>>  ha scritto:
>>> 
>>> 
>>> On 12/28/17 12:19, Paolo Valente wrote:
>>> (snip half a tech report ;)
>>> 
>>> So either this or the previous patch ("limit tags for writes and async I/O"
>>> can lead to a hard, unrecoverable hang with heavy writes. Since I couldn't
>>> log into the affected system anymore I couldn't get any stack traces, blk-mq
>>> debug output etc. but there was nothing in dmesg/on the console, so it
>>> wasn't a BUG/OOPS.
>>> 
>>> -h
>> 
>> Hi Holger,
>> if, as I guess, this problem hasn't gone away for you, I have two
>> requests:
>> 1) could you share your exact test
>> 2) if nothing happens in my systems with your test, would you be
>> willing to retry with the dev version of bfq?  It should be able to
>> tell us what takes to your hang.  If you are willing to do this test,
>> I'll prepare a branch with everything already configured for you.
> 
> Hi,
> 
> thanks for following up but there's no need for any of that; it turned out
> to be something else since I got the same hang without those patches at
> least once (during a btrfs balance, even though it didn't look like btrfs'
> fault directly; more like block/mm/helpers.
> 
> So on January 7 I posted to linux-block et.al. where I said
> "So this turned out to be something else, sorry for the false alarm."
> but apparently that didn't make it through since it's not in the
> archives either. Sorry.
> 
> Long story short, the good news is that I've been running with both patches
> since then without any issue. :)
> 

Wow, what a relief! :)

So, Jens, being the only issue reported gone, can you please consider
queueing this patch and the other pending one [1]?  They are both
critical for bfq performance.

Thanks,
Paolo

[1] https://www.spinics.net/lists/kernel/msg2684463.html

> cheers
> Holger



Re: [PATCH IMPROVEMENT] block, bfq: limit sectors served with interactive weight raising

2018-01-11 Thread Paolo Valente


> Il giorno 28 dic 2017, alle ore 15:00, Holger Hoffstätte 
> <hol...@applied-asynchrony.com> ha scritto:
> 
> 
> On 12/28/17 12:19, Paolo Valente wrote:
> (snip half a tech report ;)
> 
> So either this or the previous patch ("limit tags for writes and async I/O"
> can lead to a hard, unrecoverable hang with heavy writes. Since I couldn't
> log into the affected system anymore I couldn't get any stack traces, blk-mq
> debug output etc. but there was nothing in dmesg/on the console, so it
> wasn't a BUG/OOPS.
> 
> -h

Hi Holger,
if, as I guess, this problem hasn't gone away for you, I have two
requests:
1) could you share your exact test
2) if nothing happens in my systems with your test, would you be
willing to retry with the dev version of bfq?  It should be able to
tell us what takes to your hang.  If you are willing to do this test,
I'll prepare a branch with everything already configured for you.

Thanks,
Paolo

Re: [PATCH IMPROVEMENT] block, bfq: limit sectors served with interactive weight raising

2018-01-11 Thread Paolo Valente


> Il giorno 28 dic 2017, alle ore 15:00, Holger Hoffstätte 
>  ha scritto:
> 
> 
> On 12/28/17 12:19, Paolo Valente wrote:
> (snip half a tech report ;)
> 
> So either this or the previous patch ("limit tags for writes and async I/O"
> can lead to a hard, unrecoverable hang with heavy writes. Since I couldn't
> log into the affected system anymore I couldn't get any stack traces, blk-mq
> debug output etc. but there was nothing in dmesg/on the console, so it
> wasn't a BUG/OOPS.
> 
> -h

Hi Holger,
if, as I guess, this problem hasn't gone away for you, I have two
requests:
1) could you share your exact test
2) if nothing happens in my systems with your test, would you be
willing to retry with the dev version of bfq?  It should be able to
tell us what takes to your hang.  If you are willing to do this test,
I'll prepare a branch with everything already configured for you.

Thanks,
Paolo

[PATCH BUGFIX 0/1] bloc, bfq: fix of a bug introduced by my last patch

2018-01-10 Thread Paolo Valente
Hi Jens,
unfortunately, the patch of mine that you applied yesterday ("block,
bfq: release oom-queue ref to root group on exit"), does not compile
if CONFIG_BFQ_GROUP_IOSCHED is not defined. I forgot to test that patch
with that option disabled. Honestly, I'm more and more uncertain
about how useful (disabling) that option is ...

In any case, one question about the present patch: I didn't add any
hash for the commit that this patch fixes, because, if I'm not
mistaken, the hash that that commit will have in mainline is not yet
defined. I hope this is the right way to go.

Thanks,
Paolo

Paolo Valente (1):
  block, bfq: compile group put for oom queue only if BFQ_GROUP_IOSCHED
is set

 block/bfq-iosched.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--
2.15.1


[PATCH BUGFIX 0/1] bloc, bfq: fix of a bug introduced by my last patch

2018-01-10 Thread Paolo Valente
Hi Jens,
unfortunately, the patch of mine that you applied yesterday ("block,
bfq: release oom-queue ref to root group on exit"), does not compile
if CONFIG_BFQ_GROUP_IOSCHED is not defined. I forgot to test that patch
with that option disabled. Honestly, I'm more and more uncertain
about how useful (disabling) that option is ...

In any case, one question about the present patch: I didn't add any
hash for the commit that this patch fixes, because, if I'm not
mistaken, the hash that that commit will have in mainline is not yet
defined. I hope this is the right way to go.

Thanks,
Paolo

Paolo Valente (1):
  block, bfq: compile group put for oom queue only if BFQ_GROUP_IOSCHED
is set

 block/bfq-iosched.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--
2.15.1


[PATCH BUGFIX 1/1] block, bfq: compile group put for oom queue only if BFQ_GROUP_IOSCHED is set

2018-01-10 Thread Paolo Valente
Commit ("block, bfq: release oom-queue ref to root group on exit")
added a missing put of the root bfq group for the oom queue. That put
has to be, and can be, performed only if CONFIG_BFQ_GROUP_IOSCHED is
defined: the function doing the put is even not defined at all if
CONFIG_BFQ_GROUP_IOSCHED is not defined. But that commit makes that
put be invoked regardless of whether CONFIG_BFQ_GROUP_IOSCHED is
defined. This commit fixes this mistake, by making that invocation be
compiled only if CONFIG_BFQ_GROUP_IOSCHED is actually defined.

Fixes ("block, bfq: release oom-queue ref to root group on exit")
Reported-by: Jan Alexander Steffens <jan.steff...@gmail.com>
Signed-off-by: Paolo Valente <paolo.vale...@linaro.org>
---
 block/bfq-iosched.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index c56a495af2e8..93a97a7fe519 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -5015,10 +5015,9 @@ static void bfq_exit_queue(struct elevator_queue *e)
 
hrtimer_cancel(>idle_slice_timer);
 
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
/* release oom-queue reference to root group */
bfqg_and_blkg_put(bfqd->root_group);
-
-#ifdef CONFIG_BFQ_GROUP_IOSCHED
blkcg_deactivate_policy(bfqd->queue, _policy_bfq);
 #else
spin_lock_irq(>lock);
-- 
2.15.1



[PATCH BUGFIX 1/1] block, bfq: compile group put for oom queue only if BFQ_GROUP_IOSCHED is set

2018-01-10 Thread Paolo Valente
Commit ("block, bfq: release oom-queue ref to root group on exit")
added a missing put of the root bfq group for the oom queue. That put
has to be, and can be, performed only if CONFIG_BFQ_GROUP_IOSCHED is
defined: the function doing the put is even not defined at all if
CONFIG_BFQ_GROUP_IOSCHED is not defined. But that commit makes that
put be invoked regardless of whether CONFIG_BFQ_GROUP_IOSCHED is
defined. This commit fixes this mistake, by making that invocation be
compiled only if CONFIG_BFQ_GROUP_IOSCHED is actually defined.

Fixes ("block, bfq: release oom-queue ref to root group on exit")
Reported-by: Jan Alexander Steffens 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index c56a495af2e8..93a97a7fe519 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -5015,10 +5015,9 @@ static void bfq_exit_queue(struct elevator_queue *e)
 
hrtimer_cancel(>idle_slice_timer);
 
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
/* release oom-queue reference to root group */
bfqg_and_blkg_put(bfqd->root_group);
-
-#ifdef CONFIG_BFQ_GROUP_IOSCHED
blkcg_deactivate_policy(bfqd->queue, _policy_bfq);
 #else
spin_lock_irq(>lock);
-- 
2.15.1



Re: [PATCH BUGFIX V2 1/2] block, bfq: put async queues for root bfq groups too

2018-01-09 Thread Paolo Valente


> Il giorno 10 gen 2018, alle ore 02:41, Guoqing Jiang <gqji...@suse.com> ha 
> scritto:
> 
> 
> 
> On 01/09/2018 05:27 PM, Paolo Valente wrote:
>> For each pair [device for which bfq is selected as I/O scheduler,
>> group in blkio/io], bfq maintains a corresponding bfq group. Each such
>> bfq group contains a set of async queues, with each async queue
>> created on demand, i.e., when some I/O request arrives for it.  On
>> creation, an async queue gets an extra reference, to make sure that
>> the queue is not freed as long as its bfq group exists.  Accordingly,
>> to allow the queue to be freed after the group exited, this extra
>> reference must released on group exit.
>> 
>> The above holds also for a bfq root group, i.e., for the bfq group
>> corresponding to the root blkio/io root for a given device. Yet, by
>> mistake, the references to the existing async queues of a root group
>> are not released when the latter exits. This causes a memory leak when
>> the instance of bfq for a given device exits. In a similar vein,
>> bfqg_stats_xfer_dead is not executed for a root group.
>> 
>> This commit fixes bfq_pd_offline so that the latter executes the above
>> missing operations for a root group too.
>> 
>> Reported-by: Holger Hoffstätte <hol...@applied-asynchrony.com>
>> Reported-by: Guoqing Jiang <gqji...@suse.com>
>> Signed-off-by: Davide Ferrari <davideferra...@gmail.com>
>> Signed-off-by: Paolo Valente <paolo.vale...@linaro.org>
>> ---
>>  block/bfq-cgroup.c | 7 +--
>>  1 file changed, 5 insertions(+), 2 deletions(-)
>> 
>> diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
>> index da1525ec4c87..d819dc77fe65 100644
>> --- a/block/bfq-cgroup.c
>> +++ b/block/bfq-cgroup.c
>> @@ -775,10 +775,11 @@ static void bfq_pd_offline(struct blkg_policy_data *pd)
>>  unsigned long flags;
>>  int i;
>>  +   spin_lock_irqsave(>lock, flags);
>> +
>>  if (!entity) /* root group */
>> -return;
>> +goto put_async_queues;
>>  -   spin_lock_irqsave(>lock, flags);
>>  /*
>>   * Empty all service_trees belonging to this group before
>>   * deactivating the group itself.
>> @@ -809,6 +810,8 @@ static void bfq_pd_offline(struct blkg_policy_data *pd)
>>  }
>>  __bfq_deactivate_entity(entity, false);
>> +
>> +put_async_queues:
>>  bfq_put_async_queues(bfqd, bfqg);
>>  spin_unlock_irqrestore(>lock, flags);
> 
> With this change, bfqg_stats_xfer_dead will be called even entity is not 
> existed,

Actually, the entity associated with the bfq group being offlined
exists even if the local variable entity is NULL here.  Simply, that
variable gets NULL if the bfq group is the bfq root group for a
device.

> is it necessary?

No, I opted for this solution to not shake the code too much, and
considering that
- bfqg_stats_xfer_dead simply does nothing if applied
to a root group
- who knows, in the future that function may need
do be invoked for a root group too.

Yet, if you guys think that it would be cleaner to not invoke
bfqg_stats_xfer_dead at all for a root group, I'll change the code
accordingly (this would introduce a little asymmetry with
cfq_pd_offline, which invokes cfqg_stats_xfer_dead unconditionally).

Thanks,
Paolo

> Thanks.
> 
> Regards,
> Guoqing



Re: [PATCH BUGFIX V2 1/2] block, bfq: put async queues for root bfq groups too

2018-01-09 Thread Paolo Valente


> Il giorno 10 gen 2018, alle ore 02:41, Guoqing Jiang  ha 
> scritto:
> 
> 
> 
> On 01/09/2018 05:27 PM, Paolo Valente wrote:
>> For each pair [device for which bfq is selected as I/O scheduler,
>> group in blkio/io], bfq maintains a corresponding bfq group. Each such
>> bfq group contains a set of async queues, with each async queue
>> created on demand, i.e., when some I/O request arrives for it.  On
>> creation, an async queue gets an extra reference, to make sure that
>> the queue is not freed as long as its bfq group exists.  Accordingly,
>> to allow the queue to be freed after the group exited, this extra
>> reference must released on group exit.
>> 
>> The above holds also for a bfq root group, i.e., for the bfq group
>> corresponding to the root blkio/io root for a given device. Yet, by
>> mistake, the references to the existing async queues of a root group
>> are not released when the latter exits. This causes a memory leak when
>> the instance of bfq for a given device exits. In a similar vein,
>> bfqg_stats_xfer_dead is not executed for a root group.
>> 
>> This commit fixes bfq_pd_offline so that the latter executes the above
>> missing operations for a root group too.
>> 
>> Reported-by: Holger Hoffstätte 
>> Reported-by: Guoqing Jiang 
>> Signed-off-by: Davide Ferrari 
>> Signed-off-by: Paolo Valente 
>> ---
>>  block/bfq-cgroup.c | 7 +--
>>  1 file changed, 5 insertions(+), 2 deletions(-)
>> 
>> diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
>> index da1525ec4c87..d819dc77fe65 100644
>> --- a/block/bfq-cgroup.c
>> +++ b/block/bfq-cgroup.c
>> @@ -775,10 +775,11 @@ static void bfq_pd_offline(struct blkg_policy_data *pd)
>>  unsigned long flags;
>>  int i;
>>  +   spin_lock_irqsave(>lock, flags);
>> +
>>  if (!entity) /* root group */
>> -return;
>> +goto put_async_queues;
>>  -   spin_lock_irqsave(>lock, flags);
>>  /*
>>   * Empty all service_trees belonging to this group before
>>   * deactivating the group itself.
>> @@ -809,6 +810,8 @@ static void bfq_pd_offline(struct blkg_policy_data *pd)
>>  }
>>  __bfq_deactivate_entity(entity, false);
>> +
>> +put_async_queues:
>>  bfq_put_async_queues(bfqd, bfqg);
>>  spin_unlock_irqrestore(>lock, flags);
> 
> With this change, bfqg_stats_xfer_dead will be called even entity is not 
> existed,

Actually, the entity associated with the bfq group being offlined
exists even if the local variable entity is NULL here.  Simply, that
variable gets NULL if the bfq group is the bfq root group for a
device.

> is it necessary?

No, I opted for this solution to not shake the code too much, and
considering that
- bfqg_stats_xfer_dead simply does nothing if applied
to a root group
- who knows, in the future that function may need
do be invoked for a root group too.

Yet, if you guys think that it would be cleaner to not invoke
bfqg_stats_xfer_dead at all for a root group, I'll change the code
accordingly (this would introduce a little asymmetry with
cfq_pd_offline, which invokes cfqg_stats_xfer_dead unconditionally).

Thanks,
Paolo

> Thanks.
> 
> Regards,
> Guoqing



Re: unify the interface of the proportional-share policy in blkio/io

2018-01-09 Thread Paolo Valente


> Il giorno 09 gen 2018, alle ore 20:53, Jens Axboe <ax...@kernel.dk> ha 
> scritto:
> 
> On 1/9/18 12:52 PM, Tejun Heo wrote:
>> Hello, Paolo.
>> 
>> On Thu, Jan 04, 2018 at 08:00:02PM +0100, Paolo Valente wrote:
>>> The solution for the second type of parameters may prove useful to
>>> unify also the computation of statistics for the throttling policy.
>>> 
>>> Does this proposal sound reasonable?
>> 
>> So, the above should work too but I wonder whether we could do this
>> simpler.  Frankly, I wouldn't mind if cfq and bfq can't be mixed on a
>> system - e.g. they can be built together but you can't enable bfq on
>> some devides and cfq on others.  If we do that, all we need to do is
>> just removing / adding cftypes when either gets activated which cgroup
>> already does.
> 
> Not sure that would fly, since cfq is legacy and bfq is mq. You don't
> always have a free choice of which one to use...
> 

Yep.  So, do you guys think that our proposal may be ok?  We are
waiting just for the green light to start implementing it.

Thanks,
Paolo

> -- 
> Jens Axboe
> 



Re: unify the interface of the proportional-share policy in blkio/io

2018-01-09 Thread Paolo Valente


> Il giorno 09 gen 2018, alle ore 20:53, Jens Axboe  ha 
> scritto:
> 
> On 1/9/18 12:52 PM, Tejun Heo wrote:
>> Hello, Paolo.
>> 
>> On Thu, Jan 04, 2018 at 08:00:02PM +0100, Paolo Valente wrote:
>>> The solution for the second type of parameters may prove useful to
>>> unify also the computation of statistics for the throttling policy.
>>> 
>>> Does this proposal sound reasonable?
>> 
>> So, the above should work too but I wonder whether we could do this
>> simpler.  Frankly, I wouldn't mind if cfq and bfq can't be mixed on a
>> system - e.g. they can be built together but you can't enable bfq on
>> some devides and cfq on others.  If we do that, all we need to do is
>> just removing / adding cftypes when either gets activated which cgroup
>> already does.
> 
> Not sure that would fly, since cfq is legacy and bfq is mq. You don't
> always have a free choice of which one to use...
> 

Yep.  So, do you guys think that our proposal may be ok?  We are
waiting just for the green light to start implementing it.

Thanks,
Paolo

> -- 
> Jens Axboe
> 



[PATCH BUGFIX V2 1/2] block, bfq: put async queues for root bfq groups too

2018-01-09 Thread Paolo Valente
For each pair [device for which bfq is selected as I/O scheduler,
group in blkio/io], bfq maintains a corresponding bfq group. Each such
bfq group contains a set of async queues, with each async queue
created on demand, i.e., when some I/O request arrives for it.  On
creation, an async queue gets an extra reference, to make sure that
the queue is not freed as long as its bfq group exists.  Accordingly,
to allow the queue to be freed after the group exited, this extra
reference must released on group exit.

The above holds also for a bfq root group, i.e., for the bfq group
corresponding to the root blkio/io root for a given device. Yet, by
mistake, the references to the existing async queues of a root group
are not released when the latter exits. This causes a memory leak when
the instance of bfq for a given device exits. In a similar vein,
bfqg_stats_xfer_dead is not executed for a root group.

This commit fixes bfq_pd_offline so that the latter executes the above
missing operations for a root group too.

Reported-by: Holger Hoffstätte <hol...@applied-asynchrony.com>
Reported-by: Guoqing Jiang <gqji...@suse.com>
Signed-off-by: Davide Ferrari <davideferra...@gmail.com>
Signed-off-by: Paolo Valente <paolo.vale...@linaro.org>
---
 block/bfq-cgroup.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index da1525ec4c87..d819dc77fe65 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -775,10 +775,11 @@ static void bfq_pd_offline(struct blkg_policy_data *pd)
unsigned long flags;
int i;
 
+   spin_lock_irqsave(>lock, flags);
+
if (!entity) /* root group */
-   return;
+   goto put_async_queues;
 
-   spin_lock_irqsave(>lock, flags);
/*
 * Empty all service_trees belonging to this group before
 * deactivating the group itself.
@@ -809,6 +810,8 @@ static void bfq_pd_offline(struct blkg_policy_data *pd)
}
 
__bfq_deactivate_entity(entity, false);
+
+put_async_queues:
bfq_put_async_queues(bfqd, bfqg);
 
spin_unlock_irqrestore(>lock, flags);
-- 
2.15.1



[PATCH BUGFIX V2 1/2] block, bfq: put async queues for root bfq groups too

2018-01-09 Thread Paolo Valente
For each pair [device for which bfq is selected as I/O scheduler,
group in blkio/io], bfq maintains a corresponding bfq group. Each such
bfq group contains a set of async queues, with each async queue
created on demand, i.e., when some I/O request arrives for it.  On
creation, an async queue gets an extra reference, to make sure that
the queue is not freed as long as its bfq group exists.  Accordingly,
to allow the queue to be freed after the group exited, this extra
reference must released on group exit.

The above holds also for a bfq root group, i.e., for the bfq group
corresponding to the root blkio/io root for a given device. Yet, by
mistake, the references to the existing async queues of a root group
are not released when the latter exits. This causes a memory leak when
the instance of bfq for a given device exits. In a similar vein,
bfqg_stats_xfer_dead is not executed for a root group.

This commit fixes bfq_pd_offline so that the latter executes the above
missing operations for a root group too.

Reported-by: Holger Hoffstätte 
Reported-by: Guoqing Jiang 
Signed-off-by: Davide Ferrari 
Signed-off-by: Paolo Valente 
---
 block/bfq-cgroup.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index da1525ec4c87..d819dc77fe65 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -775,10 +775,11 @@ static void bfq_pd_offline(struct blkg_policy_data *pd)
unsigned long flags;
int i;
 
+   spin_lock_irqsave(>lock, flags);
+
if (!entity) /* root group */
-   return;
+   goto put_async_queues;
 
-   spin_lock_irqsave(>lock, flags);
/*
 * Empty all service_trees belonging to this group before
 * deactivating the group itself.
@@ -809,6 +810,8 @@ static void bfq_pd_offline(struct blkg_policy_data *pd)
}
 
__bfq_deactivate_entity(entity, false);
+
+put_async_queues:
bfq_put_async_queues(bfqd, bfqg);
 
spin_unlock_irqrestore(>lock, flags);
-- 
2.15.1



[PATCH BUGFIX V2 0/2] block, bfq: fix two memory leaks related to cgroups

2018-01-09 Thread Paolo Valente
[There was a mistake in the subject of the second patch, sorry]

Hi Jens,
these two patches fix two related memory leaks, the first reported in
[1], and the second found by ourselves while fixing the first.

Thanks,
Paolo

[1] https://www.mail-archive.com/linux-block@vger.kernel.org/msg16258.html

Paolo Valente (2):
  block, bfq: put async queues for root bfq groups too
  block, bfq: release oom-queue ref to root group on exit

 block/bfq-cgroup.c  | 7 +--
 block/bfq-iosched.c | 3 +++
 2 files changed, 8 insertions(+), 2 deletions(-)

--
2.15.1


[PATCH BUGFIX V2 0/2] block, bfq: fix two memory leaks related to cgroups

2018-01-09 Thread Paolo Valente
[There was a mistake in the subject of the second patch, sorry]

Hi Jens,
these two patches fix two related memory leaks, the first reported in
[1], and the second found by ourselves while fixing the first.

Thanks,
Paolo

[1] https://www.mail-archive.com/linux-block@vger.kernel.org/msg16258.html

Paolo Valente (2):
  block, bfq: put async queues for root bfq groups too
  block, bfq: release oom-queue ref to root group on exit

 block/bfq-cgroup.c  | 7 +--
 block/bfq-iosched.c | 3 +++
 2 files changed, 8 insertions(+), 2 deletions(-)

--
2.15.1


[PATCH BUGFIX V2 2/2] block, bfq: release oom-queue ref to root group on exit

2018-01-09 Thread Paolo Valente
On scheduler init, a reference to the root group, and a reference to
its corresponding blkg are taken for the oom queue. Yet these
references are not released on scheduler exit, which prevents these
objects from be freed. This commit adds the missing reference
releases.

Reported-by: Davide Ferrari <davideferra...@gmail.com>
Signed-off-by: Paolo Valente <paolo.vale...@linaro.org>
---
 block/bfq-iosched.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index ea48b5c8f088..c56a495af2e8 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -5015,6 +5015,9 @@ static void bfq_exit_queue(struct elevator_queue *e)
 
hrtimer_cancel(>idle_slice_timer);
 
+   /* release oom-queue reference to root group */
+   bfqg_and_blkg_put(bfqd->root_group);
+
 #ifdef CONFIG_BFQ_GROUP_IOSCHED
blkcg_deactivate_policy(bfqd->queue, _policy_bfq);
 #else
-- 
2.15.1



[PATCH BUGFIX V2 2/2] block, bfq: release oom-queue ref to root group on exit

2018-01-09 Thread Paolo Valente
On scheduler init, a reference to the root group, and a reference to
its corresponding blkg are taken for the oom queue. Yet these
references are not released on scheduler exit, which prevents these
objects from be freed. This commit adds the missing reference
releases.

Reported-by: Davide Ferrari 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index ea48b5c8f088..c56a495af2e8 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -5015,6 +5015,9 @@ static void bfq_exit_queue(struct elevator_queue *e)
 
hrtimer_cancel(>idle_slice_timer);
 
+   /* release oom-queue reference to root group */
+   bfqg_and_blkg_put(bfqd->root_group);
+
 #ifdef CONFIG_BFQ_GROUP_IOSCHED
blkcg_deactivate_policy(bfqd->queue, _policy_bfq);
 #else
-- 
2.15.1



[PATCH BUGFIX 1/2] block, bfq: put async queues for root bfq groups too

2018-01-09 Thread Paolo Valente
For each pair [device for which bfq is selected as I/O scheduler,
group in blkio/io], bfq maintains a corresponding bfq group. Each such
bfq group contains a set of async queues, with each async queue
created on demand, i.e., when some I/O request arrives for it.  On
creation, an async queue gets an extra reference, to make sure that
the queue is not freed as long as its bfq group exists.  Accordingly,
to allow the queue to be freed after the group exited, this extra
reference must released on group exit.

The above holds also for a bfq root group, i.e., for the bfq group
corresponding to the root blkio/io root for a given device. Yet, by
mistake, the references to the existing async queues of a root group
are not released when the latter exits. This causes a memory leak when
the instance of bfq for a given device exits. In a similar vein,
bfqg_stats_xfer_dead is not executed for a root group.

This commit fixes bfq_pd_offline so that the latter executes the above
missing operations for a root group too.

Reported-by: Holger Hoffstätte <hol...@applied-asynchrony.com>
Reported-by: Guoqing Jiang <gqji...@suse.com>
Signed-off-by: Davide Ferrari <davideferra...@gmail.com>
Signed-off-by: Paolo Valente <paolo.vale...@linaro.org>
---
 block/bfq-cgroup.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index da1525ec4c87..d819dc77fe65 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -775,10 +775,11 @@ static void bfq_pd_offline(struct blkg_policy_data *pd)
unsigned long flags;
int i;
 
+   spin_lock_irqsave(>lock, flags);
+
if (!entity) /* root group */
-   return;
+   goto put_async_queues;
 
-   spin_lock_irqsave(>lock, flags);
/*
 * Empty all service_trees belonging to this group before
 * deactivating the group itself.
@@ -809,6 +810,8 @@ static void bfq_pd_offline(struct blkg_policy_data *pd)
}
 
__bfq_deactivate_entity(entity, false);
+
+put_async_queues:
bfq_put_async_queues(bfqd, bfqg);
 
spin_unlock_irqrestore(>lock, flags);
-- 
2.15.1



[PATCH BUGFIX 1/2] block, bfq: put async queues for root bfq groups too

2018-01-09 Thread Paolo Valente
For each pair [device for which bfq is selected as I/O scheduler,
group in blkio/io], bfq maintains a corresponding bfq group. Each such
bfq group contains a set of async queues, with each async queue
created on demand, i.e., when some I/O request arrives for it.  On
creation, an async queue gets an extra reference, to make sure that
the queue is not freed as long as its bfq group exists.  Accordingly,
to allow the queue to be freed after the group exited, this extra
reference must released on group exit.

The above holds also for a bfq root group, i.e., for the bfq group
corresponding to the root blkio/io root for a given device. Yet, by
mistake, the references to the existing async queues of a root group
are not released when the latter exits. This causes a memory leak when
the instance of bfq for a given device exits. In a similar vein,
bfqg_stats_xfer_dead is not executed for a root group.

This commit fixes bfq_pd_offline so that the latter executes the above
missing operations for a root group too.

Reported-by: Holger Hoffstätte 
Reported-by: Guoqing Jiang 
Signed-off-by: Davide Ferrari 
Signed-off-by: Paolo Valente 
---
 block/bfq-cgroup.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index da1525ec4c87..d819dc77fe65 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -775,10 +775,11 @@ static void bfq_pd_offline(struct blkg_policy_data *pd)
unsigned long flags;
int i;
 
+   spin_lock_irqsave(>lock, flags);
+
if (!entity) /* root group */
-   return;
+   goto put_async_queues;
 
-   spin_lock_irqsave(>lock, flags);
/*
 * Empty all service_trees belonging to this group before
 * deactivating the group itself.
@@ -809,6 +810,8 @@ static void bfq_pd_offline(struct blkg_policy_data *pd)
}
 
__bfq_deactivate_entity(entity, false);
+
+put_async_queues:
bfq_put_async_queues(bfqd, bfqg);
 
spin_unlock_irqrestore(>lock, flags);
-- 
2.15.1



[PATCH BUGFIX 2/2] bfq-sq, bfq-mq: release oom-queue ref to root group on exit

2018-01-09 Thread Paolo Valente
On scheduler init, a reference to the root group, and a reference to
its corresponding blkg are taken for the oom queue. Yet these
references are not released on scheduler exit, which prevents these
objects from be freed. This commit adds the missing reference
releases.

Reported-by: Davide Ferrari <davideferra...@gmail.com>
Signed-off-by: Paolo Valente <paolo.vale...@linaro.org>
---
 block/bfq-iosched.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index ea48b5c8f088..c56a495af2e8 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -5015,6 +5015,9 @@ static void bfq_exit_queue(struct elevator_queue *e)
 
hrtimer_cancel(>idle_slice_timer);
 
+   /* release oom-queue reference to root group */
+   bfqg_and_blkg_put(bfqd->root_group);
+
 #ifdef CONFIG_BFQ_GROUP_IOSCHED
blkcg_deactivate_policy(bfqd->queue, _policy_bfq);
 #else
-- 
2.15.1



[PATCH BUGFIX 2/2] bfq-sq, bfq-mq: release oom-queue ref to root group on exit

2018-01-09 Thread Paolo Valente
On scheduler init, a reference to the root group, and a reference to
its corresponding blkg are taken for the oom queue. Yet these
references are not released on scheduler exit, which prevents these
objects from be freed. This commit adds the missing reference
releases.

Reported-by: Davide Ferrari 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index ea48b5c8f088..c56a495af2e8 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -5015,6 +5015,9 @@ static void bfq_exit_queue(struct elevator_queue *e)
 
hrtimer_cancel(>idle_slice_timer);
 
+   /* release oom-queue reference to root group */
+   bfqg_and_blkg_put(bfqd->root_group);
+
 #ifdef CONFIG_BFQ_GROUP_IOSCHED
blkcg_deactivate_policy(bfqd->queue, _policy_bfq);
 #else
-- 
2.15.1



[PATCH BUGFIX 0/2] block, bfq: fix two memory leaks related to cgroups

2018-01-09 Thread Paolo Valente
Hi Jens,
these two patches fix two related memory leaks, the first reported in
[1], and the second found by ourselves while fixing the first.

Thanks,
Paolo

[1] https://www.mail-archive.com/linux-block@vger.kernel.org/msg16258.html

Paolo Valente (2):
  block, bfq: put async queues for root bfq groups too
  bfq-sq, bfq-mq: release oom-queue ref to root group on exit

 block/bfq-cgroup.c  | 7 +--
 block/bfq-iosched.c | 3 +++
 2 files changed, 8 insertions(+), 2 deletions(-)

--
2.15.1


[PATCH BUGFIX 0/2] block, bfq: fix two memory leaks related to cgroups

2018-01-09 Thread Paolo Valente
Hi Jens,
these two patches fix two related memory leaks, the first reported in
[1], and the second found by ourselves while fixing the first.

Thanks,
Paolo

[1] https://www.mail-archive.com/linux-block@vger.kernel.org/msg16258.html

Paolo Valente (2):
  block, bfq: put async queues for root bfq groups too
  bfq-sq, bfq-mq: release oom-queue ref to root group on exit

 block/bfq-cgroup.c  | 7 +--
 block/bfq-iosched.c | 3 +++
 2 files changed, 8 insertions(+), 2 deletions(-)

--
2.15.1


unify the interface of the proportional-share policy in blkio/io

2018-01-04 Thread Paolo Valente
Hi Tejun, Jens, all,
the shares of storage resources are controlled through weights in the
proportional-share policy of the blkio/io controllers of the cgroups
subsystem.  But, on blk-mq, this control doesn't work for any legacy
application, service or tool.  In a similar vein, in most of the
interface files where legacy code expects to find statistics,
statistics are not updated at all.  The cause is as follows.

For devices using blk-mq, the proportional-share policy is enforced by
BFQ, while CFQ enforces this policy for blk.  But the current
implementation of blkio/io doesn't allow two I/O schedulers to share
the same interface sysfs files; so, if CFQ creates these files for the
proportional-share policy for blk, BFQ cannot attach somehow to them,
and viceversa.  One of these parameters is the weight of blkio/io
groups, used to control resource shares.  So, to still allow people to
set group weights with BFQ, I resorted to making BFQ create its own
weight parameter, with a different name: bfq.weight.  I used a similar
approach to replicate all statistic files.

Of course, no legacy code uses these different names, or is likely to
do so.  Having these two sets of names is simply a source of
confusion, as also pointed out, e.g., by Lennart Poettering, and
acknowledged by Tejun [1].

So, I started to work on getting a unified interface, with a
collaborator.  And we designed a solution that seems sensible to us.
Before proceeding with the implementation, we would need some feedback
on this solution, especially to avoid wasting time on the wrong
solution.

The code that shows or reads values through blkio/io parameters, for
the proportional-share policy, is currently fully contained in the BFQ
and CFQ schedulers.  We want to split this code into two parts:
1.  I/O part, which reads the value passed by the user, and shows the
value to the user; we want to move this part, which becomes common
among schedulers, into blk-cgroup.c or the like.
2.  get/set part, which gets/gives the value from/to the above part,
reading/writing this value from/to the internal state of the
scheduler; each scheduler knows what to do exactly for each of these
get/set function, so this part will stay in the scheduler.

In addition, we consider two types of parameters:
1. exact parameters, such as the weight, for which: (a) the
read-from-user function (I/O part moved to blk-cgroup) must pass the
value read to both I/O schedulers, through the set functions of the
schedulers, and (b) the show-to-user function (I/O part moved to
blk-cgroup) assumes that it would get the same value from any of the
two schedulers;
2. cumulative parameters such as the statistics, for which the related
code is identical (and replicated) in CFQ and BFQ.  Our idea, in this
case, is to move the common code into blk-cgroup, and leave in the
schedulers only the parts that may differ.  In practice, to update
statistics, CFQ and BFQ will invoke common blk-cgroup functions, and
the latter will take care of properly cumulating/combining statistics.

The solution for the second type of parameters may prove useful to
unify also the computation of statistics for the throttling policy.

Does this proposal sound reasonable?

Thanks,
Paolo

[1] https://github.com/systemd/systemd/issues/7057


unify the interface of the proportional-share policy in blkio/io

2018-01-04 Thread Paolo Valente
Hi Tejun, Jens, all,
the shares of storage resources are controlled through weights in the
proportional-share policy of the blkio/io controllers of the cgroups
subsystem.  But, on blk-mq, this control doesn't work for any legacy
application, service or tool.  In a similar vein, in most of the
interface files where legacy code expects to find statistics,
statistics are not updated at all.  The cause is as follows.

For devices using blk-mq, the proportional-share policy is enforced by
BFQ, while CFQ enforces this policy for blk.  But the current
implementation of blkio/io doesn't allow two I/O schedulers to share
the same interface sysfs files; so, if CFQ creates these files for the
proportional-share policy for blk, BFQ cannot attach somehow to them,
and viceversa.  One of these parameters is the weight of blkio/io
groups, used to control resource shares.  So, to still allow people to
set group weights with BFQ, I resorted to making BFQ create its own
weight parameter, with a different name: bfq.weight.  I used a similar
approach to replicate all statistic files.

Of course, no legacy code uses these different names, or is likely to
do so.  Having these two sets of names is simply a source of
confusion, as also pointed out, e.g., by Lennart Poettering, and
acknowledged by Tejun [1].

So, I started to work on getting a unified interface, with a
collaborator.  And we designed a solution that seems sensible to us.
Before proceeding with the implementation, we would need some feedback
on this solution, especially to avoid wasting time on the wrong
solution.

The code that shows or reads values through blkio/io parameters, for
the proportional-share policy, is currently fully contained in the BFQ
and CFQ schedulers.  We want to split this code into two parts:
1.  I/O part, which reads the value passed by the user, and shows the
value to the user; we want to move this part, which becomes common
among schedulers, into blk-cgroup.c or the like.
2.  get/set part, which gets/gives the value from/to the above part,
reading/writing this value from/to the internal state of the
scheduler; each scheduler knows what to do exactly for each of these
get/set function, so this part will stay in the scheduler.

In addition, we consider two types of parameters:
1. exact parameters, such as the weight, for which: (a) the
read-from-user function (I/O part moved to blk-cgroup) must pass the
value read to both I/O schedulers, through the set functions of the
schedulers, and (b) the show-to-user function (I/O part moved to
blk-cgroup) assumes that it would get the same value from any of the
two schedulers;
2. cumulative parameters such as the statistics, for which the related
code is identical (and replicated) in CFQ and BFQ.  Our idea, in this
case, is to move the common code into blk-cgroup, and leave in the
schedulers only the parts that may differ.  In practice, to update
statistics, CFQ and BFQ will invoke common blk-cgroup functions, and
the latter will take care of properly cumulating/combining statistics.

The solution for the second type of parameters may prove useful to
unify also the computation of statistics for the throttling policy.

Does this proposal sound reasonable?

Thanks,
Paolo

[1] https://github.com/systemd/systemd/issues/7057


[PATCH IMPROVEMENT] block, bfq: limit sectors served with interactive weight raising

2017-12-28 Thread Paolo Valente
To maximise responsiveness, BFQ raises the weight, and performs device
idling, for bfq_queues associated with processes deemed as
interactive. In particular, weight raising has a maximum duration,
equal to the time needed to start a large application. If a
weight-raised process goes on doing I/O beyond this maximum duration,
it loses weight-raising.

This mechanism is evidently vulnerable to the following false
positives: I/O-bound applications that will go on doing I/O for much
longer than the duration of weight-raising. These applications have
basically no benefit from being weight-raised at the beginning of
their I/O. On the opposite end, while being weight-raised, these
applications
a) unjustly steal throughput to applications that may truly need
low latency;
b) make BFQ uselessly perform device idling; device idling results
in loss of device throughput with most flash-based storage, and may
increase latencies when used purposelessly.

This commit adds a countermeasure to reduce both the above
problems. To introduce this countermeasure, we provide the following
extra piece of information (full details in the comments added by this
commit). During the start-up of the large application used as a
reference to set the duration of weight-raising, involved processes
transfer at most ~110K sectors each. Accordingly, a process initially
deemed as interactive has no right to be weight-raised any longer,
once transferred 110K sectors or more.

Basing on this consideration, this commit early-ends weight-raising
for a bfq_queue if the latter happens to have received an amount of
service at least equal to 110K sectors (actually, a little bit more,
to keep a safety margin). I/O-bound applications that reach a high
throughput, such as file copy, get to this threshold much before the
allowed weight-raising period finishes. Thus this early ending of
weight-raising reduces the amount of time during which these
applications cause the problems described above.

Signed-off-by: Paolo Valente <paolo.vale...@linaro.org>
---
 block/bfq-iosched.c | 81 +++--
 block/bfq-iosched.h |  5 
 block/bfq-wf2q.c|  3 ++
 3 files changed, 80 insertions(+), 9 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 6f75015d18c0..ea48b5c8f088 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -209,15 +209,17 @@ static struct kmem_cache *bfq_pool;
  * interactive applications automatically, using the following formula:
  * duration = (R / r) * T, where r is the peak rate of the device, and
  * R and T are two reference parameters.
- * In particular, R is the peak rate of the reference device (see below),
- * and T is a reference time: given the systems that are likely to be
- * installed on the reference device according to its speed class, T is
- * about the maximum time needed, under BFQ and while reading two files in
- * parallel, to load typical large applications on these systems.
- * In practice, the slower/faster the device at hand is, the more/less it
- * takes to load applications with respect to the reference device.
- * Accordingly, the longer/shorter BFQ grants weight raising to interactive
- * applications.
+ * In particular, R is the peak rate of the reference device (see
+ * below), and T is a reference time: given the systems that are
+ * likely to be installed on the reference device according to its
+ * speed class, T is about the maximum time needed, under BFQ and
+ * while reading two files in parallel, to load typical large
+ * applications on these systems (see the comments on
+ * max_service_from_wr below, for more details on how T is obtained).
+ * In practice, the slower/faster the device at hand is, the more/less
+ * it takes to load applications with respect to the reference device.
+ * Accordingly, the longer/shorter BFQ grants weight raising to
+ * interactive applications.
  *
  * BFQ uses four different reference pairs (R, T), depending on:
  * . whether the device is rotational or non-rotational;
@@ -254,6 +256,60 @@ static int T_slow[2];
 static int T_fast[2];
 static int device_speed_thresh[2];
 
+/*
+ * BFQ uses the above-detailed, time-based weight-raising mechanism to
+ * privilege interactive tasks. This mechanism is vulnerable to the
+ * following false positives: I/O-bound applications that will go on
+ * doing I/O for much longer than the duration of weight
+ * raising. These applications have basically no benefit from being
+ * weight-raised at the beginning of their I/O. On the opposite end,
+ * while being weight-raised, these applications
+ * a) unjustly steal throughput to applications that may actually need
+ * low latency;
+ * b) make BFQ uselessly perform device idling; device idling results
+ * in loss of device throughput with most flash-based storage, and may
+ * increase latencies when used purposelessly.
+ *
+ * BFQ tries to reduce these problems, by adopting the following
+ * countermeasure. To int

[PATCH IMPROVEMENT] block, bfq: limit sectors served with interactive weight raising

2017-12-28 Thread Paolo Valente
To maximise responsiveness, BFQ raises the weight, and performs device
idling, for bfq_queues associated with processes deemed as
interactive. In particular, weight raising has a maximum duration,
equal to the time needed to start a large application. If a
weight-raised process goes on doing I/O beyond this maximum duration,
it loses weight-raising.

This mechanism is evidently vulnerable to the following false
positives: I/O-bound applications that will go on doing I/O for much
longer than the duration of weight-raising. These applications have
basically no benefit from being weight-raised at the beginning of
their I/O. On the opposite end, while being weight-raised, these
applications
a) unjustly steal throughput to applications that may truly need
low latency;
b) make BFQ uselessly perform device idling; device idling results
in loss of device throughput with most flash-based storage, and may
increase latencies when used purposelessly.

This commit adds a countermeasure to reduce both the above
problems. To introduce this countermeasure, we provide the following
extra piece of information (full details in the comments added by this
commit). During the start-up of the large application used as a
reference to set the duration of weight-raising, involved processes
transfer at most ~110K sectors each. Accordingly, a process initially
deemed as interactive has no right to be weight-raised any longer,
once transferred 110K sectors or more.

Basing on this consideration, this commit early-ends weight-raising
for a bfq_queue if the latter happens to have received an amount of
service at least equal to 110K sectors (actually, a little bit more,
to keep a safety margin). I/O-bound applications that reach a high
throughput, such as file copy, get to this threshold much before the
allowed weight-raising period finishes. Thus this early ending of
weight-raising reduces the amount of time during which these
applications cause the problems described above.

Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 81 +++--
 block/bfq-iosched.h |  5 
 block/bfq-wf2q.c|  3 ++
 3 files changed, 80 insertions(+), 9 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 6f75015d18c0..ea48b5c8f088 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -209,15 +209,17 @@ static struct kmem_cache *bfq_pool;
  * interactive applications automatically, using the following formula:
  * duration = (R / r) * T, where r is the peak rate of the device, and
  * R and T are two reference parameters.
- * In particular, R is the peak rate of the reference device (see below),
- * and T is a reference time: given the systems that are likely to be
- * installed on the reference device according to its speed class, T is
- * about the maximum time needed, under BFQ and while reading two files in
- * parallel, to load typical large applications on these systems.
- * In practice, the slower/faster the device at hand is, the more/less it
- * takes to load applications with respect to the reference device.
- * Accordingly, the longer/shorter BFQ grants weight raising to interactive
- * applications.
+ * In particular, R is the peak rate of the reference device (see
+ * below), and T is a reference time: given the systems that are
+ * likely to be installed on the reference device according to its
+ * speed class, T is about the maximum time needed, under BFQ and
+ * while reading two files in parallel, to load typical large
+ * applications on these systems (see the comments on
+ * max_service_from_wr below, for more details on how T is obtained).
+ * In practice, the slower/faster the device at hand is, the more/less
+ * it takes to load applications with respect to the reference device.
+ * Accordingly, the longer/shorter BFQ grants weight raising to
+ * interactive applications.
  *
  * BFQ uses four different reference pairs (R, T), depending on:
  * . whether the device is rotational or non-rotational;
@@ -254,6 +256,60 @@ static int T_slow[2];
 static int T_fast[2];
 static int device_speed_thresh[2];
 
+/*
+ * BFQ uses the above-detailed, time-based weight-raising mechanism to
+ * privilege interactive tasks. This mechanism is vulnerable to the
+ * following false positives: I/O-bound applications that will go on
+ * doing I/O for much longer than the duration of weight
+ * raising. These applications have basically no benefit from being
+ * weight-raised at the beginning of their I/O. On the opposite end,
+ * while being weight-raised, these applications
+ * a) unjustly steal throughput to applications that may actually need
+ * low latency;
+ * b) make BFQ uselessly perform device idling; device idling results
+ * in loss of device throughput with most flash-based storage, and may
+ * increase latencies when used purposelessly.
+ *
+ * BFQ tries to reduce these problems, by adopting the following
+ * countermeasure. To introduce this countermeasure, we

[PATCH IMPROVEMENT/BUGFIX 0/1] block, bfq: address starvation caused by tag consumption

2017-12-27 Thread Paolo Valente
Hi Jens, all,
here's the patch I anticipated in my last email. It addresses
(serious) starvation problems caused by request-tag exhaustion, as
explained in more detail in the commit message. I started from the
solution in the function kyber_limit_depth, but then I had to define
more articulate limits, to counter starvation also in cases not
covered in kyber_limit_depth.

If this solution proves to be effective, I'm willing to port it
somehow to the other schedulers.

Thanks,
Paolo

Paolo Valente (1):
  block, bfq: limit tags for writes and async I/O

 block/bfq-iosched.c | 77 +
 block/bfq-iosched.h | 12 +
 2 files changed, 89 insertions(+)

--
2.15.1


[PATCH IMPROVEMENT/BUGFIX 0/1] block, bfq: address starvation caused by tag consumption

2017-12-27 Thread Paolo Valente
Hi Jens, all,
here's the patch I anticipated in my last email. It addresses
(serious) starvation problems caused by request-tag exhaustion, as
explained in more detail in the commit message. I started from the
solution in the function kyber_limit_depth, but then I had to define
more articulate limits, to counter starvation also in cases not
covered in kyber_limit_depth.

If this solution proves to be effective, I'm willing to port it
somehow to the other schedulers.

Thanks,
Paolo

Paolo Valente (1):
  block, bfq: limit tags for writes and async I/O

 block/bfq-iosched.c | 77 +
 block/bfq-iosched.h | 12 +
 2 files changed, 89 insertions(+)

--
2.15.1


[PATCH IMPROVEMENT/BUGFIX 1/1] block, bfq: limit tags for writes and async I/O

2017-12-27 Thread Paolo Valente
Asynchronous I/O can easily starve synchronous I/O (both sync reads
and sync writes), by consuming all request tags. Similarly, storms of
synchronous writes, such as those that sync(2) may trigger, can starve
synchronous reads. In their turn, these two problems may also cause
BFQ to loose control on latency for interactive and soft real-time
applications. For example, on a PLEXTOR PX-256M5S SSD, LibreOffice
Writer takes 0.6 seconds to start if the device is idle, but it takes
more than 45 seconds (!) if there are sequential writes in the
background.

This commit addresses this issue by limiting the maximum percentage of
tags that asynchronous I/O requests and synchronous write requests can
consume. In particular, this commit grants a higher threshold to
synchronous writes, to prevent the latter from being starved by
asynchronous I/O.

According to the above test, LibreOffice Writer now starts in about
1.2 seconds on average, regardless of the background workload, and
apart from some rare outlier. To check this improvement, run, e.g.,
sudo ./comm_startup_lat.sh bfq 5 5 seq 10 "lowriter --terminate_after_init"
for the comm_startup_lat benchmark in the S suite [1].

[1] https://github.com/Algodev-github/S

Signed-off-by: Paolo Valente <paolo.vale...@linaro.org>
---
 block/bfq-iosched.c | 77 +
 block/bfq-iosched.h | 12 +
 2 files changed, 89 insertions(+)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index e33c5c4c9856..6f75015d18c0 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -417,6 +417,82 @@ static struct request *bfq_choose_req(struct bfq_data 
*bfqd,
}
 }
 
+/*
+ * See the comments on bfq_limit_depth for the purpose of
+ * the depths set in the function.
+ */
+static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt)
+{
+   bfqd->sb_shift = bt->sb.shift;
+
+   /*
+* In-word depths if no bfq_queue is being weight-raised:
+* leaving 25% of tags only for sync reads.
+*
+* In next formulas, right-shift the value
+* (1U<sb_shift), instead of computing directly
+* (1U<<(bfqd->sb_shift - something)), to be robust against
+* any possible value of bfqd->sb_shift, without having to
+* limit 'something'.
+*/
+   /* no more than 50% of tags for async I/O */
+   bfqd->word_depths[0][0] = max((1U<sb_shift)>>1, 1U);
+   /*
+* no more than 75% of tags for sync writes (25% extra tags
+* w.r.t. async I/O, to prevent async I/O from starving sync
+* writes)
+*/
+   bfqd->word_depths[0][1] = max(((1U<sb_shift) * 3)>>2, 1U);
+
+   /*
+* In-word depths in case some bfq_queue is being weight-
+* raised: leaving ~63% of tags for sync reads. This is the
+* highest percentage for which, in our tests, application
+* start-up times didn't suffer from any regression due to tag
+* shortage.
+*/
+   /* no more than ~18% of tags for async I/O */
+   bfqd->word_depths[1][0] = max(((1U<sb_shift) * 3)>>4, 1U);
+   /* no more than ~37% of tags for sync writes (~20% extra tags) */
+   bfqd->word_depths[1][1] = max(((1U<sb_shift) * 6)>>4, 1U);
+}
+
+/*
+ * Async I/O can easily starve sync I/O (both sync reads and sync
+ * writes), by consuming all tags. Similarly, storms of sync writes,
+ * such as those that sync(2) may trigger, can starve sync reads.
+ * Limit depths of async I/O and sync writes so as to counter both
+ * problems.
+ */
+static void bfq_limit_depth(unsigned int op, struct blk_mq_alloc_data *data)
+{
+   struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
+   struct bfq_data *bfqd = data->q->elevator->elevator_data;
+   struct sbitmap_queue *bt;
+
+   if (op_is_sync(op) && !op_is_write(op))
+   return;
+
+   if (data->flags & BLK_MQ_REQ_RESERVED) {
+   if (unlikely(!tags->nr_reserved_tags)) {
+   WARN_ON_ONCE(1);
+   return;
+   }
+   bt = >breserved_tags;
+   } else
+   bt = >bitmap_tags;
+
+   if (unlikely(bfqd->sb_shift != bt->sb.shift))
+   bfq_update_depths(bfqd, bt);
+
+   data->shallow_depth =
+   bfqd->word_depths[!!bfqd->wr_busy_queues][op_is_sync(op)];
+
+   bfq_log(bfqd, "[%s] wr_busy %d sync %d depth %u",
+   __func__, bfqd->wr_busy_queues, op_is_sync(op),
+   data->shallow_depth);
+}
+
 static struct bfq_queue *
 bfq_rq_pos_tree_lookup(struct bfq_data *bfqd, struct rb_root *root,
 sector_t sector, struct rb_node **ret_parent,
@@ -5265,6 +5341,7 @@ static struct elv_fs_entry bfq_attrs[] = {
 
 static struct elevator_type iosched_bf

[PATCH IMPROVEMENT/BUGFIX 1/1] block, bfq: limit tags for writes and async I/O

2017-12-27 Thread Paolo Valente
Asynchronous I/O can easily starve synchronous I/O (both sync reads
and sync writes), by consuming all request tags. Similarly, storms of
synchronous writes, such as those that sync(2) may trigger, can starve
synchronous reads. In their turn, these two problems may also cause
BFQ to loose control on latency for interactive and soft real-time
applications. For example, on a PLEXTOR PX-256M5S SSD, LibreOffice
Writer takes 0.6 seconds to start if the device is idle, but it takes
more than 45 seconds (!) if there are sequential writes in the
background.

This commit addresses this issue by limiting the maximum percentage of
tags that asynchronous I/O requests and synchronous write requests can
consume. In particular, this commit grants a higher threshold to
synchronous writes, to prevent the latter from being starved by
asynchronous I/O.

According to the above test, LibreOffice Writer now starts in about
1.2 seconds on average, regardless of the background workload, and
apart from some rare outlier. To check this improvement, run, e.g.,
sudo ./comm_startup_lat.sh bfq 5 5 seq 10 "lowriter --terminate_after_init"
for the comm_startup_lat benchmark in the S suite [1].

[1] https://github.com/Algodev-github/S

Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 77 +
 block/bfq-iosched.h | 12 +
 2 files changed, 89 insertions(+)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index e33c5c4c9856..6f75015d18c0 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -417,6 +417,82 @@ static struct request *bfq_choose_req(struct bfq_data 
*bfqd,
}
 }
 
+/*
+ * See the comments on bfq_limit_depth for the purpose of
+ * the depths set in the function.
+ */
+static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt)
+{
+   bfqd->sb_shift = bt->sb.shift;
+
+   /*
+* In-word depths if no bfq_queue is being weight-raised:
+* leaving 25% of tags only for sync reads.
+*
+* In next formulas, right-shift the value
+* (1U<sb_shift), instead of computing directly
+* (1U<<(bfqd->sb_shift - something)), to be robust against
+* any possible value of bfqd->sb_shift, without having to
+* limit 'something'.
+*/
+   /* no more than 50% of tags for async I/O */
+   bfqd->word_depths[0][0] = max((1U<sb_shift)>>1, 1U);
+   /*
+* no more than 75% of tags for sync writes (25% extra tags
+* w.r.t. async I/O, to prevent async I/O from starving sync
+* writes)
+*/
+   bfqd->word_depths[0][1] = max(((1U<sb_shift) * 3)>>2, 1U);
+
+   /*
+* In-word depths in case some bfq_queue is being weight-
+* raised: leaving ~63% of tags for sync reads. This is the
+* highest percentage for which, in our tests, application
+* start-up times didn't suffer from any regression due to tag
+* shortage.
+*/
+   /* no more than ~18% of tags for async I/O */
+   bfqd->word_depths[1][0] = max(((1U<sb_shift) * 3)>>4, 1U);
+   /* no more than ~37% of tags for sync writes (~20% extra tags) */
+   bfqd->word_depths[1][1] = max(((1U<sb_shift) * 6)>>4, 1U);
+}
+
+/*
+ * Async I/O can easily starve sync I/O (both sync reads and sync
+ * writes), by consuming all tags. Similarly, storms of sync writes,
+ * such as those that sync(2) may trigger, can starve sync reads.
+ * Limit depths of async I/O and sync writes so as to counter both
+ * problems.
+ */
+static void bfq_limit_depth(unsigned int op, struct blk_mq_alloc_data *data)
+{
+   struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
+   struct bfq_data *bfqd = data->q->elevator->elevator_data;
+   struct sbitmap_queue *bt;
+
+   if (op_is_sync(op) && !op_is_write(op))
+   return;
+
+   if (data->flags & BLK_MQ_REQ_RESERVED) {
+   if (unlikely(!tags->nr_reserved_tags)) {
+   WARN_ON_ONCE(1);
+   return;
+   }
+   bt = >breserved_tags;
+   } else
+   bt = >bitmap_tags;
+
+   if (unlikely(bfqd->sb_shift != bt->sb.shift))
+   bfq_update_depths(bfqd, bt);
+
+   data->shallow_depth =
+   bfqd->word_depths[!!bfqd->wr_busy_queues][op_is_sync(op)];
+
+   bfq_log(bfqd, "[%s] wr_busy %d sync %d depth %u",
+   __func__, bfqd->wr_busy_queues, op_is_sync(op),
+   data->shallow_depth);
+}
+
 static struct bfq_queue *
 bfq_rq_pos_tree_lookup(struct bfq_data *bfqd, struct rb_root *root,
 sector_t sector, struct rb_node **ret_parent,
@@ -5265,6 +5341,7 @@ static struct elv_fs_entry bfq_attrs[] = {
 
 static struct elevator_type iosched_bfq_mq = {
.ops.mq = {
+ 

Re: severe regression with random I/O from (at least) 4.14?

2017-12-21 Thread Paolo Valente


> Il giorno 21 dic 2017, alle ore 11:57, Paolo Valente 
> <paolo.vale...@linaro.org> ha scritto:
> 
> Hi,
> a few minutes ago I bumped into this apparent severe regression, with 
> 4.15-rc4 and an SSD PLEXTOR PX-256M5S. If, with none as I/O scheduler, I do
> fio --name=global --rw=randread --size=512m --name=job1
> 
> I get
> read : io=524288KB, bw=34402KB/s, iops=8600, runt= 15240msec
> 
> This device had to reach at least 40 KIOPS. I retried with mq-deadline too, 
> and then with a 4.14, same outcome in all cases. I asked some colleague to 
> repeat the test with his/her device. Same outcome, so far, also with an SSD 
> 850 EVO 250GB.
> 
> I'm sorry if I'm making some colossal mistake.
> 

Forget it, sorry. I was confused by a failed test where iodepth was higher than 
1, and I am sleeping too little ...

Thanks,
Paolo

> Thanks,
> Paolo
> 
> 
> 



Re: severe regression with random I/O from (at least) 4.14?

2017-12-21 Thread Paolo Valente


> Il giorno 21 dic 2017, alle ore 11:57, Paolo Valente 
>  ha scritto:
> 
> Hi,
> a few minutes ago I bumped into this apparent severe regression, with 
> 4.15-rc4 and an SSD PLEXTOR PX-256M5S. If, with none as I/O scheduler, I do
> fio --name=global --rw=randread --size=512m --name=job1
> 
> I get
> read : io=524288KB, bw=34402KB/s, iops=8600, runt= 15240msec
> 
> This device had to reach at least 40 KIOPS. I retried with mq-deadline too, 
> and then with a 4.14, same outcome in all cases. I asked some colleague to 
> repeat the test with his/her device. Same outcome, so far, also with an SSD 
> 850 EVO 250GB.
> 
> I'm sorry if I'm making some colossal mistake.
> 

Forget it, sorry. I was confused by a failed test where iodepth was higher than 
1, and I am sleeping too little ...

Thanks,
Paolo

> Thanks,
> Paolo
> 
> 
> 



severe regression with random I/O from (at least) 4.14?

2017-12-21 Thread Paolo Valente
Hi,
a few minutes ago I bumped into this apparent severe regression, with 4.15-rc4 
and an SSD PLEXTOR PX-256M5S. If, with none as I/O scheduler, I do
fio --name=global --rw=randread --size=512m --name=job1

I get
read : io=524288KB, bw=34402KB/s, iops=8600, runt= 15240msec

This device had to reach at least 40 KIOPS. I retried with mq-deadline too, and 
then with a 4.14, same outcome in all cases. I asked some colleague to repeat 
the test with his/her device. Same outcome, so far, also with an SSD 850 EVO 
250GB.

I'm sorry if I'm making some colossal mistake.

Thanks,
Paolo





severe regression with random I/O from (at least) 4.14?

2017-12-21 Thread Paolo Valente
Hi,
a few minutes ago I bumped into this apparent severe regression, with 4.15-rc4 
and an SSD PLEXTOR PX-256M5S. If, with none as I/O scheduler, I do
fio --name=global --rw=randread --size=512m --name=job1

I get
read : io=524288KB, bw=34402KB/s, iops=8600, runt= 15240msec

This device had to reach at least 40 KIOPS. I retried with mq-deadline too, and 
then with a 4.14, same outcome in all cases. I asked some colleague to repeat 
the test with his/her device. Same outcome, so far, also with an SSD 850 EVO 
250GB.

I'm sorry if I'm making some colossal mistake.

Thanks,
Paolo





[PATCH IMPROVEMENT] block, bfq: increase threshold to deem I/O as random

2017-12-20 Thread Paolo Valente
If two processes do I/O close to each other, i.e., are cooperating
processes in BFQ (and CFQ'S) nomenclature, then BFQ merges their
associated bfq_queues, so as to get sequential I/O from the union of
the I/O requests of the processes, and thus reach a higher
throughput. A merged queue is then split if its I/O stops being
sequential. In this respect, BFQ deems the I/O of a bfq_queue as
(mostly) sequential only if less than 4 I/O requests are random, out
of the last 32 requests inserted into the queue.

Unfortunately, extensive testing (with the interleaved_io benchmark of
the S suite [1], and with real applications spawning cooperating
processes) has clearly shown that, with such a low threshold, only a
rather low I/O throughput may be reached when several cooperating
processes do I/O. In particular, the outcome of each test run was
bimodal: if queue merging occurred and was stable during the test,
then the throughput was close to the peak rate of the storage device,
otherwise the throughput was arbitrarily low (usually around 1/10 of
the peak rate with a rotational device). The probability to get the
unlucky outcomes grew with the number of cooperating processes: it was
already significant with 5 processes, and close to one with 7 or more
processes.

The cause of the low throughput in the unlucky runs was that the
merged queues containing the I/O of these cooperating processes were
soon split, because they contained more random I/O requests than those
tolerated by the 4/32 threshold, but
- that I/O would have however allowed the storage device to reach
  peak throughput or almost peak throughput;
- in contrast, the I/O of these processes, if served individually
  (from separate queues) yielded a rather low throughput.

So we repeated our tests with increasing values of the threshold,
until we found the minimum value (19) for which we obtained maximum
throughput, reliably, with at least up to 9 cooperating
processes. Then we checked that the use of that higher threshold value
did not cause any regression for any other benchmark in the suite [1].
This commit raises the threshold to such a higher value.

[1] https://github.com/Algodev-github/S

Signed-off-by: Angelo Ruocco <angeloruocc...@gmail.com>
Signed-off-by: Paolo Valente <paolo.vale...@linaro.org>
---
 block/bfq-iosched.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index c66578592c9e..e33c5c4c9856 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -192,7 +192,7 @@ static struct kmem_cache *bfq_pool;
 #define BFQQ_SEEK_THR  (sector_t)(8 * 100)
 #define BFQQ_SECT_THR_NONROT   (sector_t)(2 * 32)
 #define BFQQ_CLOSE_THR (sector_t)(8 * 1024)
-#define BFQQ_SEEKY(bfqq)   (hweight32(bfqq->seek_history) > 32/8)
+#define BFQQ_SEEKY(bfqq)   (hweight32(bfqq->seek_history) > 19)
 
 /* Min number of samples required to perform peak-rate update */
 #define BFQ_RATE_MIN_SAMPLES   32
-- 
2.15.1



[PATCH IMPROVEMENT] block, bfq: increase threshold to deem I/O as random

2017-12-20 Thread Paolo Valente
If two processes do I/O close to each other, i.e., are cooperating
processes in BFQ (and CFQ'S) nomenclature, then BFQ merges their
associated bfq_queues, so as to get sequential I/O from the union of
the I/O requests of the processes, and thus reach a higher
throughput. A merged queue is then split if its I/O stops being
sequential. In this respect, BFQ deems the I/O of a bfq_queue as
(mostly) sequential only if less than 4 I/O requests are random, out
of the last 32 requests inserted into the queue.

Unfortunately, extensive testing (with the interleaved_io benchmark of
the S suite [1], and with real applications spawning cooperating
processes) has clearly shown that, with such a low threshold, only a
rather low I/O throughput may be reached when several cooperating
processes do I/O. In particular, the outcome of each test run was
bimodal: if queue merging occurred and was stable during the test,
then the throughput was close to the peak rate of the storage device,
otherwise the throughput was arbitrarily low (usually around 1/10 of
the peak rate with a rotational device). The probability to get the
unlucky outcomes grew with the number of cooperating processes: it was
already significant with 5 processes, and close to one with 7 or more
processes.

The cause of the low throughput in the unlucky runs was that the
merged queues containing the I/O of these cooperating processes were
soon split, because they contained more random I/O requests than those
tolerated by the 4/32 threshold, but
- that I/O would have however allowed the storage device to reach
  peak throughput or almost peak throughput;
- in contrast, the I/O of these processes, if served individually
  (from separate queues) yielded a rather low throughput.

So we repeated our tests with increasing values of the threshold,
until we found the minimum value (19) for which we obtained maximum
throughput, reliably, with at least up to 9 cooperating
processes. Then we checked that the use of that higher threshold value
did not cause any regression for any other benchmark in the suite [1].
This commit raises the threshold to such a higher value.

[1] https://github.com/Algodev-github/S

Signed-off-by: Angelo Ruocco 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index c66578592c9e..e33c5c4c9856 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -192,7 +192,7 @@ static struct kmem_cache *bfq_pool;
 #define BFQQ_SEEK_THR  (sector_t)(8 * 100)
 #define BFQQ_SECT_THR_NONROT   (sector_t)(2 * 32)
 #define BFQQ_CLOSE_THR (sector_t)(8 * 1024)
-#define BFQQ_SEEKY(bfqq)   (hweight32(bfqq->seek_history) > 32/8)
+#define BFQQ_SEEKY(bfqq)   (hweight32(bfqq->seek_history) > 19)
 
 /* Min number of samples required to perform peak-rate update */
 #define BFQ_RATE_MIN_SAMPLES   32
-- 
2.15.1



[PATCH IMPROVEMENT/BUGFIX 3/4] block, bfq: let a queue be merged only shortly after starting I/O

2017-12-20 Thread Paolo Valente
In BFQ and CFQ, two processes are said to be cooperating if they do
I/O in such a way that the union of their I/O requests yields a
sequential I/O pattern. To get such a sequential I/O pattern out of
the non-sequential pattern of each cooperating process, BFQ and CFQ
merge the queues associated with these processes. In more detail,
cooperating processes, and thus their associated queues, usually
start, or restart, to do I/O shortly after each other. This is the
case, e.g., for the I/O threads of KVM/QEMU and of the dump
utility. Basing on this assumption, this commit allows a bfq_queue to
be merged only during a short time interval (100ms) after it starts,
or re-starts, to do I/O.  This filtering provides two important
benefits.

First, it greatly reduces the probability that two non-cooperating
processes have their queues merged by mistake, if they just happen to
do I/O close to each other for a short time interval. These spurious
merges cause loss of service guarantees. A low-weight bfq_queue may
unjustly get more than its expected share of the throughput: if such a
low-weight queue is merged with a high-weight queue, then the I/O for
the low-weight queue is served as if the queue had a high weight. This
may damage other high-weight queues unexpectedly.  For instance,
because of this issue, lxterminal occasionally took 7.5 seconds to
start, instead of 6.5 seconds, when some sequential readers and
writers did I/O in the background on a FUJITSU MHX2300BT HDD.  The
reason is that the bfq_queues associated with some of the readers or
the writers were merged with the high-weight queues of some processes
that had to do some urgent but little I/O. The readers then exploited
the inherited high weight for all or most of their I/O, during the
start-up of terminal. The filtering introduced by this commit
eliminated any outlier caused by spurious queue merges in our start-up
time tests.

This filtering also provides a little boost of the throughput
sustainable by BFQ: 3-4%, depending on the CPU. The reason is that,
once a bfq_queue cannot be merged any longer, this commit makes BFQ
stop updating the data needed to handle merging for the queue.

Signed-off-by: Paolo Valente <paolo.vale...@linaro.org>
Signed-off-by: Angelo Ruocco <angeloruocc...@gmail.com>
---
 block/bfq-iosched.c | 57 ++---
 block/bfq-iosched.h |  2 ++
 block/bfq-wf2q.c|  4 
 3 files changed, 52 insertions(+), 11 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index f58367ef98c1..320022135dc8 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -166,6 +166,20 @@ static const int bfq_async_charge_factor = 10;
 /* Default timeout values, in jiffies, approximating CFQ defaults. */
 const int bfq_timeout = HZ / 8;
 
+/*
+ * Time limit for merging (see comments in bfq_setup_cooperator). Set
+ * to the slowest value that, in our tests, proved to be effective in
+ * removing false positives, while not causing true positives to miss
+ * queue merging.
+ *
+ * As can be deduced from the low time limit below, queue merging, if
+ * successful, happens at the very beggining of the I/O of the involved
+ * cooperating processes, as a consequence of the arrival of the very
+ * first requests from each cooperator.  After that, there is very
+ * little chance to find cooperators.
+ */
+static const unsigned long bfq_merge_time_limit = HZ/10;
+
 static struct kmem_cache *bfq_pool;
 
 /* Below this threshold (in ns), we consider thinktime immediate. */
@@ -444,6 +458,13 @@ bfq_rq_pos_tree_lookup(struct bfq_data *bfqd, struct 
rb_root *root,
return bfqq;
 }
 
+static bool bfq_too_late_for_merging(struct bfq_queue *bfqq)
+{
+   return bfqq->service_from_backlogged > 0 &&
+   time_is_before_jiffies(bfqq->first_IO_time +
+  bfq_merge_time_limit);
+}
+
 void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 {
struct rb_node **p, *parent;
@@ -454,6 +475,14 @@ void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct 
bfq_queue *bfqq)
bfqq->pos_root = NULL;
}
 
+   /*
+* bfqq cannot be merged any longer (see comments in
+* bfq_setup_cooperator): no point in adding bfqq into the
+* position tree.
+*/
+   if (bfq_too_late_for_merging(bfqq))
+   return;
+
if (bfq_class_idle(bfqq))
return;
if (!bfqq->next_rq)
@@ -1935,6 +1964,9 @@ bfq_setup_merge(struct bfq_queue *bfqq, struct bfq_queue 
*new_bfqq)
 static bool bfq_may_be_close_cooperator(struct bfq_queue *bfqq,
struct bfq_queue *new_bfqq)
 {
+   if (bfq_too_late_for_merging(new_bfqq))
+   return false;
+
if (bfq_class_idle(bfqq) || bfq_class_idle(new_bfqq) ||
(bfqq->ioprio_class != new_bfqq->ioprio_class))

[PATCH IMPROVEMENT/BUGFIX 3/4] block, bfq: let a queue be merged only shortly after starting I/O

2017-12-20 Thread Paolo Valente
In BFQ and CFQ, two processes are said to be cooperating if they do
I/O in such a way that the union of their I/O requests yields a
sequential I/O pattern. To get such a sequential I/O pattern out of
the non-sequential pattern of each cooperating process, BFQ and CFQ
merge the queues associated with these processes. In more detail,
cooperating processes, and thus their associated queues, usually
start, or restart, to do I/O shortly after each other. This is the
case, e.g., for the I/O threads of KVM/QEMU and of the dump
utility. Basing on this assumption, this commit allows a bfq_queue to
be merged only during a short time interval (100ms) after it starts,
or re-starts, to do I/O.  This filtering provides two important
benefits.

First, it greatly reduces the probability that two non-cooperating
processes have their queues merged by mistake, if they just happen to
do I/O close to each other for a short time interval. These spurious
merges cause loss of service guarantees. A low-weight bfq_queue may
unjustly get more than its expected share of the throughput: if such a
low-weight queue is merged with a high-weight queue, then the I/O for
the low-weight queue is served as if the queue had a high weight. This
may damage other high-weight queues unexpectedly.  For instance,
because of this issue, lxterminal occasionally took 7.5 seconds to
start, instead of 6.5 seconds, when some sequential readers and
writers did I/O in the background on a FUJITSU MHX2300BT HDD.  The
reason is that the bfq_queues associated with some of the readers or
the writers were merged with the high-weight queues of some processes
that had to do some urgent but little I/O. The readers then exploited
the inherited high weight for all or most of their I/O, during the
start-up of terminal. The filtering introduced by this commit
eliminated any outlier caused by spurious queue merges in our start-up
time tests.

This filtering also provides a little boost of the throughput
sustainable by BFQ: 3-4%, depending on the CPU. The reason is that,
once a bfq_queue cannot be merged any longer, this commit makes BFQ
stop updating the data needed to handle merging for the queue.

Signed-off-by: Paolo Valente 
Signed-off-by: Angelo Ruocco 
---
 block/bfq-iosched.c | 57 ++---
 block/bfq-iosched.h |  2 ++
 block/bfq-wf2q.c|  4 
 3 files changed, 52 insertions(+), 11 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index f58367ef98c1..320022135dc8 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -166,6 +166,20 @@ static const int bfq_async_charge_factor = 10;
 /* Default timeout values, in jiffies, approximating CFQ defaults. */
 const int bfq_timeout = HZ / 8;
 
+/*
+ * Time limit for merging (see comments in bfq_setup_cooperator). Set
+ * to the slowest value that, in our tests, proved to be effective in
+ * removing false positives, while not causing true positives to miss
+ * queue merging.
+ *
+ * As can be deduced from the low time limit below, queue merging, if
+ * successful, happens at the very beggining of the I/O of the involved
+ * cooperating processes, as a consequence of the arrival of the very
+ * first requests from each cooperator.  After that, there is very
+ * little chance to find cooperators.
+ */
+static const unsigned long bfq_merge_time_limit = HZ/10;
+
 static struct kmem_cache *bfq_pool;
 
 /* Below this threshold (in ns), we consider thinktime immediate. */
@@ -444,6 +458,13 @@ bfq_rq_pos_tree_lookup(struct bfq_data *bfqd, struct 
rb_root *root,
return bfqq;
 }
 
+static bool bfq_too_late_for_merging(struct bfq_queue *bfqq)
+{
+   return bfqq->service_from_backlogged > 0 &&
+   time_is_before_jiffies(bfqq->first_IO_time +
+  bfq_merge_time_limit);
+}
+
 void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 {
struct rb_node **p, *parent;
@@ -454,6 +475,14 @@ void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct 
bfq_queue *bfqq)
bfqq->pos_root = NULL;
}
 
+   /*
+* bfqq cannot be merged any longer (see comments in
+* bfq_setup_cooperator): no point in adding bfqq into the
+* position tree.
+*/
+   if (bfq_too_late_for_merging(bfqq))
+   return;
+
if (bfq_class_idle(bfqq))
return;
if (!bfqq->next_rq)
@@ -1935,6 +1964,9 @@ bfq_setup_merge(struct bfq_queue *bfqq, struct bfq_queue 
*new_bfqq)
 static bool bfq_may_be_close_cooperator(struct bfq_queue *bfqq,
struct bfq_queue *new_bfqq)
 {
+   if (bfq_too_late_for_merging(new_bfqq))
+   return false;
+
if (bfq_class_idle(bfqq) || bfq_class_idle(new_bfqq) ||
(bfqq->ioprio_class != new_bfqq->ioprio_class))
return false;
@@ -2003,6 +2035,20 @@ bfq_setup_cooperator(struct bfq_data *

[PATCH IMPROVEMENT/BUGFIX 1/4] block, bfq: add missing rq_pos_tree update on rq removal

2017-12-20 Thread Paolo Valente
If two processes do I/O close to each other, then BFQ merges the
bfq_queues associated with these processes, to get a more sequential
I/O, and thus a higher throughput.  In this respect, to detect whether
two processes are doing I/O close to each other, BFQ keeps a list of
the head-of-line I/O requests of all active bfq_queues.  The list is
ordered by initial sectors, and implemented through a red-black tree
(rq_pos_tree).

Unfortunately, the update of the rq_pos_tree was incomplete, because
the tree was not updated on the removal of the head-of-line I/O
request of a bfq_queue, in case the queue did not remain empty. This
commit adds the missing update.

Signed-off-by: Paolo Valente <paolo.vale...@linaro.org>
Signed-off-by: Angelo Ruocco <angeloruocc...@gmail.com>
---
 block/bfq-iosched.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index a9e06217e64d..c6f0b93a769c 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1627,6 +1627,8 @@ static void bfq_remove_request(struct request_queue *q,
rb_erase(>pos_node, bfqq->pos_root);
bfqq->pos_root = NULL;
}
+   } else {
+   bfq_pos_tree_add_move(bfqd, bfqq);
}
 
if (rq->cmd_flags & REQ_META)
-- 
2.15.1



[PATCH IMPROVEMENT/BUGFIX 4/4] block, bfq: remove superfluous check in queue-merging setup

2017-12-20 Thread Paolo Valente
From: Angelo Ruocco <angeloruocc...@gmail.com>

When two or more processes do I/O in a way that the their requests are
sequential in respect to one another, BFQ merges the bfq_queues associated
with the processes. This way the overall I/O pattern becomes sequential,
and thus there is a boost in througput.
These cooperating processes usually start or restart to do I/O shortly
after each other. So, in order to avoid merging non-cooperating processes,
BFQ ensures that none of these queues has been in weight raising for too
long.

In this respect, from commit "block, bfq-sq, bfq-mq: let a queue be merged
only shortly after being created", BFQ checks whether any queue (and not
only weight-raised ones) is doing I/O continuously from too long to be
merged.

This new additional check makes the first one useless: a queue doing
I/O from long enough, if being weight-raised, is also a queue in
weight raising for too long to be merged. Accordingly, this commit
removes the first check.

Signed-off-by: Angelo Ruocco <angeloruocc...@gmail.com>
Signed-off-by: Paolo Valente <paolo.vale...@linaro.com>
---
 block/bfq-iosched.c | 36 +---
 1 file changed, 5 insertions(+), 31 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 320022135dc8..c66578592c9e 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1990,20 +1990,6 @@ static bool bfq_may_be_close_cooperator(struct bfq_queue 
*bfqq,
return true;
 }
 
-/*
- * If this function returns true, then bfqq cannot be merged. The idea
- * is that true cooperation happens very early after processes start
- * to do I/O. Usually, late cooperations are just accidental false
- * positives. In case bfqq is weight-raised, such false positives
- * would evidently degrade latency guarantees for bfqq.
- */
-static bool wr_from_too_long(struct bfq_queue *bfqq)
-{
-   return bfqq->wr_coeff > 1 &&
-   time_is_before_jiffies(bfqq->last_wr_start_finish +
-  msecs_to_jiffies(100));
-}
-
 /*
  * Attempt to schedule a merge of bfqq with the currently in-service
  * queue or with a close queue among the scheduled queues.  Return
@@ -2017,11 +2003,6 @@ static bool wr_from_too_long(struct bfq_queue *bfqq)
  * to maintain. Besides, in such a critical condition as an out of memory,
  * the benefits of queue merging may be little relevant, or even negligible.
  *
- * Weight-raised queues can be merged only if their weight-raising
- * period has just started. In fact cooperating processes are usually
- * started together. Thus, with this filter we avoid false positives
- * that would jeopardize low-latency guarantees.
- *
  * WARNING: queue merging may impair fairness among non-weight raised
  * queues, for at least two reasons: 1) the original weight of a
  * merged queue may change during the merged state, 2) even being the
@@ -2052,9 +2033,7 @@ bfq_setup_cooperator(struct bfq_data *bfqd, struct 
bfq_queue *bfqq,
if (bfqq->new_bfqq)
return bfqq->new_bfqq;
 
-   if (!io_struct ||
-   wr_from_too_long(bfqq) ||
-   unlikely(bfqq == >oom_bfqq))
+   if (!io_struct || unlikely(bfqq == >oom_bfqq))
return NULL;
 
/* If there is only one backlogged queue, don't search. */
@@ -2063,12 +2042,9 @@ bfq_setup_cooperator(struct bfq_data *bfqd, struct 
bfq_queue *bfqq,
 
in_service_bfqq = bfqd->in_service_queue;
 
-   if (!in_service_bfqq || in_service_bfqq == bfqq
-   || wr_from_too_long(in_service_bfqq) ||
-   unlikely(in_service_bfqq == >oom_bfqq))
-   goto check_scheduled;
-
-   if (bfq_rq_close_to_sector(io_struct, request, bfqd->last_position) &&
+   if (in_service_bfqq && in_service_bfqq != bfqq &&
+   likely(in_service_bfqq != >oom_bfqq) &&
+   bfq_rq_close_to_sector(io_struct, request, bfqd->last_position) &&
bfqq->entity.parent == in_service_bfqq->entity.parent &&
bfq_may_be_close_cooperator(bfqq, in_service_bfqq)) {
new_bfqq = bfq_setup_merge(bfqq, in_service_bfqq);
@@ -2080,12 +2056,10 @@ bfq_setup_cooperator(struct bfq_data *bfqd, struct 
bfq_queue *bfqq,
 * queues. The only thing we need is that the bio/request is not
 * NULL, as we need it to establish whether a cooperator exists.
 */
-check_scheduled:
new_bfqq = bfq_find_close_cooperator(bfqd, bfqq,
bfq_io_struct_pos(io_struct, request));
 
-   if (new_bfqq && !wr_from_too_long(new_bfqq) &&
-   likely(new_bfqq != >oom_bfqq) &&
+   if (new_bfqq && likely(new_bfqq != >oom_bfqq) &&
bfq_may_be_close_cooperator(bfqq, new_bfqq))
return bfq_setup_merge(bfqq, new_bfqq);
 
-- 
2.15.1



[PATCH IMPROVEMENT/BUGFIX 1/4] block, bfq: add missing rq_pos_tree update on rq removal

2017-12-20 Thread Paolo Valente
If two processes do I/O close to each other, then BFQ merges the
bfq_queues associated with these processes, to get a more sequential
I/O, and thus a higher throughput.  In this respect, to detect whether
two processes are doing I/O close to each other, BFQ keeps a list of
the head-of-line I/O requests of all active bfq_queues.  The list is
ordered by initial sectors, and implemented through a red-black tree
(rq_pos_tree).

Unfortunately, the update of the rq_pos_tree was incomplete, because
the tree was not updated on the removal of the head-of-line I/O
request of a bfq_queue, in case the queue did not remain empty. This
commit adds the missing update.

Signed-off-by: Paolo Valente 
Signed-off-by: Angelo Ruocco 
---
 block/bfq-iosched.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index a9e06217e64d..c6f0b93a769c 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1627,6 +1627,8 @@ static void bfq_remove_request(struct request_queue *q,
rb_erase(>pos_node, bfqq->pos_root);
bfqq->pos_root = NULL;
}
+   } else {
+   bfq_pos_tree_add_move(bfqd, bfqq);
}
 
if (rq->cmd_flags & REQ_META)
-- 
2.15.1



[PATCH IMPROVEMENT/BUGFIX 4/4] block, bfq: remove superfluous check in queue-merging setup

2017-12-20 Thread Paolo Valente
From: Angelo Ruocco 

When two or more processes do I/O in a way that the their requests are
sequential in respect to one another, BFQ merges the bfq_queues associated
with the processes. This way the overall I/O pattern becomes sequential,
and thus there is a boost in througput.
These cooperating processes usually start or restart to do I/O shortly
after each other. So, in order to avoid merging non-cooperating processes,
BFQ ensures that none of these queues has been in weight raising for too
long.

In this respect, from commit "block, bfq-sq, bfq-mq: let a queue be merged
only shortly after being created", BFQ checks whether any queue (and not
only weight-raised ones) is doing I/O continuously from too long to be
merged.

This new additional check makes the first one useless: a queue doing
I/O from long enough, if being weight-raised, is also a queue in
weight raising for too long to be merged. Accordingly, this commit
removes the first check.

Signed-off-by: Angelo Ruocco 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 36 +---
 1 file changed, 5 insertions(+), 31 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 320022135dc8..c66578592c9e 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1990,20 +1990,6 @@ static bool bfq_may_be_close_cooperator(struct bfq_queue 
*bfqq,
return true;
 }
 
-/*
- * If this function returns true, then bfqq cannot be merged. The idea
- * is that true cooperation happens very early after processes start
- * to do I/O. Usually, late cooperations are just accidental false
- * positives. In case bfqq is weight-raised, such false positives
- * would evidently degrade latency guarantees for bfqq.
- */
-static bool wr_from_too_long(struct bfq_queue *bfqq)
-{
-   return bfqq->wr_coeff > 1 &&
-   time_is_before_jiffies(bfqq->last_wr_start_finish +
-  msecs_to_jiffies(100));
-}
-
 /*
  * Attempt to schedule a merge of bfqq with the currently in-service
  * queue or with a close queue among the scheduled queues.  Return
@@ -2017,11 +2003,6 @@ static bool wr_from_too_long(struct bfq_queue *bfqq)
  * to maintain. Besides, in such a critical condition as an out of memory,
  * the benefits of queue merging may be little relevant, or even negligible.
  *
- * Weight-raised queues can be merged only if their weight-raising
- * period has just started. In fact cooperating processes are usually
- * started together. Thus, with this filter we avoid false positives
- * that would jeopardize low-latency guarantees.
- *
  * WARNING: queue merging may impair fairness among non-weight raised
  * queues, for at least two reasons: 1) the original weight of a
  * merged queue may change during the merged state, 2) even being the
@@ -2052,9 +2033,7 @@ bfq_setup_cooperator(struct bfq_data *bfqd, struct 
bfq_queue *bfqq,
if (bfqq->new_bfqq)
return bfqq->new_bfqq;
 
-   if (!io_struct ||
-   wr_from_too_long(bfqq) ||
-   unlikely(bfqq == >oom_bfqq))
+   if (!io_struct || unlikely(bfqq == >oom_bfqq))
return NULL;
 
/* If there is only one backlogged queue, don't search. */
@@ -2063,12 +2042,9 @@ bfq_setup_cooperator(struct bfq_data *bfqd, struct 
bfq_queue *bfqq,
 
in_service_bfqq = bfqd->in_service_queue;
 
-   if (!in_service_bfqq || in_service_bfqq == bfqq
-   || wr_from_too_long(in_service_bfqq) ||
-   unlikely(in_service_bfqq == >oom_bfqq))
-   goto check_scheduled;
-
-   if (bfq_rq_close_to_sector(io_struct, request, bfqd->last_position) &&
+   if (in_service_bfqq && in_service_bfqq != bfqq &&
+   likely(in_service_bfqq != >oom_bfqq) &&
+   bfq_rq_close_to_sector(io_struct, request, bfqd->last_position) &&
bfqq->entity.parent == in_service_bfqq->entity.parent &&
bfq_may_be_close_cooperator(bfqq, in_service_bfqq)) {
new_bfqq = bfq_setup_merge(bfqq, in_service_bfqq);
@@ -2080,12 +2056,10 @@ bfq_setup_cooperator(struct bfq_data *bfqd, struct 
bfq_queue *bfqq,
 * queues. The only thing we need is that the bio/request is not
 * NULL, as we need it to establish whether a cooperator exists.
 */
-check_scheduled:
new_bfqq = bfq_find_close_cooperator(bfqd, bfqq,
bfq_io_struct_pos(io_struct, request));
 
-   if (new_bfqq && !wr_from_too_long(new_bfqq) &&
-   likely(new_bfqq != >oom_bfqq) &&
+   if (new_bfqq && likely(new_bfqq != >oom_bfqq) &&
bfq_may_be_close_cooperator(bfqq, new_bfqq))
return bfq_setup_merge(bfqq, new_bfqq);
 
-- 
2.15.1



[PATCH IMPROVEMENT/BUGFIX 0/4] remove start-up time outlier caused by wrong detection of cooperating processes

2017-12-20 Thread Paolo Valente
Hi,
the main patch in this series ("block, bfq: let a queue be merged only
shortly after starting I/O") eliminates an outlier in the application
start-up time guaranteed by BFQ. This outlier occurs more or less
frequently, as a function of the characteristics of the system, and is
caused by a wrong detection of so-called cooperating processes
(details in the commit message).  This main patch is preceded by two
patches that fix two bugs found while working on this problem.  The
patch is then followed by a further, optimization patch, that removes
an operation made superfluous by the main patch.

Jens, I've not forgotten our decision to make a patch that enables
blkio stats (related to proportional-share policy) to not be enabled
at boot, or when CFQ or BFQ modules are loaded. Just, we have already
prepared the present patches, plus a few other patches for improving
BFQ and fixing bugs, and I'd like to clear this backlog first.

In this respect, after a patch for boosting throughput more reliably
with cooperating processes, I'll send out a patch to solve an
important read starvation problem. If I'm not making a blunder, this
problem affects every I/O scheduler in blk-mq. As a first step, I'll
propose a fix for BFQ. If the fix is ok, I'm willing to port it to the
other schedulers.

Thanks,
Paolo

Angelo Ruocco (2):
  block, bfq: check low_latency flag in bfq_bfqq_save_state()
  block, bfq: remove superfluous check in queue-merging setup

Paolo Valente (2):
  block, bfq: add missing rq_pos_tree update on rq removal
  block, bfq: let a queue be merged only shortly after starting I/O

 block/bfq-iosched.c | 98 ++---
 block/bfq-iosched.h |  2 ++
 block/bfq-wf2q.c|  4 +++
 3 files changed, 61 insertions(+), 43 deletions(-)

--
2.15.1


[PATCH IMPROVEMENT/BUGFIX 2/4] block, bfq: check low_latency flag in bfq_bfqq_save_state()

2017-12-20 Thread Paolo Valente
From: Angelo Ruocco <angeloruocc...@gmail.com>

A just-created bfq_queue will certainly be deemed as interactive on
the arrival of its first I/O request, if the low_latency flag is
set. Yet, if the queue is merged with another queue on the arrival of
its first I/O request, it will not have the chance to be flagged as
interactive. Nevertheless, if the queue is then split soon enough, it
has to be flagged as interactive after the split.

To handle this early-merge scenario correctly, BFQ saves the state of
the queue, on the merge, as if the latter had already been deemed
interactive. So, if the queue is split soon, it will get
weight-raised, because the previous state of the queue is resumed on
the split.

Unfortunately, in the act of saving the state of the newly-created
queue, BFQ doesn't check whether the low_latency flag is set, and this
causes early-merged queues to be then weight-raised, on queue splits,
even if low_latency is off. This commit addresses this problem by
adding the missing check.

Signed-off-by: Angelo Ruocco <angeloruocc...@gmail.com>
Signed-off-by: Paolo Valente <paolo.vale...@linaro.org>
---
 block/bfq-iosched.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index c6f0b93a769c..f58367ef98c1 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -2064,7 +2064,8 @@ static void bfq_bfqq_save_state(struct bfq_queue *bfqq)
bic->saved_in_large_burst = bfq_bfqq_in_large_burst(bfqq);
bic->was_in_burst_list = !hlist_unhashed(>burst_list_node);
if (unlikely(bfq_bfqq_just_created(bfqq) &&
-!bfq_bfqq_in_large_burst(bfqq))) {
+!bfq_bfqq_in_large_burst(bfqq) &&
+bfqq->bfqd->low_latency)) {
/*
 * bfqq being merged right after being created: bfqq
 * would have deserved interactive weight raising, but
-- 
2.15.1



<    1   2   3   4   5   6   7   8   9   10   >