[PATCH BUGFIX/IMPROVEMENT V2 1/3] block, bfq: make lookup_next_entity push up vtime on expirations

2017-08-30 Thread Paolo Valente
To provide a very smooth service, bfq starts to serve a bfq_queue
only if the queue is 'eligible', i.e., if the same queue would
have started to be served in the ideal, perfectly fair system that
bfq simulates internally. This is obtained by associating each
queue with a virtual start time, and by computing a special system
virtual time quantity: a queue is eligible only if the system
virtual time has reached the virtual start time of the
queue. Finally, bfq guarantees that, when a new queue must be set
in service, there is always at least one eligible entity for each
active parent entity in the scheduler. To provide this guarantee,
the function __bfq_lookup_next_entity pushes up, for each parent
entity on which it is invoked, the system virtual time to the
minimum among the virtual start times of the entities in the
active tree for the parent entity (more precisely, the push up
occurs if the system virtual time happens to be lower than all
such virtual start times).

There is however a circumstance in which __bfq_lookup_next_entity
cannot push up the system virtual time for a parent entity, even
if the system virtual time is lower than the virtual start times
of all the child entities in the active tree. It happens if one of
the child entities is in service. In fact, in such a case, there
is already an eligible entity, the in-service one, even if it may
not be not present in the active tree (because in-service entities
may be removed from the active tree).

Unfortunately, in the last re-design of the
hierarchical-scheduling engine, the reset of the pointer to the
in-service entity for a given parent entity--reset to be done as a
consequence of the expiration of the in-service entity--always
happens after the function __bfq_lookup_next_entity has been
invoked. This causes the function to think that there is still an
entity in service for the parent entity, and then that the system
virtual time cannot be pushed up, even if actually such a
no-more-in-service entity has already been properly reinserted
into the active tree (or in some other tree if no more
active). Yet, the system virtual time *had* to be pushed up, to be
ready to correctly choose the next queue to serve. Because of the
lack of this push up, bfq may wrongly set in service a queue that
had been speculatively pre-computed as the possible
next-in-service queue, but that would no more be the one to serve
after the expiration and the reinsertion into the active trees of
the previously in-service entities.

This commit addresses this issue by making
__bfq_lookup_next_entity properly push up the system virtual time
if an expiration is occurring.

Signed-off-by: Paolo Valente 
Tested-by: Lee Tibbert 
Tested-by: Oleksandr Natalenko 
---
 block/bfq-iosched.c |  4 ++--
 block/bfq-iosched.h |  4 ++--
 block/bfq-wf2q.c| 58 +++--
 3 files changed, 47 insertions(+), 19 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 436b6ca..a10f147 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -720,7 +720,7 @@ static void bfq_updated_next_req(struct bfq_data *bfqd,
entity->budget = new_budget;
bfq_log_bfqq(bfqd, bfqq, "updated next rq: new budget %lu",
 new_budget);
-   bfq_requeue_bfqq(bfqd, bfqq);
+   bfq_requeue_bfqq(bfqd, bfqq, false);
}
 }
 
@@ -2563,7 +2563,7 @@ static void __bfq_bfqq_expire(struct bfq_data *bfqd, 
struct bfq_queue *bfqq)
 
bfq_del_bfqq_busy(bfqd, bfqq, true);
} else {
-   bfq_requeue_bfqq(bfqd, bfqq);
+   bfq_requeue_bfqq(bfqd, bfqq, true);
/*
 * Resort priority tree of potential close cooperators.
 */
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 859f0a8..3e2659c 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -817,7 +817,6 @@ extern const int bfq_timeout;
 struct bfq_queue *bic_to_bfqq(struct bfq_io_cq *bic, bool is_sync);
 void bic_set_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq, bool is_sync);
 struct bfq_data *bic_to_bfqd(struct bfq_io_cq *bic);
-void bfq_requeue_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq);
 void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq);
 void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_entity *entity,
  struct rb_root *root);
@@ -917,7 +916,8 @@ void __bfq_bfqd_reset_in_service(struct bfq_data *bfqd);
 void bfq_deactivate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 bool ins_into_idle_tree, bool expiration);
 void bfq_activate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq);
-void bfq_requeue_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq);
+void bfq_requeue_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+ bool expiration);
 void bfq_del_bfqq_busy(struct bfq_data *bfqd

[PATCH BUGFIX/IMPROVEMENT V2 2/3] block, bfq: remove direct switch to an entity in higher class

2017-08-30 Thread Paolo Valente
If the function bfq_update_next_in_service is invoked as a consequence
of the activation or requeueing of an entity, say E, and finds out
that E belongs to a higher-priority class than that of the current
next-in-service entity, then it sets next_in_service directly to
E. But this may lead to anomalous schedules, because E may happen not
be eligible for service, because its virtual start time is higher than
the system virtual time for its service tree.

This commit addresses this issue by simply removing this direct
switch.

Signed-off-by: Paolo Valente 
Tested-by: Lee Tibbert 
Tested-by: Oleksandr Natalenko 
---
 block/bfq-wf2q.c | 19 +--
 1 file changed, 5 insertions(+), 14 deletions(-)

diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index 138732e..eeaf326 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -86,9 +86,8 @@ static bool bfq_update_next_in_service(struct bfq_sched_data 
*sd,
 * or repositiong of an entity that does not coincide with
 * sd->next_in_service, then a full lookup in the active tree
 * can be avoided. In fact, it is enough to check whether the
-* just-modified entity has a higher priority than
-* sd->next_in_service, or, even if it has the same priority
-* as sd->next_in_service, is eligible and has a lower virtual
+* just-modified entity has the same priority as
+* sd->next_in_service, is eligible and has a lower virtual
 * finish time than sd->next_in_service. If this compound
 * condition holds, then the new entity becomes the new
 * next_in_service. Otherwise no change is needed.
@@ -104,9 +103,8 @@ static bool bfq_update_next_in_service(struct 
bfq_sched_data *sd,
 
/*
 * If there is already a next_in_service candidate
-* entity, then compare class priorities or timestamps
-* to decide whether to replace sd->service_tree with
-* new_entity.
+* entity, then compare timestamps to decide whether
+* to replace sd->service_tree with new_entity.
 */
if (next_in_service) {
unsigned int new_entity_class_idx =
@@ -114,10 +112,6 @@ static bool bfq_update_next_in_service(struct 
bfq_sched_data *sd,
struct bfq_service_tree *st =
sd->service_tree + new_entity_class_idx;
 
-   /*
-* For efficiency, evaluate the most likely
-* sub-condition first.
-*/
replace_next =
(new_entity_class_idx ==
 bfq_class_idx(next_in_service)
@@ -125,10 +119,7 @@ static bool bfq_update_next_in_service(struct 
bfq_sched_data *sd,
 !bfq_gt(new_entity->start, st->vtime)
 &&
 bfq_gt(next_in_service->finish,
-   new_entity->finish))
-   ||
-   new_entity_class_idx <
-   bfq_class_idx(next_in_service);
+   new_entity->finish));
}
 
if (replace_next)
-- 
2.10.0



[PATCH BUGFIX/IMPROVEMENT V2 3/3] block, bfq: guarantee update_next_in_service always returns an eligible entity

2017-08-30 Thread Paolo Valente
If the function bfq_update_next_in_service is invoked as a consequence
of the activation or requeueing of an entity, say E, then it doesn't
invoke bfq_lookup_next_entity to get the next-in-service entity. In
contrast, it follows a shorter path: if E happens to be eligible (see
commit "bfq-sq-mq: make lookup_next_entity push up vtime on
expirations" for details on eligibility) and to have a lower virtual
finish time than the current candidate as next-in-service entity, then
E directly becomes the next-in-service entity. Unfortunately, there is
a corner case for which this shorter path makes
bfq_update_next_in_service choose a non eligible entity: it occurs if
both E and the current next-in-service entity happen to be non
eligible when bfq_update_next_in_service is invoked. In this case, E
is not set as next-in-service, and, since bfq_lookup_next_entity is
not invoked, the state of the parent entity is not updated so as to
end up with an eligible entity as the proper next-in-service entity.

In this respect, next-in-service is actually allowed to be non
eligible while some queue is in service: since no system-virtual-time
push-up can be performed in that case (see again commit "bfq-sq-mq:
make lookup_next_entity push up vtime on expirations" for details),
next-in-service is chosen, speculatively, as a function of the
possible value that the system virtual time may get after a push
up. But the correctness of the schedule breaks if next-in-service is
still a non eligible entity when it is time to set in service the next
entity. Unfortunately, this may happen in the above corner case.

This commit fixes this problem by making bfq_update_next_in_service
invoke bfq_lookup_next_entity not only if the above shorter path
cannot be taken, but also if the shorter path is taken but fails to
yield an eligible next-in-service entity.

Signed-off-by: Paolo Valente 
Tested-by: Lee Tibbert 
Tested-by: Oleksandr Natalenko 
---
 block/bfq-wf2q.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index eeaf326..add54f2 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -80,6 +80,7 @@ static bool bfq_update_next_in_service(struct bfq_sched_data 
*sd,
 {
struct bfq_entity *next_in_service = sd->next_in_service;
bool parent_sched_may_change = false;
+   bool change_without_lookup = false;
 
/*
 * If this update is triggered by the activation, requeueing
@@ -99,7 +100,7 @@ static bool bfq_update_next_in_service(struct bfq_sched_data 
*sd,
 * set to true, and left as true if
 * sd->next_in_service is NULL.
 */
-   bool replace_next = true;
+   change_without_lookup = true;
 
/*
 * If there is already a next_in_service candidate
@@ -112,7 +113,7 @@ static bool bfq_update_next_in_service(struct 
bfq_sched_data *sd,
struct bfq_service_tree *st =
sd->service_tree + new_entity_class_idx;
 
-   replace_next =
+   change_without_lookup =
(new_entity_class_idx ==
 bfq_class_idx(next_in_service)
 &&
@@ -122,15 +123,16 @@ static bool bfq_update_next_in_service(struct 
bfq_sched_data *sd,
new_entity->finish));
}
 
-   if (replace_next)
+   if (change_without_lookup)
next_in_service = new_entity;
-   } else /* invoked because of a deactivation: lookup needed */
+   }
+
+   if (!change_without_lookup) /* lookup needed */
next_in_service = bfq_lookup_next_entity(sd, expiration);
 
-   if (next_in_service) {
+   if (next_in_service)
parent_sched_may_change = !sd->next_in_service ||
bfq_update_parent_budget(next_in_service);
-   }
 
sd->next_in_service = next_in_service;
 
-- 
2.10.0



[PATCH BUGFIX/IMPROVEMENT V2 0/3] three bfq fixes restoring service guarantees with random sync writes in bg

2017-08-30 Thread Paolo Valente
[SECOND TAKE, with just the name of one of the tester fixed]

Hi,
while testing the read-write unfairness issues reported by Mel, I
found BFQ failing to guarantee good responsiveness against heavy
random sync writes in the background, i.e., multiple writers doing
random writes and systematic fdatasync [1]. The failure was caused by
three related bugs, because of which BFQ failed to guarantee to
high-weight processes the expected fraction of the throughput.

The three patches in this series fix these bugs. These fixes restore
the usual BFQ service guarantees (and thus optimal responsiveness
too), against the above background workload and, probably, against
other similar workloads.

Thanks,
Paolo

[1] https://lkml.org/lkml/2017/8/9/957

Paolo Valente (3):
  block, bfq: make lookup_next_entity push up vtime on expirations
  block, bfq: remove direct switch to an entity in higher class
  block, bfq: guarantee update_next_in_service always returns an
eligible entity

 block/bfq-iosched.c |  4 +--
 block/bfq-iosched.h |  4 +--
 block/bfq-wf2q.c| 91 -
 3 files changed, 60 insertions(+), 39 deletions(-)

--
2.10.0


Re: [PATCH BUGFIX/IMPROVEMENT 0/3] three bfq fixes restoring service guarantees with random sync writes in bg

2017-08-30 Thread Paolo Valente

> Il giorno 31 ago 2017, alle ore 08:41, oleksa...@natalenko.name ha scritto:
> 
>> Tested-by: Oleksander Natalenko 
> 
> I'm "Oleksandr" :).
> 

Sorry, resending ...

Paolo

> 31.08.2017 08:10, Paolo Valente wrote:
>> Hi,
>> while testing the read-write unfairness issues reported by Mel, I
>> found BFQ failing to guarantee good responsiveness against heavy
>> random sync writes in the background, i.e., multiple writers doing
>> random writes and systematic fdatasync [1]. The failure was caused by
>> three related bugs, because of which BFQ failed to guarantee to
>> high-weight processes the expected fraction of the throughput.
>> The three patches in this series fix these bugs. These fixes restore
>> the usual BFQ service guarantees (and thus optimal responsiveness
>> too), against the above background workload and, probably, against
>> other similar workloads.
>> Thanks,
>> Paolo
>> [1] https://lkml.org/lkml/2017/8/9/957
>> Paolo Valente (3):
>>  block, bfq: make lookup_next_entity push up vtime on expirations
>>  block, bfq: remove direct switch to an entity in higher class
>>  block, bfq: guarantee update_next_in_service always returns an
>>eligible entity
>> block/bfq-iosched.c |  4 +--
>> block/bfq-iosched.h |  4 +--
>> block/bfq-wf2q.c| 91 
>> -
>> 3 files changed, 60 insertions(+), 39 deletions(-)
>> --
>> 2.10.0



Re: [PATCH BUGFIX/IMPROVEMENT 0/3] three bfq fixes restoring service guarantees with random sync writes in bg

2017-08-30 Thread oleksandr

Tested-by: Oleksander Natalenko 


I'm "Oleksandr" :).

31.08.2017 08:10, Paolo Valente wrote:

Hi,
while testing the read-write unfairness issues reported by Mel, I
found BFQ failing to guarantee good responsiveness against heavy
random sync writes in the background, i.e., multiple writers doing
random writes and systematic fdatasync [1]. The failure was caused by
three related bugs, because of which BFQ failed to guarantee to
high-weight processes the expected fraction of the throughput.

The three patches in this series fix these bugs. These fixes restore
the usual BFQ service guarantees (and thus optimal responsiveness
too), against the above background workload and, probably, against
other similar workloads.

Thanks,
Paolo

[1] https://lkml.org/lkml/2017/8/9/957

Paolo Valente (3):
  block, bfq: make lookup_next_entity push up vtime on expirations
  block, bfq: remove direct switch to an entity in higher class
  block, bfq: guarantee update_next_in_service always returns an
eligible entity

 block/bfq-iosched.c |  4 +--
 block/bfq-iosched.h |  4 +--
 block/bfq-wf2q.c| 91 
-

 3 files changed, 60 insertions(+), 39 deletions(-)

--
2.10.0


[PATCH BUGFIX/IMPROVEMENT 3/3] block, bfq: guarantee update_next_in_service always returns an eligible entity

2017-08-30 Thread Paolo Valente
If the function bfq_update_next_in_service is invoked as a consequence
of the activation or requeueing of an entity, say E, then it doesn't
invoke bfq_lookup_next_entity to get the next-in-service entity. In
contrast, it follows a shorter path: if E happens to be eligible (see
commit "bfq-sq-mq: make lookup_next_entity push up vtime on
expirations" for details on eligibility) and to have a lower virtual
finish time than the current candidate as next-in-service entity, then
E directly becomes the next-in-service entity. Unfortunately, there is
a corner case for which this shorter path makes
bfq_update_next_in_service choose a non eligible entity: it occurs if
both E and the current next-in-service entity happen to be non
eligible when bfq_update_next_in_service is invoked. In this case, E
is not set as next-in-service, and, since bfq_lookup_next_entity is
not invoked, the state of the parent entity is not updated so as to
end up with an eligible entity as the proper next-in-service entity.

In this respect, next-in-service is actually allowed to be non
eligible while some queue is in service: since no system-virtual-time
push-up can be performed in that case (see again commit "bfq-sq-mq:
make lookup_next_entity push up vtime on expirations" for details),
next-in-service is chosen, speculatively, as a function of the
possible value that the system virtual time may get after a push
up. But the correctness of the schedule breaks if next-in-service is
still a non eligible entity when it is time to set in service the next
entity. Unfortunately, this may happen in the above corner case.

This commit fixes this problem by making bfq_update_next_in_service
invoke bfq_lookup_next_entity not only if the above shorter path
cannot be taken, but also if the shorter path is taken but fails to
yield an eligible next-in-service entity.

Signed-off-by: Paolo Valente 
Tested-by: Lee Tibbert 
Tested-by: Oleksander Natalenko 
---
 block/bfq-wf2q.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index eeaf326..add54f2 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -80,6 +80,7 @@ static bool bfq_update_next_in_service(struct bfq_sched_data 
*sd,
 {
struct bfq_entity *next_in_service = sd->next_in_service;
bool parent_sched_may_change = false;
+   bool change_without_lookup = false;
 
/*
 * If this update is triggered by the activation, requeueing
@@ -99,7 +100,7 @@ static bool bfq_update_next_in_service(struct bfq_sched_data 
*sd,
 * set to true, and left as true if
 * sd->next_in_service is NULL.
 */
-   bool replace_next = true;
+   change_without_lookup = true;
 
/*
 * If there is already a next_in_service candidate
@@ -112,7 +113,7 @@ static bool bfq_update_next_in_service(struct 
bfq_sched_data *sd,
struct bfq_service_tree *st =
sd->service_tree + new_entity_class_idx;
 
-   replace_next =
+   change_without_lookup =
(new_entity_class_idx ==
 bfq_class_idx(next_in_service)
 &&
@@ -122,15 +123,16 @@ static bool bfq_update_next_in_service(struct 
bfq_sched_data *sd,
new_entity->finish));
}
 
-   if (replace_next)
+   if (change_without_lookup)
next_in_service = new_entity;
-   } else /* invoked because of a deactivation: lookup needed */
+   }
+
+   if (!change_without_lookup) /* lookup needed */
next_in_service = bfq_lookup_next_entity(sd, expiration);
 
-   if (next_in_service) {
+   if (next_in_service)
parent_sched_may_change = !sd->next_in_service ||
bfq_update_parent_budget(next_in_service);
-   }
 
sd->next_in_service = next_in_service;
 
-- 
2.10.0



[PATCH BUGFIX/IMPROVEMENT 1/3] block, bfq: make lookup_next_entity push up vtime on expirations

2017-08-30 Thread Paolo Valente
To provide a very smooth service, bfq starts to serve a bfq_queue
only if the queue is 'eligible', i.e., if the same queue would
have started to be served in the ideal, perfectly fair system that
bfq simulates internally. This is obtained by associating each
queue with a virtual start time, and by computing a special system
virtual time quantity: a queue is eligible only if the system
virtual time has reached the virtual start time of the
queue. Finally, bfq guarantees that, when a new queue must be set
in service, there is always at least one eligible entity for each
active parent entity in the scheduler. To provide this guarantee,
the function __bfq_lookup_next_entity pushes up, for each parent
entity on which it is invoked, the system virtual time to the
minimum among the virtual start times of the entities in the
active tree for the parent entity (more precisely, the push up
occurs if the system virtual time happens to be lower than all
such virtual start times).

There is however a circumstance in which __bfq_lookup_next_entity
cannot push up the system virtual time for a parent entity, even
if the system virtual time is lower than the virtual start times
of all the child entities in the active tree. It happens if one of
the child entities is in service. In fact, in such a case, there
is already an eligible entity, the in-service one, even if it may
not be not present in the active tree (because in-service entities
may be removed from the active tree).

Unfortunately, in the last re-design of the
hierarchical-scheduling engine, the reset of the pointer to the
in-service entity for a given parent entity--reset to be done as a
consequence of the expiration of the in-service entity--always
happens after the function __bfq_lookup_next_entity has been
invoked. This causes the function to think that there is still an
entity in service for the parent entity, and then that the system
virtual time cannot be pushed up, even if actually such a
no-more-in-service entity has already been properly reinserted
into the active tree (or in some other tree if no more
active). Yet, the system virtual time *had* to be pushed up, to be
ready to correctly choose the next queue to serve. Because of the
lack of this push up, bfq may wrongly set in service a queue that
had been speculatively pre-computed as the possible
next-in-service queue, but that would no more be the one to serve
after the expiration and the reinsertion into the active trees of
the previously in-service entities.

This commit addresses this issue by making
__bfq_lookup_next_entity properly push up the system virtual time
if an expiration is occurring.

Signed-off-by: Paolo Valente 
Tested-by: Lee Tibbert 
Tested-by: Oleksander Natalenko 
---
 block/bfq-iosched.c |  4 ++--
 block/bfq-iosched.h |  4 ++--
 block/bfq-wf2q.c| 58 +++--
 3 files changed, 47 insertions(+), 19 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 436b6ca..a10f147 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -720,7 +720,7 @@ static void bfq_updated_next_req(struct bfq_data *bfqd,
entity->budget = new_budget;
bfq_log_bfqq(bfqd, bfqq, "updated next rq: new budget %lu",
 new_budget);
-   bfq_requeue_bfqq(bfqd, bfqq);
+   bfq_requeue_bfqq(bfqd, bfqq, false);
}
 }
 
@@ -2563,7 +2563,7 @@ static void __bfq_bfqq_expire(struct bfq_data *bfqd, 
struct bfq_queue *bfqq)
 
bfq_del_bfqq_busy(bfqd, bfqq, true);
} else {
-   bfq_requeue_bfqq(bfqd, bfqq);
+   bfq_requeue_bfqq(bfqd, bfqq, true);
/*
 * Resort priority tree of potential close cooperators.
 */
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 859f0a8..3e2659c 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -817,7 +817,6 @@ extern const int bfq_timeout;
 struct bfq_queue *bic_to_bfqq(struct bfq_io_cq *bic, bool is_sync);
 void bic_set_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq, bool is_sync);
 struct bfq_data *bic_to_bfqd(struct bfq_io_cq *bic);
-void bfq_requeue_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq);
 void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq);
 void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_entity *entity,
  struct rb_root *root);
@@ -917,7 +916,8 @@ void __bfq_bfqd_reset_in_service(struct bfq_data *bfqd);
 void bfq_deactivate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 bool ins_into_idle_tree, bool expiration);
 void bfq_activate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq);
-void bfq_requeue_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq);
+void bfq_requeue_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+ bool expiration);
 void bfq_del_bfqq_busy(struct bfq_data *bfq

[PATCH BUGFIX/IMPROVEMENT 2/3] block, bfq: remove direct switch to an entity in higher class

2017-08-30 Thread Paolo Valente
If the function bfq_update_next_in_service is invoked as a consequence
of the activation or requeueing of an entity, say E, and finds out
that E belongs to a higher-priority class than that of the current
next-in-service entity, then it sets next_in_service directly to
E. But this may lead to anomalous schedules, because E may happen not
be eligible for service, because its virtual start time is higher than
the system virtual time for its service tree.

This commit addresses this issue by simply removing this direct
switch.

Signed-off-by: Paolo Valente 
Tested-by: Lee Tibbert 
Tested-by: Oleksander Natalenko 
---
 block/bfq-wf2q.c | 19 +--
 1 file changed, 5 insertions(+), 14 deletions(-)

diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index 138732e..eeaf326 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -86,9 +86,8 @@ static bool bfq_update_next_in_service(struct bfq_sched_data 
*sd,
 * or repositiong of an entity that does not coincide with
 * sd->next_in_service, then a full lookup in the active tree
 * can be avoided. In fact, it is enough to check whether the
-* just-modified entity has a higher priority than
-* sd->next_in_service, or, even if it has the same priority
-* as sd->next_in_service, is eligible and has a lower virtual
+* just-modified entity has the same priority as
+* sd->next_in_service, is eligible and has a lower virtual
 * finish time than sd->next_in_service. If this compound
 * condition holds, then the new entity becomes the new
 * next_in_service. Otherwise no change is needed.
@@ -104,9 +103,8 @@ static bool bfq_update_next_in_service(struct 
bfq_sched_data *sd,
 
/*
 * If there is already a next_in_service candidate
-* entity, then compare class priorities or timestamps
-* to decide whether to replace sd->service_tree with
-* new_entity.
+* entity, then compare timestamps to decide whether
+* to replace sd->service_tree with new_entity.
 */
if (next_in_service) {
unsigned int new_entity_class_idx =
@@ -114,10 +112,6 @@ static bool bfq_update_next_in_service(struct 
bfq_sched_data *sd,
struct bfq_service_tree *st =
sd->service_tree + new_entity_class_idx;
 
-   /*
-* For efficiency, evaluate the most likely
-* sub-condition first.
-*/
replace_next =
(new_entity_class_idx ==
 bfq_class_idx(next_in_service)
@@ -125,10 +119,7 @@ static bool bfq_update_next_in_service(struct 
bfq_sched_data *sd,
 !bfq_gt(new_entity->start, st->vtime)
 &&
 bfq_gt(next_in_service->finish,
-   new_entity->finish))
-   ||
-   new_entity_class_idx <
-   bfq_class_idx(next_in_service);
+   new_entity->finish));
}
 
if (replace_next)
-- 
2.10.0



[PATCH BUGFIX/IMPROVEMENT 0/3] three bfq fixes restoring service guarantees with random sync writes in bg

2017-08-30 Thread Paolo Valente
Hi,
while testing the read-write unfairness issues reported by Mel, I
found BFQ failing to guarantee good responsiveness against heavy
random sync writes in the background, i.e., multiple writers doing
random writes and systematic fdatasync [1]. The failure was caused by
three related bugs, because of which BFQ failed to guarantee to
high-weight processes the expected fraction of the throughput.

The three patches in this series fix these bugs. These fixes restore
the usual BFQ service guarantees (and thus optimal responsiveness
too), against the above background workload and, probably, against
other similar workloads.

Thanks,
Paolo

[1] https://lkml.org/lkml/2017/8/9/957

Paolo Valente (3):
  block, bfq: make lookup_next_entity push up vtime on expirations
  block, bfq: remove direct switch to an entity in higher class
  block, bfq: guarantee update_next_in_service always returns an
eligible entity

 block/bfq-iosched.c |  4 +--
 block/bfq-iosched.h |  4 +--
 block/bfq-wf2q.c| 91 -
 3 files changed, 60 insertions(+), 39 deletions(-)

--
2.10.0


Re: [PATCH V3 13/14] blk-mq-sched: refactor blk_mq_sched_try_merge()

2017-08-30 Thread Ming Lei
On Wed, Aug 30, 2017 at 05:17:05PM +, Bart Van Assche wrote:
> On Sun, 2017-08-27 at 00:33 +0800, Ming Lei wrote:
> > -bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
> > -   struct request **merged_request)
> > +static bool __blk_mq_try_merge(struct request_queue *q,
> > +   struct bio *bio, struct request **merged_request,
> > +   struct request *candidate, enum elv_merge type)
> >  {
> > -   struct request *rq;
> > +   struct request *rq = candidate;
> 
> It seems weird to me that the argument 'candidate' is not used other than to
> copy its value into the local variable 'rq'? How about removing that local

OK, will simply use 'rq' as parameter.

> variable and renaming the 'candidate' argument into 'rq'? Anyway, with or
> without that change:
> 
> Reviewed-by: Bart Van Assche 

OK, thanks!

-- 
Ming


Re: [PATCH V3 06/14] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed

2017-08-30 Thread Ming Lei
On Wed, Aug 30, 2017 at 05:11:00PM +, Bart Van Assche wrote:
> On Sun, 2017-08-27 at 00:33 +0800, Ming Lei wrote:
> > During dispatching, we moved all requests from hctx->dispatch to
> > one temporary list, then dispatch them one by one from this list.
> > Unfortunately duirng this period, run queue from other contexts
> ^^
> during?

OK.

> > may think the queue is idle, then start to dequeue from sw/scheduler
> > queue and still try to dispatch because ->dispatch is empty. This way
> > hurts sequential I/O performance because requests are dequeued when
> > lld queue is busy.
> > [ ... ]
> > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > index 735e432294ab..4d7bea8c2594 100644
> > --- a/block/blk-mq-sched.c
> > +++ b/block/blk-mq-sched.c
> > @@ -146,7 +146,6 @@ void blk_mq_sched_dispatch_requests(struct 
> > blk_mq_hw_ctx *hctx)
> > struct request_queue *q = hctx->queue;
> > struct elevator_queue *e = q->elevator;
> > const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
> > -   bool do_sched_dispatch = true;
> > LIST_HEAD(rq_list);
> >  
> > /* RCU or SRCU read lock is needed before checking quiesced flag */
> 
> Shouldn't blk_mq_sched_dispatch_requests() set BLK_MQ_S_DISPATCH_BUSY just 
> after
> the following statement because this statement makes the dispatch list empty?

Actually that is what I did in V1.

I changed to this way because setting the BUSY flag here will increase
the race window a bit, for example, if one request is added to ->dispatch
just after it is flushed(), the check on the BUSY bit won't catch this
case. Then we can avoid to check both the bit and 
list_empty_careful(&hctx->dispatch)
in blk_mq_sched_dispatch_requests(), so code becomes simpler and more
readable if we set the flag simply from the beginning.

> 
>   list_splice_init(&hctx->dispatch, &rq_list);
> 
> > @@ -177,8 +176,33 @@ void blk_mq_sched_dispatch_requests(struct 
> > blk_mq_hw_ctx *hctx)
> >  */
> > if (!list_empty(&rq_list)) {
> > blk_mq_sched_mark_restart_hctx(hctx);
> > -   do_sched_dispatch = blk_mq_dispatch_rq_list(q, &rq_list);
> > -   } else if (!has_sched_dispatch && !q->queue_depth) {
> > +   blk_mq_dispatch_rq_list(q, &rq_list);
> > +
> > +   /*
> > +* We may clear DISPATCH_BUSY just after it
> > +* is set from another context, the only cost
> > +* is that one request is dequeued a bit early,
> > +* we can survive that. Given the window is
> > +* too small, no need to worry about performance
>^^^
> The word "too" seems extraneous to me in this sentence.

Maybe 'extremely' is better, or just remove it?

> 
> >  bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
> > @@ -330,6 +353,7 @@ static bool blk_mq_sched_bypass_insert(struct 
> > blk_mq_hw_ctx *hctx,
> >  */
> > spin_lock(&hctx->lock);
> > list_add(&rq->queuelist, &hctx->dispatch);
> > +   set_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
> > spin_unlock(&hctx->lock);
> > return true;
> >  }
> 
> Is it necessary to make blk_mq_sched_bypass_insert() set 
> BLK_MQ_S_DISPATCH_BUSY?
> My understanding is that only code that makes the dispatch list empty should
> set BLK_MQ_S_DISPATCH_BUSY. However, blk_mq_sched_bypass_insert() adds an 
> element
> to the dispatch list so that guarantees that that list is not empty.

I believe my above comment has explained it already.

> 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index f063dd0f197f..6af56a71c1cd 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1140,6 +1140,11 @@ bool blk_mq_dispatch_rq_list(struct request_queue 
> > *q, struct list_head *list)
> >  
> > spin_lock(&hctx->lock);
> > list_splice_init(list, &hctx->dispatch);
> > +   /*
> > +* DISPATCH_BUSY won't be cleared until all requests
> > +* in hctx->dispatch are dispatched successfully
> > +*/
> > +   set_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
> > spin_unlock(&hctx->lock);
> 
> Same comment here - since this code adds one or more requests to the dispatch 
> list,
> is it really needed to set the DISPATCH_BUSY flag?

See same comment above, :-)

-- 
Ming


Re: [PATCH V3 05/14] blk-mq-sched: improve dispatching from sw queue

2017-08-30 Thread Ming Lei
On Wed, Aug 30, 2017 at 04:34:47PM +, Bart Van Assche wrote:
> On Sun, 2017-08-27 at 00:33 +0800, Ming Lei wrote:
> > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> > index 50c6485cb04f..7b7a366a97f3 100644
> > --- a/include/linux/blk-mq.h
> > +++ b/include/linux/blk-mq.h
> > @@ -30,6 +30,8 @@ struct blk_mq_hw_ctx {
> >  
> > struct sbitmap  ctx_map;
> >  
> > +   struct blk_mq_ctx   *dispatch_from;
> > +
> > struct blk_mq_ctx   **ctxs;
> > unsigned intnr_ctx;
> 
> Hello Ming,
> 
> Are you relying here on the fact that the per-CPU queues are never
> reallocated, even if CPU hot-plugging occurs? Sorry but that seems fragile
> to me. I would like to see 'dispatch_from' be converted into an integer. It
> is easy to check whether an integer software queue index is out of range
> but it's not that easy to check whether a pointer became invalid.

If CPU hotplug happens, the instance of 'struct blk_mq_ctx' for that
CPU is still there and its index won't change from being setup because
its lifetime is same with 'struct request_queue', blk_mq_hctx_notify_dead()
just flushes the sw queue when the CPU is going away.

So we don't need to pay special attention to CPU hotplug here, please let
me know if you are fine now.


-- 
Ming


Re: [PATCH V3 02/14] sbitmap: introduce __sbitmap_for_each_set()

2017-08-30 Thread Ming Lei
On Wed, Aug 30, 2017 at 03:55:13PM +, Bart Van Assche wrote:
> On Sun, 2017-08-27 at 00:33 +0800, Ming Lei wrote:
> >  /**
> >   * sbitmap_for_each_set() - Iterate over each set bit in a &struct sbitmap.
> > + * @start: Where to start the iteration
> 
> Thanks for having changed the name of this argument ...
> 
> > -static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn 
> > fn,
> > -   void *data)
> > +static inline void __sbitmap_for_each_set(struct sbitmap *sb,
> > + unsigned int start,
> > + sb_for_each_fn fn, void *data)
> >  {
> > -   unsigned int i;
> > +   unsigned int index = SB_NR_TO_INDEX(sb, start);
> > +   unsigned int nr = SB_NR_TO_BIT(sb, start);
> > +   unsigned int scanned = 0;
> 
> ... but I'm still missing a check here whether or not index >= sb->map_nr.

The reason I didn't do that is because the only one user is always to
pass correct 'start'.

OK, will add it.

>  
> > -   for (i = 0; i < sb->map_nr; i++) {
> > -   struct sbitmap_word *word = &sb->map[i];
> > -   unsigned int off, nr;
> > +   while (1) {
> > +   struct sbitmap_word *word = &sb->map[index];
> > +   unsigned int depth = min_t(unsigned int, word->depth - nr,
> > +  sb->depth - scanned);
> >  
> > +   scanned += depth;
> > if (!word->word)
> > -   continue;
> > +   goto next;
> >  
> > -   nr = 0;
> > -   off = i << sb->shift;
> > +   depth += nr;
> > +   start = index << sb->shift;
> 
> The above statement reuses the argument 'start' for a new purpose. This is
> confusing - please don't do this. Why not to keep the name 'off'? That will
> keep the changes minimal.

OK, that will rename the parameter of 'start' as 'off', so that we can
save one local variable.

-- 
Ming


Re: [PATCH V2 2/2] block/loop: allow request merge for directio mode

2017-08-30 Thread Ming Lei
On Wed, Aug 30, 2017 at 03:06:16PM -0700, Shaohua Li wrote:
> On Wed, Aug 30, 2017 at 02:43:40PM +0800, Ming Lei wrote:
> > On Tue, Aug 29, 2017 at 09:43:20PM -0700, Shaohua Li wrote:
> > > On Wed, Aug 30, 2017 at 10:51:21AM +0800, Ming Lei wrote:
> > > > On Tue, Aug 29, 2017 at 08:13:39AM -0700, Shaohua Li wrote:
> > > > > On Tue, Aug 29, 2017 at 05:56:05PM +0800, Ming Lei wrote:
> > > > > > On Thu, Aug 24, 2017 at 12:24:53PM -0700, Shaohua Li wrote:
> > > > > > > From: Shaohua Li 
> > > > > > > 
> > > > > > > Currently loop disables merge. While it makes sense for buffer IO 
> > > > > > > mode,
> > > > > > > directio mode can benefit from request merge. Without merge, loop 
> > > > > > > could
> > > > > > > send small size IO to underlayer disk and harm performance.
> > > > > > 
> > > > > > Hi Shaohua,
> > > > > > 
> > > > > > IMO no matter if merge is used, loop always sends page by page
> > > > > > to VFS in both dio or buffer I/O.
> > > > > 
> > > > > Why do you think so?
> > > > 
> > > > do_blockdev_direct_IO() still handles page by page from iov_iter, and
> > > > with bigger request, I guess it might be the plug merge working.
> > > 
> > > This is not true. directio sends big size bio directly, not because of 
> > > plug
> > > merge. Please at least check the code before you complain.
> > 
> > I complain nothing, just try to understand the idea behind,
> > never mind, :-)
> > 
> > > 
> > > > >  
> > > > > > Also if merge is enabled on loop, that means merge is run
> > > > > > on both loop and low level block driver, and not sure if we
> > > > > > can benefit from that.
> > > > > 
> > > > > why does merge still happen in low level block driver?
> > > > 
> > > > Because scheduler is still working on low level disk. My question
> > > > is that why the scheduler in low level disk doesn't work now
> > > > if scheduler on loop can merge?
> > > 
> > > The low level disk can still do merge, but since this is directio, the 
> > > upper
> > > layer already dispatches request as big as possible. There is very little
> > > chance the requests can be merged again.
> > 
> > That is true, but these requests need to enter scheduler queue and
> > be tried to merge again, even though it is less possible to succeed.
> > Double merge may take extra CPU utilization.
> > 
> > Looks it doesn't answer my question.
> > 
> > Without this patch, the requests dispatched to loop won't be merged,
> > so they may be small and their sectors may be continuous, my question
> > is why dio bios converted from these small loop requests can't be
> > merged in block layer when queuing these dio bios to low level device?
> 
> loop thread doesn't have plug there. Even we have plug there, it's still a bad
> idea to do the merge in low level layer. If we run direct_IO for every 4k, the
> overhead is much much higher than bio merge. The direct_IO will call into fs
> code, take different mutexes, metadata update for write and so on.

OK, that looks making sense now.

-- 
Ming


Re: [PATCH] kernfs: checking for IS_ERR() instead of NULL

2017-08-30 Thread Tejun Heo
On Wed, Aug 30, 2017 at 05:04:56PM +0300, Dan Carpenter wrote:
> The kernfs_get_inode() returns NULL on error, it never returns error
> pointers.
> 
> Fixes: aa8188253474 ("kernfs: add exportfs operations")
> Signed-off-by: Dan Carpenter 

Acked-by: Tejun Heo 

Greg, can you please route this patch?

Thanks.

-- 
tejun


Re: [RFC] block/loop: make loop cgroup aware

2017-08-30 Thread Tejun Heo
Hello, Shaohua.

On Tue, Aug 29, 2017 at 10:07:04PM -0700, Shaohua Li wrote:
> > The association is already coming from the page.  We just need to make
> > sure that going through loop driver doesn't get in the way of the
> > membership getting propagated to the underlying device.
> 
> I think there is confusion. App writes files in upper layer fs on loop. memcg

Ah, yes, for some reason, I was still thinking about direct ios.

> estabilish membership for the pages of these files. Then writeback does write,
> loop then write these pages to under layer fs. The write is done in loop

Yes.

> thread. The write will allocate new page cache for under layer fs files. The
> issue is we must forward memcg info from upper layer files page cache to under
> layer files page cache.

Ah, I see.  We need to assign the ownership of the page cache pages to
the original dirtier.  Yeah, that's a different problem.  For now,
let's concentrate on the dio case.

> > So, this looks like the loop driver is explicitly forwarding the
> > association from the original issuer to the aio command and then from
> > the aio command to the ios to the underlying device.  I'm wondering
> > whether the right way to do this is making aio forward the association
> > by default, instead of the loop driver doing it explicitly.  That way,
> > all aio users can benefit from the forwarding instead of just loop.
> 
> That's possible. The downside doing this in aio is we must audit all fs to 
> make
> sure all bio have association. I'd like to avoid doing this if there is no
> other loop-like cgroup usage.

But wouldn't that mean that we break cgroup membership for aios that
users issue?  Can't we do this in the generic aio layer?

Thanks.

-- 
tejun


About btrfs on top of wb_throttle

2017-08-30 Thread Liu Bo
Hi,

While playing with btrfs on top of wb_throttle, an interesting thing
is that writeback from btrfs always falls into %rwb->wb_background
even if there're no other writers.

The peculiarity of btrfs is that, it owns the ability of mananging
disks so that it creates a private bdi stored in sb->s_bdi, which is
different from %queue->backing_device_info.

So running balance_dirty_pages() during btrfs's buffered writes
doesn't take any effect on %queue->backing_device_info->wb, thus it's
got into the wb_background bucket all the time.

Haven't measure the performance numbers with a real test, but in
theory this problem will let buffered writer spend more time in
balance_dirty_pages()'s for(;;) loop.

Chris, Jens, thoughts?

Thanks,

-liubo


Re: [PATCH] block/loop: fix use after feee

2017-08-30 Thread Shaohua Li
On Wed, Aug 30, 2017 at 02:51:05PM -0700, Shaohua Li wrote:
> lo_rw_aio->call_read_iter->
> 1 aops->direct_IO
> 2 iov_iter_revert
> lo_rw_aio_complete could happen between 1 and 2, the bio and bvec could
> be freed before 2, which accesses bvec.

please ignore this one, I accidentally sent it out. The correct fix is in
another patch.
 
> This conflicts with my direcio performance improvement patches, which
> I'll resend.
> 
> Signed-off-by: Shaohua Li 
> ---
>  drivers/block/loop.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index ef83349..153ab3c 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -490,6 +490,7 @@ static int lo_rw_aio(struct loop_device *lo, struct 
> loop_cmd *cmd,
>   bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
>   iov_iter_bvec(&iter, ITER_BVEC | rw, bvec,
> bio_segments(bio), blk_rq_bytes(cmd->rq));
> + bio_inc_remaining(bio);
>   /*
>* This bio may be started from the middle of the 'bvec'
>* because of bio splitting, so offset from the bvec must
> @@ -507,6 +508,7 @@ static int lo_rw_aio(struct loop_device *lo, struct 
> loop_cmd *cmd,
>   else
>   ret = call_read_iter(file, &cmd->iocb, &iter);
>  
> + bio_endio(bio);
>   if (ret != -EIOCBQUEUED)
>   cmd->iocb.ki_complete(&cmd->iocb, ret, 0);
>   return 0;
> -- 
> 2.9.5
> 


Re: [PATCH V2 2/2] block/loop: allow request merge for directio mode

2017-08-30 Thread Shaohua Li
On Wed, Aug 30, 2017 at 02:43:40PM +0800, Ming Lei wrote:
> On Tue, Aug 29, 2017 at 09:43:20PM -0700, Shaohua Li wrote:
> > On Wed, Aug 30, 2017 at 10:51:21AM +0800, Ming Lei wrote:
> > > On Tue, Aug 29, 2017 at 08:13:39AM -0700, Shaohua Li wrote:
> > > > On Tue, Aug 29, 2017 at 05:56:05PM +0800, Ming Lei wrote:
> > > > > On Thu, Aug 24, 2017 at 12:24:53PM -0700, Shaohua Li wrote:
> > > > > > From: Shaohua Li 
> > > > > > 
> > > > > > Currently loop disables merge. While it makes sense for buffer IO 
> > > > > > mode,
> > > > > > directio mode can benefit from request merge. Without merge, loop 
> > > > > > could
> > > > > > send small size IO to underlayer disk and harm performance.
> > > > > 
> > > > > Hi Shaohua,
> > > > > 
> > > > > IMO no matter if merge is used, loop always sends page by page
> > > > > to VFS in both dio or buffer I/O.
> > > > 
> > > > Why do you think so?
> > > 
> > > do_blockdev_direct_IO() still handles page by page from iov_iter, and
> > > with bigger request, I guess it might be the plug merge working.
> > 
> > This is not true. directio sends big size bio directly, not because of plug
> > merge. Please at least check the code before you complain.
> 
> I complain nothing, just try to understand the idea behind,
> never mind, :-)
> 
> > 
> > > >  
> > > > > Also if merge is enabled on loop, that means merge is run
> > > > > on both loop and low level block driver, and not sure if we
> > > > > can benefit from that.
> > > > 
> > > > why does merge still happen in low level block driver?
> > > 
> > > Because scheduler is still working on low level disk. My question
> > > is that why the scheduler in low level disk doesn't work now
> > > if scheduler on loop can merge?
> > 
> > The low level disk can still do merge, but since this is directio, the upper
> > layer already dispatches request as big as possible. There is very little
> > chance the requests can be merged again.
> 
> That is true, but these requests need to enter scheduler queue and
> be tried to merge again, even though it is less possible to succeed.
> Double merge may take extra CPU utilization.
> 
> Looks it doesn't answer my question.
> 
> Without this patch, the requests dispatched to loop won't be merged,
> so they may be small and their sectors may be continuous, my question
> is why dio bios converted from these small loop requests can't be
> merged in block layer when queuing these dio bios to low level device?

loop thread doesn't have plug there. Even we have plug there, it's still a bad
idea to do the merge in low level layer. If we run direct_IO for every 4k, the
overhead is much much higher than bio merge. The direct_IO will call into fs
code, take different mutexes, metadata update for write and so on.


[PATCH] block/loop: fix use after feee

2017-08-30 Thread Shaohua Li
lo_rw_aio->call_read_iter->
1   aops->direct_IO
2   iov_iter_revert
lo_rw_aio_complete could happen between 1 and 2, the bio and bvec could
be freed before 2, which accesses bvec.

This conflicts with my direcio performance improvement patches, which
I'll resend.

Signed-off-by: Shaohua Li 
---
 drivers/block/loop.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index ef83349..153ab3c 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -490,6 +490,7 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
iov_iter_bvec(&iter, ITER_BVEC | rw, bvec,
  bio_segments(bio), blk_rq_bytes(cmd->rq));
+   bio_inc_remaining(bio);
/*
 * This bio may be started from the middle of the 'bvec'
 * because of bio splitting, so offset from the bvec must
@@ -507,6 +508,7 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
else
ret = call_read_iter(file, &cmd->iocb, &iter);
 
+   bio_endio(bio);
if (ret != -EIOCBQUEUED)
cmd->iocb.ki_complete(&cmd->iocb, ret, 0);
return 0;
-- 
2.9.5



[PATCH] block/loop: fix use after free

2017-08-30 Thread Shaohua Li
lo_rw_aio->call_read_iter->
1   aops->direct_IO
2   iov_iter_revert
lo_rw_aio_complete could happen between 1 and 2, the bio and bvec could
be freed before 2, which accesses bvec.

This conflicts with my direcio performance improvement patches, which
I'll resend.

Signed-off-by: Shaohua Li 
---
 drivers/block/loop.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index ef83349..153ab3c 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -490,6 +490,7 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
iov_iter_bvec(&iter, ITER_BVEC | rw, bvec,
  bio_segments(bio), blk_rq_bytes(cmd->rq));
+   bio_inc_remaining(bio);
/*
 * This bio may be started from the middle of the 'bvec'
 * because of bio splitting, so offset from the bvec must
@@ -507,6 +508,7 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
else
ret = call_read_iter(file, &cmd->iocb, &iter);
 
+   bio_endio(bio);
if (ret != -EIOCBQUEUED)
cmd->iocb.ki_complete(&cmd->iocb, ret, 0);
return 0;
-- 
2.9.5



Re: [GIT PULL] nvme update for Linux 4.14, take 2

2017-08-30 Thread Sagi Grimberg



That would mean that I need to open-code the tagset iteration in nvme
which does not feel like something a driver should do.


How about renaming blk_mq_reinit_tagset() into blk_mq_tagset_iter() and
to make the argument list of blk_mq_tagset_iter() more similar to that of
blk_mq_queue_tag_busy_iter() such that callers of blk_mq_tagset_iter()
can pass a pointer to any structure through the @priv argument? That would
make this function more general and maybe also more useful to other block
drivers.


We could do that. But it feels like trying to go head over heals just to
keep a change titled:

"blk-mq: Make blk_mq_reinit_tagset() calls easier to read"

Which I'm not exactly sure I share the motivation. Also, I kinda liked
the symmetry of init/exit/reinit_request calling convention.

But, if you absolutely think its necessary to keep the change, we can
add an "all tags" iterator and use that to implement reinit_request in
nvme.

Christoph? Jens? verdict before we go forward here?


Re: [PATCH 5/6] platform/x86: make device_attribute const

2017-08-30 Thread Andy Shevchenko
On Mon, Aug 21, 2017 at 2:43 PM, Bhumika Goyal  wrote:
> Make these const as they are only passed as an argument to the
> function device_create_file and device_remove_file and the corresponding
> arguments are of type const.
> Done using Coccinelle
>

Split on per driver basis.

> Signed-off-by: Bhumika Goyal 
> ---
>  drivers/platform/x86/classmate-laptop.c | 6 +++---
>  drivers/platform/x86/intel-rst.c| 4 ++--
>  2 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/platform/x86/classmate-laptop.c 
> b/drivers/platform/x86/classmate-laptop.c
> index 55cf10b..d3715e2 100644
> --- a/drivers/platform/x86/classmate-laptop.c
> +++ b/drivers/platform/x86/classmate-laptop.c
> @@ -254,7 +254,7 @@ static ssize_t cmpc_accel_sensitivity_store_v4(struct 
> device *dev,
> return strnlen(buf, count);
>  }
>
> -static struct device_attribute cmpc_accel_sensitivity_attr_v4 = {
> +static const struct device_attribute cmpc_accel_sensitivity_attr_v4 = {
> .attr = { .name = "sensitivity", .mode = 0660 },
> .show = cmpc_accel_sensitivity_show_v4,
> .store = cmpc_accel_sensitivity_store_v4
> @@ -303,7 +303,7 @@ static ssize_t cmpc_accel_g_select_store_v4(struct device 
> *dev,
> return strnlen(buf, count);
>  }
>
> -static struct device_attribute cmpc_accel_g_select_attr_v4 = {
> +static const struct device_attribute cmpc_accel_g_select_attr_v4 = {
> .attr = { .name = "g_select", .mode = 0660 },
> .show = cmpc_accel_g_select_show_v4,
> .store = cmpc_accel_g_select_store_v4
> @@ -599,7 +599,7 @@ static ssize_t cmpc_accel_sensitivity_store(struct device 
> *dev,
> return strnlen(buf, count);
>  }
>
> -static struct device_attribute cmpc_accel_sensitivity_attr = {
> +static const struct device_attribute cmpc_accel_sensitivity_attr = {
> .attr = { .name = "sensitivity", .mode = 0660 },
> .show = cmpc_accel_sensitivity_show,
> .store = cmpc_accel_sensitivity_store
> diff --git a/drivers/platform/x86/intel-rst.c 
> b/drivers/platform/x86/intel-rst.c
> index 7344d84..760a9bf 100644
> --- a/drivers/platform/x86/intel-rst.c
> +++ b/drivers/platform/x86/intel-rst.c
> @@ -65,7 +65,7 @@ static ssize_t irst_store_wakeup_events(struct device *dev,
> return count;
>  }
>
> -static struct device_attribute irst_wakeup_attr = {
> +static const struct device_attribute irst_wakeup_attr = {
> .attr = { .name = "wakeup_events", .mode = 0600 },
> .show = irst_show_wakeup_events,
> .store = irst_store_wakeup_events
> @@ -111,7 +111,7 @@ static ssize_t irst_store_wakeup_time(struct device *dev,
> return count;
>  }
>
> -static struct device_attribute irst_timeout_attr = {
> +static const struct device_attribute irst_timeout_attr = {
> .attr = { .name = "wakeup_time", .mode = 0600 },
> .show = irst_show_wakeup_time,
> .store = irst_store_wakeup_time
> --
> 1.9.1
>



-- 
With Best Regards,
Andy Shevchenko


[PATCH 3/3] block/loop: suppress discard IO error message

2017-08-30 Thread Shaohua Li
We don't know if fallocate really supports FALLOC_FL_PUNCH_HOLE till
fallocate is called. If it doesn't support, loop will return
-EOPNOTSUPP and we seee a lot of error message printed by
blk_update_request. Failure for discard IO isn't a problem, so we just
return 0 in loop which will suppress the IO error message

Signed-off-by: Shaohua Li 
---
 drivers/block/loop.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index a30aa45..15f51e3 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -441,6 +441,9 @@ static int lo_discard(struct loop_device *lo, struct 
request *rq, loff_t pos)
ret = file->f_op->fallocate(file, mode, pos, blk_rq_bytes(rq));
if (unlikely(ret && ret != -EINVAL && ret != -EOPNOTSUPP))
ret = -EIO;
+
+   if (req_op(rq) == REQ_OP_DISCARD && ret == -EOPNOTSUPP)
+   ret = 0;
  out:
return ret;
 }
-- 
2.9.5



[PATCH 0/3]block/loop: handle discard/zeroout error

2017-08-30 Thread Shaohua Li
Fix some problems when setting up loop device with a block device as back file
and create/mount ext4 in the loop device.

BTW: blkdev_issue_zeroout retries if we immediately find the device doesn't
support zeroout, but it doesn't retry if submit_bio_wait returns -EOPNOTSUPP.
Is this correct behavior?

Thanks,
Shaohua


Shaohua Li (3):
  block/loop: don't hijack error number
  block/loop: use FALLOC_FL_ZERO_RANGE for REQ_OP_WRITE_ZEROES
  block/loop: suppress discard IO error message

 drivers/block/loop.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

-- 
2.9.5



[PATCH 1/3] block/loop: don't hijack error number

2017-08-30 Thread Shaohua Li
If the bio returns -EOPNOTSUPP, we shouldn't hijack it and return -EIO

Signed-off-by: Shaohua Li 
---
 drivers/block/loop.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index ef83349..054dccc 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -464,7 +464,7 @@ static void lo_complete_rq(struct request *rq)
zero_fill_bio(bio);
}
 
-   blk_mq_end_request(rq, cmd->ret < 0 ? BLK_STS_IOERR : BLK_STS_OK);
+   blk_mq_end_request(rq, errno_to_blk_status(cmd->ret));
 }
 
 static void lo_rw_aio_complete(struct kiocb *iocb, long ret, long ret2)
@@ -1725,7 +1725,7 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
  failed:
/* complete non-aio request */
if (!cmd->use_aio || ret) {
-   cmd->ret = ret ? -EIO : 0;
+   cmd->ret = ret;
blk_mq_complete_request(cmd->rq);
}
 }
-- 
2.9.5



[PATCH 2/3] block/loop: use FALLOC_FL_ZERO_RANGE for REQ_OP_WRITE_ZEROES

2017-08-30 Thread Shaohua Li
REQ_OP_WRITE_ZEROES really means zero the data. And the in
blkdev_fallocate, FALLOC_FL_ZERO_RANGE will retry but
FALLOC_FL_PUNCH_HOLE not, even loop request doesn't have
BLKDEV_ZERO_NOFALLBACK set.

Signed-off-by: Shaohua Li 
---
 drivers/block/loop.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 054dccc..a30aa45 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -430,6 +430,9 @@ static int lo_discard(struct loop_device *lo, struct 
request *rq, loff_t pos)
int mode = FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE;
int ret;
 
+   if (req_op(rq) == REQ_OP_WRITE_ZEROES)
+   mode = FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE;
+
if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
ret = -EOPNOTSUPP;
goto out;
-- 
2.9.5



[PATCH 5/5] bfq: Use icq_to_bic() consistently

2017-08-30 Thread Bart Van Assche
Some code uses icq_to_bic() to convert an io_cq pointer to a
bfq_io_cq pointer while other code uses a direct cast. Convert
the code that uses a direct cast such that it uses icq_to_bic().

Signed-off-by: Bart Van Assche 
Cc: Paolo Valente 
---
 block/bfq-iosched.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index d375b29ad085..7d1b85e393b2 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -239,7 +239,7 @@ static int T_slow[2];
 static int T_fast[2];
 static int device_speed_thresh[2];
 
-#define RQ_BIC(rq) ((struct bfq_io_cq *) (rq)->elv.priv[0])
+#define RQ_BIC(rq) icq_to_bic((rq)->elv.priv[0])
 #define RQ_BFQQ(rq)((rq)->elv.priv[1])
 
 struct bfq_queue *bic_to_bfqq(struct bfq_io_cq *bic, bool is_sync)
-- 
2.14.1



[PATCH 4/5] bfq: Suppress compiler warnings about comparisons

2017-08-30 Thread Bart Van Assche
This patch avoids that the following warnings are reported when
building with W=1:

block/bfq-iosched.c: In function 'bfq_back_seek_max_store':
block/bfq-iosched.c:4860:13: warning: comparison of unsigned expression < 0 is 
always false [-Wtype-limits]
  if (__data < (MIN))  \
 ^
block/bfq-iosched.c:4876:1: note: in expansion of macro 'STORE_FUNCTION'
 STORE_FUNCTION(bfq_back_seek_max_store, &bfqd->bfq_back_max, 0, INT_MAX, 0);
 ^~
block/bfq-iosched.c: In function 'bfq_slice_idle_store':
block/bfq-iosched.c:4860:13: warning: comparison of unsigned expression < 0 is 
always false [-Wtype-limits]
  if (__data < (MIN))  \
 ^
block/bfq-iosched.c:4879:1: note: in expansion of macro 'STORE_FUNCTION'
 STORE_FUNCTION(bfq_slice_idle_store, &bfqd->bfq_slice_idle, 0, INT_MAX, 2);
 ^~
block/bfq-iosched.c: In function 'bfq_slice_idle_us_store':
block/bfq-iosched.c:4892:13: warning: comparison of unsigned expression < 0 is 
always false [-Wtype-limits]
  if (__data < (MIN))  \
 ^
block/bfq-iosched.c:4899:1: note: in expansion of macro 'USEC_STORE_FUNCTION'
 USEC_STORE_FUNCTION(bfq_slice_idle_us_store, &bfqd->bfq_slice_idle, 0,
 ^~~

Signed-off-by: Bart Van Assche 
Cc: Paolo Valente 
---
 block/bfq-iosched.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index cf92f16eb5f2..d375b29ad085 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -4851,16 +4851,16 @@ static ssize_t  
\
 __FUNC(struct elevator_queue *e, const char *page, size_t count)   \
 {  \
struct bfq_data *bfqd = e->elevator_data;   \
-   unsigned long __data;   \
+   unsigned long __data, __min = (MIN), __max = (MAX); \
int ret;\
\
ret = bfq_var_store(&__data, (page));   \
if (ret)\
return ret; \
-   if (__data < (MIN)) \
-   __data = (MIN); \
-   else if (__data > (MAX))\
-   __data = (MAX); \
+   if (__data < __min) \
+   __data = __min; \
+   else if (__data > __max)\
+   __data = __max; \
if (__CONV == 1)\
*(__PTR) = msecs_to_jiffies(__data);\
else if (__CONV == 2)   \
@@ -4883,16 +4883,16 @@ STORE_FUNCTION(bfq_slice_idle_store, 
&bfqd->bfq_slice_idle, 0, INT_MAX, 2);
 static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t 
count)\
 {  \
struct bfq_data *bfqd = e->elevator_data;   \
-   unsigned long __data;   \
+   unsigned long __data, __min = (MIN), __max = (MAX); \
int ret;\
\
ret = bfq_var_store(&__data, (page));   \
if (ret)\
return ret; \
-   if (__data < (MIN)) \
-   __data = (MIN); \
-   else if (__data > (MAX))\
-   __data = (MAX); \
+   if (__data < __min) \
+   __data = __min; \
+   else if (__data > __max)\
+   __data = __max; \
*(__PTR) = (u64)__data * NSEC_PER_USEC; \
return count;   \
 }
-- 
2.14.1



[PATCH 3/5] bfq: Check kstrtoul() return value

2017-08-30 Thread Bart Van Assche
Make sysfs writes fail for invalid numbers instead of storing
uninitialized data copied from the stack. This patch removes
all uninitialized_var() occurrences from the BFQ source code.

Signed-off-by: Bart Van Assche 
Cc: Paolo Valente 
---
 block/bfq-iosched.c | 52 +---
 1 file changed, 37 insertions(+), 15 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 8c11c2e827a5..cf92f16eb5f2 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -4802,13 +4802,15 @@ static ssize_t bfq_var_show(unsigned int var, char 
*page)
return sprintf(page, "%u\n", var);
 }
 
-static void bfq_var_store(unsigned long *var, const char *page)
+static int bfq_var_store(unsigned long *var, const char *page)
 {
unsigned long new_val;
int ret = kstrtoul(page, 10, &new_val);
 
-   if (ret == 0)
-   *var = new_val;
+   if (ret)
+   return ret;
+   *var = new_val;
+   return 0;
 }
 
 #define SHOW_FUNCTION(__FUNC, __VAR, __CONV)   \
@@ -4849,8 +4851,12 @@ static ssize_t   
\
 __FUNC(struct elevator_queue *e, const char *page, size_t count)   \
 {  \
struct bfq_data *bfqd = e->elevator_data;   \
-   unsigned long uninitialized_var(__data);\
-   bfq_var_store(&__data, (page)); \
+   unsigned long __data;   \
+   int ret;\
+   \
+   ret = bfq_var_store(&__data, (page));   \
+   if (ret)\
+   return ret; \
if (__data < (MIN)) \
__data = (MIN); \
else if (__data > (MAX))\
@@ -4877,8 +4883,12 @@ STORE_FUNCTION(bfq_slice_idle_store, 
&bfqd->bfq_slice_idle, 0, INT_MAX, 2);
 static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t 
count)\
 {  \
struct bfq_data *bfqd = e->elevator_data;   \
-   unsigned long uninitialized_var(__data);\
-   bfq_var_store(&__data, (page)); \
+   unsigned long __data;   \
+   int ret;\
+   \
+   ret = bfq_var_store(&__data, (page));   \
+   if (ret)\
+   return ret; \
if (__data < (MIN)) \
__data = (MIN); \
else if (__data > (MAX))\
@@ -4894,9 +4904,12 @@ static ssize_t bfq_max_budget_store(struct 
elevator_queue *e,
const char *page, size_t count)
 {
struct bfq_data *bfqd = e->elevator_data;
-   unsigned long uninitialized_var(__data);
+   unsigned long __data;
+   int ret;
 
-   bfq_var_store(&__data, (page));
+   ret = bfq_var_store(&__data, (page));
+   if (ret)
+   return ret;
 
if (__data == 0)
bfqd->bfq_max_budget = bfq_calc_max_budget(bfqd);
@@ -4919,9 +4932,12 @@ static ssize_t bfq_timeout_sync_store(struct 
elevator_queue *e,
  const char *page, size_t count)
 {
struct bfq_data *bfqd = e->elevator_data;
-   unsigned long uninitialized_var(__data);
+   unsigned long __data;
+   int ret;
 
-   bfq_var_store(&__data, (page));
+   ret = bfq_var_store(&__data, (page));
+   if (ret)
+   return ret;
 
if (__data < 1)
__data = 1;
@@ -4939,9 +4955,12 @@ static ssize_t bfq_strict_guarantees_store(struct 
elevator_queue *e,
 const char *page, size_t count)
 {
struct bfq_data *bfqd = e->elevator_data;
-   unsigned long uninitialized_var(__data);
+   unsigned long __data;
+   int ret;
 
-   bfq_var_store(&__data, (page));
+   ret = bfq_var_store(&__data, (page));
+   if (ret)
+   return ret;
 
if (__data > 1)
__data = 1;
@@ -4958,9 +4977,12 @@ static ssize_t bfq_low_latency_store(struct 
elevator_queue *e,

[PATCH 2/5] bfq: Declare local functions static

2017-08-30 Thread Bart Van Assche
Signed-off-by: Bart Van Assche 
Cc: Paolo Valente 
---
 block/bfq-cgroup.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index 78b2e0db4fb2..ceefb9a706d6 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -206,7 +206,7 @@ static void bfqg_get(struct bfq_group *bfqg)
bfqg->ref++;
 }
 
-void bfqg_put(struct bfq_group *bfqg)
+static void bfqg_put(struct bfq_group *bfqg)
 {
bfqg->ref--;
 
@@ -385,7 +385,7 @@ static struct bfq_group_data *blkcg_to_bfqgd(struct blkcg 
*blkcg)
return cpd_to_bfqgd(blkcg_to_cpd(blkcg, &blkcg_policy_bfq));
 }
 
-struct blkcg_policy_data *bfq_cpd_alloc(gfp_t gfp)
+static struct blkcg_policy_data *bfq_cpd_alloc(gfp_t gfp)
 {
struct bfq_group_data *bgd;
 
@@ -395,7 +395,7 @@ struct blkcg_policy_data *bfq_cpd_alloc(gfp_t gfp)
return &bgd->pd;
 }
 
-void bfq_cpd_init(struct blkcg_policy_data *cpd)
+static void bfq_cpd_init(struct blkcg_policy_data *cpd)
 {
struct bfq_group_data *d = cpd_to_bfqgd(cpd);
 
@@ -403,12 +403,12 @@ void bfq_cpd_init(struct blkcg_policy_data *cpd)
CGROUP_WEIGHT_DFL : BFQ_WEIGHT_LEGACY_DFL;
 }
 
-void bfq_cpd_free(struct blkcg_policy_data *cpd)
+static void bfq_cpd_free(struct blkcg_policy_data *cpd)
 {
kfree(cpd_to_bfqgd(cpd));
 }
 
-struct blkg_policy_data *bfq_pd_alloc(gfp_t gfp, int node)
+static struct blkg_policy_data *bfq_pd_alloc(gfp_t gfp, int node)
 {
struct bfq_group *bfqg;
 
@@ -426,7 +426,7 @@ struct blkg_policy_data *bfq_pd_alloc(gfp_t gfp, int node)
return &bfqg->pd;
 }
 
-void bfq_pd_init(struct blkg_policy_data *pd)
+static void bfq_pd_init(struct blkg_policy_data *pd)
 {
struct blkcg_gq *blkg = pd_to_blkg(pd);
struct bfq_group *bfqg = blkg_to_bfqg(blkg);
@@ -445,7 +445,7 @@ void bfq_pd_init(struct blkg_policy_data *pd)
bfqg->rq_pos_tree = RB_ROOT;
 }
 
-void bfq_pd_free(struct blkg_policy_data *pd)
+static void bfq_pd_free(struct blkg_policy_data *pd)
 {
struct bfq_group *bfqg = pd_to_bfqg(pd);
 
@@ -453,7 +453,7 @@ void bfq_pd_free(struct blkg_policy_data *pd)
bfqg_put(bfqg);
 }
 
-void bfq_pd_reset_stats(struct blkg_policy_data *pd)
+static void bfq_pd_reset_stats(struct blkg_policy_data *pd)
 {
struct bfq_group *bfqg = pd_to_bfqg(pd);
 
@@ -740,7 +740,7 @@ static void bfq_reparent_active_entities(struct bfq_data 
*bfqd,
  * blkio already grabs the queue_lock for us, so no need to use
  * RCU-based magic
  */
-void bfq_pd_offline(struct blkg_policy_data *pd)
+static void bfq_pd_offline(struct blkg_policy_data *pd)
 {
struct bfq_service_tree *st;
struct bfq_group *bfqg = pd_to_bfqg(pd);
-- 
2.14.1



[PATCH 1/5] bfq: Annotate fall-through in a switch statement

2017-08-30 Thread Bart Van Assche
This patch avoids that gcc 7 issues a warning about fall-through
when building with W=1.

Signed-off-by: Bart Van Assche 
Cc: Paolo Valente 
---
 block/bfq-iosched.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 6a7a26b6cec1..8c11c2e827a5 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -3780,6 +3780,7 @@ bfq_set_next_ioprio_data(struct bfq_queue *bfqq, struct 
bfq_io_cq *bic)
default:
dev_err(bfqq->bfqd->queue->backing_dev_info->dev,
"bfq: bad prio class %d\n", ioprio_class);
+   /* fall through */
case IOPRIO_CLASS_NONE:
/*
 * No prio set, inherit CPU scheduling settings.
-- 
2.14.1



[PATCH 0/5] Five BFQ patches for kernel v4.14

2017-08-30 Thread Bart Van Assche
Hello Jens,

This series consists of five small patches. Three patches address compiler
warnings, one patch is a fix for the BFQ sysfs interface and a fifth patch
makes the BFQ code slightly more consistent. Please consider these patches
for kernel v4.14.

Thanks,

Bart.

Bart Van Assche (5):
  bfq: Annotate fall-through in a switch statement
  bfq: Declare local functions static
  bfq: Check kstrtoul() return value
  bfq: Suppress compiler warnings about comparisons
  bfq: Use icq_to_bic() consistently

 block/bfq-cgroup.c  | 18 +++---
 block/bfq-iosched.c | 71 +++--
 2 files changed, 56 insertions(+), 33 deletions(-)

-- 
2.14.1



Re: [PATCH V3 13/14] blk-mq-sched: refactor blk_mq_sched_try_merge()

2017-08-30 Thread Bart Van Assche
On Sun, 2017-08-27 at 00:33 +0800, Ming Lei wrote:
> -bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
> - struct request **merged_request)
> +static bool __blk_mq_try_merge(struct request_queue *q,
> + struct bio *bio, struct request **merged_request,
> + struct request *candidate, enum elv_merge type)
>  {
> - struct request *rq;
> + struct request *rq = candidate;

It seems weird to me that the argument 'candidate' is not used other than to
copy its value into the local variable 'rq'? How about removing that local
variable and renaming the 'candidate' argument into 'rq'? Anyway, with or
without that change:

Reviewed-by: Bart Van Assche 

Re: [PATCH V3 06/14] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed

2017-08-30 Thread Bart Van Assche
On Sun, 2017-08-27 at 00:33 +0800, Ming Lei wrote:
> During dispatching, we moved all requests from hctx->dispatch to
> one temporary list, then dispatch them one by one from this list.
> Unfortunately duirng this period, run queue from other contexts
^^
during?
> may think the queue is idle, then start to dequeue from sw/scheduler
> queue and still try to dispatch because ->dispatch is empty. This way
> hurts sequential I/O performance because requests are dequeued when
> lld queue is busy.
> [ ... ]
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 735e432294ab..4d7bea8c2594 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -146,7 +146,6 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
> *hctx)
>   struct request_queue *q = hctx->queue;
>   struct elevator_queue *e = q->elevator;
>   const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
> - bool do_sched_dispatch = true;
>   LIST_HEAD(rq_list);
>  
>   /* RCU or SRCU read lock is needed before checking quiesced flag */

Shouldn't blk_mq_sched_dispatch_requests() set BLK_MQ_S_DISPATCH_BUSY just after
the following statement because this statement makes the dispatch list empty?

list_splice_init(&hctx->dispatch, &rq_list);

> @@ -177,8 +176,33 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
> *hctx)
>*/
>   if (!list_empty(&rq_list)) {
>   blk_mq_sched_mark_restart_hctx(hctx);
> - do_sched_dispatch = blk_mq_dispatch_rq_list(q, &rq_list);
> - } else if (!has_sched_dispatch && !q->queue_depth) {
> + blk_mq_dispatch_rq_list(q, &rq_list);
> +
> + /*
> +  * We may clear DISPATCH_BUSY just after it
> +  * is set from another context, the only cost
> +  * is that one request is dequeued a bit early,
> +  * we can survive that. Given the window is
> +  * too small, no need to worry about performance
   ^^^
The word "too" seems extraneous to me in this sentence.

>  bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
> @@ -330,6 +353,7 @@ static bool blk_mq_sched_bypass_insert(struct 
> blk_mq_hw_ctx *hctx,
>*/
>   spin_lock(&hctx->lock);
>   list_add(&rq->queuelist, &hctx->dispatch);
> + set_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
>   spin_unlock(&hctx->lock);
>   return true;
>  }

Is it necessary to make blk_mq_sched_bypass_insert() set BLK_MQ_S_DISPATCH_BUSY?
My understanding is that only code that makes the dispatch list empty should
set BLK_MQ_S_DISPATCH_BUSY. However, blk_mq_sched_bypass_insert() adds an 
element
to the dispatch list so that guarantees that that list is not empty.

> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f063dd0f197f..6af56a71c1cd 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1140,6 +1140,11 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
> struct list_head *list)
>  
>   spin_lock(&hctx->lock);
>   list_splice_init(list, &hctx->dispatch);
> + /*
> +  * DISPATCH_BUSY won't be cleared until all requests
> +  * in hctx->dispatch are dispatched successfully
> +  */
> + set_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
>   spin_unlock(&hctx->lock);

Same comment here - since this code adds one or more requests to the dispatch 
list,
is it really needed to set the DISPATCH_BUSY flag?

Bart.

Re: [PATCH 1/2] blk-mq: add requests in the tail of hctx->dispatch

2017-08-30 Thread Ming Lei
On Wed, Aug 30, 2017 at 09:51:31AM -0600, Jens Axboe wrote:
> On 08/30/2017 09:39 AM, Ming Lei wrote:
> > On Wed, Aug 30, 2017 at 09:22:42AM -0600, Jens Axboe wrote:
> >> On 08/30/2017 09:19 AM, Ming Lei wrote:
> >>> It is more reasonable to add requests to ->dispatch in way
> >>> of FIFO style, instead of LIFO style.
> >>>
> >>> Also in this way, we can allow to insert request at the front
> >>> of hw queue, which function is needed to fix one bug
> >>> in blk-mq's implementation of blk_execute_rq()
> >>>
> >>> Reported-by: Oleksandr Natalenko 
> >>> Tested-by: Oleksandr Natalenko 
> >>> Signed-off-by: Ming Lei 
> >>> ---
> >>>  block/blk-mq-sched.c | 2 +-
> >>>  block/blk-mq.c   | 2 +-
> >>>  2 files changed, 2 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> >>> index 4ab69435708c..8d97df40fc28 100644
> >>> --- a/block/blk-mq-sched.c
> >>> +++ b/block/blk-mq-sched.c
> >>> @@ -272,7 +272,7 @@ static bool blk_mq_sched_bypass_insert(struct 
> >>> blk_mq_hw_ctx *hctx,
> >>>* the dispatch list.
> >>>*/
> >>>   spin_lock(&hctx->lock);
> >>> - list_add(&rq->queuelist, &hctx->dispatch);
> >>> + list_add_tail(&rq->queuelist, &hctx->dispatch);
> >>>   spin_unlock(&hctx->lock);
> >>>   return true;
> >>>  }
> >>> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >>> index 4603b115e234..fed3d0c16266 100644
> >>> --- a/block/blk-mq.c
> >>> +++ b/block/blk-mq.c
> >>> @@ -1067,7 +1067,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue 
> >>> *q, struct list_head *list)
> >>>   blk_mq_put_driver_tag(rq);
> >>>  
> >>>   spin_lock(&hctx->lock);
> >>> - list_splice_init(list, &hctx->dispatch);
> >>> + list_splice_tail_init(list, &hctx->dispatch);
> >>>   spin_unlock(&hctx->lock);
> >>
> >> I'm not convinced this is safe, there's actually a reason why the
> >> request is added to the front and not the back. We do have
> >> reorder_tags_to_front() as a safe guard, but I'd much rather get rid of
> > 
> > reorder_tags_to_front() is for reordering the requests in current list,
> > this patch is for splicing list into hctx->dispatch, so I can't see
> > it isn't safe, or could you explain it a bit?
> 
> If we can get the ordering right, then down the line we won't need to
> have the tags reordering at all. It's an ugly hack that I'd love to see
> go away.

If reorder_tags_to_front() isn't removed, this patch is still safe.

But blk_execute_rq_nowait() need to add one request in the front
of hw queue, that can be a contradiction compared with maintaining
a perfect order for removing reorder_tags_to_front().

So could you share your opinion on the 2nd patch for fixing
blk_execute_rq_nowait()?

> 
> >> that than make this change.
> >>
> >> What's your reasoning here? Your changelog doesn't really explain why
> > 
> > Firstly the 2nd patch need to add one rq(such as RQF_PM) to the
> > front of the hw queue, the simple way is to add it to the front
> > of hctx->dispatch. Without this change, the 2nd patch can't work
> > at all.
> > 
> > Secondly this way is still reasonable:
> > 
> > - one rq is added to hctx->dispatch because queue is busy
> > - another rq is added to hctx->dispatch too because of same reason
> >
> > so it is reasonable to to add list into hctx->dispatch in FIFO style.
> 
> Not disagreeing with the logic. But it also begs the question of why we
> don't apply the same treatment to when we splice leftovers to the
> dispatch list, currently we front splice that.
> 
> All I'm saying is that you need to tread very carefully with this, and
> throw it through some careful testing to ensure that we don't introduce
> conditions that now livelock. NVMe is the easy test case, that will

Yes, ->dispatch is far away from NVMe, but friends of SCSI-MQ.

> generally always work since we never run out of tags. The problematic
> test case is usually things like SATA with 31 tags, and especially SATA
> with flushes that don't queue. One good test case is the one where you
> end up having all tags (or almost all) consumed by flushes, and still
> ensuring that we're making forward progress.

Understood!

Even we can make the test more aggressive.

I just setup one virtio-scsi by changing both 'can_queue' and
'cmd_per_lun' as 1, that means the hw queue depth is 1, and
hw queue number is set as 1.

Then I run 'dbench -t 30 -s -F 64' in ext4 which is over
this virtio-scsi device.

The dbench(64 jobs, sync write, fsync) just works fine with
this patch applied.


-- 
Ming


Re: [GIT PULL] nvme update for Linux 4.14, take 2

2017-08-30 Thread Bart Van Assche
On Wed, 2017-08-30 at 19:47 +0300, Sagi Grimberg wrote:
> > > If we get to choose, my preference would be to restore the old behavior
> > > because currently existing nvme transports keep their internal ctrl
> > > representation in the tagset->driver_data as they implement
> > > init/exit_request. Bouncing in nvme core would require to change that
> > > and always keep struct nvme_ctrl as the tagset->driver_data and convert
> > > it on every handler...
> > > 
> > > Everything is doable but it seems like an unneeded hassle to me...
> > 
> > Sorry but I'm not convinced that it would be necessary to change what
> > tagset->driver_data points at. How about moving blk_mq_reinit_tagset() from
> > the block layer core to the NVMe core, to rename it and to pass a pointer
> > to the nvme_ctrl data structure to that function instead of only block layer
> > information?
> 
> That would mean that I need to open-code the tagset iteration in nvme
> which does not feel like something a driver should do.

How about renaming blk_mq_reinit_tagset() into blk_mq_tagset_iter() and
to make the argument list of blk_mq_tagset_iter() more similar to that of
blk_mq_queue_tag_busy_iter() such that callers of blk_mq_tagset_iter()
can pass a pointer to any structure through the @priv argument? That would
make this function more general and maybe also more useful to other block
drivers.

Bart.

Re: [GIT PULL] nvme update for Linux 4.14, take 2

2017-08-30 Thread Sagi Grimberg



If we get to choose, my preference would be to restore the old behavior
because currently existing nvme transports keep their internal ctrl
representation in the tagset->driver_data as they implement
init/exit_request. Bouncing in nvme core would require to change that
and always keep struct nvme_ctrl as the tagset->driver_data and convert
it on every handler...

Everything is doable but it seems like an unneeded hassle to me...


Sorry but I'm not convinced that it would be necessary to change what
tagset->driver_data points at. How about moving blk_mq_reinit_tagset() from
the block layer core to the NVMe core, to rename it and to pass a pointer
to the nvme_ctrl data structure to that function instead of only block layer
information?


That would mean that I need to open-code the tagset iteration in nvme
which does not feel like something a driver should do.


Re: [PATCH V3 05/14] blk-mq-sched: improve dispatching from sw queue

2017-08-30 Thread Bart Van Assche
On Sun, 2017-08-27 at 00:33 +0800, Ming Lei wrote:
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 50c6485cb04f..7b7a366a97f3 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -30,6 +30,8 @@ struct blk_mq_hw_ctx {
>  
>   struct sbitmap  ctx_map;
>  
> + struct blk_mq_ctx   *dispatch_from;
> +
>   struct blk_mq_ctx   **ctxs;
>   unsigned intnr_ctx;

Hello Ming,

Are you relying here on the fact that the per-CPU queues are never
reallocated, even if CPU hot-plugging occurs? Sorry but that seems fragile
to me. I would like to see 'dispatch_from' be converted into an integer. It
is easy to check whether an integer software queue index is out of range
but it's not that easy to check whether a pointer became invalid.

Thanks,

Bart.

Re: BFQ + dm-mpath

2017-08-30 Thread Paolo Valente

> Il giorno 25 ago 2017, alle ore 01:16, Bart Van Assche 
>  ha scritto:
> 
> Hello Paolo,
> 

Hi Bart

> Has BFQ ever been tested in combination with dm-mpath?

Unfortunately no.

> A few seconds
> after I had started a run of the srp-tests software with BFQ a kernel
> oops appeared on the console. The command I ran is:
> 
> $ while true; do ~bart/software/infiniband/srp-test/run_tests -d -r 30 -e 
> bfq; done
> 
> And this is what appeared on the console:
> 
> [89251.977201] BUG: unable to handle kernel NULL pointer dereference at 
> 0018 
> [89251.977201] BUG: unable to handle kernel NULL pointer dereference at 
> 0018 
> [89251.987831] IP: bfq_setup_cooperator+0x16/0x2e0 [bfq] 
> [89251.987831] IP: bfq_setup_cooperator+0x16/0x2e0 [bfq] 
> [89251.995241] PGD 0  
> [89251.995241] PGD 0  
> [89251.995243] P4D 0  
> [89251.995243] P4D 0  
> [89251.999194]  
> [89251.999194]  
> [89252.006423] Oops:  [#1] PREEMPT SMP 
> [89252.006423] Oops:  [#1] PREEMPT SMP 
> [89252.012354] Modules linked in: ib_srp libcrc32c scsi_transport_srp 
> target_core_file ib_srpt target_core_iblock target_core_mod scsi_debug brd 
> bfq dm_service_time rdma_cm
> iw_cm bridge stp llc ip6table_filter ip 
> 6_tables iptable_filter intel_rapl sb_edac x86_pkg_temp_thermal 
> intel_powerclamp coretemp kvm_intel kvm af_packet irqbypass ipmi_ssif dcdbas 
> crct10dif_pclmul ioatdma
> crc32_pclmul mei_me joydev ghash_clmulni_intel 
> mei aesni_intel sg shpchp ipmi_si aes_x86_64 ipmi_devintf crypto_simd dca 
> cryptd glue_helper lpc_ich ipmi_msghandler acpi_power_meter acpi_pad wmi 
> mfd_core ib_ipoib ib_cm
> ib_uverbs ib_umad mlx4_ib ib_core dm_mul 
> tipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua netconsole configfs 
> ip_tables x_tables autofs4 ext4 mbcache jbd2 mlx4_en hid_generic usbhid 
> mgag200 i2c_algo_bit
> drm_kms_helper 
> [89252.012354] Modules linked in: ib_srp libcrc32c scsi_transport_srp 
> target_core_file ib_srpt target_core_iblock target_core_mod scsi_debug brd 
> bfq dm_service_time rdma_cm
> iw_cm bridge stp llc ip6table_filter ip 
> 6_tables iptable_filter intel_rapl sb_edac x86_pkg_temp_thermal 
> intel_powerclamp coretemp kvm_intel kvm af_packet irqbypass ipmi_ssif dcdbas 
> crct10dif_pclmul ioatdma
> crc32_pclmul mei_me joydev ghash_clmulni_intel 
> mei aesni_intel sg shpchp ipmi_si aes_x86_64 ipmi_devintf crypto_simd dca 
> cryptd glue_helper lpc_ich ipmi_msghandler acpi_power_meter acpi_pad wmi 
> mfd_core ib_ipoib ib_cm
> ib_uverbs ib_umad mlx4_ib ib_core dm_mul 
> tipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua netconsole configfs 
> ip_tables x_tables autofs4 ext4 mbcache jbd2 mlx4_en hid_generic usbhid 
> mgag200 i2c_algo_bit
> drm_kms_helper 
> [89252.103941]  syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ehci_pci 
> mlx4_core tg3 ehci_hcd crc32c_intel devlink drm ptp usbcore usb_common 
> pps_core megaraid_sas libphy
> [last unloaded: scsi_transport_srp] 
> [89252.103941]  syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ehci_pci 
> mlx4_core tg3 ehci_hcd crc32c_intel devlink drm ptp usbcore usb_common 
> pps_core megaraid_sas libphy
> [last unloaded: scsi_transport_srp] 
> [89252.128375] CPU: 13 PID: 352 Comm: kworker/13:1H Tainted: GW   
> 4.13.0-rc6-dbg+ #2 
> [89252.128375] CPU: 13 PID: 352 Comm: kworker/13:1H Tainted: GW   
> 4.13.0-rc6-dbg+ #2 
> [89252.139884] Hardware name: Dell Inc. PowerEdge R720/0VWT90, BIOS 1.3.6 
> 09/11/2012 
> [89252.139884] Hardware name: Dell Inc. PowerEdge R720/0VWT90, BIOS 1.3.6 
> 09/11/2012 
> [89252.150243] Workqueue: kblockd blk_mq_run_work_fn 
> [89252.150243] Workqueue: kblockd blk_mq_run_work_fn 
> [89252.157449] task: 880911d80040 task.stack: c90007c64000 
> [89252.157449] task: 880911d80040 task.stack: c90007c64000 
> [89252.166038] RIP: 0010:bfq_setup_cooperator+0x16/0x2e0 [bfq] 
> [89252.166038] RIP: 0010:bfq_setup_cooperator+0x16/0x2e0 [bfq] 
> [89252.174222] RSP: 0018:c90007c67b20 EFLAGS: 00010082 
> [89252.174222] RSP: 0018:c90007c67b20 EFLAGS: 00010082 
> [89252.182026] RAX: ffe0 RBX: 8807b9988700 RCX: 
> 0001 
> [89252.182026] RAX: ffe0 RBX: 8807b9988700 RCX: 
> 0001 
> [89252.191990] RDX: 8807b9988700 RSI:  RDI: 
> 880288554a68 
> [89252.191990] RDX: 8807b9988700 RSI:  RDI: 
> 880288554a68 
> [89252.201956] RBP: c90007c67b68 R08: 825820c0 R09: 
>  
> [89252.201956] RBP: c90007c67b68 R08: 825820c0 R09: 
>  
> [89252.211899] R10: c90007c67ae8 R11: a05a5ad5 R12: 
> 880288554dc8 
> [89252.211899] R10: c90007c67ae8 R11: a05a5ad5 R12: 
> 880288554dc8 
> [89252.221821] R13: 880288554a68 R14: 880928078bd0 R15: 
> 8807bbe03528 
> [89252.221821] R13: 880288554a68 R14: 880928078bd0 R15: 
> 8807bbe03528 
> [89252.231715] FS:  () GS:88093f7

Re: [GIT PULL] nvme update for Linux 4.14, take 2

2017-08-30 Thread Bart Van Assche
On Wed, 2017-08-30 at 19:05 +0300, Sagi Grimberg wrote:
> If we get to choose, my preference would be to restore the old behavior
> because currently existing nvme transports keep their internal ctrl
> representation in the tagset->driver_data as they implement
> init/exit_request. Bouncing in nvme core would require to change that
> and always keep struct nvme_ctrl as the tagset->driver_data and convert
> it on every handler...
> 
> Everything is doable but it seems like an unneeded hassle to me...

Sorry but I'm not convinced that it would be necessary to change what
tagset->driver_data points at. How about moving blk_mq_reinit_tagset() from
the block layer core to the NVMe core, to rename it and to pass a pointer
to the nvme_ctrl data structure to that function instead of only block layer
information?

Bart.

Re: [GIT PULL] nvme update for Linux 4.14, take 2

2017-08-30 Thread Sagi Grimberg



Hello Sagi,

Sorry but I doubt that that patch makes it "impossible" to move controller
reset flow to the NVMe core. There are already several function pointers in
the nvme_ctrl_ops data structure and there is one such data structure per
transport. Had you already considered to add a function pointer to that
structure?


I have, that's the trampoline function that I was referring to, it feels
a bit funny to have aa nvme core function that would look like:

int nvme_reinit_request()
{
return ctrl->ops->reinit_request()
}

I can easily do that, but doesn't it defeat the purpose of blk_mq_ops?


I don't think so. Request reinitialization is an NVMe concept that is
not used by any other block driver, so why should the pointer to the
reinitialization function exist in blk_mq_ops?


The point of blk-mq is to provide all the functionality that drivers
need, even if it is for just a single driver. Functionality that can be
removed from drivers is good. The smaller/simpler we can make the
driver, the better off we are, even if that means adding a bit of
complexity to the core.

Obviously this is a case-by-case decision. For this particular case, I'm
happy with either solution.


If we get to choose, my preference would be to restore the old behavior
because currently existing nvme transports keep their internal ctrl
representation in the tagset->driver_data as they implement
init/exit_request. Bouncing in nvme core would require to change that
and always keep struct nvme_ctrl as the tagset->driver_data and convert
it on every handler...

Everything is doable but it seems like an unneeded hassle to me...


Re: [PATCH V3 03/14] blk-mq: introduce blk_mq_dispatch_rq_from_ctx()

2017-08-30 Thread Bart Van Assche
On Sun, 2017-08-27 at 00:33 +0800, Ming Lei wrote:
> This function is introduced for dequeuing request
> from sw queue so that we can dispatch it in
> scheduler's way.
> 
> More importantly, some SCSI devices may set
> q->queue_depth, which is a per-request_queue limit,
> and applied on pending I/O from all hctxs. This
> function is introduced for avoiding to dequeue too
> many requests from sw queue when ->dispatch isn't
> flushed completely.

Reviewed-by: Bart Van Assche 

Re: [PATCH V3 02/14] sbitmap: introduce __sbitmap_for_each_set()

2017-08-30 Thread Bart Van Assche
On Sun, 2017-08-27 at 00:33 +0800, Ming Lei wrote:
>  /**
>   * sbitmap_for_each_set() - Iterate over each set bit in a &struct sbitmap.
> + * @start: Where to start the iteration

Thanks for having changed the name of this argument ...

> -static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn 
> fn,
> - void *data)
> +static inline void __sbitmap_for_each_set(struct sbitmap *sb,
> +   unsigned int start,
> +   sb_for_each_fn fn, void *data)
>  {
> - unsigned int i;
> + unsigned int index = SB_NR_TO_INDEX(sb, start);
> + unsigned int nr = SB_NR_TO_BIT(sb, start);
> + unsigned int scanned = 0;

... but I'm still missing a check here whether or not index >= sb->map_nr.
 
> - for (i = 0; i < sb->map_nr; i++) {
> - struct sbitmap_word *word = &sb->map[i];
> - unsigned int off, nr;
> + while (1) {
> + struct sbitmap_word *word = &sb->map[index];
> + unsigned int depth = min_t(unsigned int, word->depth - nr,
> +sb->depth - scanned);
>  
> + scanned += depth;
>   if (!word->word)
> - continue;
> + goto next;
>  
> - nr = 0;
> - off = i << sb->shift;
> + depth += nr;
> + start = index << sb->shift;

The above statement reuses the argument 'start' for a new purpose. This is
confusing - please don't do this. Why not to keep the name 'off'? That will
keep the changes minimal.

Bart.

Re: [GIT PULL] nvme update for Linux 4.14, take 2

2017-08-30 Thread Jens Axboe
On 08/30/2017 09:46 AM, Bart Van Assche wrote:
> On Wed, 2017-08-30 at 18:33 +0300, Sagi Grimberg wrote:
 I just realized that patch:
 --
 commit d352ae205d8b05f3f7558d10f474d8436581b3e2
 Author: Bart Van Assche 
 Date:   Thu Aug 17 16:23:03 2017 -0700

   blk-mq: Make blk_mq_reinit_tagset() calls easier to read

   Since blk_mq_ops.reinit_request is only called from inside
   blk_mq_reinit_tagset(), make this function pointer an argument of
   blk_mq_reinit_tagset() instead of a member of struct blk_mq_ops.
   This patch does not change any functionality but makes
   blk_mq_reinit_tagset() calls easier to read and to analyze.
 --

 Makes it impossible for me to move controller reset flow to
 nvme-core without adding a trampoline (as the reinit_request
 is transport specific)...
>>>
>>> Hello Sagi,
>>>
>>> Sorry but I doubt that that patch makes it "impossible" to move controller
>>> reset flow to the NVMe core. There are already several function pointers in
>>> the nvme_ctrl_ops data structure and there is one such data structure per
>>> transport. Had you already considered to add a function pointer to that
>>> structure?
>>
>> I have, that's the trampoline function that I was referring to, it feels
>> a bit funny to have aa nvme core function that would look like:
>>
>> int nvme_reinit_request()
>> {
>>  return ctrl->ops->reinit_request()
>> }
>>
>> I can easily do that, but doesn't it defeat the purpose of blk_mq_ops?
> 
> I don't think so. Request reinitialization is an NVMe concept that is
> not used by any other block driver, so why should the pointer to the
> reinitialization function exist in blk_mq_ops?

The point of blk-mq is to provide all the functionality that drivers
need, even if it is for just a single driver. Functionality that can be
removed from drivers is good. The smaller/simpler we can make the
driver, the better off we are, even if that means adding a bit of
complexity to the core.

Obviously this is a case-by-case decision. For this particular case, I'm
happy with either solution.

-- 
Jens Axboe



Re: [PATCH 1/2] blk-mq: add requests in the tail of hctx->dispatch

2017-08-30 Thread Jens Axboe
On 08/30/2017 09:39 AM, Ming Lei wrote:
> On Wed, Aug 30, 2017 at 09:22:42AM -0600, Jens Axboe wrote:
>> On 08/30/2017 09:19 AM, Ming Lei wrote:
>>> It is more reasonable to add requests to ->dispatch in way
>>> of FIFO style, instead of LIFO style.
>>>
>>> Also in this way, we can allow to insert request at the front
>>> of hw queue, which function is needed to fix one bug
>>> in blk-mq's implementation of blk_execute_rq()
>>>
>>> Reported-by: Oleksandr Natalenko 
>>> Tested-by: Oleksandr Natalenko 
>>> Signed-off-by: Ming Lei 
>>> ---
>>>  block/blk-mq-sched.c | 2 +-
>>>  block/blk-mq.c   | 2 +-
>>>  2 files changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
>>> index 4ab69435708c..8d97df40fc28 100644
>>> --- a/block/blk-mq-sched.c
>>> +++ b/block/blk-mq-sched.c
>>> @@ -272,7 +272,7 @@ static bool blk_mq_sched_bypass_insert(struct 
>>> blk_mq_hw_ctx *hctx,
>>>  * the dispatch list.
>>>  */
>>> spin_lock(&hctx->lock);
>>> -   list_add(&rq->queuelist, &hctx->dispatch);
>>> +   list_add_tail(&rq->queuelist, &hctx->dispatch);
>>> spin_unlock(&hctx->lock);
>>> return true;
>>>  }
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 4603b115e234..fed3d0c16266 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -1067,7 +1067,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
>>> struct list_head *list)
>>> blk_mq_put_driver_tag(rq);
>>>  
>>> spin_lock(&hctx->lock);
>>> -   list_splice_init(list, &hctx->dispatch);
>>> +   list_splice_tail_init(list, &hctx->dispatch);
>>> spin_unlock(&hctx->lock);
>>
>> I'm not convinced this is safe, there's actually a reason why the
>> request is added to the front and not the back. We do have
>> reorder_tags_to_front() as a safe guard, but I'd much rather get rid of
> 
> reorder_tags_to_front() is for reordering the requests in current list,
> this patch is for splicing list into hctx->dispatch, so I can't see
> it isn't safe, or could you explain it a bit?

If we can get the ordering right, then down the line we won't need to
have the tags reordering at all. It's an ugly hack that I'd love to see
go away.

>> that than make this change.
>>
>> What's your reasoning here? Your changelog doesn't really explain why
> 
> Firstly the 2nd patch need to add one rq(such as RQF_PM) to the
> front of the hw queue, the simple way is to add it to the front
> of hctx->dispatch. Without this change, the 2nd patch can't work
> at all.
> 
> Secondly this way is still reasonable:
> 
>   - one rq is added to hctx->dispatch because queue is busy
>   - another rq is added to hctx->dispatch too because of same reason
>
> so it is reasonable to to add list into hctx->dispatch in FIFO style.

Not disagreeing with the logic. But it also begs the question of why we
don't apply the same treatment to when we splice leftovers to the
dispatch list, currently we front splice that.

All I'm saying is that you need to tread very carefully with this, and
throw it through some careful testing to ensure that we don't introduce
conditions that now livelock. NVMe is the easy test case, that will
generally always work since we never run out of tags. The problematic
test case is usually things like SATA with 31 tags, and especially SATA
with flushes that don't queue. One good test case is the one where you
end up having all tags (or almost all) consumed by flushes, and still
ensuring that we're making forward progress.

-- 
Jens Axboe



Re: [GIT PULL] nvme update for Linux 4.14, take 2

2017-08-30 Thread Bart Van Assche
On Wed, 2017-08-30 at 18:33 +0300, Sagi Grimberg wrote:
> > > I just realized that patch:
> > > --
> > > commit d352ae205d8b05f3f7558d10f474d8436581b3e2
> > > Author: Bart Van Assche 
> > > Date:   Thu Aug 17 16:23:03 2017 -0700
> > > 
> > >   blk-mq: Make blk_mq_reinit_tagset() calls easier to read
> > > 
> > >   Since blk_mq_ops.reinit_request is only called from inside
> > >   blk_mq_reinit_tagset(), make this function pointer an argument of
> > >   blk_mq_reinit_tagset() instead of a member of struct blk_mq_ops.
> > >   This patch does not change any functionality but makes
> > >   blk_mq_reinit_tagset() calls easier to read and to analyze.
> > > --
> > > 
> > > Makes it impossible for me to move controller reset flow to
> > > nvme-core without adding a trampoline (as the reinit_request
> > > is transport specific)...
> > 
> > Hello Sagi,
> > 
> > Sorry but I doubt that that patch makes it "impossible" to move controller
> > reset flow to the NVMe core. There are already several function pointers in
> > the nvme_ctrl_ops data structure and there is one such data structure per
> > transport. Had you already considered to add a function pointer to that
> > structure?
> 
> I have, that's the trampoline function that I was referring to, it feels
> a bit funny to have aa nvme core function that would look like:
> 
> int nvme_reinit_request()
> {
>   return ctrl->ops->reinit_request()
> }
> 
> I can easily do that, but doesn't it defeat the purpose of blk_mq_ops?

I don't think so. Request reinitialization is an NVMe concept that is not used
by any other block driver, so why should the pointer to the reinitialization
function exist in blk_mq_ops?

Bart.

Re: [PATCH 1/2] blk-mq: add requests in the tail of hctx->dispatch

2017-08-30 Thread Ming Lei
On Wed, Aug 30, 2017 at 09:22:42AM -0600, Jens Axboe wrote:
> On 08/30/2017 09:19 AM, Ming Lei wrote:
> > It is more reasonable to add requests to ->dispatch in way
> > of FIFO style, instead of LIFO style.
> > 
> > Also in this way, we can allow to insert request at the front
> > of hw queue, which function is needed to fix one bug
> > in blk-mq's implementation of blk_execute_rq()
> > 
> > Reported-by: Oleksandr Natalenko 
> > Tested-by: Oleksandr Natalenko 
> > Signed-off-by: Ming Lei 
> > ---
> >  block/blk-mq-sched.c | 2 +-
> >  block/blk-mq.c   | 2 +-
> >  2 files changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > index 4ab69435708c..8d97df40fc28 100644
> > --- a/block/blk-mq-sched.c
> > +++ b/block/blk-mq-sched.c
> > @@ -272,7 +272,7 @@ static bool blk_mq_sched_bypass_insert(struct 
> > blk_mq_hw_ctx *hctx,
> >  * the dispatch list.
> >  */
> > spin_lock(&hctx->lock);
> > -   list_add(&rq->queuelist, &hctx->dispatch);
> > +   list_add_tail(&rq->queuelist, &hctx->dispatch);
> > spin_unlock(&hctx->lock);
> > return true;
> >  }
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 4603b115e234..fed3d0c16266 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1067,7 +1067,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
> > struct list_head *list)
> > blk_mq_put_driver_tag(rq);
> >  
> > spin_lock(&hctx->lock);
> > -   list_splice_init(list, &hctx->dispatch);
> > +   list_splice_tail_init(list, &hctx->dispatch);
> > spin_unlock(&hctx->lock);
> 
> I'm not convinced this is safe, there's actually a reason why the
> request is added to the front and not the back. We do have
> reorder_tags_to_front() as a safe guard, but I'd much rather get rid of

reorder_tags_to_front() is for reordering the requests in current list,
this patch is for splicing list into hctx->dispatch, so I can't see
it isn't safe, or could you explain it a bit?

> that than make this change.
> 
> What's your reasoning here? Your changelog doesn't really explain why

Firstly the 2nd patch need to add one rq(such as RQF_PM) to the
front of the hw queue, the simple way is to add it to the front
of hctx->dispatch. Without this change, the 2nd patch can't work
at all.

Secondly this way is still reasonable:

- one rq is added to hctx->dispatch because queue is busy
- another rq is added to hctx->dispatch too because of same reason

so it is reasonable to to add list into hctx->dispatch in FIFO style.

Finally my patchset for 'improving SCSI-MQ perf' will change to not
dequeue rq from sw/scheduler if ->dispatch isn't flushed. I believe
it is reasonable and correct thing to do, with that change, there
won't be difference between the two styles.


-- 
Ming


Re: [GIT PULL] nvme update for Linux 4.14, take 2

2017-08-30 Thread Sagi Grimberg



I just realized that patch:
--
commit d352ae205d8b05f3f7558d10f474d8436581b3e2
Author: Bart Van Assche 
Date:   Thu Aug 17 16:23:03 2017 -0700

  blk-mq: Make blk_mq_reinit_tagset() calls easier to read

  Since blk_mq_ops.reinit_request is only called from inside
  blk_mq_reinit_tagset(), make this function pointer an argument of
  blk_mq_reinit_tagset() instead of a member of struct blk_mq_ops.
  This patch does not change any functionality but makes
  blk_mq_reinit_tagset() calls easier to read and to analyze.
--

Makes it impossible for me to move controller reset flow to
nvme-core without adding a trampoline (as the reinit_request
is transport specific)...


Hello Sagi,

Sorry but I doubt that that patch makes it "impossible" to move controller
reset flow to the NVMe core. There are already several function pointers in
the nvme_ctrl_ops data structure and there is one such data structure per
transport. Had you already considered to add a function pointer to that
structure?


I have, that's the trampoline function that I was referring to, it feels
a bit funny to have aa nvme core function that would look like:

int nvme_reinit_request()
{
return ctrl->ops->reinit_request()
}

I can easily do that, but doesn't it defeat the purpose
of blk_mq_ops?


Re: [GIT PULL] nvme update for Linux 4.14, take 2

2017-08-30 Thread Bart Van Assche
On Wed, 2017-08-30 at 18:10 +0300, Sagi Grimberg wrote:
> I just realized that patch:
> --
> commit d352ae205d8b05f3f7558d10f474d8436581b3e2
> Author: Bart Van Assche 
> Date:   Thu Aug 17 16:23:03 2017 -0700
> 
>  blk-mq: Make blk_mq_reinit_tagset() calls easier to read
> 
>  Since blk_mq_ops.reinit_request is only called from inside
>  blk_mq_reinit_tagset(), make this function pointer an argument of
>  blk_mq_reinit_tagset() instead of a member of struct blk_mq_ops.
>  This patch does not change any functionality but makes
>  blk_mq_reinit_tagset() calls easier to read and to analyze.
> --
> 
> Makes it impossible for me to move controller reset flow to
> nvme-core without adding a trampoline (as the reinit_request
> is transport specific)...

Hello Sagi,

Sorry but I doubt that that patch makes it "impossible" to move controller
reset flow to the NVMe core. There are already several function pointers in
the nvme_ctrl_ops data structure and there is one such data structure per
transport. Had you already considered to add a function pointer to that
structure?

Bart.

Re: [PATCH 1/2] blk-mq: add requests in the tail of hctx->dispatch

2017-08-30 Thread Jens Axboe
On 08/30/2017 09:19 AM, Ming Lei wrote:
> It is more reasonable to add requests to ->dispatch in way
> of FIFO style, instead of LIFO style.
> 
> Also in this way, we can allow to insert request at the front
> of hw queue, which function is needed to fix one bug
> in blk-mq's implementation of blk_execute_rq()
> 
> Reported-by: Oleksandr Natalenko 
> Tested-by: Oleksandr Natalenko 
> Signed-off-by: Ming Lei 
> ---
>  block/blk-mq-sched.c | 2 +-
>  block/blk-mq.c   | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 4ab69435708c..8d97df40fc28 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -272,7 +272,7 @@ static bool blk_mq_sched_bypass_insert(struct 
> blk_mq_hw_ctx *hctx,
>* the dispatch list.
>*/
>   spin_lock(&hctx->lock);
> - list_add(&rq->queuelist, &hctx->dispatch);
> + list_add_tail(&rq->queuelist, &hctx->dispatch);
>   spin_unlock(&hctx->lock);
>   return true;
>  }
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4603b115e234..fed3d0c16266 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1067,7 +1067,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
> struct list_head *list)
>   blk_mq_put_driver_tag(rq);
>  
>   spin_lock(&hctx->lock);
> - list_splice_init(list, &hctx->dispatch);
> + list_splice_tail_init(list, &hctx->dispatch);
>   spin_unlock(&hctx->lock);

I'm not convinced this is safe, there's actually a reason why the
request is added to the front and not the back. We do have
reorder_tags_to_front() as a safe guard, but I'd much rather get rid of
that than make this change.

What's your reasoning here? Your changelog doesn't really explain why
this fixes anything, it's very vague.

-- 
Jens Axboe



[PATCH 2/2] blk-mq: align to legacy's implementation of blk_execute_rq

2017-08-30 Thread Ming Lei
In legacy path, when one request is executed via blk_execute_rq(),
the request is added to the front of q->queue_head directly,
and I/O scheduler's queue is bypassed because either merging
or sorting isn't needed.

When SCSI device is put into quiece state, such as during
system suspend, we need to add the RQF_PM request into
front of the queue.

This patch fixes I/O hang after system resume by taking
the similar implementation of legacy path.

Tested-by: Oleksandr Natalenko 
Reported-by: Oleksandr Natalenko 
Signed-off-by: Ming Lei 
---
 block/blk-core.c |  2 +-
 block/blk-exec.c |  2 +-
 block/blk-flush.c|  2 +-
 block/blk-mq-sched.c | 58 
 block/blk-mq-sched.h |  2 ++
 5 files changed, 63 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index dbecbf4a64e0..fb75bc646ebc 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2330,7 +2330,7 @@ blk_status_t blk_insert_cloned_request(struct 
request_queue *q, struct request *
if (q->mq_ops) {
if (blk_queue_io_stat(q))
blk_account_io_start(rq, true);
-   blk_mq_sched_insert_request(rq, false, true, false, false);
+   blk_mq_sched_insert_request_bypass(rq, false, true, false, 
false);
return BLK_STS_OK;
}
 
diff --git a/block/blk-exec.c b/block/blk-exec.c
index 5c0f3dc446dc..4565aa6bb624 100644
--- a/block/blk-exec.c
+++ b/block/blk-exec.c
@@ -61,7 +61,7 @@ void blk_execute_rq_nowait(struct request_queue *q, struct 
gendisk *bd_disk,
 * be reused after dying flag is set
 */
if (q->mq_ops) {
-   blk_mq_sched_insert_request(rq, at_head, true, false, false);
+   blk_mq_sched_insert_request_bypass(rq, at_head, true, false, 
false);
return;
}
 
diff --git a/block/blk-flush.c b/block/blk-flush.c
index ed5fe322abba..51e89e5c525a 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -463,7 +463,7 @@ void blk_insert_flush(struct request *rq)
if ((policy & REQ_FSEQ_DATA) &&
!(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
if (q->mq_ops)
-   blk_mq_sched_insert_request(rq, false, true, false, 
false);
+   blk_mq_sched_insert_request_bypass(rq, false, true, 
false, false);
else
list_add_tail(&rq->queuelist, &q->queue_head);
return;
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 8d97df40fc28..b40dd063d61f 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -354,6 +354,64 @@ static void blk_mq_sched_insert_flush(struct blk_mq_hw_ctx 
*hctx,
blk_mq_add_to_requeue_list(rq, false, true);
 }
 
+static void blk_mq_flush_hctx(struct blk_mq_hw_ctx *hctx,
+ struct elevator_queue *e,
+ const bool has_sched_dispatch,
+ struct list_head *rqs)
+{
+   LIST_HEAD(list);
+
+   if (!has_sched_dispatch)
+   blk_mq_flush_busy_ctxs(hctx, &list);
+   else {
+   while (true) {
+   struct request *rq;
+
+   rq = e->type->ops.mq.dispatch_request(hctx);
+   if (!rq)
+   break;
+   list_add_tail(&rq->queuelist, &list);
+   }
+   }
+
+   list_splice_tail(&list, rqs);
+}
+
+void blk_mq_sched_insert_request_bypass(struct request *rq, bool at_head,
+   bool run_queue, bool async,
+   bool can_block)
+{
+   struct request_queue *q = rq->q;
+   struct elevator_queue *e = q->elevator;
+   struct blk_mq_ctx *ctx = rq->mq_ctx;
+   struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
+   LIST_HEAD(list);
+   const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
+
+   if (rq->tag == -1 && op_is_flush(rq->cmd_flags)) {
+   blk_mq_sched_insert_flush(hctx, rq, can_block);
+   return;
+   }
+
+   if (at_head)
+   list_add_tail(&rq->queuelist, &list);
+   else {
+   blk_mq_flush_hctx(hctx, e, has_sched_dispatch, &list);
+   list_add_tail(&rq->queuelist, &list);
+   run_queue = true;
+   }
+
+   spin_lock(&hctx->lock);
+   if (at_head)
+   list_splice(&list, &hctx->dispatch);
+   else
+   list_splice_tail(&list, &hctx->dispatch);
+   spin_unlock(&hctx->lock);
+
+   if (run_queue)
+   blk_mq_run_hw_queue(hctx, async);
+}
+
 void blk_mq_sched_insert_request(struct request *rq, bool at_head,
 bool run_queue, bool async, bool can_block)
 {
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 9267d0b7c197..4d01697a627f 100644
--- 

[PATCH 0/2] blk-mq: fix I/O hang during system resume

2017-08-30 Thread Ming Lei
Hi,

This two patches fix I/O hang of SCSI-MQ during system resume.

The cause is that when SCSI device is put into SCSI's
quiesce state, normal I/O request can't be dispatched to lld
any more, only request with RQF_PREEMPT is allowed to be
sent to drive.

In current blk-mq implementation, if there is request in ->dispatch,
no new request can't be dispatched to driver any more.

This two patches fix the issue reported by Oleksandr.

Thanks,
Ming

Ming Lei (2):
  blk-mq: add requests in the tail of hctx->dispatch
  blk-mq: align to legacy's implementation of blk_execute_rq

 block/blk-core.c |  2 +-
 block/blk-exec.c |  2 +-
 block/blk-flush.c|  2 +-
 block/blk-mq-sched.c | 60 +++-
 block/blk-mq-sched.h |  2 ++
 block/blk-mq.c   |  2 +-
 6 files changed, 65 insertions(+), 5 deletions(-)

-- 
2.9.5



[PATCH 1/2] blk-mq: add requests in the tail of hctx->dispatch

2017-08-30 Thread Ming Lei
It is more reasonable to add requests to ->dispatch in way
of FIFO style, instead of LIFO style.

Also in this way, we can allow to insert request at the front
of hw queue, which function is needed to fix one bug
in blk-mq's implementation of blk_execute_rq()

Reported-by: Oleksandr Natalenko 
Tested-by: Oleksandr Natalenko 
Signed-off-by: Ming Lei 
---
 block/blk-mq-sched.c | 2 +-
 block/blk-mq.c   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 4ab69435708c..8d97df40fc28 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -272,7 +272,7 @@ static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx 
*hctx,
 * the dispatch list.
 */
spin_lock(&hctx->lock);
-   list_add(&rq->queuelist, &hctx->dispatch);
+   list_add_tail(&rq->queuelist, &hctx->dispatch);
spin_unlock(&hctx->lock);
return true;
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4603b115e234..fed3d0c16266 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1067,7 +1067,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list)
blk_mq_put_driver_tag(rq);
 
spin_lock(&hctx->lock);
-   list_splice_init(list, &hctx->dispatch);
+   list_splice_tail_init(list, &hctx->dispatch);
spin_unlock(&hctx->lock);
 
/*
-- 
2.9.5



[PATCH 0/2] blk-mq: fix I/O hang during system resume

2017-08-30 Thread Ming Lei
Hi,

This two patches fix SCSI I/O hang during system resume.

The cause is that when SCSI device is put into SCSI's
quiesce state, normal I/O request can't be dispatched to lld
any more, only request with RQF_PREEMPT is allowed to be
sent to drive.

In current blk-mq implementation, if there is request in ->dispatch,
no new request can't be dispatched to driver any more.

This two patches fix the issue reported by Oleksandr.

Thanks,
Ming

Ming Lei (2):
  blk-mq: add requests in the tail of hctx->dispatch
  blk-mq: align to legacy's implementation of blk_execute_rq

 block/blk-core.c |  2 +-
 block/blk-exec.c |  2 +-
 block/blk-flush.c|  2 +-
 block/blk-mq-sched.c | 60 +++-
 block/blk-mq-sched.h |  2 ++
 block/blk-mq.c   |  2 +-
 6 files changed, 65 insertions(+), 5 deletions(-)

-- 
2.9.5



Re: [GIT PULL] last minute NVMe fixes for 4.13

2017-08-30 Thread Jens Axboe
On 08/30/2017 09:07 AM, Christoph Hellwig wrote:
> Hi Jens,
> 
> three more fixes for 4.13 below:
> 
>  - fix the incorrect bit for the doorbell buffer features (Changpeng Liu)
>  - always use a 4k MR page size for RDMA, to not get in trouble with
>offset in non-4k page size systems (no-op for x86) (Max Gurtovoy)
>  - and a fix for the new nvme host memory buffer support to keep the
>descriptor list DMA mapped when the buffer is enabled (me)

Pulled, thanks.

-- 
Jens Axboe



Re: [GIT PULL] nvme update for Linux 4.14, take 2

2017-08-30 Thread Sagi Grimberg



Hi Jens,

below is the current set of NVMe updates for Linux 4.14, now against your
postmerge branch, and with three more patches.

The biggest bit comes from Sagi and refactors the RDMA driver to prepare
for more code sharing in the setup and teardown path.  But we have various
features and bug fixes from a lot of people as well.


Pulled, thanks.


I just realized that patch:
--
commit d352ae205d8b05f3f7558d10f474d8436581b3e2
Author: Bart Van Assche 
Date:   Thu Aug 17 16:23:03 2017 -0700

blk-mq: Make blk_mq_reinit_tagset() calls easier to read

Since blk_mq_ops.reinit_request is only called from inside
blk_mq_reinit_tagset(), make this function pointer an argument of
blk_mq_reinit_tagset() instead of a member of struct blk_mq_ops.
This patch does not change any functionality but makes
blk_mq_reinit_tagset() calls easier to read and to analyze.
--

Makes it impossible for me to move controller reset flow to
nvme-core without adding a trampoline (as the reinit_request
is transport specific)...

I'd really like to avoid adding more churn to the code refactoring
than it already has.

Given that this patch is not fixing anything should we revert it?


[GIT PULL] last minute NVMe fixes for 4.13

2017-08-30 Thread Christoph Hellwig
Hi Jens,

three more fixes for 4.13 below:

 - fix the incorrect bit for the doorbell buffer features (Changpeng Liu)
 - always use a 4k MR page size for RDMA, to not get in trouble with
   offset in non-4k page size systems (no-op for x86) (Max Gurtovoy)
 - and a fix for the new nvme host memory buffer support to keep the
   descriptor list DMA mapped when the buffer is enabled (me)

The following changes since commit e9d8a0fdeacd843c85dcef480cdb2ab76bcdb6e4:

  nvme-pci: set cqe_seen on polled completions (2017-08-18 09:19:39 +0200)

are available in the git repository at:

  git://git.infradead.org/nvme.git nvme-4.13

for you to fetch changes up to 223694b9ae8bfba99f3528d49d07a740af6ff95a:

  nvme: fix the definition of the doorbell buffer config support bit 
(2017-08-30 14:46:32 +0200)


Changpeng Liu (1):
  nvme: fix the definition of the doorbell buffer config support bit

Christoph Hellwig (1):
  nvme-pci: use dma memory for the host memory buffer descriptors

Max Gurtovoy (1):
  nvme-rdma: default MR page size to 4k

 drivers/nvme/host/pci.c  | 22 +++---
 drivers/nvme/host/rdma.c |  8 ++--
 include/linux/nvme.h |  2 +-
 3 files changed, 18 insertions(+), 14 deletions(-)


Re: [PATCH] block, bfq: fix error handle in bfq_init

2017-08-30 Thread Paolo Valente

> Il giorno 23 ago 2017, alle ore 17:25, weiping zhang 
>  ha scritto:
> 
> On Sat, Aug 19, 2017 at 12:37:20AM +0800, weiping zhang wrote:
>> if elv_register fail, bfq_pool should be free.
>> 
>> Signed-off-by: weiping zhang 
>> ---
>> block/bfq-iosched.c | 4 +++-
>> 1 file changed, 3 insertions(+), 1 deletion(-)
>> 
>> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
>> index 436b6ca..bdb072f 100644
>> --- a/block/bfq-iosched.c
>> +++ b/block/bfq-iosched.c
>> @@ -5048,10 +5048,12 @@ static int __init bfq_init(void)
>> 
>>  ret = elv_register(&iosched_bfq_mq);
>>  if (ret)
>> -goto err_pol_unreg;
>> +goto slab_kill;
>> 
>>  return 0;
>> 
>> +slab_kill:
>> +bfq_slab_kill();
>> err_pol_unreg:
>> #ifdef CONFIG_BFQ_GROUP_IOSCHED
>>  blkcg_policy_unregister(&blkcg_policy_bfq);
>> -- 
>> 2.9.4
>> 
> Hi Paolo, Jens,
> 
> could you give some comments for this patch ?
> 

Sorry for not making it in time (vacation), and thanks for fixing this
bug.

Paolo

> Thanks



Re: I/O hangs after resuming from suspend-to-ram

2017-08-30 Thread Ming Lei
Hi,

On Wed, Aug 30, 2017 at 12:58:21PM +0200, oleksa...@natalenko.name wrote:
> Hi.
> 
> So, current summary:
> 
> 1) first patch + debug patch: I can reproduce the issue in wild, but not
> with pm_test set;

That is interesting, since I always test via pm_test.

> 2) first patch + debug patch + second patch: I cannot reproduce issue at
> all, neither with "none", nor with "mq-deadline".
> 
> Thus, "blk-mq: align to legacy path for implementing blk_execute_rq" +
> "blk-mq: add requests in the tail of hctx->dispatch" look like a proper fix.
> Thanks for that!

That is great, will prepare a formal one for merge.

> 
> > I setup dm-crypt on /dev/md0 directly, could you show me how to setup
> > lvm on raid10?
> 
> If still relevant, here is a layout:

Looks not relevant, just because I use pm_test.

> 
> ===
> sda 8:008G  0 disk
> |-sda1  8:10  128M  0 part   /boot/EFI
> `-sda2  8:20  7.9G  0 part
>   `-md0 9:00  7.9G  0 raid10
> `-system-root 253:00  7.9G  0 lvm/
> sdb 8:16   08G  0 disk
> |-sdb1  8:17   0  128M  0 part
> `-sdb2  8:18   0  7.9G  0 part
>   `-md0 9:00  7.9G  0 raid10
> `-system-root 253:00  7.9G  0 lvm/
> ===
> 
> Anything else you'd like /me to test?

You have done enough, thanks a lot for your report and test!


-- 
Ming


Re: [RFC] block: deprecate choosing elevator via boot param

2017-08-30 Thread Jens Axboe
On 08/14/2017 03:27 AM, Oleksandr Natalenko wrote:
> Setting I/O scheduler via kernel command line is not flexible enough
> anymore. Different schedulers might be desirable for different types
> of devices (SSDs and HDDs, for instance). Moreover, setting elevator
> while using blk-mq framework does not work in this way already.
> 
> This commit enables warning if user specifies "elevator" boot param.
> Removing this option at all might be considered in some future.

Not a huge fan of doing it like this, it'll just add a bunch of noise to
the dmesg output that people will ignore. It doesn't bring us closer to
being able to remove the option.

I think we should just let it die over time. This will happen naturally
since the blk-mq scheduling framework does not support it. Once we have
everything converted, the option will end up doing nothing. Then we can
eventually kill it.

-- 
Jens Axboe



[PATCH] kernfs: checking for IS_ERR() instead of NULL

2017-08-30 Thread Dan Carpenter
The kernfs_get_inode() returns NULL on error, it never returns error
pointers.

Fixes: aa8188253474 ("kernfs: add exportfs operations")
Signed-off-by: Dan Carpenter 

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 7c452f4d83e9..95a7c88baed9 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -99,8 +99,8 @@ static struct inode *kernfs_fh_get_inode(struct super_block 
*sb,
return ERR_PTR(-ESTALE);
inode = kernfs_get_inode(sb, kn);
kernfs_put(kn);
-   if (IS_ERR(inode))
-   return ERR_CAST(inode);
+   if (!inode)
+   return ERR_PTR(-ESTALE);
 
if (generation && inode->i_generation != generation) {
/* we didn't find the right inode.. */


Re: [PATCH V6 04/12] mmc: core: Turn off CQE before sending commands

2017-08-30 Thread Ulf Hansson
On 25 August 2017 at 14:43, Adrian Hunter  wrote:
> CQE needs to be off for the host controller to accept non-CQ commands. Turn
> off the CQE before sending commands, and ensure it is off in any reset or
> power management paths, or re-tuning.
>
> Signed-off-by: Adrian Hunter 
> Reviewed-by: Linus Walleij 
> Signed-off-by: Ulf Hansson 

This one was already applied.

Kind regards
Uffe

> ---
>  drivers/mmc/core/core.c | 9 +
>  1 file changed, 9 insertions(+)
>
> diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
> index 5dd1c00d95f5..66c9cf49ad2f 100644
> --- a/drivers/mmc/core/core.c
> +++ b/drivers/mmc/core/core.c
> @@ -260,6 +260,9 @@ static void __mmc_start_request(struct mmc_host *host, 
> struct mmc_request *mrq)
>
> trace_mmc_request_start(host, mrq);
>
> +   if (host->cqe_on)
> +   host->cqe_ops->cqe_off(host);
> +
> host->ops->request(host, mrq);
>  }
>
> @@ -979,6 +982,9 @@ int mmc_execute_tuning(struct mmc_card *card)
> if (!host->ops->execute_tuning)
> return 0;
>
> +   if (host->cqe_on)
> +   host->cqe_ops->cqe_off(host);
> +
> if (mmc_card_mmc(card))
> opcode = MMC_SEND_TUNING_BLOCK_HS200;
> else
> @@ -1018,6 +1024,9 @@ void mmc_set_bus_width(struct mmc_host *host, unsigned 
> int width)
>   */
>  void mmc_set_initial_state(struct mmc_host *host)
>  {
> +   if (host->cqe_on)
> +   host->cqe_ops->cqe_off(host);
> +
> mmc_retune_disable(host);
>
> if (mmc_host_is_spi(host))
> --
> 1.9.1
>


Re: [PATCH V6 03/12] mmc: host: Add CQE interface

2017-08-30 Thread Ulf Hansson
On 25 August 2017 at 14:43, Adrian Hunter  wrote:
> Add CQE host operations, capabilities, and host members.
>
> Signed-off-by: Adrian Hunter 

I have now replaced the previous version with this one, applied for
next! Thanks!

Kind regards
Uffe

> ---
>  include/linux/mmc/core.h |  6 ++
>  include/linux/mmc/host.h | 53 
> 
>  2 files changed, 59 insertions(+)
>
> diff --git a/include/linux/mmc/core.h b/include/linux/mmc/core.h
> index 178f699ac172..927519385482 100644
> --- a/include/linux/mmc/core.h
> +++ b/include/linux/mmc/core.h
> @@ -156,6 +156,12 @@ struct mmc_request {
> struct completion   completion;
> struct completion   cmd_completion;
> void(*done)(struct mmc_request *);/* completion 
> function */
> +   /*
> +* Notify uppers layers (e.g. mmc block driver) that recovery is 
> needed
> +* due to an error associated with the mmc_request. Currently used 
> only
> +* by CQE.
> +*/
> +   void(*recovery_notifier)(struct mmc_request *);
> struct mmc_host *host;
>
> /* Allow other commands during this ongoing data transfer or busy 
> wait */
> diff --git a/include/linux/mmc/host.h b/include/linux/mmc/host.h
> index e92629518f68..f3f2d07feb2a 100644
> --- a/include/linux/mmc/host.h
> +++ b/include/linux/mmc/host.h
> @@ -162,6 +162,50 @@ struct mmc_host_ops {
>   unsigned int direction, int blk_size);
>  };
>
> +struct mmc_cqe_ops {
> +   /* Allocate resources, and make the CQE operational */
> +   int (*cqe_enable)(struct mmc_host *host, struct mmc_card *card);
> +   /* Free resources, and make the CQE non-operational */
> +   void(*cqe_disable)(struct mmc_host *host);
> +   /*
> +* Issue a read, write or DCMD request to the CQE. Also deal with the
> +* effect of ->cqe_off().
> +*/
> +   int (*cqe_request)(struct mmc_host *host, struct mmc_request 
> *mrq);
> +   /* Free resources (e.g. DMA mapping) associated with the request */
> +   void(*cqe_post_req)(struct mmc_host *host, struct mmc_request 
> *mrq);
> +   /*
> +* Prepare the CQE and host controller to accept non-CQ commands. 
> There
> +* is no corresponding ->cqe_on(), instead ->cqe_request() is required
> +* to deal with that.
> +*/
> +   void(*cqe_off)(struct mmc_host *host);
> +   /*
> +* Wait for all CQE tasks to complete. Return an error if recovery
> +* becomes necessary.
> +*/
> +   int (*cqe_wait_for_idle)(struct mmc_host *host);
> +   /*
> +* Notify CQE that a request has timed out. Return false if the 
> request
> +* completed or true if a timeout happened in which case indicate if
> +* recovery is needed.
> +*/
> +   bool(*cqe_timeout)(struct mmc_host *host, struct mmc_request *mrq,
> +  bool *recovery_needed);
> +   /*
> +* Stop all CQE activity and prepare the CQE and host controller to
> +* accept recovery commands.
> +*/
> +   void(*cqe_recovery_start)(struct mmc_host *host);
> +   /*
> +* Clear the queue and call mmc_cqe_request_done() on all requests.
> +* Requests that errored will have the error set on the mmc_request
> +* (data->error or cmd->error for DCMD).  Requests that did not error
> +* will have zero data bytes transferred.
> +*/
> +   void(*cqe_recovery_finish)(struct mmc_host *host);
> +};
> +
>  struct mmc_async_req {
> /* active mmc request */
> struct mmc_request  *mrq;
> @@ -303,6 +347,8 @@ struct mmc_host {
>  #define MMC_CAP2_HS400_ES  (1 << 20)   /* Host supports enhanced 
> strobe */
>  #define MMC_CAP2_NO_SD (1 << 21)   /* Do not send SD commands 
> during initialization */
>  #define MMC_CAP2_NO_MMC(1 << 22)   /* Do not send (e)MMC 
> commands during initialization */
> +#define MMC_CAP2_CQE   (1 << 23)   /* Has eMMC command queue 
> engine */
> +#define MMC_CAP2_CQE_DCMD  (1 << 24)   /* CQE can issue a direct 
> command */
>
> mmc_pm_flag_t   pm_caps;/* supported pm features */
>
> @@ -386,6 +432,13 @@ struct mmc_host {
> int dsr_req;/* DSR value is valid */
> u32 dsr;/* optional driver stage (DSR) value 
> */
>
> +   /* Command Queue Engine (CQE) support */
> +   const struct mmc_cqe_ops *cqe_ops;
> +   void*cqe_private;
> +   int cqe_qdepth;
> +   boolcqe_enabled;
> +   boolcqe_on;
> +
> unsigned long   private[0] cacheline_aligned;
>  };
>
> --
> 1.9.1
>


Re: [PATCH V6 01/12] mmc: core: Move mmc_start_areq() declaration

2017-08-30 Thread Ulf Hansson
On 25 August 2017 at 14:43, Adrian Hunter  wrote:
> mmc_start_areq() is an internal mmc core API. Move the declaration
> accordingly.
>
> Signed-off-by: Adrian Hunter 

Thanks, applied for next!

Kind regards
Uffe

> ---
>  drivers/mmc/core/core.h  | 6 ++
>  include/linux/mmc/core.h | 4 
>  2 files changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/mmc/core/core.h b/drivers/mmc/core/core.h
> index 55f543fd37c4..ca861091a776 100644
> --- a/drivers/mmc/core/core.h
> +++ b/drivers/mmc/core/core.h
> @@ -107,6 +107,12 @@ static inline void mmc_unregister_pm_notifier(struct 
> mmc_host *host) { }
>  void mmc_wait_for_req_done(struct mmc_host *host, struct mmc_request *mrq);
>  bool mmc_is_req_done(struct mmc_host *host, struct mmc_request *mrq);
>
> +struct mmc_async_req;
> +
> +struct mmc_async_req *mmc_start_areq(struct mmc_host *host,
> +struct mmc_async_req *areq,
> +enum mmc_blk_status *ret_stat);
> +
>  int mmc_erase(struct mmc_card *card, unsigned int from, unsigned int nr,
> unsigned int arg);
>  int mmc_can_erase(struct mmc_card *card);
> diff --git a/include/linux/mmc/core.h b/include/linux/mmc/core.h
> index bf1788a224e6..178f699ac172 100644
> --- a/include/linux/mmc/core.h
> +++ b/include/linux/mmc/core.h
> @@ -165,11 +165,7 @@ struct mmc_request {
>  };
>
>  struct mmc_card;
> -struct mmc_async_req;
>
> -struct mmc_async_req *mmc_start_areq(struct mmc_host *host,
> -   struct mmc_async_req *areq,
> -   enum mmc_blk_status *ret_stat);
>  void mmc_wait_for_req(struct mmc_host *host, struct mmc_request *mrq);
>  int mmc_wait_for_cmd(struct mmc_host *host, struct mmc_command *cmd,
> int retries);
> --
> 1.9.1
>


Re: [PATCH V6 02/12] mmc: block: Fix block status codes

2017-08-30 Thread Ulf Hansson
On 25 August 2017 at 14:43, Adrian Hunter  wrote:
> Commit 2a842acab109 ("block: introduce new block status code type") changed
> the error type but not in patches merged through the mmc tree, like
> commit 0493f6fe5bde ("mmc: block: Move boot partition locking into a driver
> op"). Fix one error code that is incorrect and also use BLK_STS_OK in
> preference to 0.
>
> Fixes: 17ece345a042 ("Merge tag 'mmc-v4.13' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc")
> Signed-off-by: Adrian Hunter 

Thanks, applied for fixes!

Kind regards
Uffe

> ---
>  drivers/mmc/core/block.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
> index 0eebc2f726c3..2f00d11adfa9 100644
> --- a/drivers/mmc/core/block.c
> +++ b/drivers/mmc/core/block.c
> @@ -1223,7 +1223,7 @@ static void mmc_blk_issue_drv_op(struct mmc_queue *mq, 
> struct request *req)
> break;
> }
> mq_rq->drv_op_result = ret;
> -   blk_end_request_all(req, ret);
> +   blk_end_request_all(req, ret ? BLK_STS_IOERR : BLK_STS_OK);
>  }
>
>  static void mmc_blk_issue_discard_rq(struct mmc_queue *mq, struct request 
> *req)
> @@ -1728,9 +1728,9 @@ static bool mmc_blk_rw_cmd_err(struct mmc_blk_data *md, 
> struct mmc_card *card,
> if (err)
> req_pending = old_req_pending;
> else
> -   req_pending = blk_end_request(req, 0, blocks << 9);
> +   req_pending = blk_end_request(req, BLK_STS_OK, blocks 
> << 9);
> } else {
> -   req_pending = blk_end_request(req, 0, brq->data.bytes_xfered);
> +   req_pending = blk_end_request(req, BLK_STS_OK, 
> brq->data.bytes_xfered);
> }
> return req_pending;
>  }
> --
> 1.9.1
>


Re: possible circular locking dependency detected [was: linux-next: Tree for Aug 22]

2017-08-30 Thread Sergey Senozhatsky
Hello Peter,

On (08/30/17 10:47), Peter Zijlstra wrote:
[..]
> On Wed, Aug 30, 2017 at 10:42:07AM +0200, Peter Zijlstra wrote:
> > 
> > So the overhead looks to be spread out over all sorts, which makes it
> > harder to find and fix.
> > 
> > stack unwinding is done lots and is fairly expensive, I've not yet
> > checked if crossrelease does too much of that.
> 
> Aah, we do an unconditional stack unwind for every __lock_acquire() now.
> It keeps a trace in the xhlocks[].
> 
> Does the below cure most of that overhead?

thanks.

mostly yes. the kernel is not so dramatically slower now. it's still
seems to be a bit slower, which is expected I suppose, but over all
it's much better

real1m35.209s
user4m12.467s
sys 0m49.457s

// approx 10 seconds slower.

-ss


Re: I/O hangs after resuming from suspend-to-ram

2017-08-30 Thread oleksandr

Hi.

So, current summary:

1) first patch + debug patch: I can reproduce the issue in wild, but not 
with pm_test set;
2) first patch + debug patch + second patch: I cannot reproduce issue at 
all, neither with "none", nor with "mq-deadline".


Thus, "blk-mq: align to legacy path for implementing blk_execute_rq" + 
"blk-mq: add requests in the tail of hctx->dispatch" look like a proper 
fix. Thanks for that!


I setup dm-crypt on /dev/md0 directly, could you show me how to setup 
lvm on raid10?


If still relevant, here is a layout:

===
sda 8:008G  0 disk
|-sda1  8:10  128M  0 part   /boot/EFI
`-sda2  8:20  7.9G  0 part
  `-md0 9:00  7.9G  0 raid10
`-system-root 253:00  7.9G  0 lvm/
sdb 8:16   08G  0 disk
|-sdb1  8:17   0  128M  0 part
`-sdb2  8:18   0  7.9G  0 part
  `-md0 9:00  7.9G  0 raid10
`-system-root 253:00  7.9G  0 lvm/
===

Anything else you'd like /me to test?

30.08.2017 10:06, Ming Lei написав:

…SNIP…
Please try the following patch and previous patch together and see
if they make a difference:
…SNIP…


RE: possible circular locking dependency detected [was: linux-next: Tree for Aug 22]

2017-08-30 Thread Byungchul Park
> -Original Message-
> From: Peter Zijlstra [mailto:pet...@infradead.org]
> Sent: Wednesday, August 30, 2017 5:48 PM
> To: Sergey Senozhatsky
> Cc: Byungchul Park; Bart Van Assche; linux-ker...@vger.kernel.org; linux-
> bl...@vger.kernel.org; martin.peter...@oracle.com; ax...@kernel.dk; linux-
> s...@vger.kernel.org; s...@canb.auug.org.au; linux-n...@vger.kernel.org;
> kernel-t...@lge.com
> Subject: Re: possible circular locking dependency detected [was: linux-
> next: Tree for Aug 22]
> 
> On Wed, Aug 30, 2017 at 10:42:07AM +0200, Peter Zijlstra wrote:
> >
> > So the overhead looks to be spread out over all sorts, which makes it
> > harder to find and fix.
> >
> > stack unwinding is done lots and is fairly expensive, I've not yet
> > checked if crossrelease does too much of that.
> 
> Aah, we do an unconditional stack unwind for every __lock_acquire() now.
> It keeps a trace in the xhlocks[].

Yeah.. I also think this is most significant..

> 
> Does the below cure most of that overhead?
> 
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index 44c8d0d17170..7b872036b72e 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -4872,7 +4872,7 @@ static void add_xhlock(struct held_lock *hlock)
>   xhlock->trace.max_entries = MAX_XHLOCK_TRACE_ENTRIES;
>   xhlock->trace.entries = xhlock->trace_entries;
>   xhlock->trace.skip = 3;
> - save_stack_trace(&xhlock->trace);
> + /* save_stack_trace(&xhlock->trace); */
>  }
> 
>  static inline int same_context_xhlock(struct hist_lock *xhlock)



Re: possible circular locking dependency detected [was: linux-next: Tree for Aug 22]

2017-08-30 Thread Peter Zijlstra
On Wed, Aug 30, 2017 at 10:42:07AM +0200, Peter Zijlstra wrote:
> 
> So the overhead looks to be spread out over all sorts, which makes it
> harder to find and fix.
> 
> stack unwinding is done lots and is fairly expensive, I've not yet
> checked if crossrelease does too much of that.

Aah, we do an unconditional stack unwind for every __lock_acquire() now.
It keeps a trace in the xhlocks[].

Does the below cure most of that overhead?

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 44c8d0d17170..7b872036b72e 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -4872,7 +4872,7 @@ static void add_xhlock(struct held_lock *hlock)
xhlock->trace.max_entries = MAX_XHLOCK_TRACE_ENTRIES;
xhlock->trace.entries = xhlock->trace_entries;
xhlock->trace.skip = 3;
-   save_stack_trace(&xhlock->trace);
+   /* save_stack_trace(&xhlock->trace); */
 }
 
 static inline int same_context_xhlock(struct hist_lock *xhlock)


Re: possible circular locking dependency detected [was: linux-next: Tree for Aug 22]

2017-08-30 Thread Peter Zijlstra
On Wed, Aug 30, 2017 at 03:15:11PM +0900, Sergey Senozhatsky wrote:
> Hi,
> 
> On (08/30/17 14:43), Byungchul Park wrote:
> [..]
> > > notably slower than earlier 4.13 linux-next. (e.g. scrolling in vim
> > > is irritatingly slow)
> > 
> > To Ingo,
> > 
> > I cannot decide if we have to roll back CONFIG_LOCKDEP_CROSSRELEASE
> > dependency on CONFIG_PROVE_LOCKING in Kconfig. With them enabled,
> > lockdep detection becomes strong but has performance impact. But,
> > it's anyway a debug option so IMHO we don't have to take case of the
> > performance impact. Please let me know your decision.
> 
> well, I expected it :)
> 
> I've been running lockdep enabled kernels for years, and was OK with
> the performance. but now it's just too much and I'm looking at disabling
> lockdep.
> 
> a more relevant test -- compilation of a relatively small project
> 
>   LOCKDEP -CROSSRELEASE -COMPLETIONS LOCKDEP +CROSSRELEASE +COMPLETIONS
> 
>real1m23.722s  real2m9.969s
>user4m11.300s  user4m15.458s
>sys 0m49.386s  sys 2m3.594s
> 
> 
> you don't want to know how much time now it takes to recompile the
> kernel ;)

Right,.. so when I look at perf annotate for __lock_acquire and
lock_release (the two most expensive lockdep functions in a kernel
profile) I don't actually see much cross-release stuff.

So the overhead looks to be spread out over all sorts, which makes it
harder to find and fix.

stack unwinding is done lots and is fairly expensive, I've not yet
checked if crossrelease does too much of that.

The below saved about 50% of my __lock_acquire() time, not sure it made
a significant difference over all though.

---
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 44c8d0d17170..f8db1ead1c48 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -3386,7 +3386,7 @@ static int __lock_acquire(struct lockdep_map *lock, 
unsigned int subclass,
if (!class)
return 0;
}
-   atomic_inc((atomic_t *)&class->ops);
+   /* atomic_inc((atomic_t *)&class->ops); */
if (very_verbose(class)) {
printk("\nacquire class [%p] %s", class->key, class->name);
if (class->name_version > 1)


Re: I/O hangs after resuming from suspend-to-ram

2017-08-30 Thread Ming Lei
Hi,

On Wed, Aug 30, 2017 at 08:15:02AM +0200, oleksa...@natalenko.name wrote:
> Hello.
> 
> Addressing your questions below.
> 
> > Can't reproduce even with putting dmcypt on raid10 after applying my
> > patch.
> 
> Just a side note, that dm-crypt is not necessary here — I am able to trigger
> hang with RAID10 and LVM only.
> 
> > BTW, could you share us which blk-mq scheduler you are using on sata?
> 
> Okay, now it makes sense. Previously, without your patch I was able to
> reproduce the issue with "mq-deadline", "bfq" and "none". Now, I cannot
> trigger it with "none" set, but "mq-deadline" still hangs for me. Does this
> mean each scheduler should be modified separately?

No, it shouldn't.

> 
> > Could you apply the following debug patch and provide the dmesg log
> > after running the commands below?
> 
> Is it still relevant since I confirm issue to be fixed with "none"?

Please try the following patch and previous patch together and see
if they make a difference:


>From bc5626b4b65c7ff26567e75f42584c2c43ffe7c1 Mon Sep 17 00:00:00 2001
From: Ming Lei 
Date: Wed, 30 Aug 2017 15:53:01 +0800
Subject: [PATCH] blk-mq: add requests in the tail of hctx->dispatch

So that we can allow to insert request at the head
of hctx->dispatch.

Signed-off-by: Ming Lei 
---
 block/blk-mq-sched.c | 2 +-
 block/blk-mq.c   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index a026fb47..b40dd063d61f 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -272,7 +272,7 @@ static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx 
*hctx,
 * the dispatch list.
 */
spin_lock(&hctx->lock);
-   list_add(&rq->queuelist, &hctx->dispatch);
+   list_add_tail(&rq->queuelist, &hctx->dispatch);
spin_unlock(&hctx->lock);
return true;
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4603b115e234..fed3d0c16266 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1067,7 +1067,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list)
blk_mq_put_driver_tag(rq);
 
spin_lock(&hctx->lock);
-   list_splice_init(list, &hctx->dispatch);
+   list_splice_tail_init(list, &hctx->dispatch);
spin_unlock(&hctx->lock);
 
/*
-- 
2.9.5


-- 
Ming


[PATCH] partitions/ibm: call ->getgeo directly

2017-08-30 Thread Christoph Hellwig
Trying to get rid of as many as possible ioctl_by_bdev callers to reduce
set_fs instances.

Signed-off-by: Christoph Hellwig 
---
 block/partitions/ibm.c | 21 +
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/block/partitions/ibm.c b/block/partitions/ibm.c
index 14b081af8d61..3ef8bfa3db9d 100644
--- a/block/partitions/ibm.c
+++ b/block/partitions/ibm.c
@@ -282,7 +282,6 @@ static int find_cms1_partitions(struct parsed_partitions 
*state,
return 1;
 }
 
-
 /*
  * This is the main function, called by check.c
  */
@@ -292,7 +291,7 @@ int ibm_partition(struct parsed_partitions *state)
int blocksize, res;
loff_t i_size, offset, size;
dasd_information2_t *info;
-   struct hd_geometry *geo;
+   struct hd_geometry geo = { 0, };
char type[5] = {0,};
char name[7] = {0,};
sector_t labelsect;
@@ -308,30 +307,30 @@ int ibm_partition(struct parsed_partitions *state)
info = kmalloc(sizeof(dasd_information2_t), GFP_KERNEL);
if (info == NULL)
goto out_exit;
-   geo = kmalloc(sizeof(struct hd_geometry), GFP_KERNEL);
-   if (geo == NULL)
-   goto out_nogeo;
label = kmalloc(sizeof(union label_t), GFP_KERNEL);
if (label == NULL)
goto out_nolab;
-   if (ioctl_by_bdev(bdev, HDIO_GETGEO, (unsigned long)geo) != 0)
+   if (!bdev->bd_disk->fops->getgeo)
+   goto out_freeall;
+   geo.start = get_start_sect(bdev);
+   if (bdev->bd_disk->fops->getgeo(bdev, &geo) != 0)
goto out_freeall;
if (ioctl_by_bdev(bdev, BIODASDINFO2, (unsigned long)info) != 0) {
kfree(info);
info = NULL;
}
 
-   if (find_label(state, info, geo, blocksize, &labelsect, name, type,
+   if (find_label(state, info, &geo, blocksize, &labelsect, name, type,
   label)) {
if (!strncmp(type, "VOL1", 4)) {
-   res = find_vol1_partitions(state, geo, blocksize, name,
+   res = find_vol1_partitions(state, &geo, blocksize, name,
   label);
} else if (!strncmp(type, "LNX1", 4)) {
-   res = find_lnx1_partitions(state, geo, blocksize, name,
+   res = find_lnx1_partitions(state, &geo, blocksize, name,
   label, labelsect, i_size,
   info);
} else if (!strncmp(type, "CMS1", 4)) {
-   res = find_cms1_partitions(state, geo, blocksize, name,
+   res = find_cms1_partitions(state, &geo, blocksize, name,
   label, labelsect);
}
} else if (info) {
@@ -356,8 +355,6 @@ int ibm_partition(struct parsed_partitions *state)
 out_freeall:
kfree(label);
 out_nolab:
-   kfree(geo);
-out_nogeo:
kfree(info);
 out_exit:
return res;
-- 
2.11.0



Re: [PATCH V6 00/12] mmc: Add Command Queue support

2017-08-30 Thread Adrian Hunter
On 25/08/17 15:43, Adrian Hunter wrote:
> Here is V6 of the hardware command queue patches without the software
> command queue patches.  Patches "mmc: host: Add CQE interface" and
> "mmc: core: Turn off CQE before sending commands" have been applied to Ulf's
> next branch.  "mmc: host: Add CQE interface" needs to be dropped in favour
> of the new version.

Any comments?