[PATCH BUGFIX/IMPROVEMENT V2 1/3] block, bfq: make lookup_next_entity push up vtime on expirations
To provide a very smooth service, bfq starts to serve a bfq_queue only if the queue is 'eligible', i.e., if the same queue would have started to be served in the ideal, perfectly fair system that bfq simulates internally. This is obtained by associating each queue with a virtual start time, and by computing a special system virtual time quantity: a queue is eligible only if the system virtual time has reached the virtual start time of the queue. Finally, bfq guarantees that, when a new queue must be set in service, there is always at least one eligible entity for each active parent entity in the scheduler. To provide this guarantee, the function __bfq_lookup_next_entity pushes up, for each parent entity on which it is invoked, the system virtual time to the minimum among the virtual start times of the entities in the active tree for the parent entity (more precisely, the push up occurs if the system virtual time happens to be lower than all such virtual start times). There is however a circumstance in which __bfq_lookup_next_entity cannot push up the system virtual time for a parent entity, even if the system virtual time is lower than the virtual start times of all the child entities in the active tree. It happens if one of the child entities is in service. In fact, in such a case, there is already an eligible entity, the in-service one, even if it may not be not present in the active tree (because in-service entities may be removed from the active tree). Unfortunately, in the last re-design of the hierarchical-scheduling engine, the reset of the pointer to the in-service entity for a given parent entity--reset to be done as a consequence of the expiration of the in-service entity--always happens after the function __bfq_lookup_next_entity has been invoked. This causes the function to think that there is still an entity in service for the parent entity, and then that the system virtual time cannot be pushed up, even if actually such a no-more-in-service entity has already been properly reinserted into the active tree (or in some other tree if no more active). Yet, the system virtual time *had* to be pushed up, to be ready to correctly choose the next queue to serve. Because of the lack of this push up, bfq may wrongly set in service a queue that had been speculatively pre-computed as the possible next-in-service queue, but that would no more be the one to serve after the expiration and the reinsertion into the active trees of the previously in-service entities. This commit addresses this issue by making __bfq_lookup_next_entity properly push up the system virtual time if an expiration is occurring. Signed-off-by: Paolo Valente Tested-by: Lee Tibbert Tested-by: Oleksandr Natalenko --- block/bfq-iosched.c | 4 ++-- block/bfq-iosched.h | 4 ++-- block/bfq-wf2q.c| 58 +++-- 3 files changed, 47 insertions(+), 19 deletions(-) diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 436b6ca..a10f147 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -720,7 +720,7 @@ static void bfq_updated_next_req(struct bfq_data *bfqd, entity->budget = new_budget; bfq_log_bfqq(bfqd, bfqq, "updated next rq: new budget %lu", new_budget); - bfq_requeue_bfqq(bfqd, bfqq); + bfq_requeue_bfqq(bfqd, bfqq, false); } } @@ -2563,7 +2563,7 @@ static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq) bfq_del_bfqq_busy(bfqd, bfqq, true); } else { - bfq_requeue_bfqq(bfqd, bfqq); + bfq_requeue_bfqq(bfqd, bfqq, true); /* * Resort priority tree of potential close cooperators. */ diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h index 859f0a8..3e2659c 100644 --- a/block/bfq-iosched.h +++ b/block/bfq-iosched.h @@ -817,7 +817,6 @@ extern const int bfq_timeout; struct bfq_queue *bic_to_bfqq(struct bfq_io_cq *bic, bool is_sync); void bic_set_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq, bool is_sync); struct bfq_data *bic_to_bfqd(struct bfq_io_cq *bic); -void bfq_requeue_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq); void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq); void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_entity *entity, struct rb_root *root); @@ -917,7 +916,8 @@ void __bfq_bfqd_reset_in_service(struct bfq_data *bfqd); void bfq_deactivate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq, bool ins_into_idle_tree, bool expiration); void bfq_activate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq); -void bfq_requeue_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq); +void bfq_requeue_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq, + bool expiration); void bfq_del_bfqq_busy(struct bfq_data *bfqd
[PATCH BUGFIX/IMPROVEMENT V2 2/3] block, bfq: remove direct switch to an entity in higher class
If the function bfq_update_next_in_service is invoked as a consequence of the activation or requeueing of an entity, say E, and finds out that E belongs to a higher-priority class than that of the current next-in-service entity, then it sets next_in_service directly to E. But this may lead to anomalous schedules, because E may happen not be eligible for service, because its virtual start time is higher than the system virtual time for its service tree. This commit addresses this issue by simply removing this direct switch. Signed-off-by: Paolo Valente Tested-by: Lee Tibbert Tested-by: Oleksandr Natalenko --- block/bfq-wf2q.c | 19 +-- 1 file changed, 5 insertions(+), 14 deletions(-) diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c index 138732e..eeaf326 100644 --- a/block/bfq-wf2q.c +++ b/block/bfq-wf2q.c @@ -86,9 +86,8 @@ static bool bfq_update_next_in_service(struct bfq_sched_data *sd, * or repositiong of an entity that does not coincide with * sd->next_in_service, then a full lookup in the active tree * can be avoided. In fact, it is enough to check whether the -* just-modified entity has a higher priority than -* sd->next_in_service, or, even if it has the same priority -* as sd->next_in_service, is eligible and has a lower virtual +* just-modified entity has the same priority as +* sd->next_in_service, is eligible and has a lower virtual * finish time than sd->next_in_service. If this compound * condition holds, then the new entity becomes the new * next_in_service. Otherwise no change is needed. @@ -104,9 +103,8 @@ static bool bfq_update_next_in_service(struct bfq_sched_data *sd, /* * If there is already a next_in_service candidate -* entity, then compare class priorities or timestamps -* to decide whether to replace sd->service_tree with -* new_entity. +* entity, then compare timestamps to decide whether +* to replace sd->service_tree with new_entity. */ if (next_in_service) { unsigned int new_entity_class_idx = @@ -114,10 +112,6 @@ static bool bfq_update_next_in_service(struct bfq_sched_data *sd, struct bfq_service_tree *st = sd->service_tree + new_entity_class_idx; - /* -* For efficiency, evaluate the most likely -* sub-condition first. -*/ replace_next = (new_entity_class_idx == bfq_class_idx(next_in_service) @@ -125,10 +119,7 @@ static bool bfq_update_next_in_service(struct bfq_sched_data *sd, !bfq_gt(new_entity->start, st->vtime) && bfq_gt(next_in_service->finish, - new_entity->finish)) - || - new_entity_class_idx < - bfq_class_idx(next_in_service); + new_entity->finish)); } if (replace_next) -- 2.10.0
[PATCH BUGFIX/IMPROVEMENT V2 3/3] block, bfq: guarantee update_next_in_service always returns an eligible entity
If the function bfq_update_next_in_service is invoked as a consequence of the activation or requeueing of an entity, say E, then it doesn't invoke bfq_lookup_next_entity to get the next-in-service entity. In contrast, it follows a shorter path: if E happens to be eligible (see commit "bfq-sq-mq: make lookup_next_entity push up vtime on expirations" for details on eligibility) and to have a lower virtual finish time than the current candidate as next-in-service entity, then E directly becomes the next-in-service entity. Unfortunately, there is a corner case for which this shorter path makes bfq_update_next_in_service choose a non eligible entity: it occurs if both E and the current next-in-service entity happen to be non eligible when bfq_update_next_in_service is invoked. In this case, E is not set as next-in-service, and, since bfq_lookup_next_entity is not invoked, the state of the parent entity is not updated so as to end up with an eligible entity as the proper next-in-service entity. In this respect, next-in-service is actually allowed to be non eligible while some queue is in service: since no system-virtual-time push-up can be performed in that case (see again commit "bfq-sq-mq: make lookup_next_entity push up vtime on expirations" for details), next-in-service is chosen, speculatively, as a function of the possible value that the system virtual time may get after a push up. But the correctness of the schedule breaks if next-in-service is still a non eligible entity when it is time to set in service the next entity. Unfortunately, this may happen in the above corner case. This commit fixes this problem by making bfq_update_next_in_service invoke bfq_lookup_next_entity not only if the above shorter path cannot be taken, but also if the shorter path is taken but fails to yield an eligible next-in-service entity. Signed-off-by: Paolo Valente Tested-by: Lee Tibbert Tested-by: Oleksandr Natalenko --- block/bfq-wf2q.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c index eeaf326..add54f2 100644 --- a/block/bfq-wf2q.c +++ b/block/bfq-wf2q.c @@ -80,6 +80,7 @@ static bool bfq_update_next_in_service(struct bfq_sched_data *sd, { struct bfq_entity *next_in_service = sd->next_in_service; bool parent_sched_may_change = false; + bool change_without_lookup = false; /* * If this update is triggered by the activation, requeueing @@ -99,7 +100,7 @@ static bool bfq_update_next_in_service(struct bfq_sched_data *sd, * set to true, and left as true if * sd->next_in_service is NULL. */ - bool replace_next = true; + change_without_lookup = true; /* * If there is already a next_in_service candidate @@ -112,7 +113,7 @@ static bool bfq_update_next_in_service(struct bfq_sched_data *sd, struct bfq_service_tree *st = sd->service_tree + new_entity_class_idx; - replace_next = + change_without_lookup = (new_entity_class_idx == bfq_class_idx(next_in_service) && @@ -122,15 +123,16 @@ static bool bfq_update_next_in_service(struct bfq_sched_data *sd, new_entity->finish)); } - if (replace_next) + if (change_without_lookup) next_in_service = new_entity; - } else /* invoked because of a deactivation: lookup needed */ + } + + if (!change_without_lookup) /* lookup needed */ next_in_service = bfq_lookup_next_entity(sd, expiration); - if (next_in_service) { + if (next_in_service) parent_sched_may_change = !sd->next_in_service || bfq_update_parent_budget(next_in_service); - } sd->next_in_service = next_in_service; -- 2.10.0
[PATCH BUGFIX/IMPROVEMENT V2 0/3] three bfq fixes restoring service guarantees with random sync writes in bg
[SECOND TAKE, with just the name of one of the tester fixed] Hi, while testing the read-write unfairness issues reported by Mel, I found BFQ failing to guarantee good responsiveness against heavy random sync writes in the background, i.e., multiple writers doing random writes and systematic fdatasync [1]. The failure was caused by three related bugs, because of which BFQ failed to guarantee to high-weight processes the expected fraction of the throughput. The three patches in this series fix these bugs. These fixes restore the usual BFQ service guarantees (and thus optimal responsiveness too), against the above background workload and, probably, against other similar workloads. Thanks, Paolo [1] https://lkml.org/lkml/2017/8/9/957 Paolo Valente (3): block, bfq: make lookup_next_entity push up vtime on expirations block, bfq: remove direct switch to an entity in higher class block, bfq: guarantee update_next_in_service always returns an eligible entity block/bfq-iosched.c | 4 +-- block/bfq-iosched.h | 4 +-- block/bfq-wf2q.c| 91 - 3 files changed, 60 insertions(+), 39 deletions(-) -- 2.10.0
Re: [PATCH BUGFIX/IMPROVEMENT 0/3] three bfq fixes restoring service guarantees with random sync writes in bg
> Il giorno 31 ago 2017, alle ore 08:41, oleksa...@natalenko.name ha scritto: > >> Tested-by: Oleksander Natalenko > > I'm "Oleksandr" :). > Sorry, resending ... Paolo > 31.08.2017 08:10, Paolo Valente wrote: >> Hi, >> while testing the read-write unfairness issues reported by Mel, I >> found BFQ failing to guarantee good responsiveness against heavy >> random sync writes in the background, i.e., multiple writers doing >> random writes and systematic fdatasync [1]. The failure was caused by >> three related bugs, because of which BFQ failed to guarantee to >> high-weight processes the expected fraction of the throughput. >> The three patches in this series fix these bugs. These fixes restore >> the usual BFQ service guarantees (and thus optimal responsiveness >> too), against the above background workload and, probably, against >> other similar workloads. >> Thanks, >> Paolo >> [1] https://lkml.org/lkml/2017/8/9/957 >> Paolo Valente (3): >> block, bfq: make lookup_next_entity push up vtime on expirations >> block, bfq: remove direct switch to an entity in higher class >> block, bfq: guarantee update_next_in_service always returns an >>eligible entity >> block/bfq-iosched.c | 4 +-- >> block/bfq-iosched.h | 4 +-- >> block/bfq-wf2q.c| 91 >> - >> 3 files changed, 60 insertions(+), 39 deletions(-) >> -- >> 2.10.0
Re: [PATCH BUGFIX/IMPROVEMENT 0/3] three bfq fixes restoring service guarantees with random sync writes in bg
Tested-by: Oleksander Natalenko I'm "Oleksandr" :). 31.08.2017 08:10, Paolo Valente wrote: Hi, while testing the read-write unfairness issues reported by Mel, I found BFQ failing to guarantee good responsiveness against heavy random sync writes in the background, i.e., multiple writers doing random writes and systematic fdatasync [1]. The failure was caused by three related bugs, because of which BFQ failed to guarantee to high-weight processes the expected fraction of the throughput. The three patches in this series fix these bugs. These fixes restore the usual BFQ service guarantees (and thus optimal responsiveness too), against the above background workload and, probably, against other similar workloads. Thanks, Paolo [1] https://lkml.org/lkml/2017/8/9/957 Paolo Valente (3): block, bfq: make lookup_next_entity push up vtime on expirations block, bfq: remove direct switch to an entity in higher class block, bfq: guarantee update_next_in_service always returns an eligible entity block/bfq-iosched.c | 4 +-- block/bfq-iosched.h | 4 +-- block/bfq-wf2q.c| 91 - 3 files changed, 60 insertions(+), 39 deletions(-) -- 2.10.0
[PATCH BUGFIX/IMPROVEMENT 3/3] block, bfq: guarantee update_next_in_service always returns an eligible entity
If the function bfq_update_next_in_service is invoked as a consequence of the activation or requeueing of an entity, say E, then it doesn't invoke bfq_lookup_next_entity to get the next-in-service entity. In contrast, it follows a shorter path: if E happens to be eligible (see commit "bfq-sq-mq: make lookup_next_entity push up vtime on expirations" for details on eligibility) and to have a lower virtual finish time than the current candidate as next-in-service entity, then E directly becomes the next-in-service entity. Unfortunately, there is a corner case for which this shorter path makes bfq_update_next_in_service choose a non eligible entity: it occurs if both E and the current next-in-service entity happen to be non eligible when bfq_update_next_in_service is invoked. In this case, E is not set as next-in-service, and, since bfq_lookup_next_entity is not invoked, the state of the parent entity is not updated so as to end up with an eligible entity as the proper next-in-service entity. In this respect, next-in-service is actually allowed to be non eligible while some queue is in service: since no system-virtual-time push-up can be performed in that case (see again commit "bfq-sq-mq: make lookup_next_entity push up vtime on expirations" for details), next-in-service is chosen, speculatively, as a function of the possible value that the system virtual time may get after a push up. But the correctness of the schedule breaks if next-in-service is still a non eligible entity when it is time to set in service the next entity. Unfortunately, this may happen in the above corner case. This commit fixes this problem by making bfq_update_next_in_service invoke bfq_lookup_next_entity not only if the above shorter path cannot be taken, but also if the shorter path is taken but fails to yield an eligible next-in-service entity. Signed-off-by: Paolo Valente Tested-by: Lee Tibbert Tested-by: Oleksander Natalenko --- block/bfq-wf2q.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c index eeaf326..add54f2 100644 --- a/block/bfq-wf2q.c +++ b/block/bfq-wf2q.c @@ -80,6 +80,7 @@ static bool bfq_update_next_in_service(struct bfq_sched_data *sd, { struct bfq_entity *next_in_service = sd->next_in_service; bool parent_sched_may_change = false; + bool change_without_lookup = false; /* * If this update is triggered by the activation, requeueing @@ -99,7 +100,7 @@ static bool bfq_update_next_in_service(struct bfq_sched_data *sd, * set to true, and left as true if * sd->next_in_service is NULL. */ - bool replace_next = true; + change_without_lookup = true; /* * If there is already a next_in_service candidate @@ -112,7 +113,7 @@ static bool bfq_update_next_in_service(struct bfq_sched_data *sd, struct bfq_service_tree *st = sd->service_tree + new_entity_class_idx; - replace_next = + change_without_lookup = (new_entity_class_idx == bfq_class_idx(next_in_service) && @@ -122,15 +123,16 @@ static bool bfq_update_next_in_service(struct bfq_sched_data *sd, new_entity->finish)); } - if (replace_next) + if (change_without_lookup) next_in_service = new_entity; - } else /* invoked because of a deactivation: lookup needed */ + } + + if (!change_without_lookup) /* lookup needed */ next_in_service = bfq_lookup_next_entity(sd, expiration); - if (next_in_service) { + if (next_in_service) parent_sched_may_change = !sd->next_in_service || bfq_update_parent_budget(next_in_service); - } sd->next_in_service = next_in_service; -- 2.10.0
[PATCH BUGFIX/IMPROVEMENT 1/3] block, bfq: make lookup_next_entity push up vtime on expirations
To provide a very smooth service, bfq starts to serve a bfq_queue only if the queue is 'eligible', i.e., if the same queue would have started to be served in the ideal, perfectly fair system that bfq simulates internally. This is obtained by associating each queue with a virtual start time, and by computing a special system virtual time quantity: a queue is eligible only if the system virtual time has reached the virtual start time of the queue. Finally, bfq guarantees that, when a new queue must be set in service, there is always at least one eligible entity for each active parent entity in the scheduler. To provide this guarantee, the function __bfq_lookup_next_entity pushes up, for each parent entity on which it is invoked, the system virtual time to the minimum among the virtual start times of the entities in the active tree for the parent entity (more precisely, the push up occurs if the system virtual time happens to be lower than all such virtual start times). There is however a circumstance in which __bfq_lookup_next_entity cannot push up the system virtual time for a parent entity, even if the system virtual time is lower than the virtual start times of all the child entities in the active tree. It happens if one of the child entities is in service. In fact, in such a case, there is already an eligible entity, the in-service one, even if it may not be not present in the active tree (because in-service entities may be removed from the active tree). Unfortunately, in the last re-design of the hierarchical-scheduling engine, the reset of the pointer to the in-service entity for a given parent entity--reset to be done as a consequence of the expiration of the in-service entity--always happens after the function __bfq_lookup_next_entity has been invoked. This causes the function to think that there is still an entity in service for the parent entity, and then that the system virtual time cannot be pushed up, even if actually such a no-more-in-service entity has already been properly reinserted into the active tree (or in some other tree if no more active). Yet, the system virtual time *had* to be pushed up, to be ready to correctly choose the next queue to serve. Because of the lack of this push up, bfq may wrongly set in service a queue that had been speculatively pre-computed as the possible next-in-service queue, but that would no more be the one to serve after the expiration and the reinsertion into the active trees of the previously in-service entities. This commit addresses this issue by making __bfq_lookup_next_entity properly push up the system virtual time if an expiration is occurring. Signed-off-by: Paolo Valente Tested-by: Lee Tibbert Tested-by: Oleksander Natalenko --- block/bfq-iosched.c | 4 ++-- block/bfq-iosched.h | 4 ++-- block/bfq-wf2q.c| 58 +++-- 3 files changed, 47 insertions(+), 19 deletions(-) diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 436b6ca..a10f147 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -720,7 +720,7 @@ static void bfq_updated_next_req(struct bfq_data *bfqd, entity->budget = new_budget; bfq_log_bfqq(bfqd, bfqq, "updated next rq: new budget %lu", new_budget); - bfq_requeue_bfqq(bfqd, bfqq); + bfq_requeue_bfqq(bfqd, bfqq, false); } } @@ -2563,7 +2563,7 @@ static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq) bfq_del_bfqq_busy(bfqd, bfqq, true); } else { - bfq_requeue_bfqq(bfqd, bfqq); + bfq_requeue_bfqq(bfqd, bfqq, true); /* * Resort priority tree of potential close cooperators. */ diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h index 859f0a8..3e2659c 100644 --- a/block/bfq-iosched.h +++ b/block/bfq-iosched.h @@ -817,7 +817,6 @@ extern const int bfq_timeout; struct bfq_queue *bic_to_bfqq(struct bfq_io_cq *bic, bool is_sync); void bic_set_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq, bool is_sync); struct bfq_data *bic_to_bfqd(struct bfq_io_cq *bic); -void bfq_requeue_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq); void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq); void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_entity *entity, struct rb_root *root); @@ -917,7 +916,8 @@ void __bfq_bfqd_reset_in_service(struct bfq_data *bfqd); void bfq_deactivate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq, bool ins_into_idle_tree, bool expiration); void bfq_activate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq); -void bfq_requeue_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq); +void bfq_requeue_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq, + bool expiration); void bfq_del_bfqq_busy(struct bfq_data *bfq
[PATCH BUGFIX/IMPROVEMENT 2/3] block, bfq: remove direct switch to an entity in higher class
If the function bfq_update_next_in_service is invoked as a consequence of the activation or requeueing of an entity, say E, and finds out that E belongs to a higher-priority class than that of the current next-in-service entity, then it sets next_in_service directly to E. But this may lead to anomalous schedules, because E may happen not be eligible for service, because its virtual start time is higher than the system virtual time for its service tree. This commit addresses this issue by simply removing this direct switch. Signed-off-by: Paolo Valente Tested-by: Lee Tibbert Tested-by: Oleksander Natalenko --- block/bfq-wf2q.c | 19 +-- 1 file changed, 5 insertions(+), 14 deletions(-) diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c index 138732e..eeaf326 100644 --- a/block/bfq-wf2q.c +++ b/block/bfq-wf2q.c @@ -86,9 +86,8 @@ static bool bfq_update_next_in_service(struct bfq_sched_data *sd, * or repositiong of an entity that does not coincide with * sd->next_in_service, then a full lookup in the active tree * can be avoided. In fact, it is enough to check whether the -* just-modified entity has a higher priority than -* sd->next_in_service, or, even if it has the same priority -* as sd->next_in_service, is eligible and has a lower virtual +* just-modified entity has the same priority as +* sd->next_in_service, is eligible and has a lower virtual * finish time than sd->next_in_service. If this compound * condition holds, then the new entity becomes the new * next_in_service. Otherwise no change is needed. @@ -104,9 +103,8 @@ static bool bfq_update_next_in_service(struct bfq_sched_data *sd, /* * If there is already a next_in_service candidate -* entity, then compare class priorities or timestamps -* to decide whether to replace sd->service_tree with -* new_entity. +* entity, then compare timestamps to decide whether +* to replace sd->service_tree with new_entity. */ if (next_in_service) { unsigned int new_entity_class_idx = @@ -114,10 +112,6 @@ static bool bfq_update_next_in_service(struct bfq_sched_data *sd, struct bfq_service_tree *st = sd->service_tree + new_entity_class_idx; - /* -* For efficiency, evaluate the most likely -* sub-condition first. -*/ replace_next = (new_entity_class_idx == bfq_class_idx(next_in_service) @@ -125,10 +119,7 @@ static bool bfq_update_next_in_service(struct bfq_sched_data *sd, !bfq_gt(new_entity->start, st->vtime) && bfq_gt(next_in_service->finish, - new_entity->finish)) - || - new_entity_class_idx < - bfq_class_idx(next_in_service); + new_entity->finish)); } if (replace_next) -- 2.10.0
[PATCH BUGFIX/IMPROVEMENT 0/3] three bfq fixes restoring service guarantees with random sync writes in bg
Hi, while testing the read-write unfairness issues reported by Mel, I found BFQ failing to guarantee good responsiveness against heavy random sync writes in the background, i.e., multiple writers doing random writes and systematic fdatasync [1]. The failure was caused by three related bugs, because of which BFQ failed to guarantee to high-weight processes the expected fraction of the throughput. The three patches in this series fix these bugs. These fixes restore the usual BFQ service guarantees (and thus optimal responsiveness too), against the above background workload and, probably, against other similar workloads. Thanks, Paolo [1] https://lkml.org/lkml/2017/8/9/957 Paolo Valente (3): block, bfq: make lookup_next_entity push up vtime on expirations block, bfq: remove direct switch to an entity in higher class block, bfq: guarantee update_next_in_service always returns an eligible entity block/bfq-iosched.c | 4 +-- block/bfq-iosched.h | 4 +-- block/bfq-wf2q.c| 91 - 3 files changed, 60 insertions(+), 39 deletions(-) -- 2.10.0
Re: [PATCH V3 13/14] blk-mq-sched: refactor blk_mq_sched_try_merge()
On Wed, Aug 30, 2017 at 05:17:05PM +, Bart Van Assche wrote: > On Sun, 2017-08-27 at 00:33 +0800, Ming Lei wrote: > > -bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio, > > - struct request **merged_request) > > +static bool __blk_mq_try_merge(struct request_queue *q, > > + struct bio *bio, struct request **merged_request, > > + struct request *candidate, enum elv_merge type) > > { > > - struct request *rq; > > + struct request *rq = candidate; > > It seems weird to me that the argument 'candidate' is not used other than to > copy its value into the local variable 'rq'? How about removing that local OK, will simply use 'rq' as parameter. > variable and renaming the 'candidate' argument into 'rq'? Anyway, with or > without that change: > > Reviewed-by: Bart Van Assche OK, thanks! -- Ming
Re: [PATCH V3 06/14] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed
On Wed, Aug 30, 2017 at 05:11:00PM +, Bart Van Assche wrote: > On Sun, 2017-08-27 at 00:33 +0800, Ming Lei wrote: > > During dispatching, we moved all requests from hctx->dispatch to > > one temporary list, then dispatch them one by one from this list. > > Unfortunately duirng this period, run queue from other contexts > ^^ > during? OK. > > may think the queue is idle, then start to dequeue from sw/scheduler > > queue and still try to dispatch because ->dispatch is empty. This way > > hurts sequential I/O performance because requests are dequeued when > > lld queue is busy. > > [ ... ] > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c > > index 735e432294ab..4d7bea8c2594 100644 > > --- a/block/blk-mq-sched.c > > +++ b/block/blk-mq-sched.c > > @@ -146,7 +146,6 @@ void blk_mq_sched_dispatch_requests(struct > > blk_mq_hw_ctx *hctx) > > struct request_queue *q = hctx->queue; > > struct elevator_queue *e = q->elevator; > > const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request; > > - bool do_sched_dispatch = true; > > LIST_HEAD(rq_list); > > > > /* RCU or SRCU read lock is needed before checking quiesced flag */ > > Shouldn't blk_mq_sched_dispatch_requests() set BLK_MQ_S_DISPATCH_BUSY just > after > the following statement because this statement makes the dispatch list empty? Actually that is what I did in V1. I changed to this way because setting the BUSY flag here will increase the race window a bit, for example, if one request is added to ->dispatch just after it is flushed(), the check on the BUSY bit won't catch this case. Then we can avoid to check both the bit and list_empty_careful(&hctx->dispatch) in blk_mq_sched_dispatch_requests(), so code becomes simpler and more readable if we set the flag simply from the beginning. > > list_splice_init(&hctx->dispatch, &rq_list); > > > @@ -177,8 +176,33 @@ void blk_mq_sched_dispatch_requests(struct > > blk_mq_hw_ctx *hctx) > > */ > > if (!list_empty(&rq_list)) { > > blk_mq_sched_mark_restart_hctx(hctx); > > - do_sched_dispatch = blk_mq_dispatch_rq_list(q, &rq_list); > > - } else if (!has_sched_dispatch && !q->queue_depth) { > > + blk_mq_dispatch_rq_list(q, &rq_list); > > + > > + /* > > +* We may clear DISPATCH_BUSY just after it > > +* is set from another context, the only cost > > +* is that one request is dequeued a bit early, > > +* we can survive that. Given the window is > > +* too small, no need to worry about performance >^^^ > The word "too" seems extraneous to me in this sentence. Maybe 'extremely' is better, or just remove it? > > > bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio, > > @@ -330,6 +353,7 @@ static bool blk_mq_sched_bypass_insert(struct > > blk_mq_hw_ctx *hctx, > > */ > > spin_lock(&hctx->lock); > > list_add(&rq->queuelist, &hctx->dispatch); > > + set_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state); > > spin_unlock(&hctx->lock); > > return true; > > } > > Is it necessary to make blk_mq_sched_bypass_insert() set > BLK_MQ_S_DISPATCH_BUSY? > My understanding is that only code that makes the dispatch list empty should > set BLK_MQ_S_DISPATCH_BUSY. However, blk_mq_sched_bypass_insert() adds an > element > to the dispatch list so that guarantees that that list is not empty. I believe my above comment has explained it already. > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > index f063dd0f197f..6af56a71c1cd 100644 > > --- a/block/blk-mq.c > > +++ b/block/blk-mq.c > > @@ -1140,6 +1140,11 @@ bool blk_mq_dispatch_rq_list(struct request_queue > > *q, struct list_head *list) > > > > spin_lock(&hctx->lock); > > list_splice_init(list, &hctx->dispatch); > > + /* > > +* DISPATCH_BUSY won't be cleared until all requests > > +* in hctx->dispatch are dispatched successfully > > +*/ > > + set_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state); > > spin_unlock(&hctx->lock); > > Same comment here - since this code adds one or more requests to the dispatch > list, > is it really needed to set the DISPATCH_BUSY flag? See same comment above, :-) -- Ming
Re: [PATCH V3 05/14] blk-mq-sched: improve dispatching from sw queue
On Wed, Aug 30, 2017 at 04:34:47PM +, Bart Van Assche wrote: > On Sun, 2017-08-27 at 00:33 +0800, Ming Lei wrote: > > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h > > index 50c6485cb04f..7b7a366a97f3 100644 > > --- a/include/linux/blk-mq.h > > +++ b/include/linux/blk-mq.h > > @@ -30,6 +30,8 @@ struct blk_mq_hw_ctx { > > > > struct sbitmap ctx_map; > > > > + struct blk_mq_ctx *dispatch_from; > > + > > struct blk_mq_ctx **ctxs; > > unsigned intnr_ctx; > > Hello Ming, > > Are you relying here on the fact that the per-CPU queues are never > reallocated, even if CPU hot-plugging occurs? Sorry but that seems fragile > to me. I would like to see 'dispatch_from' be converted into an integer. It > is easy to check whether an integer software queue index is out of range > but it's not that easy to check whether a pointer became invalid. If CPU hotplug happens, the instance of 'struct blk_mq_ctx' for that CPU is still there and its index won't change from being setup because its lifetime is same with 'struct request_queue', blk_mq_hctx_notify_dead() just flushes the sw queue when the CPU is going away. So we don't need to pay special attention to CPU hotplug here, please let me know if you are fine now. -- Ming
Re: [PATCH V3 02/14] sbitmap: introduce __sbitmap_for_each_set()
On Wed, Aug 30, 2017 at 03:55:13PM +, Bart Van Assche wrote: > On Sun, 2017-08-27 at 00:33 +0800, Ming Lei wrote: > > /** > > * sbitmap_for_each_set() - Iterate over each set bit in a &struct sbitmap. > > + * @start: Where to start the iteration > > Thanks for having changed the name of this argument ... > > > -static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn > > fn, > > - void *data) > > +static inline void __sbitmap_for_each_set(struct sbitmap *sb, > > + unsigned int start, > > + sb_for_each_fn fn, void *data) > > { > > - unsigned int i; > > + unsigned int index = SB_NR_TO_INDEX(sb, start); > > + unsigned int nr = SB_NR_TO_BIT(sb, start); > > + unsigned int scanned = 0; > > ... but I'm still missing a check here whether or not index >= sb->map_nr. The reason I didn't do that is because the only one user is always to pass correct 'start'. OK, will add it. > > > - for (i = 0; i < sb->map_nr; i++) { > > - struct sbitmap_word *word = &sb->map[i]; > > - unsigned int off, nr; > > + while (1) { > > + struct sbitmap_word *word = &sb->map[index]; > > + unsigned int depth = min_t(unsigned int, word->depth - nr, > > + sb->depth - scanned); > > > > + scanned += depth; > > if (!word->word) > > - continue; > > + goto next; > > > > - nr = 0; > > - off = i << sb->shift; > > + depth += nr; > > + start = index << sb->shift; > > The above statement reuses the argument 'start' for a new purpose. This is > confusing - please don't do this. Why not to keep the name 'off'? That will > keep the changes minimal. OK, that will rename the parameter of 'start' as 'off', so that we can save one local variable. -- Ming
Re: [PATCH V2 2/2] block/loop: allow request merge for directio mode
On Wed, Aug 30, 2017 at 03:06:16PM -0700, Shaohua Li wrote: > On Wed, Aug 30, 2017 at 02:43:40PM +0800, Ming Lei wrote: > > On Tue, Aug 29, 2017 at 09:43:20PM -0700, Shaohua Li wrote: > > > On Wed, Aug 30, 2017 at 10:51:21AM +0800, Ming Lei wrote: > > > > On Tue, Aug 29, 2017 at 08:13:39AM -0700, Shaohua Li wrote: > > > > > On Tue, Aug 29, 2017 at 05:56:05PM +0800, Ming Lei wrote: > > > > > > On Thu, Aug 24, 2017 at 12:24:53PM -0700, Shaohua Li wrote: > > > > > > > From: Shaohua Li > > > > > > > > > > > > > > Currently loop disables merge. While it makes sense for buffer IO > > > > > > > mode, > > > > > > > directio mode can benefit from request merge. Without merge, loop > > > > > > > could > > > > > > > send small size IO to underlayer disk and harm performance. > > > > > > > > > > > > Hi Shaohua, > > > > > > > > > > > > IMO no matter if merge is used, loop always sends page by page > > > > > > to VFS in both dio or buffer I/O. > > > > > > > > > > Why do you think so? > > > > > > > > do_blockdev_direct_IO() still handles page by page from iov_iter, and > > > > with bigger request, I guess it might be the plug merge working. > > > > > > This is not true. directio sends big size bio directly, not because of > > > plug > > > merge. Please at least check the code before you complain. > > > > I complain nothing, just try to understand the idea behind, > > never mind, :-) > > > > > > > > > > > > > > > > Also if merge is enabled on loop, that means merge is run > > > > > > on both loop and low level block driver, and not sure if we > > > > > > can benefit from that. > > > > > > > > > > why does merge still happen in low level block driver? > > > > > > > > Because scheduler is still working on low level disk. My question > > > > is that why the scheduler in low level disk doesn't work now > > > > if scheduler on loop can merge? > > > > > > The low level disk can still do merge, but since this is directio, the > > > upper > > > layer already dispatches request as big as possible. There is very little > > > chance the requests can be merged again. > > > > That is true, but these requests need to enter scheduler queue and > > be tried to merge again, even though it is less possible to succeed. > > Double merge may take extra CPU utilization. > > > > Looks it doesn't answer my question. > > > > Without this patch, the requests dispatched to loop won't be merged, > > so they may be small and their sectors may be continuous, my question > > is why dio bios converted from these small loop requests can't be > > merged in block layer when queuing these dio bios to low level device? > > loop thread doesn't have plug there. Even we have plug there, it's still a bad > idea to do the merge in low level layer. If we run direct_IO for every 4k, the > overhead is much much higher than bio merge. The direct_IO will call into fs > code, take different mutexes, metadata update for write and so on. OK, that looks making sense now. -- Ming
Re: [PATCH] kernfs: checking for IS_ERR() instead of NULL
On Wed, Aug 30, 2017 at 05:04:56PM +0300, Dan Carpenter wrote: > The kernfs_get_inode() returns NULL on error, it never returns error > pointers. > > Fixes: aa8188253474 ("kernfs: add exportfs operations") > Signed-off-by: Dan Carpenter Acked-by: Tejun Heo Greg, can you please route this patch? Thanks. -- tejun
Re: [RFC] block/loop: make loop cgroup aware
Hello, Shaohua. On Tue, Aug 29, 2017 at 10:07:04PM -0700, Shaohua Li wrote: > > The association is already coming from the page. We just need to make > > sure that going through loop driver doesn't get in the way of the > > membership getting propagated to the underlying device. > > I think there is confusion. App writes files in upper layer fs on loop. memcg Ah, yes, for some reason, I was still thinking about direct ios. > estabilish membership for the pages of these files. Then writeback does write, > loop then write these pages to under layer fs. The write is done in loop Yes. > thread. The write will allocate new page cache for under layer fs files. The > issue is we must forward memcg info from upper layer files page cache to under > layer files page cache. Ah, I see. We need to assign the ownership of the page cache pages to the original dirtier. Yeah, that's a different problem. For now, let's concentrate on the dio case. > > So, this looks like the loop driver is explicitly forwarding the > > association from the original issuer to the aio command and then from > > the aio command to the ios to the underlying device. I'm wondering > > whether the right way to do this is making aio forward the association > > by default, instead of the loop driver doing it explicitly. That way, > > all aio users can benefit from the forwarding instead of just loop. > > That's possible. The downside doing this in aio is we must audit all fs to > make > sure all bio have association. I'd like to avoid doing this if there is no > other loop-like cgroup usage. But wouldn't that mean that we break cgroup membership for aios that users issue? Can't we do this in the generic aio layer? Thanks. -- tejun
About btrfs on top of wb_throttle
Hi, While playing with btrfs on top of wb_throttle, an interesting thing is that writeback from btrfs always falls into %rwb->wb_background even if there're no other writers. The peculiarity of btrfs is that, it owns the ability of mananging disks so that it creates a private bdi stored in sb->s_bdi, which is different from %queue->backing_device_info. So running balance_dirty_pages() during btrfs's buffered writes doesn't take any effect on %queue->backing_device_info->wb, thus it's got into the wb_background bucket all the time. Haven't measure the performance numbers with a real test, but in theory this problem will let buffered writer spend more time in balance_dirty_pages()'s for(;;) loop. Chris, Jens, thoughts? Thanks, -liubo
Re: [PATCH] block/loop: fix use after feee
On Wed, Aug 30, 2017 at 02:51:05PM -0700, Shaohua Li wrote: > lo_rw_aio->call_read_iter-> > 1 aops->direct_IO > 2 iov_iter_revert > lo_rw_aio_complete could happen between 1 and 2, the bio and bvec could > be freed before 2, which accesses bvec. please ignore this one, I accidentally sent it out. The correct fix is in another patch. > This conflicts with my direcio performance improvement patches, which > I'll resend. > > Signed-off-by: Shaohua Li > --- > drivers/block/loop.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c > index ef83349..153ab3c 100644 > --- a/drivers/block/loop.c > +++ b/drivers/block/loop.c > @@ -490,6 +490,7 @@ static int lo_rw_aio(struct loop_device *lo, struct > loop_cmd *cmd, > bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter); > iov_iter_bvec(&iter, ITER_BVEC | rw, bvec, > bio_segments(bio), blk_rq_bytes(cmd->rq)); > + bio_inc_remaining(bio); > /* >* This bio may be started from the middle of the 'bvec' >* because of bio splitting, so offset from the bvec must > @@ -507,6 +508,7 @@ static int lo_rw_aio(struct loop_device *lo, struct > loop_cmd *cmd, > else > ret = call_read_iter(file, &cmd->iocb, &iter); > > + bio_endio(bio); > if (ret != -EIOCBQUEUED) > cmd->iocb.ki_complete(&cmd->iocb, ret, 0); > return 0; > -- > 2.9.5 >
Re: [PATCH V2 2/2] block/loop: allow request merge for directio mode
On Wed, Aug 30, 2017 at 02:43:40PM +0800, Ming Lei wrote: > On Tue, Aug 29, 2017 at 09:43:20PM -0700, Shaohua Li wrote: > > On Wed, Aug 30, 2017 at 10:51:21AM +0800, Ming Lei wrote: > > > On Tue, Aug 29, 2017 at 08:13:39AM -0700, Shaohua Li wrote: > > > > On Tue, Aug 29, 2017 at 05:56:05PM +0800, Ming Lei wrote: > > > > > On Thu, Aug 24, 2017 at 12:24:53PM -0700, Shaohua Li wrote: > > > > > > From: Shaohua Li > > > > > > > > > > > > Currently loop disables merge. While it makes sense for buffer IO > > > > > > mode, > > > > > > directio mode can benefit from request merge. Without merge, loop > > > > > > could > > > > > > send small size IO to underlayer disk and harm performance. > > > > > > > > > > Hi Shaohua, > > > > > > > > > > IMO no matter if merge is used, loop always sends page by page > > > > > to VFS in both dio or buffer I/O. > > > > > > > > Why do you think so? > > > > > > do_blockdev_direct_IO() still handles page by page from iov_iter, and > > > with bigger request, I guess it might be the plug merge working. > > > > This is not true. directio sends big size bio directly, not because of plug > > merge. Please at least check the code before you complain. > > I complain nothing, just try to understand the idea behind, > never mind, :-) > > > > > > > > > > > > Also if merge is enabled on loop, that means merge is run > > > > > on both loop and low level block driver, and not sure if we > > > > > can benefit from that. > > > > > > > > why does merge still happen in low level block driver? > > > > > > Because scheduler is still working on low level disk. My question > > > is that why the scheduler in low level disk doesn't work now > > > if scheduler on loop can merge? > > > > The low level disk can still do merge, but since this is directio, the upper > > layer already dispatches request as big as possible. There is very little > > chance the requests can be merged again. > > That is true, but these requests need to enter scheduler queue and > be tried to merge again, even though it is less possible to succeed. > Double merge may take extra CPU utilization. > > Looks it doesn't answer my question. > > Without this patch, the requests dispatched to loop won't be merged, > so they may be small and their sectors may be continuous, my question > is why dio bios converted from these small loop requests can't be > merged in block layer when queuing these dio bios to low level device? loop thread doesn't have plug there. Even we have plug there, it's still a bad idea to do the merge in low level layer. If we run direct_IO for every 4k, the overhead is much much higher than bio merge. The direct_IO will call into fs code, take different mutexes, metadata update for write and so on.
[PATCH] block/loop: fix use after feee
lo_rw_aio->call_read_iter-> 1 aops->direct_IO 2 iov_iter_revert lo_rw_aio_complete could happen between 1 and 2, the bio and bvec could be freed before 2, which accesses bvec. This conflicts with my direcio performance improvement patches, which I'll resend. Signed-off-by: Shaohua Li --- drivers/block/loop.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index ef83349..153ab3c 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -490,6 +490,7 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd, bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter); iov_iter_bvec(&iter, ITER_BVEC | rw, bvec, bio_segments(bio), blk_rq_bytes(cmd->rq)); + bio_inc_remaining(bio); /* * This bio may be started from the middle of the 'bvec' * because of bio splitting, so offset from the bvec must @@ -507,6 +508,7 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd, else ret = call_read_iter(file, &cmd->iocb, &iter); + bio_endio(bio); if (ret != -EIOCBQUEUED) cmd->iocb.ki_complete(&cmd->iocb, ret, 0); return 0; -- 2.9.5
[PATCH] block/loop: fix use after free
lo_rw_aio->call_read_iter-> 1 aops->direct_IO 2 iov_iter_revert lo_rw_aio_complete could happen between 1 and 2, the bio and bvec could be freed before 2, which accesses bvec. This conflicts with my direcio performance improvement patches, which I'll resend. Signed-off-by: Shaohua Li --- drivers/block/loop.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index ef83349..153ab3c 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -490,6 +490,7 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd, bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter); iov_iter_bvec(&iter, ITER_BVEC | rw, bvec, bio_segments(bio), blk_rq_bytes(cmd->rq)); + bio_inc_remaining(bio); /* * This bio may be started from the middle of the 'bvec' * because of bio splitting, so offset from the bvec must @@ -507,6 +508,7 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd, else ret = call_read_iter(file, &cmd->iocb, &iter); + bio_endio(bio); if (ret != -EIOCBQUEUED) cmd->iocb.ki_complete(&cmd->iocb, ret, 0); return 0; -- 2.9.5
Re: [GIT PULL] nvme update for Linux 4.14, take 2
That would mean that I need to open-code the tagset iteration in nvme which does not feel like something a driver should do. How about renaming blk_mq_reinit_tagset() into blk_mq_tagset_iter() and to make the argument list of blk_mq_tagset_iter() more similar to that of blk_mq_queue_tag_busy_iter() such that callers of blk_mq_tagset_iter() can pass a pointer to any structure through the @priv argument? That would make this function more general and maybe also more useful to other block drivers. We could do that. But it feels like trying to go head over heals just to keep a change titled: "blk-mq: Make blk_mq_reinit_tagset() calls easier to read" Which I'm not exactly sure I share the motivation. Also, I kinda liked the symmetry of init/exit/reinit_request calling convention. But, if you absolutely think its necessary to keep the change, we can add an "all tags" iterator and use that to implement reinit_request in nvme. Christoph? Jens? verdict before we go forward here?
Re: [PATCH 5/6] platform/x86: make device_attribute const
On Mon, Aug 21, 2017 at 2:43 PM, Bhumika Goyal wrote: > Make these const as they are only passed as an argument to the > function device_create_file and device_remove_file and the corresponding > arguments are of type const. > Done using Coccinelle > Split on per driver basis. > Signed-off-by: Bhumika Goyal > --- > drivers/platform/x86/classmate-laptop.c | 6 +++--- > drivers/platform/x86/intel-rst.c| 4 ++-- > 2 files changed, 5 insertions(+), 5 deletions(-) > > diff --git a/drivers/platform/x86/classmate-laptop.c > b/drivers/platform/x86/classmate-laptop.c > index 55cf10b..d3715e2 100644 > --- a/drivers/platform/x86/classmate-laptop.c > +++ b/drivers/platform/x86/classmate-laptop.c > @@ -254,7 +254,7 @@ static ssize_t cmpc_accel_sensitivity_store_v4(struct > device *dev, > return strnlen(buf, count); > } > > -static struct device_attribute cmpc_accel_sensitivity_attr_v4 = { > +static const struct device_attribute cmpc_accel_sensitivity_attr_v4 = { > .attr = { .name = "sensitivity", .mode = 0660 }, > .show = cmpc_accel_sensitivity_show_v4, > .store = cmpc_accel_sensitivity_store_v4 > @@ -303,7 +303,7 @@ static ssize_t cmpc_accel_g_select_store_v4(struct device > *dev, > return strnlen(buf, count); > } > > -static struct device_attribute cmpc_accel_g_select_attr_v4 = { > +static const struct device_attribute cmpc_accel_g_select_attr_v4 = { > .attr = { .name = "g_select", .mode = 0660 }, > .show = cmpc_accel_g_select_show_v4, > .store = cmpc_accel_g_select_store_v4 > @@ -599,7 +599,7 @@ static ssize_t cmpc_accel_sensitivity_store(struct device > *dev, > return strnlen(buf, count); > } > > -static struct device_attribute cmpc_accel_sensitivity_attr = { > +static const struct device_attribute cmpc_accel_sensitivity_attr = { > .attr = { .name = "sensitivity", .mode = 0660 }, > .show = cmpc_accel_sensitivity_show, > .store = cmpc_accel_sensitivity_store > diff --git a/drivers/platform/x86/intel-rst.c > b/drivers/platform/x86/intel-rst.c > index 7344d84..760a9bf 100644 > --- a/drivers/platform/x86/intel-rst.c > +++ b/drivers/platform/x86/intel-rst.c > @@ -65,7 +65,7 @@ static ssize_t irst_store_wakeup_events(struct device *dev, > return count; > } > > -static struct device_attribute irst_wakeup_attr = { > +static const struct device_attribute irst_wakeup_attr = { > .attr = { .name = "wakeup_events", .mode = 0600 }, > .show = irst_show_wakeup_events, > .store = irst_store_wakeup_events > @@ -111,7 +111,7 @@ static ssize_t irst_store_wakeup_time(struct device *dev, > return count; > } > > -static struct device_attribute irst_timeout_attr = { > +static const struct device_attribute irst_timeout_attr = { > .attr = { .name = "wakeup_time", .mode = 0600 }, > .show = irst_show_wakeup_time, > .store = irst_store_wakeup_time > -- > 1.9.1 > -- With Best Regards, Andy Shevchenko
[PATCH 3/3] block/loop: suppress discard IO error message
We don't know if fallocate really supports FALLOC_FL_PUNCH_HOLE till fallocate is called. If it doesn't support, loop will return -EOPNOTSUPP and we seee a lot of error message printed by blk_update_request. Failure for discard IO isn't a problem, so we just return 0 in loop which will suppress the IO error message Signed-off-by: Shaohua Li --- drivers/block/loop.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index a30aa45..15f51e3 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -441,6 +441,9 @@ static int lo_discard(struct loop_device *lo, struct request *rq, loff_t pos) ret = file->f_op->fallocate(file, mode, pos, blk_rq_bytes(rq)); if (unlikely(ret && ret != -EINVAL && ret != -EOPNOTSUPP)) ret = -EIO; + + if (req_op(rq) == REQ_OP_DISCARD && ret == -EOPNOTSUPP) + ret = 0; out: return ret; } -- 2.9.5
[PATCH 0/3]block/loop: handle discard/zeroout error
Fix some problems when setting up loop device with a block device as back file and create/mount ext4 in the loop device. BTW: blkdev_issue_zeroout retries if we immediately find the device doesn't support zeroout, but it doesn't retry if submit_bio_wait returns -EOPNOTSUPP. Is this correct behavior? Thanks, Shaohua Shaohua Li (3): block/loop: don't hijack error number block/loop: use FALLOC_FL_ZERO_RANGE for REQ_OP_WRITE_ZEROES block/loop: suppress discard IO error message drivers/block/loop.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) -- 2.9.5
[PATCH 1/3] block/loop: don't hijack error number
If the bio returns -EOPNOTSUPP, we shouldn't hijack it and return -EIO Signed-off-by: Shaohua Li --- drivers/block/loop.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index ef83349..054dccc 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -464,7 +464,7 @@ static void lo_complete_rq(struct request *rq) zero_fill_bio(bio); } - blk_mq_end_request(rq, cmd->ret < 0 ? BLK_STS_IOERR : BLK_STS_OK); + blk_mq_end_request(rq, errno_to_blk_status(cmd->ret)); } static void lo_rw_aio_complete(struct kiocb *iocb, long ret, long ret2) @@ -1725,7 +1725,7 @@ static void loop_handle_cmd(struct loop_cmd *cmd) failed: /* complete non-aio request */ if (!cmd->use_aio || ret) { - cmd->ret = ret ? -EIO : 0; + cmd->ret = ret; blk_mq_complete_request(cmd->rq); } } -- 2.9.5
[PATCH 2/3] block/loop: use FALLOC_FL_ZERO_RANGE for REQ_OP_WRITE_ZEROES
REQ_OP_WRITE_ZEROES really means zero the data. And the in blkdev_fallocate, FALLOC_FL_ZERO_RANGE will retry but FALLOC_FL_PUNCH_HOLE not, even loop request doesn't have BLKDEV_ZERO_NOFALLBACK set. Signed-off-by: Shaohua Li --- drivers/block/loop.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 054dccc..a30aa45 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -430,6 +430,9 @@ static int lo_discard(struct loop_device *lo, struct request *rq, loff_t pos) int mode = FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE; int ret; + if (req_op(rq) == REQ_OP_WRITE_ZEROES) + mode = FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE; + if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) { ret = -EOPNOTSUPP; goto out; -- 2.9.5
[PATCH 5/5] bfq: Use icq_to_bic() consistently
Some code uses icq_to_bic() to convert an io_cq pointer to a bfq_io_cq pointer while other code uses a direct cast. Convert the code that uses a direct cast such that it uses icq_to_bic(). Signed-off-by: Bart Van Assche Cc: Paolo Valente --- block/bfq-iosched.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index d375b29ad085..7d1b85e393b2 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -239,7 +239,7 @@ static int T_slow[2]; static int T_fast[2]; static int device_speed_thresh[2]; -#define RQ_BIC(rq) ((struct bfq_io_cq *) (rq)->elv.priv[0]) +#define RQ_BIC(rq) icq_to_bic((rq)->elv.priv[0]) #define RQ_BFQQ(rq)((rq)->elv.priv[1]) struct bfq_queue *bic_to_bfqq(struct bfq_io_cq *bic, bool is_sync) -- 2.14.1
[PATCH 4/5] bfq: Suppress compiler warnings about comparisons
This patch avoids that the following warnings are reported when building with W=1: block/bfq-iosched.c: In function 'bfq_back_seek_max_store': block/bfq-iosched.c:4860:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] if (__data < (MIN)) \ ^ block/bfq-iosched.c:4876:1: note: in expansion of macro 'STORE_FUNCTION' STORE_FUNCTION(bfq_back_seek_max_store, &bfqd->bfq_back_max, 0, INT_MAX, 0); ^~ block/bfq-iosched.c: In function 'bfq_slice_idle_store': block/bfq-iosched.c:4860:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] if (__data < (MIN)) \ ^ block/bfq-iosched.c:4879:1: note: in expansion of macro 'STORE_FUNCTION' STORE_FUNCTION(bfq_slice_idle_store, &bfqd->bfq_slice_idle, 0, INT_MAX, 2); ^~ block/bfq-iosched.c: In function 'bfq_slice_idle_us_store': block/bfq-iosched.c:4892:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] if (__data < (MIN)) \ ^ block/bfq-iosched.c:4899:1: note: in expansion of macro 'USEC_STORE_FUNCTION' USEC_STORE_FUNCTION(bfq_slice_idle_us_store, &bfqd->bfq_slice_idle, 0, ^~~ Signed-off-by: Bart Van Assche Cc: Paolo Valente --- block/bfq-iosched.c | 20 ++-- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index cf92f16eb5f2..d375b29ad085 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -4851,16 +4851,16 @@ static ssize_t \ __FUNC(struct elevator_queue *e, const char *page, size_t count) \ { \ struct bfq_data *bfqd = e->elevator_data; \ - unsigned long __data; \ + unsigned long __data, __min = (MIN), __max = (MAX); \ int ret;\ \ ret = bfq_var_store(&__data, (page)); \ if (ret)\ return ret; \ - if (__data < (MIN)) \ - __data = (MIN); \ - else if (__data > (MAX))\ - __data = (MAX); \ + if (__data < __min) \ + __data = __min; \ + else if (__data > __max)\ + __data = __max; \ if (__CONV == 1)\ *(__PTR) = msecs_to_jiffies(__data);\ else if (__CONV == 2) \ @@ -4883,16 +4883,16 @@ STORE_FUNCTION(bfq_slice_idle_store, &bfqd->bfq_slice_idle, 0, INT_MAX, 2); static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)\ { \ struct bfq_data *bfqd = e->elevator_data; \ - unsigned long __data; \ + unsigned long __data, __min = (MIN), __max = (MAX); \ int ret;\ \ ret = bfq_var_store(&__data, (page)); \ if (ret)\ return ret; \ - if (__data < (MIN)) \ - __data = (MIN); \ - else if (__data > (MAX))\ - __data = (MAX); \ + if (__data < __min) \ + __data = __min; \ + else if (__data > __max)\ + __data = __max; \ *(__PTR) = (u64)__data * NSEC_PER_USEC; \ return count; \ } -- 2.14.1
[PATCH 3/5] bfq: Check kstrtoul() return value
Make sysfs writes fail for invalid numbers instead of storing uninitialized data copied from the stack. This patch removes all uninitialized_var() occurrences from the BFQ source code. Signed-off-by: Bart Van Assche Cc: Paolo Valente --- block/bfq-iosched.c | 52 +--- 1 file changed, 37 insertions(+), 15 deletions(-) diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 8c11c2e827a5..cf92f16eb5f2 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -4802,13 +4802,15 @@ static ssize_t bfq_var_show(unsigned int var, char *page) return sprintf(page, "%u\n", var); } -static void bfq_var_store(unsigned long *var, const char *page) +static int bfq_var_store(unsigned long *var, const char *page) { unsigned long new_val; int ret = kstrtoul(page, 10, &new_val); - if (ret == 0) - *var = new_val; + if (ret) + return ret; + *var = new_val; + return 0; } #define SHOW_FUNCTION(__FUNC, __VAR, __CONV) \ @@ -4849,8 +4851,12 @@ static ssize_t \ __FUNC(struct elevator_queue *e, const char *page, size_t count) \ { \ struct bfq_data *bfqd = e->elevator_data; \ - unsigned long uninitialized_var(__data);\ - bfq_var_store(&__data, (page)); \ + unsigned long __data; \ + int ret;\ + \ + ret = bfq_var_store(&__data, (page)); \ + if (ret)\ + return ret; \ if (__data < (MIN)) \ __data = (MIN); \ else if (__data > (MAX))\ @@ -4877,8 +4883,12 @@ STORE_FUNCTION(bfq_slice_idle_store, &bfqd->bfq_slice_idle, 0, INT_MAX, 2); static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)\ { \ struct bfq_data *bfqd = e->elevator_data; \ - unsigned long uninitialized_var(__data);\ - bfq_var_store(&__data, (page)); \ + unsigned long __data; \ + int ret;\ + \ + ret = bfq_var_store(&__data, (page)); \ + if (ret)\ + return ret; \ if (__data < (MIN)) \ __data = (MIN); \ else if (__data > (MAX))\ @@ -4894,9 +4904,12 @@ static ssize_t bfq_max_budget_store(struct elevator_queue *e, const char *page, size_t count) { struct bfq_data *bfqd = e->elevator_data; - unsigned long uninitialized_var(__data); + unsigned long __data; + int ret; - bfq_var_store(&__data, (page)); + ret = bfq_var_store(&__data, (page)); + if (ret) + return ret; if (__data == 0) bfqd->bfq_max_budget = bfq_calc_max_budget(bfqd); @@ -4919,9 +4932,12 @@ static ssize_t bfq_timeout_sync_store(struct elevator_queue *e, const char *page, size_t count) { struct bfq_data *bfqd = e->elevator_data; - unsigned long uninitialized_var(__data); + unsigned long __data; + int ret; - bfq_var_store(&__data, (page)); + ret = bfq_var_store(&__data, (page)); + if (ret) + return ret; if (__data < 1) __data = 1; @@ -4939,9 +4955,12 @@ static ssize_t bfq_strict_guarantees_store(struct elevator_queue *e, const char *page, size_t count) { struct bfq_data *bfqd = e->elevator_data; - unsigned long uninitialized_var(__data); + unsigned long __data; + int ret; - bfq_var_store(&__data, (page)); + ret = bfq_var_store(&__data, (page)); + if (ret) + return ret; if (__data > 1) __data = 1; @@ -4958,9 +4977,12 @@ static ssize_t bfq_low_latency_store(struct elevator_queue *e,
[PATCH 2/5] bfq: Declare local functions static
Signed-off-by: Bart Van Assche Cc: Paolo Valente --- block/bfq-cgroup.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c index 78b2e0db4fb2..ceefb9a706d6 100644 --- a/block/bfq-cgroup.c +++ b/block/bfq-cgroup.c @@ -206,7 +206,7 @@ static void bfqg_get(struct bfq_group *bfqg) bfqg->ref++; } -void bfqg_put(struct bfq_group *bfqg) +static void bfqg_put(struct bfq_group *bfqg) { bfqg->ref--; @@ -385,7 +385,7 @@ static struct bfq_group_data *blkcg_to_bfqgd(struct blkcg *blkcg) return cpd_to_bfqgd(blkcg_to_cpd(blkcg, &blkcg_policy_bfq)); } -struct blkcg_policy_data *bfq_cpd_alloc(gfp_t gfp) +static struct blkcg_policy_data *bfq_cpd_alloc(gfp_t gfp) { struct bfq_group_data *bgd; @@ -395,7 +395,7 @@ struct blkcg_policy_data *bfq_cpd_alloc(gfp_t gfp) return &bgd->pd; } -void bfq_cpd_init(struct blkcg_policy_data *cpd) +static void bfq_cpd_init(struct blkcg_policy_data *cpd) { struct bfq_group_data *d = cpd_to_bfqgd(cpd); @@ -403,12 +403,12 @@ void bfq_cpd_init(struct blkcg_policy_data *cpd) CGROUP_WEIGHT_DFL : BFQ_WEIGHT_LEGACY_DFL; } -void bfq_cpd_free(struct blkcg_policy_data *cpd) +static void bfq_cpd_free(struct blkcg_policy_data *cpd) { kfree(cpd_to_bfqgd(cpd)); } -struct blkg_policy_data *bfq_pd_alloc(gfp_t gfp, int node) +static struct blkg_policy_data *bfq_pd_alloc(gfp_t gfp, int node) { struct bfq_group *bfqg; @@ -426,7 +426,7 @@ struct blkg_policy_data *bfq_pd_alloc(gfp_t gfp, int node) return &bfqg->pd; } -void bfq_pd_init(struct blkg_policy_data *pd) +static void bfq_pd_init(struct blkg_policy_data *pd) { struct blkcg_gq *blkg = pd_to_blkg(pd); struct bfq_group *bfqg = blkg_to_bfqg(blkg); @@ -445,7 +445,7 @@ void bfq_pd_init(struct blkg_policy_data *pd) bfqg->rq_pos_tree = RB_ROOT; } -void bfq_pd_free(struct blkg_policy_data *pd) +static void bfq_pd_free(struct blkg_policy_data *pd) { struct bfq_group *bfqg = pd_to_bfqg(pd); @@ -453,7 +453,7 @@ void bfq_pd_free(struct blkg_policy_data *pd) bfqg_put(bfqg); } -void bfq_pd_reset_stats(struct blkg_policy_data *pd) +static void bfq_pd_reset_stats(struct blkg_policy_data *pd) { struct bfq_group *bfqg = pd_to_bfqg(pd); @@ -740,7 +740,7 @@ static void bfq_reparent_active_entities(struct bfq_data *bfqd, * blkio already grabs the queue_lock for us, so no need to use * RCU-based magic */ -void bfq_pd_offline(struct blkg_policy_data *pd) +static void bfq_pd_offline(struct blkg_policy_data *pd) { struct bfq_service_tree *st; struct bfq_group *bfqg = pd_to_bfqg(pd); -- 2.14.1
[PATCH 1/5] bfq: Annotate fall-through in a switch statement
This patch avoids that gcc 7 issues a warning about fall-through when building with W=1. Signed-off-by: Bart Van Assche Cc: Paolo Valente --- block/bfq-iosched.c | 1 + 1 file changed, 1 insertion(+) diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 6a7a26b6cec1..8c11c2e827a5 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -3780,6 +3780,7 @@ bfq_set_next_ioprio_data(struct bfq_queue *bfqq, struct bfq_io_cq *bic) default: dev_err(bfqq->bfqd->queue->backing_dev_info->dev, "bfq: bad prio class %d\n", ioprio_class); + /* fall through */ case IOPRIO_CLASS_NONE: /* * No prio set, inherit CPU scheduling settings. -- 2.14.1
[PATCH 0/5] Five BFQ patches for kernel v4.14
Hello Jens, This series consists of five small patches. Three patches address compiler warnings, one patch is a fix for the BFQ sysfs interface and a fifth patch makes the BFQ code slightly more consistent. Please consider these patches for kernel v4.14. Thanks, Bart. Bart Van Assche (5): bfq: Annotate fall-through in a switch statement bfq: Declare local functions static bfq: Check kstrtoul() return value bfq: Suppress compiler warnings about comparisons bfq: Use icq_to_bic() consistently block/bfq-cgroup.c | 18 +++--- block/bfq-iosched.c | 71 +++-- 2 files changed, 56 insertions(+), 33 deletions(-) -- 2.14.1
Re: [PATCH V3 13/14] blk-mq-sched: refactor blk_mq_sched_try_merge()
On Sun, 2017-08-27 at 00:33 +0800, Ming Lei wrote: > -bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio, > - struct request **merged_request) > +static bool __blk_mq_try_merge(struct request_queue *q, > + struct bio *bio, struct request **merged_request, > + struct request *candidate, enum elv_merge type) > { > - struct request *rq; > + struct request *rq = candidate; It seems weird to me that the argument 'candidate' is not used other than to copy its value into the local variable 'rq'? How about removing that local variable and renaming the 'candidate' argument into 'rq'? Anyway, with or without that change: Reviewed-by: Bart Van Assche
Re: [PATCH V3 06/14] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed
On Sun, 2017-08-27 at 00:33 +0800, Ming Lei wrote: > During dispatching, we moved all requests from hctx->dispatch to > one temporary list, then dispatch them one by one from this list. > Unfortunately duirng this period, run queue from other contexts ^^ during? > may think the queue is idle, then start to dequeue from sw/scheduler > queue and still try to dispatch because ->dispatch is empty. This way > hurts sequential I/O performance because requests are dequeued when > lld queue is busy. > [ ... ] > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c > index 735e432294ab..4d7bea8c2594 100644 > --- a/block/blk-mq-sched.c > +++ b/block/blk-mq-sched.c > @@ -146,7 +146,6 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx > *hctx) > struct request_queue *q = hctx->queue; > struct elevator_queue *e = q->elevator; > const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request; > - bool do_sched_dispatch = true; > LIST_HEAD(rq_list); > > /* RCU or SRCU read lock is needed before checking quiesced flag */ Shouldn't blk_mq_sched_dispatch_requests() set BLK_MQ_S_DISPATCH_BUSY just after the following statement because this statement makes the dispatch list empty? list_splice_init(&hctx->dispatch, &rq_list); > @@ -177,8 +176,33 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx > *hctx) >*/ > if (!list_empty(&rq_list)) { > blk_mq_sched_mark_restart_hctx(hctx); > - do_sched_dispatch = blk_mq_dispatch_rq_list(q, &rq_list); > - } else if (!has_sched_dispatch && !q->queue_depth) { > + blk_mq_dispatch_rq_list(q, &rq_list); > + > + /* > + * We may clear DISPATCH_BUSY just after it > + * is set from another context, the only cost > + * is that one request is dequeued a bit early, > + * we can survive that. Given the window is > + * too small, no need to worry about performance ^^^ The word "too" seems extraneous to me in this sentence. > bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio, > @@ -330,6 +353,7 @@ static bool blk_mq_sched_bypass_insert(struct > blk_mq_hw_ctx *hctx, >*/ > spin_lock(&hctx->lock); > list_add(&rq->queuelist, &hctx->dispatch); > + set_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state); > spin_unlock(&hctx->lock); > return true; > } Is it necessary to make blk_mq_sched_bypass_insert() set BLK_MQ_S_DISPATCH_BUSY? My understanding is that only code that makes the dispatch list empty should set BLK_MQ_S_DISPATCH_BUSY. However, blk_mq_sched_bypass_insert() adds an element to the dispatch list so that guarantees that that list is not empty. > diff --git a/block/blk-mq.c b/block/blk-mq.c > index f063dd0f197f..6af56a71c1cd 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1140,6 +1140,11 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, > struct list_head *list) > > spin_lock(&hctx->lock); > list_splice_init(list, &hctx->dispatch); > + /* > + * DISPATCH_BUSY won't be cleared until all requests > + * in hctx->dispatch are dispatched successfully > + */ > + set_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state); > spin_unlock(&hctx->lock); Same comment here - since this code adds one or more requests to the dispatch list, is it really needed to set the DISPATCH_BUSY flag? Bart.
Re: [PATCH 1/2] blk-mq: add requests in the tail of hctx->dispatch
On Wed, Aug 30, 2017 at 09:51:31AM -0600, Jens Axboe wrote: > On 08/30/2017 09:39 AM, Ming Lei wrote: > > On Wed, Aug 30, 2017 at 09:22:42AM -0600, Jens Axboe wrote: > >> On 08/30/2017 09:19 AM, Ming Lei wrote: > >>> It is more reasonable to add requests to ->dispatch in way > >>> of FIFO style, instead of LIFO style. > >>> > >>> Also in this way, we can allow to insert request at the front > >>> of hw queue, which function is needed to fix one bug > >>> in blk-mq's implementation of blk_execute_rq() > >>> > >>> Reported-by: Oleksandr Natalenko > >>> Tested-by: Oleksandr Natalenko > >>> Signed-off-by: Ming Lei > >>> --- > >>> block/blk-mq-sched.c | 2 +- > >>> block/blk-mq.c | 2 +- > >>> 2 files changed, 2 insertions(+), 2 deletions(-) > >>> > >>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c > >>> index 4ab69435708c..8d97df40fc28 100644 > >>> --- a/block/blk-mq-sched.c > >>> +++ b/block/blk-mq-sched.c > >>> @@ -272,7 +272,7 @@ static bool blk_mq_sched_bypass_insert(struct > >>> blk_mq_hw_ctx *hctx, > >>>* the dispatch list. > >>>*/ > >>> spin_lock(&hctx->lock); > >>> - list_add(&rq->queuelist, &hctx->dispatch); > >>> + list_add_tail(&rq->queuelist, &hctx->dispatch); > >>> spin_unlock(&hctx->lock); > >>> return true; > >>> } > >>> diff --git a/block/blk-mq.c b/block/blk-mq.c > >>> index 4603b115e234..fed3d0c16266 100644 > >>> --- a/block/blk-mq.c > >>> +++ b/block/blk-mq.c > >>> @@ -1067,7 +1067,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue > >>> *q, struct list_head *list) > >>> blk_mq_put_driver_tag(rq); > >>> > >>> spin_lock(&hctx->lock); > >>> - list_splice_init(list, &hctx->dispatch); > >>> + list_splice_tail_init(list, &hctx->dispatch); > >>> spin_unlock(&hctx->lock); > >> > >> I'm not convinced this is safe, there's actually a reason why the > >> request is added to the front and not the back. We do have > >> reorder_tags_to_front() as a safe guard, but I'd much rather get rid of > > > > reorder_tags_to_front() is for reordering the requests in current list, > > this patch is for splicing list into hctx->dispatch, so I can't see > > it isn't safe, or could you explain it a bit? > > If we can get the ordering right, then down the line we won't need to > have the tags reordering at all. It's an ugly hack that I'd love to see > go away. If reorder_tags_to_front() isn't removed, this patch is still safe. But blk_execute_rq_nowait() need to add one request in the front of hw queue, that can be a contradiction compared with maintaining a perfect order for removing reorder_tags_to_front(). So could you share your opinion on the 2nd patch for fixing blk_execute_rq_nowait()? > > >> that than make this change. > >> > >> What's your reasoning here? Your changelog doesn't really explain why > > > > Firstly the 2nd patch need to add one rq(such as RQF_PM) to the > > front of the hw queue, the simple way is to add it to the front > > of hctx->dispatch. Without this change, the 2nd patch can't work > > at all. > > > > Secondly this way is still reasonable: > > > > - one rq is added to hctx->dispatch because queue is busy > > - another rq is added to hctx->dispatch too because of same reason > > > > so it is reasonable to to add list into hctx->dispatch in FIFO style. > > Not disagreeing with the logic. But it also begs the question of why we > don't apply the same treatment to when we splice leftovers to the > dispatch list, currently we front splice that. > > All I'm saying is that you need to tread very carefully with this, and > throw it through some careful testing to ensure that we don't introduce > conditions that now livelock. NVMe is the easy test case, that will Yes, ->dispatch is far away from NVMe, but friends of SCSI-MQ. > generally always work since we never run out of tags. The problematic > test case is usually things like SATA with 31 tags, and especially SATA > with flushes that don't queue. One good test case is the one where you > end up having all tags (or almost all) consumed by flushes, and still > ensuring that we're making forward progress. Understood! Even we can make the test more aggressive. I just setup one virtio-scsi by changing both 'can_queue' and 'cmd_per_lun' as 1, that means the hw queue depth is 1, and hw queue number is set as 1. Then I run 'dbench -t 30 -s -F 64' in ext4 which is over this virtio-scsi device. The dbench(64 jobs, sync write, fsync) just works fine with this patch applied. -- Ming
Re: [GIT PULL] nvme update for Linux 4.14, take 2
On Wed, 2017-08-30 at 19:47 +0300, Sagi Grimberg wrote: > > > If we get to choose, my preference would be to restore the old behavior > > > because currently existing nvme transports keep their internal ctrl > > > representation in the tagset->driver_data as they implement > > > init/exit_request. Bouncing in nvme core would require to change that > > > and always keep struct nvme_ctrl as the tagset->driver_data and convert > > > it on every handler... > > > > > > Everything is doable but it seems like an unneeded hassle to me... > > > > Sorry but I'm not convinced that it would be necessary to change what > > tagset->driver_data points at. How about moving blk_mq_reinit_tagset() from > > the block layer core to the NVMe core, to rename it and to pass a pointer > > to the nvme_ctrl data structure to that function instead of only block layer > > information? > > That would mean that I need to open-code the tagset iteration in nvme > which does not feel like something a driver should do. How about renaming blk_mq_reinit_tagset() into blk_mq_tagset_iter() and to make the argument list of blk_mq_tagset_iter() more similar to that of blk_mq_queue_tag_busy_iter() such that callers of blk_mq_tagset_iter() can pass a pointer to any structure through the @priv argument? That would make this function more general and maybe also more useful to other block drivers. Bart.
Re: [GIT PULL] nvme update for Linux 4.14, take 2
If we get to choose, my preference would be to restore the old behavior because currently existing nvme transports keep their internal ctrl representation in the tagset->driver_data as they implement init/exit_request. Bouncing in nvme core would require to change that and always keep struct nvme_ctrl as the tagset->driver_data and convert it on every handler... Everything is doable but it seems like an unneeded hassle to me... Sorry but I'm not convinced that it would be necessary to change what tagset->driver_data points at. How about moving blk_mq_reinit_tagset() from the block layer core to the NVMe core, to rename it and to pass a pointer to the nvme_ctrl data structure to that function instead of only block layer information? That would mean that I need to open-code the tagset iteration in nvme which does not feel like something a driver should do.
Re: [PATCH V3 05/14] blk-mq-sched: improve dispatching from sw queue
On Sun, 2017-08-27 at 00:33 +0800, Ming Lei wrote: > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h > index 50c6485cb04f..7b7a366a97f3 100644 > --- a/include/linux/blk-mq.h > +++ b/include/linux/blk-mq.h > @@ -30,6 +30,8 @@ struct blk_mq_hw_ctx { > > struct sbitmap ctx_map; > > + struct blk_mq_ctx *dispatch_from; > + > struct blk_mq_ctx **ctxs; > unsigned intnr_ctx; Hello Ming, Are you relying here on the fact that the per-CPU queues are never reallocated, even if CPU hot-plugging occurs? Sorry but that seems fragile to me. I would like to see 'dispatch_from' be converted into an integer. It is easy to check whether an integer software queue index is out of range but it's not that easy to check whether a pointer became invalid. Thanks, Bart.
Re: BFQ + dm-mpath
> Il giorno 25 ago 2017, alle ore 01:16, Bart Van Assche > ha scritto: > > Hello Paolo, > Hi Bart > Has BFQ ever been tested in combination with dm-mpath? Unfortunately no. > A few seconds > after I had started a run of the srp-tests software with BFQ a kernel > oops appeared on the console. The command I ran is: > > $ while true; do ~bart/software/infiniband/srp-test/run_tests -d -r 30 -e > bfq; done > > And this is what appeared on the console: > > [89251.977201] BUG: unable to handle kernel NULL pointer dereference at > 0018 > [89251.977201] BUG: unable to handle kernel NULL pointer dereference at > 0018 > [89251.987831] IP: bfq_setup_cooperator+0x16/0x2e0 [bfq] > [89251.987831] IP: bfq_setup_cooperator+0x16/0x2e0 [bfq] > [89251.995241] PGD 0 > [89251.995241] PGD 0 > [89251.995243] P4D 0 > [89251.995243] P4D 0 > [89251.999194] > [89251.999194] > [89252.006423] Oops: [#1] PREEMPT SMP > [89252.006423] Oops: [#1] PREEMPT SMP > [89252.012354] Modules linked in: ib_srp libcrc32c scsi_transport_srp > target_core_file ib_srpt target_core_iblock target_core_mod scsi_debug brd > bfq dm_service_time rdma_cm > iw_cm bridge stp llc ip6table_filter ip > 6_tables iptable_filter intel_rapl sb_edac x86_pkg_temp_thermal > intel_powerclamp coretemp kvm_intel kvm af_packet irqbypass ipmi_ssif dcdbas > crct10dif_pclmul ioatdma > crc32_pclmul mei_me joydev ghash_clmulni_intel > mei aesni_intel sg shpchp ipmi_si aes_x86_64 ipmi_devintf crypto_simd dca > cryptd glue_helper lpc_ich ipmi_msghandler acpi_power_meter acpi_pad wmi > mfd_core ib_ipoib ib_cm > ib_uverbs ib_umad mlx4_ib ib_core dm_mul > tipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua netconsole configfs > ip_tables x_tables autofs4 ext4 mbcache jbd2 mlx4_en hid_generic usbhid > mgag200 i2c_algo_bit > drm_kms_helper > [89252.012354] Modules linked in: ib_srp libcrc32c scsi_transport_srp > target_core_file ib_srpt target_core_iblock target_core_mod scsi_debug brd > bfq dm_service_time rdma_cm > iw_cm bridge stp llc ip6table_filter ip > 6_tables iptable_filter intel_rapl sb_edac x86_pkg_temp_thermal > intel_powerclamp coretemp kvm_intel kvm af_packet irqbypass ipmi_ssif dcdbas > crct10dif_pclmul ioatdma > crc32_pclmul mei_me joydev ghash_clmulni_intel > mei aesni_intel sg shpchp ipmi_si aes_x86_64 ipmi_devintf crypto_simd dca > cryptd glue_helper lpc_ich ipmi_msghandler acpi_power_meter acpi_pad wmi > mfd_core ib_ipoib ib_cm > ib_uverbs ib_umad mlx4_ib ib_core dm_mul > tipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua netconsole configfs > ip_tables x_tables autofs4 ext4 mbcache jbd2 mlx4_en hid_generic usbhid > mgag200 i2c_algo_bit > drm_kms_helper > [89252.103941] syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ehci_pci > mlx4_core tg3 ehci_hcd crc32c_intel devlink drm ptp usbcore usb_common > pps_core megaraid_sas libphy > [last unloaded: scsi_transport_srp] > [89252.103941] syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ehci_pci > mlx4_core tg3 ehci_hcd crc32c_intel devlink drm ptp usbcore usb_common > pps_core megaraid_sas libphy > [last unloaded: scsi_transport_srp] > [89252.128375] CPU: 13 PID: 352 Comm: kworker/13:1H Tainted: GW > 4.13.0-rc6-dbg+ #2 > [89252.128375] CPU: 13 PID: 352 Comm: kworker/13:1H Tainted: GW > 4.13.0-rc6-dbg+ #2 > [89252.139884] Hardware name: Dell Inc. PowerEdge R720/0VWT90, BIOS 1.3.6 > 09/11/2012 > [89252.139884] Hardware name: Dell Inc. PowerEdge R720/0VWT90, BIOS 1.3.6 > 09/11/2012 > [89252.150243] Workqueue: kblockd blk_mq_run_work_fn > [89252.150243] Workqueue: kblockd blk_mq_run_work_fn > [89252.157449] task: 880911d80040 task.stack: c90007c64000 > [89252.157449] task: 880911d80040 task.stack: c90007c64000 > [89252.166038] RIP: 0010:bfq_setup_cooperator+0x16/0x2e0 [bfq] > [89252.166038] RIP: 0010:bfq_setup_cooperator+0x16/0x2e0 [bfq] > [89252.174222] RSP: 0018:c90007c67b20 EFLAGS: 00010082 > [89252.174222] RSP: 0018:c90007c67b20 EFLAGS: 00010082 > [89252.182026] RAX: ffe0 RBX: 8807b9988700 RCX: > 0001 > [89252.182026] RAX: ffe0 RBX: 8807b9988700 RCX: > 0001 > [89252.191990] RDX: 8807b9988700 RSI: RDI: > 880288554a68 > [89252.191990] RDX: 8807b9988700 RSI: RDI: > 880288554a68 > [89252.201956] RBP: c90007c67b68 R08: 825820c0 R09: > > [89252.201956] RBP: c90007c67b68 R08: 825820c0 R09: > > [89252.211899] R10: c90007c67ae8 R11: a05a5ad5 R12: > 880288554dc8 > [89252.211899] R10: c90007c67ae8 R11: a05a5ad5 R12: > 880288554dc8 > [89252.221821] R13: 880288554a68 R14: 880928078bd0 R15: > 8807bbe03528 > [89252.221821] R13: 880288554a68 R14: 880928078bd0 R15: > 8807bbe03528 > [89252.231715] FS: () GS:88093f7
Re: [GIT PULL] nvme update for Linux 4.14, take 2
On Wed, 2017-08-30 at 19:05 +0300, Sagi Grimberg wrote: > If we get to choose, my preference would be to restore the old behavior > because currently existing nvme transports keep their internal ctrl > representation in the tagset->driver_data as they implement > init/exit_request. Bouncing in nvme core would require to change that > and always keep struct nvme_ctrl as the tagset->driver_data and convert > it on every handler... > > Everything is doable but it seems like an unneeded hassle to me... Sorry but I'm not convinced that it would be necessary to change what tagset->driver_data points at. How about moving blk_mq_reinit_tagset() from the block layer core to the NVMe core, to rename it and to pass a pointer to the nvme_ctrl data structure to that function instead of only block layer information? Bart.
Re: [GIT PULL] nvme update for Linux 4.14, take 2
Hello Sagi, Sorry but I doubt that that patch makes it "impossible" to move controller reset flow to the NVMe core. There are already several function pointers in the nvme_ctrl_ops data structure and there is one such data structure per transport. Had you already considered to add a function pointer to that structure? I have, that's the trampoline function that I was referring to, it feels a bit funny to have aa nvme core function that would look like: int nvme_reinit_request() { return ctrl->ops->reinit_request() } I can easily do that, but doesn't it defeat the purpose of blk_mq_ops? I don't think so. Request reinitialization is an NVMe concept that is not used by any other block driver, so why should the pointer to the reinitialization function exist in blk_mq_ops? The point of blk-mq is to provide all the functionality that drivers need, even if it is for just a single driver. Functionality that can be removed from drivers is good. The smaller/simpler we can make the driver, the better off we are, even if that means adding a bit of complexity to the core. Obviously this is a case-by-case decision. For this particular case, I'm happy with either solution. If we get to choose, my preference would be to restore the old behavior because currently existing nvme transports keep their internal ctrl representation in the tagset->driver_data as they implement init/exit_request. Bouncing in nvme core would require to change that and always keep struct nvme_ctrl as the tagset->driver_data and convert it on every handler... Everything is doable but it seems like an unneeded hassle to me...
Re: [PATCH V3 03/14] blk-mq: introduce blk_mq_dispatch_rq_from_ctx()
On Sun, 2017-08-27 at 00:33 +0800, Ming Lei wrote: > This function is introduced for dequeuing request > from sw queue so that we can dispatch it in > scheduler's way. > > More importantly, some SCSI devices may set > q->queue_depth, which is a per-request_queue limit, > and applied on pending I/O from all hctxs. This > function is introduced for avoiding to dequeue too > many requests from sw queue when ->dispatch isn't > flushed completely. Reviewed-by: Bart Van Assche
Re: [PATCH V3 02/14] sbitmap: introduce __sbitmap_for_each_set()
On Sun, 2017-08-27 at 00:33 +0800, Ming Lei wrote: > /** > * sbitmap_for_each_set() - Iterate over each set bit in a &struct sbitmap. > + * @start: Where to start the iteration Thanks for having changed the name of this argument ... > -static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn > fn, > - void *data) > +static inline void __sbitmap_for_each_set(struct sbitmap *sb, > + unsigned int start, > + sb_for_each_fn fn, void *data) > { > - unsigned int i; > + unsigned int index = SB_NR_TO_INDEX(sb, start); > + unsigned int nr = SB_NR_TO_BIT(sb, start); > + unsigned int scanned = 0; ... but I'm still missing a check here whether or not index >= sb->map_nr. > - for (i = 0; i < sb->map_nr; i++) { > - struct sbitmap_word *word = &sb->map[i]; > - unsigned int off, nr; > + while (1) { > + struct sbitmap_word *word = &sb->map[index]; > + unsigned int depth = min_t(unsigned int, word->depth - nr, > +sb->depth - scanned); > > + scanned += depth; > if (!word->word) > - continue; > + goto next; > > - nr = 0; > - off = i << sb->shift; > + depth += nr; > + start = index << sb->shift; The above statement reuses the argument 'start' for a new purpose. This is confusing - please don't do this. Why not to keep the name 'off'? That will keep the changes minimal. Bart.
Re: [GIT PULL] nvme update for Linux 4.14, take 2
On 08/30/2017 09:46 AM, Bart Van Assche wrote: > On Wed, 2017-08-30 at 18:33 +0300, Sagi Grimberg wrote: I just realized that patch: -- commit d352ae205d8b05f3f7558d10f474d8436581b3e2 Author: Bart Van Assche Date: Thu Aug 17 16:23:03 2017 -0700 blk-mq: Make blk_mq_reinit_tagset() calls easier to read Since blk_mq_ops.reinit_request is only called from inside blk_mq_reinit_tagset(), make this function pointer an argument of blk_mq_reinit_tagset() instead of a member of struct blk_mq_ops. This patch does not change any functionality but makes blk_mq_reinit_tagset() calls easier to read and to analyze. -- Makes it impossible for me to move controller reset flow to nvme-core without adding a trampoline (as the reinit_request is transport specific)... >>> >>> Hello Sagi, >>> >>> Sorry but I doubt that that patch makes it "impossible" to move controller >>> reset flow to the NVMe core. There are already several function pointers in >>> the nvme_ctrl_ops data structure and there is one such data structure per >>> transport. Had you already considered to add a function pointer to that >>> structure? >> >> I have, that's the trampoline function that I was referring to, it feels >> a bit funny to have aa nvme core function that would look like: >> >> int nvme_reinit_request() >> { >> return ctrl->ops->reinit_request() >> } >> >> I can easily do that, but doesn't it defeat the purpose of blk_mq_ops? > > I don't think so. Request reinitialization is an NVMe concept that is > not used by any other block driver, so why should the pointer to the > reinitialization function exist in blk_mq_ops? The point of blk-mq is to provide all the functionality that drivers need, even if it is for just a single driver. Functionality that can be removed from drivers is good. The smaller/simpler we can make the driver, the better off we are, even if that means adding a bit of complexity to the core. Obviously this is a case-by-case decision. For this particular case, I'm happy with either solution. -- Jens Axboe
Re: [PATCH 1/2] blk-mq: add requests in the tail of hctx->dispatch
On 08/30/2017 09:39 AM, Ming Lei wrote: > On Wed, Aug 30, 2017 at 09:22:42AM -0600, Jens Axboe wrote: >> On 08/30/2017 09:19 AM, Ming Lei wrote: >>> It is more reasonable to add requests to ->dispatch in way >>> of FIFO style, instead of LIFO style. >>> >>> Also in this way, we can allow to insert request at the front >>> of hw queue, which function is needed to fix one bug >>> in blk-mq's implementation of blk_execute_rq() >>> >>> Reported-by: Oleksandr Natalenko >>> Tested-by: Oleksandr Natalenko >>> Signed-off-by: Ming Lei >>> --- >>> block/blk-mq-sched.c | 2 +- >>> block/blk-mq.c | 2 +- >>> 2 files changed, 2 insertions(+), 2 deletions(-) >>> >>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c >>> index 4ab69435708c..8d97df40fc28 100644 >>> --- a/block/blk-mq-sched.c >>> +++ b/block/blk-mq-sched.c >>> @@ -272,7 +272,7 @@ static bool blk_mq_sched_bypass_insert(struct >>> blk_mq_hw_ctx *hctx, >>> * the dispatch list. >>> */ >>> spin_lock(&hctx->lock); >>> - list_add(&rq->queuelist, &hctx->dispatch); >>> + list_add_tail(&rq->queuelist, &hctx->dispatch); >>> spin_unlock(&hctx->lock); >>> return true; >>> } >>> diff --git a/block/blk-mq.c b/block/blk-mq.c >>> index 4603b115e234..fed3d0c16266 100644 >>> --- a/block/blk-mq.c >>> +++ b/block/blk-mq.c >>> @@ -1067,7 +1067,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, >>> struct list_head *list) >>> blk_mq_put_driver_tag(rq); >>> >>> spin_lock(&hctx->lock); >>> - list_splice_init(list, &hctx->dispatch); >>> + list_splice_tail_init(list, &hctx->dispatch); >>> spin_unlock(&hctx->lock); >> >> I'm not convinced this is safe, there's actually a reason why the >> request is added to the front and not the back. We do have >> reorder_tags_to_front() as a safe guard, but I'd much rather get rid of > > reorder_tags_to_front() is for reordering the requests in current list, > this patch is for splicing list into hctx->dispatch, so I can't see > it isn't safe, or could you explain it a bit? If we can get the ordering right, then down the line we won't need to have the tags reordering at all. It's an ugly hack that I'd love to see go away. >> that than make this change. >> >> What's your reasoning here? Your changelog doesn't really explain why > > Firstly the 2nd patch need to add one rq(such as RQF_PM) to the > front of the hw queue, the simple way is to add it to the front > of hctx->dispatch. Without this change, the 2nd patch can't work > at all. > > Secondly this way is still reasonable: > > - one rq is added to hctx->dispatch because queue is busy > - another rq is added to hctx->dispatch too because of same reason > > so it is reasonable to to add list into hctx->dispatch in FIFO style. Not disagreeing with the logic. But it also begs the question of why we don't apply the same treatment to when we splice leftovers to the dispatch list, currently we front splice that. All I'm saying is that you need to tread very carefully with this, and throw it through some careful testing to ensure that we don't introduce conditions that now livelock. NVMe is the easy test case, that will generally always work since we never run out of tags. The problematic test case is usually things like SATA with 31 tags, and especially SATA with flushes that don't queue. One good test case is the one where you end up having all tags (or almost all) consumed by flushes, and still ensuring that we're making forward progress. -- Jens Axboe
Re: [GIT PULL] nvme update for Linux 4.14, take 2
On Wed, 2017-08-30 at 18:33 +0300, Sagi Grimberg wrote: > > > I just realized that patch: > > > -- > > > commit d352ae205d8b05f3f7558d10f474d8436581b3e2 > > > Author: Bart Van Assche > > > Date: Thu Aug 17 16:23:03 2017 -0700 > > > > > > blk-mq: Make blk_mq_reinit_tagset() calls easier to read > > > > > > Since blk_mq_ops.reinit_request is only called from inside > > > blk_mq_reinit_tagset(), make this function pointer an argument of > > > blk_mq_reinit_tagset() instead of a member of struct blk_mq_ops. > > > This patch does not change any functionality but makes > > > blk_mq_reinit_tagset() calls easier to read and to analyze. > > > -- > > > > > > Makes it impossible for me to move controller reset flow to > > > nvme-core without adding a trampoline (as the reinit_request > > > is transport specific)... > > > > Hello Sagi, > > > > Sorry but I doubt that that patch makes it "impossible" to move controller > > reset flow to the NVMe core. There are already several function pointers in > > the nvme_ctrl_ops data structure and there is one such data structure per > > transport. Had you already considered to add a function pointer to that > > structure? > > I have, that's the trampoline function that I was referring to, it feels > a bit funny to have aa nvme core function that would look like: > > int nvme_reinit_request() > { > return ctrl->ops->reinit_request() > } > > I can easily do that, but doesn't it defeat the purpose of blk_mq_ops? I don't think so. Request reinitialization is an NVMe concept that is not used by any other block driver, so why should the pointer to the reinitialization function exist in blk_mq_ops? Bart.
Re: [PATCH 1/2] blk-mq: add requests in the tail of hctx->dispatch
On Wed, Aug 30, 2017 at 09:22:42AM -0600, Jens Axboe wrote: > On 08/30/2017 09:19 AM, Ming Lei wrote: > > It is more reasonable to add requests to ->dispatch in way > > of FIFO style, instead of LIFO style. > > > > Also in this way, we can allow to insert request at the front > > of hw queue, which function is needed to fix one bug > > in blk-mq's implementation of blk_execute_rq() > > > > Reported-by: Oleksandr Natalenko > > Tested-by: Oleksandr Natalenko > > Signed-off-by: Ming Lei > > --- > > block/blk-mq-sched.c | 2 +- > > block/blk-mq.c | 2 +- > > 2 files changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c > > index 4ab69435708c..8d97df40fc28 100644 > > --- a/block/blk-mq-sched.c > > +++ b/block/blk-mq-sched.c > > @@ -272,7 +272,7 @@ static bool blk_mq_sched_bypass_insert(struct > > blk_mq_hw_ctx *hctx, > > * the dispatch list. > > */ > > spin_lock(&hctx->lock); > > - list_add(&rq->queuelist, &hctx->dispatch); > > + list_add_tail(&rq->queuelist, &hctx->dispatch); > > spin_unlock(&hctx->lock); > > return true; > > } > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > index 4603b115e234..fed3d0c16266 100644 > > --- a/block/blk-mq.c > > +++ b/block/blk-mq.c > > @@ -1067,7 +1067,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, > > struct list_head *list) > > blk_mq_put_driver_tag(rq); > > > > spin_lock(&hctx->lock); > > - list_splice_init(list, &hctx->dispatch); > > + list_splice_tail_init(list, &hctx->dispatch); > > spin_unlock(&hctx->lock); > > I'm not convinced this is safe, there's actually a reason why the > request is added to the front and not the back. We do have > reorder_tags_to_front() as a safe guard, but I'd much rather get rid of reorder_tags_to_front() is for reordering the requests in current list, this patch is for splicing list into hctx->dispatch, so I can't see it isn't safe, or could you explain it a bit? > that than make this change. > > What's your reasoning here? Your changelog doesn't really explain why Firstly the 2nd patch need to add one rq(such as RQF_PM) to the front of the hw queue, the simple way is to add it to the front of hctx->dispatch. Without this change, the 2nd patch can't work at all. Secondly this way is still reasonable: - one rq is added to hctx->dispatch because queue is busy - another rq is added to hctx->dispatch too because of same reason so it is reasonable to to add list into hctx->dispatch in FIFO style. Finally my patchset for 'improving SCSI-MQ perf' will change to not dequeue rq from sw/scheduler if ->dispatch isn't flushed. I believe it is reasonable and correct thing to do, with that change, there won't be difference between the two styles. -- Ming
Re: [GIT PULL] nvme update for Linux 4.14, take 2
I just realized that patch: -- commit d352ae205d8b05f3f7558d10f474d8436581b3e2 Author: Bart Van Assche Date: Thu Aug 17 16:23:03 2017 -0700 blk-mq: Make blk_mq_reinit_tagset() calls easier to read Since blk_mq_ops.reinit_request is only called from inside blk_mq_reinit_tagset(), make this function pointer an argument of blk_mq_reinit_tagset() instead of a member of struct blk_mq_ops. This patch does not change any functionality but makes blk_mq_reinit_tagset() calls easier to read and to analyze. -- Makes it impossible for me to move controller reset flow to nvme-core without adding a trampoline (as the reinit_request is transport specific)... Hello Sagi, Sorry but I doubt that that patch makes it "impossible" to move controller reset flow to the NVMe core. There are already several function pointers in the nvme_ctrl_ops data structure and there is one such data structure per transport. Had you already considered to add a function pointer to that structure? I have, that's the trampoline function that I was referring to, it feels a bit funny to have aa nvme core function that would look like: int nvme_reinit_request() { return ctrl->ops->reinit_request() } I can easily do that, but doesn't it defeat the purpose of blk_mq_ops?
Re: [GIT PULL] nvme update for Linux 4.14, take 2
On Wed, 2017-08-30 at 18:10 +0300, Sagi Grimberg wrote: > I just realized that patch: > -- > commit d352ae205d8b05f3f7558d10f474d8436581b3e2 > Author: Bart Van Assche > Date: Thu Aug 17 16:23:03 2017 -0700 > > blk-mq: Make blk_mq_reinit_tagset() calls easier to read > > Since blk_mq_ops.reinit_request is only called from inside > blk_mq_reinit_tagset(), make this function pointer an argument of > blk_mq_reinit_tagset() instead of a member of struct blk_mq_ops. > This patch does not change any functionality but makes > blk_mq_reinit_tagset() calls easier to read and to analyze. > -- > > Makes it impossible for me to move controller reset flow to > nvme-core without adding a trampoline (as the reinit_request > is transport specific)... Hello Sagi, Sorry but I doubt that that patch makes it "impossible" to move controller reset flow to the NVMe core. There are already several function pointers in the nvme_ctrl_ops data structure and there is one such data structure per transport. Had you already considered to add a function pointer to that structure? Bart.
Re: [PATCH 1/2] blk-mq: add requests in the tail of hctx->dispatch
On 08/30/2017 09:19 AM, Ming Lei wrote: > It is more reasonable to add requests to ->dispatch in way > of FIFO style, instead of LIFO style. > > Also in this way, we can allow to insert request at the front > of hw queue, which function is needed to fix one bug > in blk-mq's implementation of blk_execute_rq() > > Reported-by: Oleksandr Natalenko > Tested-by: Oleksandr Natalenko > Signed-off-by: Ming Lei > --- > block/blk-mq-sched.c | 2 +- > block/blk-mq.c | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c > index 4ab69435708c..8d97df40fc28 100644 > --- a/block/blk-mq-sched.c > +++ b/block/blk-mq-sched.c > @@ -272,7 +272,7 @@ static bool blk_mq_sched_bypass_insert(struct > blk_mq_hw_ctx *hctx, >* the dispatch list. >*/ > spin_lock(&hctx->lock); > - list_add(&rq->queuelist, &hctx->dispatch); > + list_add_tail(&rq->queuelist, &hctx->dispatch); > spin_unlock(&hctx->lock); > return true; > } > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 4603b115e234..fed3d0c16266 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1067,7 +1067,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, > struct list_head *list) > blk_mq_put_driver_tag(rq); > > spin_lock(&hctx->lock); > - list_splice_init(list, &hctx->dispatch); > + list_splice_tail_init(list, &hctx->dispatch); > spin_unlock(&hctx->lock); I'm not convinced this is safe, there's actually a reason why the request is added to the front and not the back. We do have reorder_tags_to_front() as a safe guard, but I'd much rather get rid of that than make this change. What's your reasoning here? Your changelog doesn't really explain why this fixes anything, it's very vague. -- Jens Axboe
[PATCH 2/2] blk-mq: align to legacy's implementation of blk_execute_rq
In legacy path, when one request is executed via blk_execute_rq(), the request is added to the front of q->queue_head directly, and I/O scheduler's queue is bypassed because either merging or sorting isn't needed. When SCSI device is put into quiece state, such as during system suspend, we need to add the RQF_PM request into front of the queue. This patch fixes I/O hang after system resume by taking the similar implementation of legacy path. Tested-by: Oleksandr Natalenko Reported-by: Oleksandr Natalenko Signed-off-by: Ming Lei --- block/blk-core.c | 2 +- block/blk-exec.c | 2 +- block/blk-flush.c| 2 +- block/blk-mq-sched.c | 58 block/blk-mq-sched.h | 2 ++ 5 files changed, 63 insertions(+), 3 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index dbecbf4a64e0..fb75bc646ebc 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2330,7 +2330,7 @@ blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request * if (q->mq_ops) { if (blk_queue_io_stat(q)) blk_account_io_start(rq, true); - blk_mq_sched_insert_request(rq, false, true, false, false); + blk_mq_sched_insert_request_bypass(rq, false, true, false, false); return BLK_STS_OK; } diff --git a/block/blk-exec.c b/block/blk-exec.c index 5c0f3dc446dc..4565aa6bb624 100644 --- a/block/blk-exec.c +++ b/block/blk-exec.c @@ -61,7 +61,7 @@ void blk_execute_rq_nowait(struct request_queue *q, struct gendisk *bd_disk, * be reused after dying flag is set */ if (q->mq_ops) { - blk_mq_sched_insert_request(rq, at_head, true, false, false); + blk_mq_sched_insert_request_bypass(rq, at_head, true, false, false); return; } diff --git a/block/blk-flush.c b/block/blk-flush.c index ed5fe322abba..51e89e5c525a 100644 --- a/block/blk-flush.c +++ b/block/blk-flush.c @@ -463,7 +463,7 @@ void blk_insert_flush(struct request *rq) if ((policy & REQ_FSEQ_DATA) && !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) { if (q->mq_ops) - blk_mq_sched_insert_request(rq, false, true, false, false); + blk_mq_sched_insert_request_bypass(rq, false, true, false, false); else list_add_tail(&rq->queuelist, &q->queue_head); return; diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index 8d97df40fc28..b40dd063d61f 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -354,6 +354,64 @@ static void blk_mq_sched_insert_flush(struct blk_mq_hw_ctx *hctx, blk_mq_add_to_requeue_list(rq, false, true); } +static void blk_mq_flush_hctx(struct blk_mq_hw_ctx *hctx, + struct elevator_queue *e, + const bool has_sched_dispatch, + struct list_head *rqs) +{ + LIST_HEAD(list); + + if (!has_sched_dispatch) + blk_mq_flush_busy_ctxs(hctx, &list); + else { + while (true) { + struct request *rq; + + rq = e->type->ops.mq.dispatch_request(hctx); + if (!rq) + break; + list_add_tail(&rq->queuelist, &list); + } + } + + list_splice_tail(&list, rqs); +} + +void blk_mq_sched_insert_request_bypass(struct request *rq, bool at_head, + bool run_queue, bool async, + bool can_block) +{ + struct request_queue *q = rq->q; + struct elevator_queue *e = q->elevator; + struct blk_mq_ctx *ctx = rq->mq_ctx; + struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu); + LIST_HEAD(list); + const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request; + + if (rq->tag == -1 && op_is_flush(rq->cmd_flags)) { + blk_mq_sched_insert_flush(hctx, rq, can_block); + return; + } + + if (at_head) + list_add_tail(&rq->queuelist, &list); + else { + blk_mq_flush_hctx(hctx, e, has_sched_dispatch, &list); + list_add_tail(&rq->queuelist, &list); + run_queue = true; + } + + spin_lock(&hctx->lock); + if (at_head) + list_splice(&list, &hctx->dispatch); + else + list_splice_tail(&list, &hctx->dispatch); + spin_unlock(&hctx->lock); + + if (run_queue) + blk_mq_run_hw_queue(hctx, async); +} + void blk_mq_sched_insert_request(struct request *rq, bool at_head, bool run_queue, bool async, bool can_block) { diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h index 9267d0b7c197..4d01697a627f 100644 ---
[PATCH 0/2] blk-mq: fix I/O hang during system resume
Hi, This two patches fix I/O hang of SCSI-MQ during system resume. The cause is that when SCSI device is put into SCSI's quiesce state, normal I/O request can't be dispatched to lld any more, only request with RQF_PREEMPT is allowed to be sent to drive. In current blk-mq implementation, if there is request in ->dispatch, no new request can't be dispatched to driver any more. This two patches fix the issue reported by Oleksandr. Thanks, Ming Ming Lei (2): blk-mq: add requests in the tail of hctx->dispatch blk-mq: align to legacy's implementation of blk_execute_rq block/blk-core.c | 2 +- block/blk-exec.c | 2 +- block/blk-flush.c| 2 +- block/blk-mq-sched.c | 60 +++- block/blk-mq-sched.h | 2 ++ block/blk-mq.c | 2 +- 6 files changed, 65 insertions(+), 5 deletions(-) -- 2.9.5
[PATCH 1/2] blk-mq: add requests in the tail of hctx->dispatch
It is more reasonable to add requests to ->dispatch in way of FIFO style, instead of LIFO style. Also in this way, we can allow to insert request at the front of hw queue, which function is needed to fix one bug in blk-mq's implementation of blk_execute_rq() Reported-by: Oleksandr Natalenko Tested-by: Oleksandr Natalenko Signed-off-by: Ming Lei --- block/blk-mq-sched.c | 2 +- block/blk-mq.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index 4ab69435708c..8d97df40fc28 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -272,7 +272,7 @@ static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx, * the dispatch list. */ spin_lock(&hctx->lock); - list_add(&rq->queuelist, &hctx->dispatch); + list_add_tail(&rq->queuelist, &hctx->dispatch); spin_unlock(&hctx->lock); return true; } diff --git a/block/blk-mq.c b/block/blk-mq.c index 4603b115e234..fed3d0c16266 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1067,7 +1067,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list) blk_mq_put_driver_tag(rq); spin_lock(&hctx->lock); - list_splice_init(list, &hctx->dispatch); + list_splice_tail_init(list, &hctx->dispatch); spin_unlock(&hctx->lock); /* -- 2.9.5
[PATCH 0/2] blk-mq: fix I/O hang during system resume
Hi, This two patches fix SCSI I/O hang during system resume. The cause is that when SCSI device is put into SCSI's quiesce state, normal I/O request can't be dispatched to lld any more, only request with RQF_PREEMPT is allowed to be sent to drive. In current blk-mq implementation, if there is request in ->dispatch, no new request can't be dispatched to driver any more. This two patches fix the issue reported by Oleksandr. Thanks, Ming Ming Lei (2): blk-mq: add requests in the tail of hctx->dispatch blk-mq: align to legacy's implementation of blk_execute_rq block/blk-core.c | 2 +- block/blk-exec.c | 2 +- block/blk-flush.c| 2 +- block/blk-mq-sched.c | 60 +++- block/blk-mq-sched.h | 2 ++ block/blk-mq.c | 2 +- 6 files changed, 65 insertions(+), 5 deletions(-) -- 2.9.5
Re: [GIT PULL] last minute NVMe fixes for 4.13
On 08/30/2017 09:07 AM, Christoph Hellwig wrote: > Hi Jens, > > three more fixes for 4.13 below: > > - fix the incorrect bit for the doorbell buffer features (Changpeng Liu) > - always use a 4k MR page size for RDMA, to not get in trouble with >offset in non-4k page size systems (no-op for x86) (Max Gurtovoy) > - and a fix for the new nvme host memory buffer support to keep the >descriptor list DMA mapped when the buffer is enabled (me) Pulled, thanks. -- Jens Axboe
Re: [GIT PULL] nvme update for Linux 4.14, take 2
Hi Jens, below is the current set of NVMe updates for Linux 4.14, now against your postmerge branch, and with three more patches. The biggest bit comes from Sagi and refactors the RDMA driver to prepare for more code sharing in the setup and teardown path. But we have various features and bug fixes from a lot of people as well. Pulled, thanks. I just realized that patch: -- commit d352ae205d8b05f3f7558d10f474d8436581b3e2 Author: Bart Van Assche Date: Thu Aug 17 16:23:03 2017 -0700 blk-mq: Make blk_mq_reinit_tagset() calls easier to read Since blk_mq_ops.reinit_request is only called from inside blk_mq_reinit_tagset(), make this function pointer an argument of blk_mq_reinit_tagset() instead of a member of struct blk_mq_ops. This patch does not change any functionality but makes blk_mq_reinit_tagset() calls easier to read and to analyze. -- Makes it impossible for me to move controller reset flow to nvme-core without adding a trampoline (as the reinit_request is transport specific)... I'd really like to avoid adding more churn to the code refactoring than it already has. Given that this patch is not fixing anything should we revert it?
[GIT PULL] last minute NVMe fixes for 4.13
Hi Jens, three more fixes for 4.13 below: - fix the incorrect bit for the doorbell buffer features (Changpeng Liu) - always use a 4k MR page size for RDMA, to not get in trouble with offset in non-4k page size systems (no-op for x86) (Max Gurtovoy) - and a fix for the new nvme host memory buffer support to keep the descriptor list DMA mapped when the buffer is enabled (me) The following changes since commit e9d8a0fdeacd843c85dcef480cdb2ab76bcdb6e4: nvme-pci: set cqe_seen on polled completions (2017-08-18 09:19:39 +0200) are available in the git repository at: git://git.infradead.org/nvme.git nvme-4.13 for you to fetch changes up to 223694b9ae8bfba99f3528d49d07a740af6ff95a: nvme: fix the definition of the doorbell buffer config support bit (2017-08-30 14:46:32 +0200) Changpeng Liu (1): nvme: fix the definition of the doorbell buffer config support bit Christoph Hellwig (1): nvme-pci: use dma memory for the host memory buffer descriptors Max Gurtovoy (1): nvme-rdma: default MR page size to 4k drivers/nvme/host/pci.c | 22 +++--- drivers/nvme/host/rdma.c | 8 ++-- include/linux/nvme.h | 2 +- 3 files changed, 18 insertions(+), 14 deletions(-)
Re: [PATCH] block, bfq: fix error handle in bfq_init
> Il giorno 23 ago 2017, alle ore 17:25, weiping zhang > ha scritto: > > On Sat, Aug 19, 2017 at 12:37:20AM +0800, weiping zhang wrote: >> if elv_register fail, bfq_pool should be free. >> >> Signed-off-by: weiping zhang >> --- >> block/bfq-iosched.c | 4 +++- >> 1 file changed, 3 insertions(+), 1 deletion(-) >> >> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c >> index 436b6ca..bdb072f 100644 >> --- a/block/bfq-iosched.c >> +++ b/block/bfq-iosched.c >> @@ -5048,10 +5048,12 @@ static int __init bfq_init(void) >> >> ret = elv_register(&iosched_bfq_mq); >> if (ret) >> -goto err_pol_unreg; >> +goto slab_kill; >> >> return 0; >> >> +slab_kill: >> +bfq_slab_kill(); >> err_pol_unreg: >> #ifdef CONFIG_BFQ_GROUP_IOSCHED >> blkcg_policy_unregister(&blkcg_policy_bfq); >> -- >> 2.9.4 >> > Hi Paolo, Jens, > > could you give some comments for this patch ? > Sorry for not making it in time (vacation), and thanks for fixing this bug. Paolo > Thanks
Re: I/O hangs after resuming from suspend-to-ram
Hi, On Wed, Aug 30, 2017 at 12:58:21PM +0200, oleksa...@natalenko.name wrote: > Hi. > > So, current summary: > > 1) first patch + debug patch: I can reproduce the issue in wild, but not > with pm_test set; That is interesting, since I always test via pm_test. > 2) first patch + debug patch + second patch: I cannot reproduce issue at > all, neither with "none", nor with "mq-deadline". > > Thus, "blk-mq: align to legacy path for implementing blk_execute_rq" + > "blk-mq: add requests in the tail of hctx->dispatch" look like a proper fix. > Thanks for that! That is great, will prepare a formal one for merge. > > > I setup dm-crypt on /dev/md0 directly, could you show me how to setup > > lvm on raid10? > > If still relevant, here is a layout: Looks not relevant, just because I use pm_test. > > === > sda 8:008G 0 disk > |-sda1 8:10 128M 0 part /boot/EFI > `-sda2 8:20 7.9G 0 part > `-md0 9:00 7.9G 0 raid10 > `-system-root 253:00 7.9G 0 lvm/ > sdb 8:16 08G 0 disk > |-sdb1 8:17 0 128M 0 part > `-sdb2 8:18 0 7.9G 0 part > `-md0 9:00 7.9G 0 raid10 > `-system-root 253:00 7.9G 0 lvm/ > === > > Anything else you'd like /me to test? You have done enough, thanks a lot for your report and test! -- Ming
Re: [RFC] block: deprecate choosing elevator via boot param
On 08/14/2017 03:27 AM, Oleksandr Natalenko wrote: > Setting I/O scheduler via kernel command line is not flexible enough > anymore. Different schedulers might be desirable for different types > of devices (SSDs and HDDs, for instance). Moreover, setting elevator > while using blk-mq framework does not work in this way already. > > This commit enables warning if user specifies "elevator" boot param. > Removing this option at all might be considered in some future. Not a huge fan of doing it like this, it'll just add a bunch of noise to the dmesg output that people will ignore. It doesn't bring us closer to being able to remove the option. I think we should just let it die over time. This will happen naturally since the blk-mq scheduling framework does not support it. Once we have everything converted, the option will end up doing nothing. Then we can eventually kill it. -- Jens Axboe
[PATCH] kernfs: checking for IS_ERR() instead of NULL
The kernfs_get_inode() returns NULL on error, it never returns error pointers. Fixes: aa8188253474 ("kernfs: add exportfs operations") Signed-off-by: Dan Carpenter diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c index 7c452f4d83e9..95a7c88baed9 100644 --- a/fs/kernfs/mount.c +++ b/fs/kernfs/mount.c @@ -99,8 +99,8 @@ static struct inode *kernfs_fh_get_inode(struct super_block *sb, return ERR_PTR(-ESTALE); inode = kernfs_get_inode(sb, kn); kernfs_put(kn); - if (IS_ERR(inode)) - return ERR_CAST(inode); + if (!inode) + return ERR_PTR(-ESTALE); if (generation && inode->i_generation != generation) { /* we didn't find the right inode.. */
Re: [PATCH V6 04/12] mmc: core: Turn off CQE before sending commands
On 25 August 2017 at 14:43, Adrian Hunter wrote: > CQE needs to be off for the host controller to accept non-CQ commands. Turn > off the CQE before sending commands, and ensure it is off in any reset or > power management paths, or re-tuning. > > Signed-off-by: Adrian Hunter > Reviewed-by: Linus Walleij > Signed-off-by: Ulf Hansson This one was already applied. Kind regards Uffe > --- > drivers/mmc/core/core.c | 9 + > 1 file changed, 9 insertions(+) > > diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c > index 5dd1c00d95f5..66c9cf49ad2f 100644 > --- a/drivers/mmc/core/core.c > +++ b/drivers/mmc/core/core.c > @@ -260,6 +260,9 @@ static void __mmc_start_request(struct mmc_host *host, > struct mmc_request *mrq) > > trace_mmc_request_start(host, mrq); > > + if (host->cqe_on) > + host->cqe_ops->cqe_off(host); > + > host->ops->request(host, mrq); > } > > @@ -979,6 +982,9 @@ int mmc_execute_tuning(struct mmc_card *card) > if (!host->ops->execute_tuning) > return 0; > > + if (host->cqe_on) > + host->cqe_ops->cqe_off(host); > + > if (mmc_card_mmc(card)) > opcode = MMC_SEND_TUNING_BLOCK_HS200; > else > @@ -1018,6 +1024,9 @@ void mmc_set_bus_width(struct mmc_host *host, unsigned > int width) > */ > void mmc_set_initial_state(struct mmc_host *host) > { > + if (host->cqe_on) > + host->cqe_ops->cqe_off(host); > + > mmc_retune_disable(host); > > if (mmc_host_is_spi(host)) > -- > 1.9.1 >
Re: [PATCH V6 03/12] mmc: host: Add CQE interface
On 25 August 2017 at 14:43, Adrian Hunter wrote: > Add CQE host operations, capabilities, and host members. > > Signed-off-by: Adrian Hunter I have now replaced the previous version with this one, applied for next! Thanks! Kind regards Uffe > --- > include/linux/mmc/core.h | 6 ++ > include/linux/mmc/host.h | 53 > > 2 files changed, 59 insertions(+) > > diff --git a/include/linux/mmc/core.h b/include/linux/mmc/core.h > index 178f699ac172..927519385482 100644 > --- a/include/linux/mmc/core.h > +++ b/include/linux/mmc/core.h > @@ -156,6 +156,12 @@ struct mmc_request { > struct completion completion; > struct completion cmd_completion; > void(*done)(struct mmc_request *);/* completion > function */ > + /* > +* Notify uppers layers (e.g. mmc block driver) that recovery is > needed > +* due to an error associated with the mmc_request. Currently used > only > +* by CQE. > +*/ > + void(*recovery_notifier)(struct mmc_request *); > struct mmc_host *host; > > /* Allow other commands during this ongoing data transfer or busy > wait */ > diff --git a/include/linux/mmc/host.h b/include/linux/mmc/host.h > index e92629518f68..f3f2d07feb2a 100644 > --- a/include/linux/mmc/host.h > +++ b/include/linux/mmc/host.h > @@ -162,6 +162,50 @@ struct mmc_host_ops { > unsigned int direction, int blk_size); > }; > > +struct mmc_cqe_ops { > + /* Allocate resources, and make the CQE operational */ > + int (*cqe_enable)(struct mmc_host *host, struct mmc_card *card); > + /* Free resources, and make the CQE non-operational */ > + void(*cqe_disable)(struct mmc_host *host); > + /* > +* Issue a read, write or DCMD request to the CQE. Also deal with the > +* effect of ->cqe_off(). > +*/ > + int (*cqe_request)(struct mmc_host *host, struct mmc_request > *mrq); > + /* Free resources (e.g. DMA mapping) associated with the request */ > + void(*cqe_post_req)(struct mmc_host *host, struct mmc_request > *mrq); > + /* > +* Prepare the CQE and host controller to accept non-CQ commands. > There > +* is no corresponding ->cqe_on(), instead ->cqe_request() is required > +* to deal with that. > +*/ > + void(*cqe_off)(struct mmc_host *host); > + /* > +* Wait for all CQE tasks to complete. Return an error if recovery > +* becomes necessary. > +*/ > + int (*cqe_wait_for_idle)(struct mmc_host *host); > + /* > +* Notify CQE that a request has timed out. Return false if the > request > +* completed or true if a timeout happened in which case indicate if > +* recovery is needed. > +*/ > + bool(*cqe_timeout)(struct mmc_host *host, struct mmc_request *mrq, > + bool *recovery_needed); > + /* > +* Stop all CQE activity and prepare the CQE and host controller to > +* accept recovery commands. > +*/ > + void(*cqe_recovery_start)(struct mmc_host *host); > + /* > +* Clear the queue and call mmc_cqe_request_done() on all requests. > +* Requests that errored will have the error set on the mmc_request > +* (data->error or cmd->error for DCMD). Requests that did not error > +* will have zero data bytes transferred. > +*/ > + void(*cqe_recovery_finish)(struct mmc_host *host); > +}; > + > struct mmc_async_req { > /* active mmc request */ > struct mmc_request *mrq; > @@ -303,6 +347,8 @@ struct mmc_host { > #define MMC_CAP2_HS400_ES (1 << 20) /* Host supports enhanced > strobe */ > #define MMC_CAP2_NO_SD (1 << 21) /* Do not send SD commands > during initialization */ > #define MMC_CAP2_NO_MMC(1 << 22) /* Do not send (e)MMC > commands during initialization */ > +#define MMC_CAP2_CQE (1 << 23) /* Has eMMC command queue > engine */ > +#define MMC_CAP2_CQE_DCMD (1 << 24) /* CQE can issue a direct > command */ > > mmc_pm_flag_t pm_caps;/* supported pm features */ > > @@ -386,6 +432,13 @@ struct mmc_host { > int dsr_req;/* DSR value is valid */ > u32 dsr;/* optional driver stage (DSR) value > */ > > + /* Command Queue Engine (CQE) support */ > + const struct mmc_cqe_ops *cqe_ops; > + void*cqe_private; > + int cqe_qdepth; > + boolcqe_enabled; > + boolcqe_on; > + > unsigned long private[0] cacheline_aligned; > }; > > -- > 1.9.1 >
Re: [PATCH V6 01/12] mmc: core: Move mmc_start_areq() declaration
On 25 August 2017 at 14:43, Adrian Hunter wrote: > mmc_start_areq() is an internal mmc core API. Move the declaration > accordingly. > > Signed-off-by: Adrian Hunter Thanks, applied for next! Kind regards Uffe > --- > drivers/mmc/core/core.h | 6 ++ > include/linux/mmc/core.h | 4 > 2 files changed, 6 insertions(+), 4 deletions(-) > > diff --git a/drivers/mmc/core/core.h b/drivers/mmc/core/core.h > index 55f543fd37c4..ca861091a776 100644 > --- a/drivers/mmc/core/core.h > +++ b/drivers/mmc/core/core.h > @@ -107,6 +107,12 @@ static inline void mmc_unregister_pm_notifier(struct > mmc_host *host) { } > void mmc_wait_for_req_done(struct mmc_host *host, struct mmc_request *mrq); > bool mmc_is_req_done(struct mmc_host *host, struct mmc_request *mrq); > > +struct mmc_async_req; > + > +struct mmc_async_req *mmc_start_areq(struct mmc_host *host, > +struct mmc_async_req *areq, > +enum mmc_blk_status *ret_stat); > + > int mmc_erase(struct mmc_card *card, unsigned int from, unsigned int nr, > unsigned int arg); > int mmc_can_erase(struct mmc_card *card); > diff --git a/include/linux/mmc/core.h b/include/linux/mmc/core.h > index bf1788a224e6..178f699ac172 100644 > --- a/include/linux/mmc/core.h > +++ b/include/linux/mmc/core.h > @@ -165,11 +165,7 @@ struct mmc_request { > }; > > struct mmc_card; > -struct mmc_async_req; > > -struct mmc_async_req *mmc_start_areq(struct mmc_host *host, > - struct mmc_async_req *areq, > - enum mmc_blk_status *ret_stat); > void mmc_wait_for_req(struct mmc_host *host, struct mmc_request *mrq); > int mmc_wait_for_cmd(struct mmc_host *host, struct mmc_command *cmd, > int retries); > -- > 1.9.1 >
Re: [PATCH V6 02/12] mmc: block: Fix block status codes
On 25 August 2017 at 14:43, Adrian Hunter wrote: > Commit 2a842acab109 ("block: introduce new block status code type") changed > the error type but not in patches merged through the mmc tree, like > commit 0493f6fe5bde ("mmc: block: Move boot partition locking into a driver > op"). Fix one error code that is incorrect and also use BLK_STS_OK in > preference to 0. > > Fixes: 17ece345a042 ("Merge tag 'mmc-v4.13' of > git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc") > Signed-off-by: Adrian Hunter Thanks, applied for fixes! Kind regards Uffe > --- > drivers/mmc/core/block.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c > index 0eebc2f726c3..2f00d11adfa9 100644 > --- a/drivers/mmc/core/block.c > +++ b/drivers/mmc/core/block.c > @@ -1223,7 +1223,7 @@ static void mmc_blk_issue_drv_op(struct mmc_queue *mq, > struct request *req) > break; > } > mq_rq->drv_op_result = ret; > - blk_end_request_all(req, ret); > + blk_end_request_all(req, ret ? BLK_STS_IOERR : BLK_STS_OK); > } > > static void mmc_blk_issue_discard_rq(struct mmc_queue *mq, struct request > *req) > @@ -1728,9 +1728,9 @@ static bool mmc_blk_rw_cmd_err(struct mmc_blk_data *md, > struct mmc_card *card, > if (err) > req_pending = old_req_pending; > else > - req_pending = blk_end_request(req, 0, blocks << 9); > + req_pending = blk_end_request(req, BLK_STS_OK, blocks > << 9); > } else { > - req_pending = blk_end_request(req, 0, brq->data.bytes_xfered); > + req_pending = blk_end_request(req, BLK_STS_OK, > brq->data.bytes_xfered); > } > return req_pending; > } > -- > 1.9.1 >
Re: possible circular locking dependency detected [was: linux-next: Tree for Aug 22]
Hello Peter, On (08/30/17 10:47), Peter Zijlstra wrote: [..] > On Wed, Aug 30, 2017 at 10:42:07AM +0200, Peter Zijlstra wrote: > > > > So the overhead looks to be spread out over all sorts, which makes it > > harder to find and fix. > > > > stack unwinding is done lots and is fairly expensive, I've not yet > > checked if crossrelease does too much of that. > > Aah, we do an unconditional stack unwind for every __lock_acquire() now. > It keeps a trace in the xhlocks[]. > > Does the below cure most of that overhead? thanks. mostly yes. the kernel is not so dramatically slower now. it's still seems to be a bit slower, which is expected I suppose, but over all it's much better real1m35.209s user4m12.467s sys 0m49.457s // approx 10 seconds slower. -ss
Re: I/O hangs after resuming from suspend-to-ram
Hi. So, current summary: 1) first patch + debug patch: I can reproduce the issue in wild, but not with pm_test set; 2) first patch + debug patch + second patch: I cannot reproduce issue at all, neither with "none", nor with "mq-deadline". Thus, "blk-mq: align to legacy path for implementing blk_execute_rq" + "blk-mq: add requests in the tail of hctx->dispatch" look like a proper fix. Thanks for that! I setup dm-crypt on /dev/md0 directly, could you show me how to setup lvm on raid10? If still relevant, here is a layout: === sda 8:008G 0 disk |-sda1 8:10 128M 0 part /boot/EFI `-sda2 8:20 7.9G 0 part `-md0 9:00 7.9G 0 raid10 `-system-root 253:00 7.9G 0 lvm/ sdb 8:16 08G 0 disk |-sdb1 8:17 0 128M 0 part `-sdb2 8:18 0 7.9G 0 part `-md0 9:00 7.9G 0 raid10 `-system-root 253:00 7.9G 0 lvm/ === Anything else you'd like /me to test? 30.08.2017 10:06, Ming Lei написав: …SNIP… Please try the following patch and previous patch together and see if they make a difference: …SNIP…
RE: possible circular locking dependency detected [was: linux-next: Tree for Aug 22]
> -Original Message- > From: Peter Zijlstra [mailto:pet...@infradead.org] > Sent: Wednesday, August 30, 2017 5:48 PM > To: Sergey Senozhatsky > Cc: Byungchul Park; Bart Van Assche; linux-ker...@vger.kernel.org; linux- > bl...@vger.kernel.org; martin.peter...@oracle.com; ax...@kernel.dk; linux- > s...@vger.kernel.org; s...@canb.auug.org.au; linux-n...@vger.kernel.org; > kernel-t...@lge.com > Subject: Re: possible circular locking dependency detected [was: linux- > next: Tree for Aug 22] > > On Wed, Aug 30, 2017 at 10:42:07AM +0200, Peter Zijlstra wrote: > > > > So the overhead looks to be spread out over all sorts, which makes it > > harder to find and fix. > > > > stack unwinding is done lots and is fairly expensive, I've not yet > > checked if crossrelease does too much of that. > > Aah, we do an unconditional stack unwind for every __lock_acquire() now. > It keeps a trace in the xhlocks[]. Yeah.. I also think this is most significant.. > > Does the below cure most of that overhead? > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c > index 44c8d0d17170..7b872036b72e 100644 > --- a/kernel/locking/lockdep.c > +++ b/kernel/locking/lockdep.c > @@ -4872,7 +4872,7 @@ static void add_xhlock(struct held_lock *hlock) > xhlock->trace.max_entries = MAX_XHLOCK_TRACE_ENTRIES; > xhlock->trace.entries = xhlock->trace_entries; > xhlock->trace.skip = 3; > - save_stack_trace(&xhlock->trace); > + /* save_stack_trace(&xhlock->trace); */ > } > > static inline int same_context_xhlock(struct hist_lock *xhlock)
Re: possible circular locking dependency detected [was: linux-next: Tree for Aug 22]
On Wed, Aug 30, 2017 at 10:42:07AM +0200, Peter Zijlstra wrote: > > So the overhead looks to be spread out over all sorts, which makes it > harder to find and fix. > > stack unwinding is done lots and is fairly expensive, I've not yet > checked if crossrelease does too much of that. Aah, we do an unconditional stack unwind for every __lock_acquire() now. It keeps a trace in the xhlocks[]. Does the below cure most of that overhead? diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 44c8d0d17170..7b872036b72e 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -4872,7 +4872,7 @@ static void add_xhlock(struct held_lock *hlock) xhlock->trace.max_entries = MAX_XHLOCK_TRACE_ENTRIES; xhlock->trace.entries = xhlock->trace_entries; xhlock->trace.skip = 3; - save_stack_trace(&xhlock->trace); + /* save_stack_trace(&xhlock->trace); */ } static inline int same_context_xhlock(struct hist_lock *xhlock)
Re: possible circular locking dependency detected [was: linux-next: Tree for Aug 22]
On Wed, Aug 30, 2017 at 03:15:11PM +0900, Sergey Senozhatsky wrote: > Hi, > > On (08/30/17 14:43), Byungchul Park wrote: > [..] > > > notably slower than earlier 4.13 linux-next. (e.g. scrolling in vim > > > is irritatingly slow) > > > > To Ingo, > > > > I cannot decide if we have to roll back CONFIG_LOCKDEP_CROSSRELEASE > > dependency on CONFIG_PROVE_LOCKING in Kconfig. With them enabled, > > lockdep detection becomes strong but has performance impact. But, > > it's anyway a debug option so IMHO we don't have to take case of the > > performance impact. Please let me know your decision. > > well, I expected it :) > > I've been running lockdep enabled kernels for years, and was OK with > the performance. but now it's just too much and I'm looking at disabling > lockdep. > > a more relevant test -- compilation of a relatively small project > > LOCKDEP -CROSSRELEASE -COMPLETIONS LOCKDEP +CROSSRELEASE +COMPLETIONS > >real1m23.722s real2m9.969s >user4m11.300s user4m15.458s >sys 0m49.386s sys 2m3.594s > > > you don't want to know how much time now it takes to recompile the > kernel ;) Right,.. so when I look at perf annotate for __lock_acquire and lock_release (the two most expensive lockdep functions in a kernel profile) I don't actually see much cross-release stuff. So the overhead looks to be spread out over all sorts, which makes it harder to find and fix. stack unwinding is done lots and is fairly expensive, I've not yet checked if crossrelease does too much of that. The below saved about 50% of my __lock_acquire() time, not sure it made a significant difference over all though. --- diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 44c8d0d17170..f8db1ead1c48 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -3386,7 +3386,7 @@ static int __lock_acquire(struct lockdep_map *lock, unsigned int subclass, if (!class) return 0; } - atomic_inc((atomic_t *)&class->ops); + /* atomic_inc((atomic_t *)&class->ops); */ if (very_verbose(class)) { printk("\nacquire class [%p] %s", class->key, class->name); if (class->name_version > 1)
Re: I/O hangs after resuming from suspend-to-ram
Hi, On Wed, Aug 30, 2017 at 08:15:02AM +0200, oleksa...@natalenko.name wrote: > Hello. > > Addressing your questions below. > > > Can't reproduce even with putting dmcypt on raid10 after applying my > > patch. > > Just a side note, that dm-crypt is not necessary here — I am able to trigger > hang with RAID10 and LVM only. > > > BTW, could you share us which blk-mq scheduler you are using on sata? > > Okay, now it makes sense. Previously, without your patch I was able to > reproduce the issue with "mq-deadline", "bfq" and "none". Now, I cannot > trigger it with "none" set, but "mq-deadline" still hangs for me. Does this > mean each scheduler should be modified separately? No, it shouldn't. > > > Could you apply the following debug patch and provide the dmesg log > > after running the commands below? > > Is it still relevant since I confirm issue to be fixed with "none"? Please try the following patch and previous patch together and see if they make a difference: >From bc5626b4b65c7ff26567e75f42584c2c43ffe7c1 Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Wed, 30 Aug 2017 15:53:01 +0800 Subject: [PATCH] blk-mq: add requests in the tail of hctx->dispatch So that we can allow to insert request at the head of hctx->dispatch. Signed-off-by: Ming Lei --- block/blk-mq-sched.c | 2 +- block/blk-mq.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index a026fb47..b40dd063d61f 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -272,7 +272,7 @@ static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx, * the dispatch list. */ spin_lock(&hctx->lock); - list_add(&rq->queuelist, &hctx->dispatch); + list_add_tail(&rq->queuelist, &hctx->dispatch); spin_unlock(&hctx->lock); return true; } diff --git a/block/blk-mq.c b/block/blk-mq.c index 4603b115e234..fed3d0c16266 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1067,7 +1067,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list) blk_mq_put_driver_tag(rq); spin_lock(&hctx->lock); - list_splice_init(list, &hctx->dispatch); + list_splice_tail_init(list, &hctx->dispatch); spin_unlock(&hctx->lock); /* -- 2.9.5 -- Ming
[PATCH] partitions/ibm: call ->getgeo directly
Trying to get rid of as many as possible ioctl_by_bdev callers to reduce set_fs instances. Signed-off-by: Christoph Hellwig --- block/partitions/ibm.c | 21 + 1 file changed, 9 insertions(+), 12 deletions(-) diff --git a/block/partitions/ibm.c b/block/partitions/ibm.c index 14b081af8d61..3ef8bfa3db9d 100644 --- a/block/partitions/ibm.c +++ b/block/partitions/ibm.c @@ -282,7 +282,6 @@ static int find_cms1_partitions(struct parsed_partitions *state, return 1; } - /* * This is the main function, called by check.c */ @@ -292,7 +291,7 @@ int ibm_partition(struct parsed_partitions *state) int blocksize, res; loff_t i_size, offset, size; dasd_information2_t *info; - struct hd_geometry *geo; + struct hd_geometry geo = { 0, }; char type[5] = {0,}; char name[7] = {0,}; sector_t labelsect; @@ -308,30 +307,30 @@ int ibm_partition(struct parsed_partitions *state) info = kmalloc(sizeof(dasd_information2_t), GFP_KERNEL); if (info == NULL) goto out_exit; - geo = kmalloc(sizeof(struct hd_geometry), GFP_KERNEL); - if (geo == NULL) - goto out_nogeo; label = kmalloc(sizeof(union label_t), GFP_KERNEL); if (label == NULL) goto out_nolab; - if (ioctl_by_bdev(bdev, HDIO_GETGEO, (unsigned long)geo) != 0) + if (!bdev->bd_disk->fops->getgeo) + goto out_freeall; + geo.start = get_start_sect(bdev); + if (bdev->bd_disk->fops->getgeo(bdev, &geo) != 0) goto out_freeall; if (ioctl_by_bdev(bdev, BIODASDINFO2, (unsigned long)info) != 0) { kfree(info); info = NULL; } - if (find_label(state, info, geo, blocksize, &labelsect, name, type, + if (find_label(state, info, &geo, blocksize, &labelsect, name, type, label)) { if (!strncmp(type, "VOL1", 4)) { - res = find_vol1_partitions(state, geo, blocksize, name, + res = find_vol1_partitions(state, &geo, blocksize, name, label); } else if (!strncmp(type, "LNX1", 4)) { - res = find_lnx1_partitions(state, geo, blocksize, name, + res = find_lnx1_partitions(state, &geo, blocksize, name, label, labelsect, i_size, info); } else if (!strncmp(type, "CMS1", 4)) { - res = find_cms1_partitions(state, geo, blocksize, name, + res = find_cms1_partitions(state, &geo, blocksize, name, label, labelsect); } } else if (info) { @@ -356,8 +355,6 @@ int ibm_partition(struct parsed_partitions *state) out_freeall: kfree(label); out_nolab: - kfree(geo); -out_nogeo: kfree(info); out_exit: return res; -- 2.11.0
Re: [PATCH V6 00/12] mmc: Add Command Queue support
On 25/08/17 15:43, Adrian Hunter wrote: > Here is V6 of the hardware command queue patches without the software > command queue patches. Patches "mmc: host: Add CQE interface" and > "mmc: core: Turn off CQE before sending commands" have been applied to Ulf's > next branch. "mmc: host: Add CQE interface" needs to be dropped in favour > of the new version. Any comments?