O scheduler as an extra scheduler

Paolo Valente Tue, 14 Mar 2017 07:19:27 -0700

> Il giorno 06 mar 2017, alle ore 20:40, Bart Van Assche 
> <[email protected]> ha scritto:
>


Hi Bart,
thanks for such an accurate review.  I'm addressing the issues you
raised, and I'll get back in touch as soon as I have finished.

Paolo

> On 03/04/2017 08:01 AM, Paolo Valente wrote:
>> BFQ is a proportional-share I/O scheduler, whose general structure,
>> plus a lot of code, are borrowed from CFQ.
>> [ ... ]
> 
> This description is very useful. However, since it is identical to the
> description this patch adds to Documentation/block/bfq-iosched.txt I
> propose to leave it out from the patch description.
> 
> What seems missing to me is an overview of the limitations of BFQ. Does
> BFQ e.g. support multiple hardware queues?
> 
>> +3. What are BFQ's tunable?
>> +==========================
>> +[ ... ]
> 
> A thorough knowledge of BFQ is required to tune it properly. Users don't
> want to tune I/O schedulers. Has it been considered to invent algorithms
> to tune these parameters automatically?
> 
>> + * Licensed under GPL-2.
> 
> The COPYING file at the top of the tree mentions that GPL-v2 licensing
> should be specified as follows close to the start of each source file:
> 
>    This program is free software; you can redistribute it and/or modify
>    it under the terms of the GNU General Public License as published by
>    the Free Software Foundation; either version 2 of the License, or
>    (at your option) any later version.
> 
>    This program is distributed in the hope that it will be useful,
>    but WITHOUT ANY WARRANTY; without even the implied warranty of
>    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>    GNU General Public License for more details.
> 
>    You should have received a copy of the GNU General Public License
>    along with this program; if not, write to the Free Software
>    Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
>    02110-1301 USA
> 
>> + * BFQ is a proportional-share I/O scheduler, with some extra
>> + * low-latency capabilities. BFQ also supports full hierarchical
>> + * scheduling through cgroups. Next paragraphs provide an introduction
>> + * on BFQ inner workings. Details on BFQ benefits and usage can be
>> + * found in Documentation/block/bfq-iosched.txt.
> 
> That reference should be sufficient - please do not duplicate
> Documentation/block/bfq-iosched.txt in block/bfq-iosched.c.
> 
>> +/**
>> + * struct bfq_service_tree - per ioprio_class service tree.
>> + *
>> + * Each service tree represents a B-WF2Q+ scheduler on its own.  Each
>> + * ioprio_class has its own independent scheduler, and so its own
>> + * bfq_service_tree.  All the fields are protected by the queue lock
>> + * of the containing bfqd.
>> + */
>> +struct bfq_service_tree {
>> +    /* tree for active entities (i.e., those backlogged) */
>> +    struct rb_root active;
>> +    /* tree for idle entities (i.e., not backlogged, with V <= F_i)*/
>> +    struct rb_root idle;
>> +
>> +    struct bfq_entity *first_idle;  /* idle entity with minimum F_i */
>> +    struct bfq_entity *last_idle;   /* idle entity with maximum F_i */
>> +
>> +    u64 vtime; /* scheduler virtual time */
>> +    /* scheduler weight sum; active and idle entities contribute to it */
>> +    unsigned long wsum;
>> +};
> 
> Inline comments next to structure members are ugly and make the
> structure definition hard to read. Please follow the instructions in
> Documentation/kernel-doc-nano-HOWTO.txt for documenting structure members.
> 
>> +    u64 finish; /* B-WF2Q+ finish timestamp (aka F_i) */
>> +    u64 start;  /* B-WF2Q+ start timestamp (aka S_i) */
> 
> For all times and timestamps, please document the time unit (e.g. s, ms,
> us, ns, jiffies, ...).
> 
>> +enum bfq_device_speed {
>> +    BFQ_BFQD_FAST,
>> +    BFQ_BFQD_SLOW,
>> +};
> 
> What is the meaning of "fast" and "slow" devices in this context?
> Anyway, since the first patch that uses this enum is patch 6, please
> defer introduction of this enum until patch 6.
> 
>> +
>> +/**
>> + * struct bfq_data - per-device data structure.
>> + *
>> + * All the fields are protected by @lock.
>> + */
>> +struct bfq_data {
>> +    /* device request queue */
>> +    struct request_queue *queue;
>> +    [ ... ]
>> +
>> +    /* on-disk position of the last served request */
>> +    sector_t last_position;
> 
> What is the relevance of last_position if there are multiple hardware
> queues? Will the BFQ algorithm fail to realize its guarantees in that case?
> 
> What is the relevance of this structure member for block devices that
> have multiple spindles, e.g. arrays of hard disks?
> 
>> +enum bfqq_state_flags {
>> +    BFQ_BFQQ_FLAG_busy = 0,         /* has requests or is in service */
>> +    BFQ_BFQQ_FLAG_wait_request,     /* waiting for a request */
>> +    BFQ_BFQQ_FLAG_non_blocking_wait_rq, /*
>> +                                         * waiting for a request
>> +                                         * without idling the device
>> +                                         */
>> +    BFQ_BFQQ_FLAG_fifo_expire,      /* FIFO checked in this slice */
>> +    BFQ_BFQQ_FLAG_idle_window,      /* slice idling enabled */
>> +    BFQ_BFQQ_FLAG_sync,             /* synchronous queue */
>> +    BFQ_BFQQ_FLAG_budget_new,       /* no completion with this budget */
>> +    BFQ_BFQQ_FLAG_IO_bound,         /*
>> +                                     * bfqq has timed-out at least once
>> +                                     * having consumed at most 2/10 of
>> +                                     * its budget
>> +                                     */
>> +};
> 
> The "BFQ_BFQQ_FLAG_" prefix looks silly and too long to me. How about
> e.g. using the prefix "BFQQF_" instead?
> 
>> +#define BFQ_BFQQ_FNS(name)                                          \
>> +static void bfq_mark_bfqq_##name(struct bfq_queue *bfqq)            \
>> +{                                                                   \
>> +    (bfqq)->flags |= (1 << BFQ_BFQQ_FLAG_##name);                   \
>> +}                                                                   \
>> +static void bfq_clear_bfqq_##name(struct bfq_queue *bfqq)           \
>> +{                                                                   \
>> +    (bfqq)->flags &= ~(1 << BFQ_BFQQ_FLAG_##name);                  \
>> +}                                                                   \
>> +static int bfq_bfqq_##name(const struct bfq_queue *bfqq)            \
>> +{                                                                   \
>> +    return ((bfqq)->flags & (1 << BFQ_BFQQ_FLAG_##name)) != 0;      \
>> +}
> 
> Are the bodies of the above functions duplicates of __set_bit(),
> __clear_bit() and test_bit()?
> 
>> +/* Expiration reasons. */
>> +enum bfqq_expiration {
>> +    BFQ_BFQQ_TOO_IDLE = 0,          /*
>> +                                     * queue has been idling for
>> +                                     * too long
>> +                                     */
>> +    BFQ_BFQQ_BUDGET_TIMEOUT,        /* budget took too long to be used */
>> +    BFQ_BFQQ_BUDGET_EXHAUSTED,      /* budget consumed */
>> +    BFQ_BFQQ_NO_MORE_REQUESTS,      /* the queue has no more requests */
>> +    BFQ_BFQQ_PREEMPTED              /* preemption in progress */
>> +};
> 
> The prefix of these constants refers twice to "BFQ" and does not make it
> clear that these constants are about expiration. How about using the
> "BFQQE_" prefix instead?
> 
>> +/* Maximum backwards seek, in KiB. */
>> +static const int bfq_back_max = 16 * 1024;
> 
> Where does this constant come from? Should it depend on geometry data
> like e.g. the number of sectors in a cylinder?
> 
>> +#define for_each_entity(entity)     \
>> +    for (; entity ; entity = NULL)
> 
> Why has this confusing #define been introduced? Shouldn't all
> occurrences of this macro be changed into the equivalent "if (entity)"?
> We don't want silly macros like this in the Linux kernel.
> 
>> +#define for_each_entity_safe(entity, parent) \
>> +    for (parent = NULL; entity ; entity = parent)
> 
> Same question here - why has this macro been introduced and how has its
> name been chosen? Since this macro is used only once and since no value
> is assigned to 'parent' in the code controlled by this construct, please
> remove this macro and use something that is less confusing than a "for"
> loop for something that is not a loop.
> 
>> +/**
>> + * bfq_weight_to_ioprio - calc an ioprio from a weight.
>> + * @weight: the weight value to convert.
>> + *
>> + * To preserve as much as possible the old only-ioprio user interface,
>> + * 0 is used as an escape ioprio value for weights (numerically) equal or
>> + * larger than IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF.
>> + */
>> +static unsigned short bfq_weight_to_ioprio(int weight)
>> +{
>> +    return IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF - weight < 0 ?
>> +            0 : IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF - weight;
>> +}
> 
> Please consider using max() or max_t() to make this function less verbose.
> 
>> +
>> +/**
>> + * bfq_active_extract - remove an entity from the active tree.
>> + * @st: the service_tree containing the tree.
>> + * @entity: the entity being removed.
>> + */
>> +static void bfq_active_extract(struct bfq_service_tree *st,
>> +                           struct bfq_entity *entity)
>> +{
>> +    struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
>> +    struct rb_node *node;
>> +
>> +    node = bfq_find_deepest(&entity->rb_node);
>> +    bfq_extract(&st->active, entity);
>> +
>> +    if (node)
>> +            bfq_update_active_tree(node);
>> +
>> +    if (bfqq)
>> +            list_del(&bfqq->bfqq_list);
>> +}
> 
> Which locks protect the data structures manipulated by this and other
> functions? Have you considered to use lockdep_assert_held() to document
> these assumptions?
> 
>> +    case (BFQ_RQ1_WRAP|BFQ_RQ2_WRAP): /* both rqs wrapped */
> 
> Please don't use parentheses if no confusion is possible. Additionally,
> checkpatch should have requested you to insert a space before and after
> the logical or operator.
> 
>> +static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
>> +                                   struct bfq_queue *bfqq)
>> +{
>> +    if (bfqq) {
>> +            bfq_mark_bfqq_budget_new(bfqq);
>> +            bfq_clear_bfqq_fifo_expire(bfqq);
>> +
>> +            bfqd->budgets_assigned = (bfqd->budgets_assigned*7 + 256) / 8;
> 
> Checkpatch should have asked you to insert spaces around the
> multiplication operator.
> 
>> +/*
>> + * bfq_default_budget - return the default budget for @bfqq on @bfqd.
>> + * @bfqd: the device descriptor.
>> + * @bfqq: the queue to consider.
>> + *
>> + * We use 3/4 of the @bfqd maximum budget as the default value
>> + * for the max_budget field of the queues.  This lets the feedback
>> + * mechanism to start from some middle ground, then the behavior
>> + * of the process will drive the heuristics towards high values, if
>> + * it behaves as a greedy sequential reader, or towards small values
>> + * if it shows a more intermittent behavior.
>> + */
>> +static unsigned long bfq_default_budget(struct bfq_data *bfqd,
>> +                                    struct bfq_queue *bfqq)
>> +{
>> +    unsigned long budget;
>> +
>> +    /*
>> +     * When we need an estimate of the peak rate we need to avoid
>> +     * to give budgets that are too short due to previous measurements.
>> +     * So, in the first 10 assignments use a ``safe'' budget value.
>> +     */
>> +    if (bfqd->budgets_assigned < 194 && bfqd->bfq_user_max_budget == 0)
>> +            budget = bfq_default_max_budget;
>> +    else
>> +            budget = bfqd->bfq_max_budget;
>> +
>> +    return budget - budget / 4;
>> +}
> 
> Where does the magic constant "194" come from?
> 
> 
>> +    } else
>> +            /*
>> +             * Async queues get always the maximum possible
>> +             * budget, as for them we do not care about latency
>> +             * (in addition, their ability to dispatch is limited
>> +             * by the charging factor).
>> +             */
>> +            budget = bfqd->bfq_max_budget;
>> +
> 
> Please balance braces. Checkpatch should have warned about the use of "}
> else" instead of "} else {".
> 
>> +static unsigned long bfq_calc_max_budget(u64 peak_rate, u64 timeout)
>> +{
>> +    unsigned long max_budget;
>> +
>> +    /*
>> +     * The max_budget calculated when autotuning is equal to the
>> +     * amount of sectors transferred in timeout at the
>> +     * estimated peak rate.
>> +     */
>> +    max_budget = (unsigned long)(peak_rate * 1000 *
>> +                                 timeout >> BFQ_RATE_SHIFT);
>> +
>> +    return max_budget;
>> +}
> 
> Where does the constant 1000 come from? What are the units of peak_rate
> and timeout? What is the maximum value of peak_rate? Can the
> multiplication overflow?
> 
>> +/*
>> + * In addition to updating the peak rate, checks whether the process
>> + * is "slow", and returns 1 if so. This slow flag is used, in addition
>> + * to the budget timeout, to reduce the amount of service provided to
>> + * seeky processes, and hence reduce their chances to lower the
>> + * throughput. See the code for more details.
>> + */
>> +static bool bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue 
>> *bfqq,
>> +                             bool compensate)
>> +{
>> +    u64 bw, usecs, expected, timeout;
>> +    ktime_t delta;
>> +    int update = 0;
>> +
>> +    if (!bfq_bfqq_sync(bfqq) || bfq_bfqq_budget_new(bfqq))
>> +            return false;
>> +
>> +    if (compensate)
>> +            delta = bfqd->last_idling_start;
>> +    else
>> +            delta = ktime_get();
>> +    delta = ktime_sub(delta, bfqd->last_budget_start);
>> +    usecs = ktime_to_us(delta);
>> +
>> +    /* Don't trust short/unrealistic values. */
>> +    if (usecs < 100 || usecs >= LONG_MAX)
>> +            return false;
> 
> If usecs >= LONG_MAX that indicates a kernel bug. Please consider
> triggering a kernel warning in that case.
> 
>> +/*
>> + * Budget timeout is not implemented through a dedicated timer, but
>> + * just checked on request arrivals and completions, as well as on
>> + * idle timer expirations.
>> + */
>> +static bool bfq_bfqq_budget_timeout(struct bfq_queue *bfqq)
>> +{
>> +    if (bfq_bfqq_budget_new(bfqq) ||
>> +        time_before(jiffies, bfqq->budget_timeout))
>> +            return false;
>> +    return true;
>> +}
> 
> Have you considered to use time_is_after_jiffies() instead of
> time_before(jiffies, ...)?
> 
>> +static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
>> +                      struct bfq_io_cq *bic, pid_t pid, int is_sync)
>> +{
>> +    RB_CLEAR_NODE(&bfqq->entity.rb_node);
>> +    INIT_LIST_HEAD(&bfqq->fifo);
>> +
>> +    bfqq->ref = 0;
>> +    bfqq->bfqd = bfqd;
>> +
>> +    if (bic)
>> +            bfq_set_next_ioprio_data(bfqq, bic);
>> +
>> +    if (is_sync) {
>> +            if (!bfq_class_idle(bfqq))
>> +                    bfq_mark_bfqq_idle_window(bfqq);
>> +            bfq_mark_bfqq_sync(bfqq);
>> +    } else
>> +            bfq_clear_bfqq_sync(bfqq);
>> +
>> +    bfqq->ttime.last_end_request = ktime_get_ns() - (1ULL<<32);
>> +
>> +    bfq_mark_bfqq_IO_bound(bfqq);
>> +
>> +    bfqq->pid = pid;
>> +
>> +    /* Tentative initial value to trade off between thr and lat */
>> +    bfqq->max_budget = bfq_default_budget(bfqd, bfqq);
>> +    bfqq->budget_timeout = bfq_smallest_from_now();
>> +    bfqq->pid = pid;
>> +
>> +    /* first request is almost certainly seeky */
>> +    bfqq->seek_history = 1;
>> +}
> 
> What is the meaning of the 1ULL << 32 constant?
> 
>> +static int __init bfq_init(void)
>> +{
>> +    int ret;
>> +    char msg[50] = "BFQ I/O-scheduler: v0";
> 
> Please leave out "[50]" and use "static const char" instead of "char".
> 
>> diff --git a/block/elevator.c b/block/elevator.c
>> index 01139f5..786fdcd 100644
>> --- a/block/elevator.c
>> +++ b/block/elevator.c
>> @@ -221,14 +221,20 @@ int elevator_init(struct request_queue *q, char *name)
>> 
>>      if (!e) {
>>              /*
>> -             * For blk-mq devices, we default to using mq-deadline,
>> -             * if available, for single queue devices. If deadline
>> -             * isn't available OR we have multiple queues, default
>> -             * to "none".
>> +             * For blk-mq devices, we default to using bfq, if
>> +             * available, for single queue devices. If bfq isn't
>> +             * available, we try mq-deadline. If neither is
>> +             * available, OR we have multiple queues, default to
>> +             * "none".
>>               */
>>              if (q->mq_ops) {
>> +                    if (q->nr_hw_queues == 1) {
>> +                            e = elevator_get("bfq", false);
>> +                            if (!e)
>> +                                    e = elevator_get("mq-deadline", false);
>> +                    }
>>                      if (q->nr_hw_queues == 1)
>> -                            e = elevator_get("mq-deadline", false);
>> +                            e = elevator_get("bfq", false);
>>                      if (!e)
>>                              return 0;
>>              } else
>> 
> 
> As Jens wrote, it's way too early to make BFQ the default scheduler.
> 
> Bart.

Re: [PATCH RFC 01/14] block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler

Reply via email to