Re: [patch 1/3] mm: memcontrol: lockless page counters

Michal Hocko Tue, 30 Sep 2014 04:07:13 -0700

On Wed 24-09-14 11:43:08, Johannes Weiner wrote:
> Memory is internally accounted in bytes, using spinlock-protected
> 64-bit counters, even though the smallest accounting delta is a page.
> The counter interface is also convoluted and does too many things.
> 
> Introduce a new lockless word-sized page counter API, then change all
> memory accounting over to it and remove the old one.  The translation
> from and to bytes then only happens when interfacing with userspace.
> 
> Aside from the locking costs, this gets rid of the icky unsigned long
> long types in the very heart of memcg, which is great for 32 bit and
> also makes the code a lot more readable.


Please describe the usual use pattern of the API. It is much easier to
read it here than pulling it out from the source.

Also I would expect some testing results. Especially on a larger machine
(I guess you rely on Dave here, right?).

Thanks for splitting the original patch and extracting the counter
to a separate file. I think we should add F: mm/page_counter.c under
MEMCG maintenance section.

More comments inline (I only got to page_counter for now and will check
the res_counter replacement in another go)

> Signed-off-by: Johannes Weiner <han...@cmpxchg.org>
> ---
>  Documentation/cgroups/memory.txt |   4 +-
>  include/linux/memcontrol.h       |   5 +-
>  include/linux/page_counter.h     |  49 +++
>  include/net/sock.h               |  26 +-
>  init/Kconfig                     |   5 +-
>  mm/Makefile                      |   1 +
>  mm/memcontrol.c                  | 635 
> ++++++++++++++++++---------------------
>  mm/page_counter.c                | 191 ++++++++++++
>  net/ipv4/tcp_memcontrol.c        |  87 +++---
>  9 files changed, 598 insertions(+), 405 deletions(-)
>  create mode 100644 include/linux/page_counter.h
>  create mode 100644 mm/page_counter.c
> 
[...]
> diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> new file mode 100644
> index 000000000000..d92d18949474
> --- /dev/null
> +++ b/include/linux/page_counter.h
> @@ -0,0 +1,49 @@
> +#ifndef _LINUX_PAGE_COUNTER_H
> +#define _LINUX_PAGE_COUNTER_H
> +
> +#include <linux/atomic.h>
> +
> +struct page_counter {
> +     atomic_long_t count;
> +     unsigned long limit;
> +     struct page_counter *parent;
> +
> +     /* legacy */
> +     unsigned long watermark;

The name suggest this is a restriction not a highest usage mark.
max_count would be less confusing.

> +     unsigned long failcnt;
> +};
> +
> +#if BITS_PER_LONG == 32
> +#define PAGE_COUNTER_MAX LONG_MAX
> +#else
> +#define PAGE_COUNTER_MAX (LONG_MAX / PAGE_SIZE)
> +#endif

It is not clear to me why you need a separate definitions here. LONG_MAX
seems to be good for both 32b and 64b.

[...]
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> new file mode 100644
> index 000000000000..51c45921b8d1
> --- /dev/null
> +++ b/mm/page_counter.c
> @@ -0,0 +1,191 @@
> +/*
> + * Lockless hierarchical page accounting & limiting
> + *
> + * Copyright (C) 2014 Red Hat, Inc., Johannes Weiner
> + */
> +#include <linux/page_counter.h>
> +#include <linux/atomic.h>
> +
> +/**
> + * page_counter_cancel - take pages out of the local counter
> + * @counter: counter
> + * @nr_pages: number of pages to cancel
> + *
> + * Returns whether there are remaining pages in the counter.
> + */
> +int page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
> +{
> +     long new;
> +
> +     new = atomic_long_sub_return(nr_pages, &counter->count);
> +

This really deserves a comment IMO. Something like the following:
        /*
         * Charges and uncharges are always ordered properly from memory
         * ordering point of view. The only case where underflow can happen
         * is a mismatched uncharge. Report it and fix it up now rather
         * than blow up later.
         */
> +     if (WARN_ON_ONCE(new < 0))
> +             atomic_long_add(nr_pages, &counter->count);

anyway this doesn't look correct because you can have false positives:
[counter->count = 1]
        CPU0                                    CPU1
new = atomic_long_sub_return(THP)
                                        new = atomic_long_sub_return(1)
(new < 0)                               (new < 0)
  atomic_long_add(THP)                    atomic_add(1)

So we will end up with counter->count = 1 rather than 0. I think you
need to use a lock in the slow path. Something like

        if (new < 0) {
                unsigned long flags;

                /*
                 * Multiple uncharger might race together and we do not
                 * want to let any of them revert the uncharge just
                 * because a faulty uncharge and the fixup are not
                 * atomic.
                 */
                atomic_lond_add(nr_pages, &counter->count);

                spin_lock_irqsave(&counter->lock, flags);
                new = atomic_long_sub_return(nr_pages, &counter->count);
                if (WARN_ON(new < 0))
                        atomic_long_add(nr_pages, &counter->count);
                spin_unlock_irqrestore(&counter->lock, flags);
        }

> +
> +     return new > 0;
> +}
> +
> +/**
> + * page_counter_charge - hierarchically charge pages
> + * @counter: counter
> + * @nr_pages: number of pages to charge
> + *
> + * NOTE: This may exceed the configured counter limits.

The name is rather awkward. It sounds like a standard way to charge the
counter. I would rather stick to _nofail suffix and the following
addition to the doc.
"
Can be called only from contexts where the charge failure cannot be
handled. This should be rare and used with extreme caution.
"

> + */
> +void page_counter_charge(struct page_counter *counter, unsigned long 
> nr_pages)
> +{
> +     struct page_counter *c;
> +
> +     for (c = counter; c; c = c->parent) {
> +             long new;
> +
> +             new = atomic_long_add_return(nr_pages, &c->count);
> +             /*
> +              * This is racy, but with the per-cpu caches on top
> +              * it's just a ballpark metric anyway; and with lazy
> +              * cache reclaim, the majority of workloads peg the
> +              * watermark to the group limit soon after launch.
> +              */
> +             if (new > c->watermark)
> +                     c->watermark = new;
> +     }
> +}
> +
> +/**
> + * page_counter_try_charge - try to hierarchically charge pages
> + * @counter: counter
> + * @nr_pages: number of pages to charge
> + * @fail: points first counter to hit its limit, if any
> + *
> + * Returns 0 on success, or -ENOMEM and @fail if the counter or one of
> + * its ancestors has hit its limit.
> + */
> +int page_counter_try_charge(struct page_counter *counter,
> +                         unsigned long nr_pages,
> +                         struct page_counter **fail)
> +{
> +     struct page_counter *c;
> +
> +     for (c = counter; c; c = c->parent) {
> +             long new;
> +             /*
> +              * Charge speculatively to avoid an expensive CAS.  If
> +              * a bigger charge fails, it might falsely lock out a
> +              * racing smaller charge and send it into reclaim
> +              * eraly, but the error is limited to the difference
> +              * between the two sizes, which is less than 2M/4M in
> +              * case of a THP locking out a regular page charge.
> +              */

If this ever turns out to be a problem then we can check the size of the
overflow and retry if it is > nr_online_cpus or something like that. Not
worth bothering now I guess but definitely good to have this documented.
I would even like to have it in the changelog for users bisecting an
excessive reclaim because it is easier to find that in the changelog
than in the code.

> +             new = atomic_long_add_return(nr_pages, &c->count);
> +             if (new > c->limit) {
> +                     atomic_long_sub(nr_pages, &c->count);
> +                     /*
> +                      * This is racy, but the failcnt is only a
> +                      * ballpark metric anyway.
> +                      */
> +                     c->failcnt++;
> +                     *fail = c;
> +                     goto failed;
> +             }
> +             /*
> +              * This is racy, but with the per-cpu caches on top
> +              * it's just a ballpark metric anyway; and with lazy
> +              * cache reclaim, the majority of workloads peg the
> +              * watermark to the group limit soon after launch.
> +              */
> +             if (new > c->watermark)
> +                     c->watermark = new;
> +     }
> +     return 0;
> +
> +failed:
> +     for (c = counter; c != *fail; c = c->parent)
> +             page_counter_cancel(c, nr_pages);
> +
> +     return -ENOMEM;
> +}
[...]
> +/**
> + * page_counter_limit - limit the number of pages allowed
> + * @counter: counter
> + * @limit: limit to set
> + *
> + * Returns 0 on success, -EBUSY if the current number of pages on the
> + * counter already exceeds the specified limit.
> + *
> + * The caller must serialize invocations on the same counter.
> + */
> +int page_counter_limit(struct page_counter *counter, unsigned long limit)
> +{
> +     for (;;) {
> +             unsigned long old;
> +             long count;
> +
> +             count = atomic_long_read(&counter->count);
> +
> +             old = xchg(&counter->limit, limit);
> +
> +             if (atomic_long_read(&counter->count) != count) {
> +                     counter->limit = old;
> +                     continue;
> +             }
> +
> +             if (count > limit) {
> +                     counter->limit = old;
> +                     return -EBUSY;
> +             }

Ordering doesn't make much sense to me here. Say you really want to set
limit < count. You are effectively pushing all concurrent charges to
the reclaim even though you would revert your change and return with
EBUSY later on.
Wouldn't (count > limit) check make more sense right after the first
atomic_long_read?
Also the second count check should be sufficient to check > count and
retry only when the count has increased.
Finally continuous flow of charges can keep this loop running for quite
some time and trigger lockup detector. cond_resched before continue
would handle that. Something like the following:

        for (;;) {
                unsigned long old;
                long count;

                count = atomic_long_read(&counter->count);
                if (count > limit)
                        return -EBUSY;

                old = xchg(&counter->limit, limit);

                /* Recheck for concurrent charges */
                if (atomic_long_read(&counter->count) > count) {
                        counter->limit = old;
                        cond_resched();
                        continue;
                }

                return 0;
        }

> +
> +             return 0;
> +     }
> +}
> +
> +/**
> + * page_counter_memparse - memparse() for page counter limits
> + * @buf: string to parse
> + * @nr_pages: returns the result in number of pages
> + *
> + * Returns -EINVAL, or 0 and @nr_pages on success.  @nr_pages will be
> + * limited to %PAGE_COUNTER_MAX.
> + */
> +int page_counter_memparse(const char *buf, unsigned long *nr_pages)
> +{
> +     char unlimited[] = "-1";
> +     char *end;
> +     u64 bytes;
> +
> +     if (!strncmp(buf, unlimited, sizeof(unlimited))) {
> +             *nr_pages = PAGE_COUNTER_MAX;
> +             return 0;
> +     }
> +
> +     bytes = memparse(buf, &end);
> +     if (*end != '\0')
> +             return -EINVAL;

res_counter used to round up to the next page boundary and there is no
reason to not do the same here. 

        bytes = PAGE_ALIGN(bytes);

> +
> +     *nr_pages = min(bytes / PAGE_SIZE, (u64)PAGE_COUNTER_MAX);
> +
> +     return 0;
> +}
[...]
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/3] mm: memcontrol: lockless page counters

Reply via email to