Re: [PATCH 4/4] mm/page_reporting: Fix possible user allocation failure
On 4/3/21 3:55 AM, Alexander Duyck wrote: > On Fri, Mar 26, 2021 at 2:45 AM Xunlei Pang wrote: >> >> We encountered user memory allocation failure(OOM) on our >> 512MiB tiny instances, it didn't happen after turning off >> the page reporting. >> >> After some debugging, it turns out 32*4MB=128MB(order-10) >> free pages were isolated during reporting window resulting >> in no free available. >> >> Actually this might also happen on large instances when >> having a few free memory. >> >> This patch introduces a rule to limit reporting capacity >> according to current free memory, and reduce accordingly >> for higher orders which could break this rule. >> >> For example, >> 100MiB free, sgl capacity for different orders are: >> order-9 : 32 >>order-10: 16 >> >> Reported-by: Helin Guo >> Tested-by: Helin Guo >> Signed-off-by: Xunlei Pang > > I'm curious how much of this would be solved by just making it so that > we reduce the capacity by half if we increase the order? So > specifically if we did something such as: > capacity = (PAGE_REPORTING_CAPACITY << PAGE_REPORTING_MIN_ORDER) >> order; > > We just have to make sure the capacity is greater than zero before > entering the processing loop. > > An alternative that occured to me while I reviewed this is to look at > just adding a reserve. That would be something like: > reserve = PAGE_REPORTING_CAPACITY - capacity; > > Basically the reserve would take up some space at the start of the > list so that you wouldn't need to actually change the capacity > directly. It would just be a matter of making certain we deducted it > and updated the offsets of the scatterlist as necessary. > > >> --- >> mm/page_reporting.c | 89 >> +++-- >> 1 file changed, 72 insertions(+), 17 deletions(-) >> >> diff --git a/mm/page_reporting.c b/mm/page_reporting.c >> index 6ffedb8..2ec0ec0 100644 >> --- a/mm/page_reporting.c >> +++ b/mm/page_reporting.c >> @@ -129,8 +129,8 @@ void __page_reporting_notify(void) >> */ >> static int >> page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone >> *zone, >> -unsigned int order, unsigned int mt, >> -struct scatterlist *sgl, unsigned int *offset) >> +unsigned int order, unsigned int mt, struct scatterlist >> *sgl, >> +const unsigned int capacity, unsigned int *offset) >> { >> struct free_area *area = >free_area[order]; >> struct list_head *list = >free_list[mt]; >> @@ -161,10 +161,10 @@ void __page_reporting_notify(void) >> * list processed. This should result in us reporting all pages on >> * an idle system in about 30 seconds. >> * >> -* The division here should be cheap since PAGE_REPORTING_CAPACITY >> -* should always be a power of 2. >> +* The division here should be cheap since capacity should >> +* always be a power of 2. >> */ >> - budget = DIV_ROUND_UP(area->nr_free, PAGE_REPORTING_CAPACITY * 16); >> + budget = DIV_ROUND_UP(area->nr_free, capacity * 16); > > So the comment here is no longer valid when capacity became a > variable. An alternative to look at if we were to assume the shift > approach I mentioned above would be to then shift the budget based on > the reduced capacity. > >> /* loop through free list adding unreported pages to sg list */ >> list_for_each_entry_safe(page, next, list, lru) { >> @@ -196,7 +196,7 @@ void __page_reporting_notify(void) >> --(*offset); >> sg_set_page([*offset], page, page_len, 0); >> >> - nr_pages = (PAGE_REPORTING_CAPACITY - *offset) << >> order; >> + nr_pages = (capacity - *offset) << order; >> if (zone->reported_pages + nr_pages >= threshold) { >> err = 1; >> break; > > Rather than adding a capacity value it might work better to add a > "reserve" value so that we are just padding the start of the > scatterlist rather than having to reset it every time we change the > total capacity of the scatterlist. The advantage to that is that you > could drop all the changes where you are having to reset the list and > change the capacity. > > Instead you would just need to up
Re: [PATCH 2/4] mm/page_reporting: Introduce free page reporting factor
On 4/3/21 2:56 AM, Alexander Duyck wrote: > On Fri, Mar 26, 2021 at 2:45 AM Xunlei Pang wrote: >> >> Add new "/sys/kernel/mm/page_reporting/reporting_factor" >> within [0, 100], and stop page reporting when it reaches >> the configured threshold. Default is 100 which means no >> limitation is imposed. Percentile is adopted to reflect >> the fact that it reports on the per-zone basis. >> >> We can control the total number of reporting pages via >> this knob to avoid EPT violations which may affect the >> performance of the business, imagine the guest memory >> allocation burst or host long-tail memory reclaiming >> really hurt. > > I'm not a fan of the concept as I don't think it really does what it > was meant to do. The way page reporting was meant to work is that when > we have enough free pages we will cycle through memory a few pages at > a time reporting what is unused to the hypervisor. It was meant to be > a scan more than something that just would stop once it touched a > certain part of the memory. > > If you are wanting to truly reserve some amount of memory so that it > is always left held by the guest then it might make more sense to make > the value a fixed amount of memory rather than trying to do it as a > percentage. > > Also we may need to look at adding some sort of > linearization/defragmentation logic for the reported pages. One issue > is that there are several things that will add pages to the end of the > free page lists. One of the reasons why I was processing the entire > list when I was processing reported pages was because the page freeing > functions will normally cause pages to be interleaved with the > reported pages on the end of the list. So if you are wanting to > reserve some pages as being non-reported we may need to add something > sort the lists periodically. Yes, agreed. To make the counter accurate, I also noticed this problem, I'm going to figure out a way to handle it, e.g. maybe adding a new migratetype for reported free_list is a good choice, this also helps reduce the zone lock latency during the report procedue on large VM, which can be in milliseconds. > >> This knob can help make customized control policies according >> to VM priority, it is also useful for testing, gray-release, etc. > > As far as the knob itself it would make sense to combine this with > patch 3 since they are just different versions of the same control > >> --- >> mm/page_reporting.c | 60 >> - >> 1 file changed, 59 insertions(+), 1 deletion(-) >> >> diff --git a/mm/page_reporting.c b/mm/page_reporting.c >> index ba195ea..86c6479 100644 >> --- a/mm/page_reporting.c >> +++ b/mm/page_reporting.c >> @@ -11,6 +11,8 @@ >> #include "page_reporting.h" >> #include "internal.h" >> >> +static int reporting_factor = 100; >> + >> #define PAGE_REPORTING_DELAY (2 * HZ) >> static struct page_reporting_dev_info __rcu *pr_dev_info __read_mostly; >> >> @@ -134,6 +136,7 @@ void __page_reporting_notify(void) >> struct list_head *list = >free_list[mt]; >> unsigned int page_len = PAGE_SIZE << order; >> struct page *page, *next; >> + unsigned long threshold; >> long budget; >> int err = 0; >> >> @@ -144,6 +147,7 @@ void __page_reporting_notify(void) >> if (list_empty(list)) >> return err; >> >> + threshold = atomic_long_read(>managed_pages) * >> reporting_factor / 100; > > So at 0 you are setting this threshold to 0, however based on the code > below you are still pulling at least one page. > >> spin_lock_irq(>lock); >> >> /* >> @@ -181,6 +185,8 @@ void __page_reporting_notify(void) >> >> /* Attempt to pull page from list and place in scatterlist */ >> if (*offset) { >> + unsigned long nr_pages; >> + >> if (!__isolate_free_page(page, order)) { >> next = page; >> break; >> @@ -190,6 +196,12 @@ void __page_reporting_notify(void) >> --(*offset); >> sg_set_page([*offset], page, page_len, 0); >> >> + nr_pages = (PAGE_REPORTING_CAPACITY - *offset) << >> order; >> + if (zone->reported_pages + nr_pages >= threshold) { >> + err = 1; >> +
Re: [PATCH 0/4] mm/page_reporting: Some knobs and fixes
On 3/26/21 5:44 PM, Xunlei Pang wrote: > Add the following knobs in PATCH 1~3: > /sys/kernel/mm/page_reporting/reported_kbytes > /sys/kernel/mm/page_reporting/refault_kbytes > /sys/kernel/mm/page_reporting/reporting_factor > > Fix unexpected user OOM in PATCH 4. > > Xunlei Pang (4): > mm/page_reporting: Introduce free page reported counters > mm/page_reporting: Introduce free page reporting factor > mm/page_reporting: Introduce "page_reporting_factor=" boot parameter > mm/page_reporting: Fix possible user allocation failure > > Documentation/admin-guide/kernel-parameters.txt | 3 + > include/linux/mmzone.h | 3 + > mm/page_alloc.c | 6 +- > mm/page_reporting.c | 268 > ++-- > 4 files changed, 260 insertions(+), 20 deletions(-) > Hi guys, Looks "Alexander Duyck " was not available, so Cced more, any comment? Thanks!
[PATCH 4/4] mm/page_reporting: Fix possible user allocation failure
We encountered user memory allocation failure(OOM) on our 512MiB tiny instances, it didn't happen after turning off the page reporting. After some debugging, it turns out 32*4MB=128MB(order-10) free pages were isolated during reporting window resulting in no free available. Actually this might also happen on large instances when having a few free memory. This patch introduces a rule to limit reporting capacity according to current free memory, and reduce accordingly for higher orders which could break this rule. For example, 100MiB free, sgl capacity for different orders are: order-9 : 32 order-10: 16 Reported-by: Helin Guo Tested-by: Helin Guo Signed-off-by: Xunlei Pang --- mm/page_reporting.c | 89 +++-- 1 file changed, 72 insertions(+), 17 deletions(-) diff --git a/mm/page_reporting.c b/mm/page_reporting.c index 6ffedb8..2ec0ec0 100644 --- a/mm/page_reporting.c +++ b/mm/page_reporting.c @@ -129,8 +129,8 @@ void __page_reporting_notify(void) */ static int page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone, -unsigned int order, unsigned int mt, -struct scatterlist *sgl, unsigned int *offset) +unsigned int order, unsigned int mt, struct scatterlist *sgl, +const unsigned int capacity, unsigned int *offset) { struct free_area *area = >free_area[order]; struct list_head *list = >free_list[mt]; @@ -161,10 +161,10 @@ void __page_reporting_notify(void) * list processed. This should result in us reporting all pages on * an idle system in about 30 seconds. * -* The division here should be cheap since PAGE_REPORTING_CAPACITY -* should always be a power of 2. +* The division here should be cheap since capacity should +* always be a power of 2. */ - budget = DIV_ROUND_UP(area->nr_free, PAGE_REPORTING_CAPACITY * 16); + budget = DIV_ROUND_UP(area->nr_free, capacity * 16); /* loop through free list adding unreported pages to sg list */ list_for_each_entry_safe(page, next, list, lru) { @@ -196,7 +196,7 @@ void __page_reporting_notify(void) --(*offset); sg_set_page([*offset], page, page_len, 0); - nr_pages = (PAGE_REPORTING_CAPACITY - *offset) << order; + nr_pages = (capacity - *offset) << order; if (zone->reported_pages + nr_pages >= threshold) { err = 1; break; @@ -217,10 +217,10 @@ void __page_reporting_notify(void) spin_unlock_irq(>lock); /* begin processing pages in local list */ - err = prdev->report(prdev, sgl, PAGE_REPORTING_CAPACITY); + err = prdev->report(prdev, sgl, capacity); /* reset offset since the full list was reported */ - *offset = PAGE_REPORTING_CAPACITY; + *offset = capacity; /* update budget to reflect call to report function */ budget--; @@ -229,7 +229,7 @@ void __page_reporting_notify(void) spin_lock_irq(>lock); /* flush reported pages from the sg list */ - page_reporting_drain(prdev, sgl, zone, PAGE_REPORTING_CAPACITY, !err); + page_reporting_drain(prdev, sgl, zone, capacity, !err); /* * Reset next to first entry, the old next isn't valid @@ -251,12 +251,39 @@ void __page_reporting_notify(void) return err; } +/* + * For guest with little free memory, we should tune reporting capacity + * correctly to avoid reporting too much once, otherwise user allocation + * may fail and OOM during reporting window between __isolate_free_page() + * and page_reporting_drain(). + * + * Calculate from which order we begin to reduce the scatterlist capacity, + * in order not to isolate too many pages to fail the user allocation. + */ +static unsigned int calculate_zone_order_threshold(struct zone *z) +{ + unsigned int order; + long pages_threshold; + + pages_threshold = zone_page_state(z, NR_FREE_PAGES) - low_wmark_pages(z); + for (order = PAGE_REPORTING_MIN_ORDER; order < MAX_ORDER; order++) { + if ((PAGE_REPORTING_CAPACITY << order) > pages_threshold) + break; + } + + return order; +} + static int page_reporting_process_zone(struct page_reporting_dev_info *prdev, struct scatterlist *sgl, struct zone *zone) { - unsigned int order, mt, leftover, offset = PAGE_REPORTING_CAPACITY; + unsigned int order, mt, leftover, offset; unsigned long watermark, threshold; + unsigned int capacity = PAGE_REPORTING_CAPACITY; +
[PATCH 1/4] mm/page_reporting: Introduce free page reported counters
It's useful to know how many memory has been actually reported, so add new zone::reported_pages to record that. Add "/sys/kernel/mm/page_reporting/reported_kbytes" for the actual memory has been reported. Add "/sys/kernel/mm/page_reporting/refault_kbytes" for the accumulated memory has refaulted in after been reported out. Signed-off-by: Xunlei Pang --- include/linux/mmzone.h | 3 ++ mm/page_alloc.c| 4 +- mm/page_reporting.c| 112 +++-- mm/page_reporting.h| 5 +++ 4 files changed, 119 insertions(+), 5 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 47946ce..ebd169f 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -530,6 +530,9 @@ struct zone { atomic_long_t managed_pages; unsigned long spanned_pages; unsigned long present_pages; +#ifdef CONFIG_PAGE_REPORTING + unsigned long reported_pages; +#endif #ifdef CONFIG_CMA unsigned long cma_pages; #endif diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3e4b29ee..c2c5688 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -930,8 +930,10 @@ static inline void del_page_from_free_list(struct page *page, struct zone *zone, unsigned int order) { /* clear reported state and update reported page count */ - if (page_reported(page)) + if (page_reported(page)) { __ClearPageReported(page); + page_reporting_update_refault(zone, 1 << order); + } list_del(>lru); __ClearPageBuddy(page); diff --git a/mm/page_reporting.c b/mm/page_reporting.c index c50d93f..ba195ea 100644 --- a/mm/page_reporting.c +++ b/mm/page_reporting.c @@ -1,4 +1,5 @@ // SPDX-License-Identifier: GPL-2.0 +#include #include #include #include @@ -19,6 +20,22 @@ enum { PAGE_REPORTING_ACTIVE }; +#ifdef CONFIG_SYSFS +static struct percpu_counter refault_pages; + +void page_reporting_update_refault(struct zone *zone, unsigned int pages) +{ + zone->reported_pages -= pages; + percpu_counter_add_batch(_pages, pages, INT_MAX / 2); +} +#else +void page_reporting_update_refault(struct zone *zone, unsigned int pages) +{ + zone->reported_pages -= pages; +} +#endif + + /* request page reporting */ static void __page_reporting_request(struct page_reporting_dev_info *prdev) @@ -66,7 +83,8 @@ void __page_reporting_notify(void) static void page_reporting_drain(struct page_reporting_dev_info *prdev, -struct scatterlist *sgl, unsigned int nents, bool reported) +struct scatterlist *sgl, struct zone *zone, +unsigned int nents, bool reported) { struct scatterlist *sg = sgl; @@ -92,8 +110,10 @@ void __page_reporting_notify(void) * report on the new larger page when we make our way * up to that higher order. */ - if (PageBuddy(page) && buddy_order(page) == order) + if (PageBuddy(page) && buddy_order(page) == order) { __SetPageReported(page); + zone->reported_pages += (1 << order); + } } while ((sg = sg_next(sg))); /* reinitialize scatterlist now that it is empty */ @@ -197,7 +217,7 @@ void __page_reporting_notify(void) spin_lock_irq(>lock); /* flush reported pages from the sg list */ - page_reporting_drain(prdev, sgl, PAGE_REPORTING_CAPACITY, !err); + page_reporting_drain(prdev, sgl, zone, PAGE_REPORTING_CAPACITY, !err); /* * Reset next to first entry, the old next isn't valid @@ -260,7 +280,7 @@ void __page_reporting_notify(void) /* flush any remaining pages out from the last report */ spin_lock_irq(>lock); - page_reporting_drain(prdev, sgl, leftover, !err); + page_reporting_drain(prdev, sgl, zone, leftover, !err); spin_unlock_irq(>lock); } @@ -362,3 +382,87 @@ void page_reporting_unregister(struct page_reporting_dev_info *prdev) mutex_unlock(_reporting_mutex); } EXPORT_SYMBOL_GPL(page_reporting_unregister); + +#ifdef CONFIG_SYSFS +#define REPORTING_ATTR(_name) \ + static struct kobj_attribute _name##_attr = \ + __ATTR(_name, 0644, _name##_show, _name##_store) + +static unsigned long get_reported_kbytes(void) +{ + struct zone *z; + unsigned long nr_reported = 0; + + for_each_populated_zone(z) + nr_reported += z->reported_pages; + + return nr_reported << (PAGE_SHIFT - 10); +} + +static ssize_t reported_kbytes_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ +
[PATCH 3/4] mm/page_reporting: Introduce "page_reporting_factor=" boot parameter
Currently the default behaviour(value 100) is to do full report, although we can control it after boot via "page_reporting_factor" sysfs knob, it could still be late because some amount of memory has already been reported before operating this knob. Sometimes we really want it safely off by default and turn it on as needed at runtime, so "page_reporting_factor=" boot parameter is that guarantee and meets different default setting requirements. There's also a real-world problem that I noticed on tiny instances, it always reports some memory at boot stage before application starts and uses up memory which retriggers EPT fault after boot. The following data(right after boot) indicates that 172032KiB pages were unnecessarily reported and refault in: $ cat /sys/kernel/mm/page_reporting/refault_kbytes 172032 $ cat /sys/kernel/mm/page_reporting/reported_kbytes 0 Thus it's reasonable to turn the page reporting off by default and enable it at runtime as needed. Signed-off-by: Xunlei Pang --- Documentation/admin-guide/kernel-parameters.txt | 3 +++ mm/page_reporting.c | 13 + 2 files changed, 16 insertions(+) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 0454572..46e296c 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3524,6 +3524,9 @@ off: turn off poisoning (default) on: turn on poisoning + page_reporting_factor= [KNL] Guest Free Page Reporting percentile. + [0, 100]: 0 - off, not report; 100 - default, full report. + panic= [KNL] Kernel behaviour on panic: delay timeout > 0: seconds before rebooting timeout = 0: wait forever diff --git a/mm/page_reporting.c b/mm/page_reporting.c index 86c6479..6ffedb8 100644 --- a/mm/page_reporting.c +++ b/mm/page_reporting.c @@ -524,3 +524,16 @@ static int __init page_reporting_init(void) } module_init(page_reporting_init); + +static int __init setup_reporting_factor(char *str) +{ + int v; + + if (kstrtoint(str, 10, )) + return -EINVAL; + if (v >= 0 && v <= 100) + reporting_factor = v; + + return 0; +} +__setup("page_reporting_factor=", setup_reporting_factor); -- 1.8.3.1
[PATCH 0/4] mm/page_reporting: Some knobs and fixes
Add the following knobs in PATCH 1~3: /sys/kernel/mm/page_reporting/reported_kbytes /sys/kernel/mm/page_reporting/refault_kbytes /sys/kernel/mm/page_reporting/reporting_factor Fix unexpected user OOM in PATCH 4. Xunlei Pang (4): mm/page_reporting: Introduce free page reported counters mm/page_reporting: Introduce free page reporting factor mm/page_reporting: Introduce "page_reporting_factor=" boot parameter mm/page_reporting: Fix possible user allocation failure Documentation/admin-guide/kernel-parameters.txt | 3 + include/linux/mmzone.h | 3 + mm/page_alloc.c | 6 +- mm/page_reporting.c | 268 ++-- 4 files changed, 260 insertions(+), 20 deletions(-) -- 1.8.3.1
[PATCH 2/4] mm/page_reporting: Introduce free page reporting factor
Add new "/sys/kernel/mm/page_reporting/reporting_factor" within [0, 100], and stop page reporting when it reaches the configured threshold. Default is 100 which means no limitation is imposed. Percentile is adopted to reflect the fact that it reports on the per-zone basis. We can control the total number of reporting pages via this knob to avoid EPT violations which may affect the performance of the business, imagine the guest memory allocation burst or host long-tail memory reclaiming really hurt. This knob can help make customized control policies according to VM priority, it is also useful for testing, gray-release, etc. Signed-off-by: Xunlei Pang --- mm/page_reporting.c | 60 - 1 file changed, 59 insertions(+), 1 deletion(-) diff --git a/mm/page_reporting.c b/mm/page_reporting.c index ba195ea..86c6479 100644 --- a/mm/page_reporting.c +++ b/mm/page_reporting.c @@ -11,6 +11,8 @@ #include "page_reporting.h" #include "internal.h" +static int reporting_factor = 100; + #define PAGE_REPORTING_DELAY (2 * HZ) static struct page_reporting_dev_info __rcu *pr_dev_info __read_mostly; @@ -134,6 +136,7 @@ void __page_reporting_notify(void) struct list_head *list = >free_list[mt]; unsigned int page_len = PAGE_SIZE << order; struct page *page, *next; + unsigned long threshold; long budget; int err = 0; @@ -144,6 +147,7 @@ void __page_reporting_notify(void) if (list_empty(list)) return err; + threshold = atomic_long_read(>managed_pages) * reporting_factor / 100; spin_lock_irq(>lock); /* @@ -181,6 +185,8 @@ void __page_reporting_notify(void) /* Attempt to pull page from list and place in scatterlist */ if (*offset) { + unsigned long nr_pages; + if (!__isolate_free_page(page, order)) { next = page; break; @@ -190,6 +196,12 @@ void __page_reporting_notify(void) --(*offset); sg_set_page([*offset], page, page_len, 0); + nr_pages = (PAGE_REPORTING_CAPACITY - *offset) << order; + if (zone->reported_pages + nr_pages >= threshold) { + err = 1; + break; + } + continue; } @@ -244,9 +256,13 @@ void __page_reporting_notify(void) struct scatterlist *sgl, struct zone *zone) { unsigned int order, mt, leftover, offset = PAGE_REPORTING_CAPACITY; - unsigned long watermark; + unsigned long watermark, threshold; int err = 0; + threshold = atomic_long_read(>managed_pages) * reporting_factor / 100; + if (zone->reported_pages >= threshold) + return err; + /* Generate minimum watermark to be able to guarantee progress */ watermark = low_wmark_pages(zone) + (PAGE_REPORTING_CAPACITY << PAGE_REPORTING_MIN_ORDER); @@ -267,11 +283,18 @@ void __page_reporting_notify(void) err = page_reporting_cycle(prdev, zone, order, mt, sgl, ); + /* Exceed threshold go to report leftover */ + if (err > 0) { + err = 0; + goto leftover; + } + if (err) return err; } } +leftover: /* report the leftover pages before going idle */ leftover = PAGE_REPORTING_CAPACITY - offset; if (leftover) { @@ -435,9 +458,44 @@ static ssize_t refault_kbytes_store(struct kobject *kobj, } REPORTING_ATTR(refault_kbytes); +static ssize_t reporting_factor_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sprintf(buf, "%u\n", reporting_factor); +} + +static ssize_t reporting_factor_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + int new, old, err; + struct page *page; + + err = kstrtoint(buf, 10, ); + if (err || (new < 0 || new > 100)) + return -EINVAL; + + old = reporting_factor; + reporting_factor = new; + + if (new <= old) + goto out; + + /* Trigger reporting with new larger reporting_factor */ + page = alloc_pages(__GFP_HIGHMEM | __GFP_NOWARN, + PAGE_REPORTING_MIN_ORDER); + if (page) + __free_pages(page, PAGE_REPORTING_MIN_ORDER); + +out: + return count; +} +REPORTING_ATTR(
Re: [PATCH v4 1/3] mm/slub: Introduce two counters for partial objects
On 3/18/21 8:18 PM, Vlastimil Babka wrote: > On 3/17/21 8:54 AM, Xunlei Pang wrote: >> The node list_lock in count_partial() spends long time iterating >> in case of large amount of partial page lists, which can cause >> thunder herd effect to the list_lock contention. >> >> We have HSF RT(High-speed Service Framework Response-Time) monitors, >> the RT figures fluctuated randomly, then we deployed a tool detecting >> "irq off" and "preempt off" to dump the culprit's calltrace, capturing >> the list_lock cost nearly 100ms with irq off issued by "ss", this also >> caused network timeouts. > > I forgot to ask, how does "ss" come into this? It displays network connections > AFAIK. Does it read any SLUB counters or slabinfo? > ss may access /proc/slabinfo to acquire network related slab statistics.
Re: [PATCH v4 1/3] mm/slub: Introduce two counters for partial objects
On 3/18/21 2:45 AM, Vlastimil Babka wrote: > On 3/17/21 8:54 AM, Xunlei Pang wrote: >> The node list_lock in count_partial() spends long time iterating >> in case of large amount of partial page lists, which can cause >> thunder herd effect to the list_lock contention. >> >> We have HSF RT(High-speed Service Framework Response-Time) monitors, >> the RT figures fluctuated randomly, then we deployed a tool detecting >> "irq off" and "preempt off" to dump the culprit's calltrace, capturing >> the list_lock cost nearly 100ms with irq off issued by "ss", this also >> caused network timeouts. >> >> This patch introduces two counters to maintain the actual number >> of partial objects dynamically instead of iterating the partial >> page lists with list_lock held. >> >> New counters of kmem_cache_node: partial_free_objs, partial_total_objs. >> The main operations are under list_lock in slow path, its performance >> impact is expected to be minimal except the __slab_free() path. >> >> The only concern of introducing partial counter is that partial_free_objs >> may cause cacheline contention and false sharing issues in case of same >> SLUB concurrent __slab_free(), so define it to be a percpu counter and >> places it carefully. > > Hm I wonder, is it possible that this will eventually overflow/underflow the > counter on some CPU? (I guess practially only on 32bit). Maybe the operations > that are already done under n->list_lock should flush the percpu counter to a > shared counter? You are right, thanks a lot for noticing this. > > ... > >> @@ -3039,6 +3066,13 @@ static void __slab_free(struct kmem_cache *s, struct >> page *page, >> head, new.counters, >> "__slab_free")); >> >> +if (!was_frozen && prior) { >> +if (n) >> +__update_partial_free(n, cnt); >> +else >> +__update_partial_free(get_node(s, page_to_nid(page)), >> cnt); >> +} > > I would guess this is the part that makes your measurements notice that > (although tiny) difference. We didn't need to obtain the node pointer before > and > now we do. And that is really done just for the per-node breakdown in > "objects" > and "objects_partial" files under /sys/kernel/slab - distinguishing nodes is > not > needed for /proc/slabinfo. So that kinda justifies putting this under a new > CONFIG as you did. Although perhaps somebody interested in these kind of stats > would enable CONFIG_SLUB_STATS anyway, so that's still an option to use > instead > of introducing a new oddly specific CONFIG? At least until somebody comes up > and > presents an use case where they want the per-node breakdowns in /sys but > cannot > afford CONFIG_SLUB_STATS. > > But I'm also still thinking about simply counting all free objects (for the > purposes of accurate in /proc/slabinfo) as a percpu variable in > struct kmem_cache itself. That would basically put this_cpu_add() in all the > fast paths, but AFAICS thanks to the segment register it doesn't mean > disabling > interrupts nor a LOCK operation, so maybe it wouldn't be that bad? And it > shouldn't need to deal with these node pointers. So maybe that would be > acceptable for CONFIG_SLUB_DEBUG? Guess I'll have to try... > The percpu operation itself should be fine, it looks to be cacheline pingpong issue due to extra percpu counter access, so making it cacheline aligned improves a little according to my tests.
[PATCH v4 3/3] mm/slub: Get rid of count_partial()
Now the partial counters are ready, let's use them to get rid of count_partial(). The partial counters will involve in to calculate the accurate partial usage when CONFIG_SLUB_DEBUG_PARTIAL is on, otherwise simply assume their zero usage statistics. Tested-by: James Wang Signed-off-by: Xunlei Pang --- mm/slub.c | 64 +++ 1 file changed, 31 insertions(+), 33 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 856aea4..9bff669 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2533,11 +2533,6 @@ static inline int node_match(struct page *page, int node) } #ifdef CONFIG_SLUB_DEBUG -static int count_free(struct page *page) -{ - return page->objects - page->inuse; -} - static inline unsigned long node_nr_objs(struct kmem_cache_node *n) { return atomic_long_read(>total_objects); @@ -2545,18 +2540,33 @@ static inline unsigned long node_nr_objs(struct kmem_cache_node *n) #endif /* CONFIG_SLUB_DEBUG */ #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS) -static unsigned long count_partial(struct kmem_cache_node *n, - int (*get_count)(struct page *)) +enum partial_item { PARTIAL_FREE, PARTIAL_INUSE, PARTIAL_TOTAL, PARTIAL_SLAB }; + +static unsigned long partial_counter(struct kmem_cache_node *n, + enum partial_item item) { - unsigned long flags; - unsigned long x = 0; - struct page *page; + unsigned long ret = 0; - spin_lock_irqsave(>list_lock, flags); - list_for_each_entry(page, >partial, slab_list) - x += get_count(page); - spin_unlock_irqrestore(>list_lock, flags); - return x; +#ifdef CONFIG_SLUB_DEBUG_PARTIAL + if (item == PARTIAL_FREE) { + ret = per_cpu_sum(*n->partial_free_objs); + if ((long)ret < 0) + ret = 0; + } else if (item == PARTIAL_TOTAL) { + ret = n->partial_total_objs; + } else if (item == PARTIAL_INUSE) { + ret = per_cpu_sum(*n->partial_free_objs); + if ((long)ret < 0) + ret = 0; + ret = n->partial_total_objs - ret; + if ((long)ret < 0) + ret = 0; + } else { /* item == PARTIAL_SLAB */ + ret = n->nr_partial; + } +#endif + + return ret; } #endif /* CONFIG_SLUB_DEBUG || CONFIG_SYSFS */ @@ -2587,7 +2597,7 @@ static unsigned long count_partial(struct kmem_cache_node *n, unsigned long nr_objs; unsigned long nr_free; - nr_free = count_partial(n, count_free); + nr_free = partial_counter(n, PARTIAL_FREE); nr_slabs = node_nr_slabs(n); nr_objs = node_nr_objs(n); @@ -4654,18 +4664,6 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags, EXPORT_SYMBOL(__kmalloc_node_track_caller); #endif -#ifdef CONFIG_SYSFS -static int count_inuse(struct page *page) -{ - return page->inuse; -} - -static int count_total(struct page *page) -{ - return page->objects; -} -#endif - #ifdef CONFIG_SLUB_DEBUG static void validate_slab(struct kmem_cache *s, struct page *page) { @@ -5102,7 +5100,7 @@ static ssize_t show_slab_objects(struct kmem_cache *s, x = atomic_long_read(>total_objects); else if (flags & SO_OBJECTS) x = atomic_long_read(>total_objects) - - count_partial(n, count_free); + partial_counter(n, PARTIAL_FREE); else x = atomic_long_read(>nr_slabs); total += x; @@ -5116,11 +5114,11 @@ static ssize_t show_slab_objects(struct kmem_cache *s, for_each_kmem_cache_node(s, node, n) { if (flags & SO_TOTAL) - x = count_partial(n, count_total); + x = partial_counter(n, PARTIAL_TOTAL); else if (flags & SO_OBJECTS) - x = count_partial(n, count_inuse); + x = partial_counter(n, PARTIAL_INUSE); else - x = n->nr_partial; + x = partial_counter(n, PARTIAL_SLAB); total += x; nodes[node] += x; } @@ -5884,7 +5882,7 @@ void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo) for_each_kmem_cache_node(s, node, n) { nr_slabs += node_nr_slabs(n); nr_objs += node_nr_objs(n); - nr_free += count_partial(n, count_free); + nr_free += partial_counter(n, PARTIAL_FREE); } sinfo->active_objs = nr_objs - nr_free; -- 1.8.3.1
[PATCH v4 0/3] mm/slub: Fix count_partial() problem
count_partial() can hold n->list_lock spinlock for quite long, which makes much trouble to the system. This series eliminate this problem. v1->v2: - Improved changelog and variable naming for PATCH 1~2. - PATCH3 adds per-cpu counter to avoid performance regression in concurrent __slab_free(). v2->v3: - Changed "page->inuse" to the safe "new.inuse", etc. - Used CONFIG_SLUB_DEBUG and CONFIG_SYSFS condition for new counters. - atomic_long_t -> unsigned long v3->v4: - introduced new CONFIG_SLUB_DEBUG_PARTIAL to give a chance to be enabled for production use. - Merged PATCH 4 into PATCH 1. [Testing] There seems might be a little performance impact under extreme __slab_free() concurrent calls according to my tests. On my 32-cpu 2-socket physical machine: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz 1) perf stat --null --repeat 10 -- hackbench 20 thread 2 == original, no patched Performance counter stats for 'hackbench 20 thread 2' (10 runs): 24.536050899 seconds time elapsed ( +- 0.24% ) Performance counter stats for 'hackbench 20 thread 2' (10 runs): 24.588049142 seconds time elapsed ( +- 0.35% ) == patched with patch1~4 Performance counter stats for 'hackbench 20 thread 2' (10 runs): 24.670892273 seconds time elapsed ( +- 0.29% ) Performance counter stats for 'hackbench 20 thread 2' (10 runs): 24.746755689 seconds time elapsed ( +- 0.21% ) 2) perf stat --null --repeat 10 -- hackbench 32 thread 2 == original, no patched Performance counter stats for 'hackbench 32 thread 2' (10 runs): 39.784911855 seconds time elapsed ( +- 0.14% ) Performance counter stats for 'hackbench 32 thread 2' (10 runs): 39.868687608 seconds time elapsed ( +- 0.19% ) == patched with patch1~4 Performance counter stats for 'hackbench 32 thread 2' (10 runs): 39.681273015 seconds time elapsed ( +- 0.21% ) Performance counter stats for 'hackbench 32 thread 2' (10 runs): 39.681238459 seconds time elapsed ( +- 0.09% ) Xunlei Pang (3): mm/slub: Introduce two counters for partial objects percpu: Export per_cpu_sum() mm/slub: Get rid of count_partial() include/linux/percpu-defs.h | 10 init/Kconfig | 13 + kernel/locking/percpu-rwsem.c | 10 mm/slab.h | 6 ++ mm/slub.c | 129 +- 5 files changed, 120 insertions(+), 48 deletions(-) -- 1.8.3.1
[PATCH v4 1/3] mm/slub: Introduce two counters for partial objects
The node list_lock in count_partial() spends long time iterating in case of large amount of partial page lists, which can cause thunder herd effect to the list_lock contention. We have HSF RT(High-speed Service Framework Response-Time) monitors, the RT figures fluctuated randomly, then we deployed a tool detecting "irq off" and "preempt off" to dump the culprit's calltrace, capturing the list_lock cost nearly 100ms with irq off issued by "ss", this also caused network timeouts. This patch introduces two counters to maintain the actual number of partial objects dynamically instead of iterating the partial page lists with list_lock held. New counters of kmem_cache_node: partial_free_objs, partial_total_objs. The main operations are under list_lock in slow path, its performance impact is expected to be minimal except the __slab_free() path. The only concern of introducing partial counter is that partial_free_objs may cause cacheline contention and false sharing issues in case of same SLUB concurrent __slab_free(), so define it to be a percpu counter and places it carefully. As "Vlastimil Babka" and "Christoph Lameter" both suggested that we don't care about the accurate partial objects, and in case that somebody might actually want accurate object statistics at the expense of peak performance, and it would be nice to give them such option in SLUB. Thus a new CONFIG_SLUB_DEBUG_PARTIAL is introduced to maintain accurate partial objects, which is default off to avoid unexpected performance degradation. When CONFIG_SLUB_DEBUG_PARTIAL is not selected, the sysfs or proc files related to the partial list (/proc/slabinfo, sysfs objects_partial, etc) assume their zero usage data, e.g., Value of "active_objs" and "num_objs" fields of "/proc/slabinfo" equal. "cat /sys/kernel/slab//partial" displays "0". Tested-by: James Wang Signed-off-by: Xunlei Pang --- init/Kconfig | 13 + mm/slab.h| 6 ++ mm/slub.c| 63 3 files changed, 78 insertions(+), 4 deletions(-) diff --git a/init/Kconfig b/init/Kconfig index 22946fe..686851b 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1867,6 +1867,19 @@ config SLUB_DEBUG SLUB sysfs support. /sys/slab will not exist and there will be no support for cache validation etc. +config SLUB_DEBUG_PARTIAL + default n + bool "Enable SLUB debugging for the node partial list" if EXPERT + depends on SLUB && SYSFS + help + Maintain runtime counters for the node partial list debugging. + When CONFIG_SLUB_DEBUG_PARTIAL is not selected, the sysfs or proc + files related to the partial list assume zero data usage, e.g., + value of "active_objs" and "num_objs" of "/proc/slabinfo" equals. + "cat /sys/kernel/slab//partial" displays "0". + + It might have slight performance impact for production use. + config COMPAT_BRK bool "Disable heap randomization" default y diff --git a/mm/slab.h b/mm/slab.h index 076582f..f47c811 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -546,12 +546,18 @@ struct kmem_cache_node { #ifdef CONFIG_SLUB unsigned long nr_partial; +#ifdef CONFIG_SLUB_DEBUG_PARTIAL + unsigned long partial_total_objs; +#endif struct list_head partial; #ifdef CONFIG_SLUB_DEBUG atomic_long_t nr_slabs; atomic_long_t total_objects; struct list_head full; #endif +#ifdef CONFIG_SLUB_DEBUG_PARTIAL + unsigned long __percpu *partial_free_objs cacheline_aligned_in_smp; +#endif #endif }; diff --git a/mm/slub.c b/mm/slub.c index e26c274..856aea4 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1890,10 +1890,31 @@ static void discard_slab(struct kmem_cache *s, struct page *page) /* * Management of partially allocated slabs. */ +#ifdef CONFIG_SLUB_DEBUG_PARTIAL +static inline void +__update_partial_free(struct kmem_cache_node *n, long delta) +{ + this_cpu_add(*n->partial_free_objs, delta); +} + +static inline void +__update_partial_total(struct kmem_cache_node *n, long delta) +{ + n->partial_total_objs += delta; +} +#else +static inline void +__update_partial_free(struct kmem_cache_node *n, long delta) { } + +static inline void +__update_partial_total(struct kmem_cache_node *n, long delta) { } +#endif + static inline void __add_partial(struct kmem_cache_node *n, struct page *page, int tail) { n->nr_partial++; + __update_partial_total(n, page->objects); if (tail == DEACTIVATE_TO_TAIL) list_add_tail(>slab_list, >partial); else @@ -1913,6 +1934,7 @@ static inline void remove_partial(struct kmem_cache_node *n, lockdep_assert_held(>list_lock); list_del(>slab_list); n->n
[PATCH v4 2/3] percpu: Export per_cpu_sum()
per_cpu_sum() is useful, and deserves to be exported. Tested-by: James Wang Signed-off-by: Xunlei Pang --- include/linux/percpu-defs.h | 10 ++ kernel/locking/percpu-rwsem.c | 10 -- 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/include/linux/percpu-defs.h b/include/linux/percpu-defs.h index dff7040..0e71b68 100644 --- a/include/linux/percpu-defs.h +++ b/include/linux/percpu-defs.h @@ -220,6 +220,16 @@ (void)__vpp_verify; \ } while (0) +#define per_cpu_sum(var) \ +({ \ + typeof(var) __sum = 0; \ + int cpu;\ + compiletime_assert_atomic_type(__sum); \ + for_each_possible_cpu(cpu) \ + __sum += per_cpu(var, cpu); \ + __sum; \ +}) + #ifdef CONFIG_SMP /* diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c index 70a32a5..0980e51 100644 --- a/kernel/locking/percpu-rwsem.c +++ b/kernel/locking/percpu-rwsem.c @@ -178,16 +178,6 @@ bool __percpu_down_read(struct percpu_rw_semaphore *sem, bool try) } EXPORT_SYMBOL_GPL(__percpu_down_read); -#define per_cpu_sum(var) \ -({ \ - typeof(var) __sum = 0; \ - int cpu;\ - compiletime_assert_atomic_type(__sum); \ - for_each_possible_cpu(cpu) \ - __sum += per_cpu(var, cpu); \ - __sum; \ -}) - /* * Return true if the modular sum of the sem->read_count per-CPU variable is * zero. If this sum is zero, then it is stable due to the fact that if any -- 1.8.3.1
Re: [PATCH v3 0/4] mm/slub: Fix count_partial() problem
On 3/16/21 7:02 PM, Vlastimil Babka wrote: > On 3/16/21 11:42 AM, Xunlei Pang wrote: >> On 3/16/21 2:49 AM, Vlastimil Babka wrote: >>> On 3/9/21 4:25 PM, Xunlei Pang wrote: >>>> count_partial() can hold n->list_lock spinlock for quite long, which >>>> makes much trouble to the system. This series eliminate this problem. >>> >>> Before I check the details, I have two high-level comments: >>> >>> - patch 1 introduces some counting scheme that patch 4 then changes, could >>> we do >>> this in one step to avoid the churn? >>> >>> - the series addresses the concern that spinlock is being held, but doesn't >>> address the fact that counting partial per-node slabs is not nearly enough >>> if we >>> want accurate in /proc/slabinfo because there are also percpu >>> slabs and per-cpu partial slabs, where we don't track the free objects at >>> all. >>> So after this series while the readers of /proc/slabinfo won't block the >>> spinlock, they will get the same garbage data as before. So Christoph is not >>> wrong to say that we can just report active_objs == num_objs and it won't >>> actually break any ABI. >> >> If maintainers don't mind this inaccuracy which I also doubt its >> importance, then it becomes easy. For fear that some people who really >> cares, introducing an extra config(default-off) for it would be a good >> option. > > Great. > >>> At the same time somebody might actually want accurate object statistics at >>> the >>> expense of peak performance, and it would be nice to give them such option >>> in >>> SLUB. Right now we don't provide this accuracy even with CONFIG_SLUB_STATS, >>> although that option provides many additional tuning stats, with additional >>> overhead. >>> So my proposal would be a new config for "accurate active objects" (or just >>> tie >>> it to CONFIG_SLUB_DEBUG?) that would extend the approach of percpu counters >>> in >>> patch 4 to all alloc/free, so that it includes percpu slabs. Without this >>> config >>> enabled, let's just report active_objs == num_objs. >> For percpu slabs, the numbers can be retrieved from the existing >> slub_percpu_partial()->pobjects, looks no need extra work. > > Hm, unfortunately it's not that simple, the number there is a snapshot that > can > become wildly inacurate afterwards. > It's hard to make it absoultely accurate using percpu, the data can change during you iterating all the cpus and total_objects, I can't imagine its real-world usage, not to mention the percpu freelist cache. I think sysfs slabs_cpu_partial should work enough for common debug purpose.
Re: [PATCH v3 0/4] mm/slub: Fix count_partial() problem
On 3/16/21 2:49 AM, Vlastimil Babka wrote: > On 3/9/21 4:25 PM, Xunlei Pang wrote: >> count_partial() can hold n->list_lock spinlock for quite long, which >> makes much trouble to the system. This series eliminate this problem. > > Before I check the details, I have two high-level comments: > > - patch 1 introduces some counting scheme that patch 4 then changes, could we > do > this in one step to avoid the churn? > > - the series addresses the concern that spinlock is being held, but doesn't > address the fact that counting partial per-node slabs is not nearly enough if > we > want accurate in /proc/slabinfo because there are also percpu > slabs and per-cpu partial slabs, where we don't track the free objects at all. > So after this series while the readers of /proc/slabinfo won't block the > spinlock, they will get the same garbage data as before. So Christoph is not > wrong to say that we can just report active_objs == num_objs and it won't > actually break any ABI. If maintainers don't mind this inaccuracy which I also doubt its importance, then it becomes easy. For fear that some people who really cares, introducing an extra config(default-off) for it would be a good option. > At the same time somebody might actually want accurate object statistics at > the > expense of peak performance, and it would be nice to give them such option in > SLUB. Right now we don't provide this accuracy even with CONFIG_SLUB_STATS, > although that option provides many additional tuning stats, with additional > overhead. > So my proposal would be a new config for "accurate active objects" (or just > tie > it to CONFIG_SLUB_DEBUG?) that would extend the approach of percpu counters in > patch 4 to all alloc/free, so that it includes percpu slabs. Without this > config > enabled, let's just report active_objs == num_objs. For percpu slabs, the numbers can be retrieved from the existing slub_percpu_partial()->pobjects, looks no need extra work. > > Vlastimil > >> v1->v2: >> - Improved changelog and variable naming for PATCH 1~2. >> - PATCH3 adds per-cpu counter to avoid performance regression >> in concurrent __slab_free(). >> >> v2->v3: >> - Changed "page->inuse" to the safe "new.inuse", etc. >> - Used CONFIG_SLUB_DEBUG and CONFIG_SYSFS condition for new counters. >> - atomic_long_t -> unsigned long >> >> [Testing] >> There seems might be a little performance impact under extreme >> __slab_free() concurrent calls according to my tests. >> >> On my 32-cpu 2-socket physical machine: >> Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz >> >> 1) perf stat --null --repeat 10 -- hackbench 20 thread 2 >> >> == original, no patched >> Performance counter stats for 'hackbench 20 thread 2' (10 runs): >> >> 24.536050899 seconds time elapsed >> ( +- 0.24% ) >> >> >> Performance counter stats for 'hackbench 20 thread 2' (10 runs): >> >> 24.588049142 seconds time elapsed >> ( +- 0.35% ) >> >> >> == patched with patch1~4 >> Performance counter stats for 'hackbench 20 thread 2' (10 runs): >> >> 24.670892273 seconds time elapsed >> ( +- 0.29% ) >> >> >> Performance counter stats for 'hackbench 20 thread 2' (10 runs): >> >> 24.746755689 seconds time elapsed >> ( +- 0.21% ) >> >> >> 2) perf stat --null --repeat 10 -- hackbench 32 thread 2 >> >> == original, no patched >> Performance counter stats for 'hackbench 32 thread 2' (10 runs): >> >> 39.784911855 seconds time elapsed >> ( +- 0.14% ) >> >> Performance counter stats for 'hackbench 32 thread 2' (10 runs): >> >> 39.868687608 seconds time elapsed >> ( +- 0.19% ) >> >> == patched with patch1~4 >> Performance counter stats for 'hackbench 32 thread 2' (10 runs): >> >> 39.681273015 seconds time elapsed >> ( +- 0.21% ) >> >> Performance counter stats for 'hackbench 32 thread 2' (10 runs): >> >> 39.681238459 seconds time elapsed >> ( +- 0.09% ) >> >> >> Xunlei Pang (4): >> mm/slub: Introduce two counters for partial objects >> mm/slub: Get rid of count_partial() >> percpu: Export per_cpu_sum() >> mm/slub: Use percpu partial free counter >> >> include/linux/percpu-defs.h | 10 >> kernel/locking/percpu-rwsem.c | 10 >> mm/slab.h | 4 ++ >> mm/slub.c | 120 >> +- >> 4 files changed, 97 insertions(+), 47 deletions(-) >>
[PATCH v3 2/4] mm/slub: Get rid of count_partial()
Now the partial counters are ready, let's use them directly and get rid of count_partial(). Tested-by: James Wang Reviewed-by: Pekka Enberg Signed-off-by: Xunlei Pang --- mm/slub.c | 54 ++ 1 file changed, 22 insertions(+), 32 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 4d02831..3f76b57 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2533,11 +2533,6 @@ static inline int node_match(struct page *page, int node) } #ifdef CONFIG_SLUB_DEBUG -static int count_free(struct page *page) -{ - return page->objects - page->inuse; -} - static inline unsigned long node_nr_objs(struct kmem_cache_node *n) { return atomic_long_read(>total_objects); @@ -2545,19 +2540,26 @@ static inline unsigned long node_nr_objs(struct kmem_cache_node *n) #endif /* CONFIG_SLUB_DEBUG */ #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS) -static unsigned long count_partial(struct kmem_cache_node *n, - int (*get_count)(struct page *)) +enum partial_item { PARTIAL_FREE, PARTIAL_INUSE, PARTIAL_TOTAL }; + +static unsigned long partial_counter(struct kmem_cache_node *n, + enum partial_item item) { - unsigned long flags; - unsigned long x = 0; - struct page *page; + unsigned long ret = 0; - spin_lock_irqsave(>list_lock, flags); - list_for_each_entry(page, >partial, slab_list) - x += get_count(page); - spin_unlock_irqrestore(>list_lock, flags); - return x; + if (item == PARTIAL_FREE) { + ret = atomic_long_read(>partial_free_objs); + } else if (item == PARTIAL_TOTAL) { + ret = n->partial_total_objs; + } else if (item == PARTIAL_INUSE) { + ret = n->partial_total_objs - atomic_long_read(>partial_free_objs); + if ((long)ret < 0) + ret = 0; + } + + return ret; } + #endif /* CONFIG_SLUB_DEBUG || CONFIG_SYSFS */ static noinline void @@ -2587,7 +2589,7 @@ static unsigned long count_partial(struct kmem_cache_node *n, unsigned long nr_objs; unsigned long nr_free; - nr_free = count_partial(n, count_free); + nr_free = partial_counter(n, PARTIAL_FREE); nr_slabs = node_nr_slabs(n); nr_objs = node_nr_objs(n); @@ -4643,18 +4645,6 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags, EXPORT_SYMBOL(__kmalloc_node_track_caller); #endif -#ifdef CONFIG_SYSFS -static int count_inuse(struct page *page) -{ - return page->inuse; -} - -static int count_total(struct page *page) -{ - return page->objects; -} -#endif - #ifdef CONFIG_SLUB_DEBUG static void validate_slab(struct kmem_cache *s, struct page *page) { @@ -5091,7 +5081,7 @@ static ssize_t show_slab_objects(struct kmem_cache *s, x = atomic_long_read(>total_objects); else if (flags & SO_OBJECTS) x = atomic_long_read(>total_objects) - - count_partial(n, count_free); + partial_counter(n, PARTIAL_FREE); else x = atomic_long_read(>nr_slabs); total += x; @@ -5105,9 +5095,9 @@ static ssize_t show_slab_objects(struct kmem_cache *s, for_each_kmem_cache_node(s, node, n) { if (flags & SO_TOTAL) - x = count_partial(n, count_total); + x = partial_counter(n, PARTIAL_TOTAL); else if (flags & SO_OBJECTS) - x = count_partial(n, count_inuse); + x = partial_counter(n, PARTIAL_INUSE); else x = n->nr_partial; total += x; @@ -5873,7 +5863,7 @@ void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo) for_each_kmem_cache_node(s, node, n) { nr_slabs += node_nr_slabs(n); nr_objs += node_nr_objs(n); - nr_free += count_partial(n, count_free); + nr_free += partial_counter(n, PARTIAL_FREE); } sinfo->active_objs = nr_objs - nr_free; -- 1.8.3.1
[PATCH v3 1/4] mm/slub: Introduce two counters for partial objects
The node list_lock in count_partial() spends long time iterating in case of large amount of partial page lists, which can cause thunder herd effect to the list_lock contention. We have HSF RT(High-speed Service Framework Response-Time) monitors, the RT figures fluctuated randomly, then we deployed a tool detecting "irq off" and "preempt off" to dump the culprit's calltrace, capturing the list_lock cost nearly 100ms with irq off issued by "ss", this also caused network timeouts. This patch introduces two counters to maintain the actual number of partial objects dynamically instead of iterating the partial page lists with list_lock held. New counters of kmem_cache_node: partial_free_objs, partial_total_objs. The main operations are under list_lock in slow path, its performance impact should be minimal except the __slab_free() path which will be addressed later. Tested-by: James Wang Reviewed-by: Pekka Enberg Signed-off-by: Xunlei Pang --- mm/slab.h | 4 mm/slub.c | 46 +- 2 files changed, 49 insertions(+), 1 deletion(-) diff --git a/mm/slab.h b/mm/slab.h index 076582f..817bfa0 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -547,6 +547,10 @@ struct kmem_cache_node { #ifdef CONFIG_SLUB unsigned long nr_partial; struct list_head partial; +#if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS) + atomic_long_t partial_free_objs; + unsigned long partial_total_objs; +#endif #ifdef CONFIG_SLUB_DEBUG atomic_long_t nr_slabs; atomic_long_t total_objects; diff --git a/mm/slub.c b/mm/slub.c index e26c274..4d02831 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1890,10 +1890,31 @@ static void discard_slab(struct kmem_cache *s, struct page *page) /* * Management of partially allocated slabs. */ +#if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS) +static inline void +__update_partial_free(struct kmem_cache_node *n, long delta) +{ + atomic_long_add(delta, >partial_free_objs); +} + +static inline void +__update_partial_total(struct kmem_cache_node *n, long delta) +{ + n->partial_total_objs += delta; +} +#else +static inline void +__update_partial_free(struct kmem_cache_node *n, long delta) { } + +static inline void +__update_partial_total(struct kmem_cache_node *n, long delta) { } +#endif + static inline void __add_partial(struct kmem_cache_node *n, struct page *page, int tail) { n->nr_partial++; + __update_partial_total(n, page->objects); if (tail == DEACTIVATE_TO_TAIL) list_add_tail(>slab_list, >partial); else @@ -1913,6 +1934,7 @@ static inline void remove_partial(struct kmem_cache_node *n, lockdep_assert_held(>list_lock); list_del(>slab_list); n->nr_partial--; + __update_partial_total(n, -page->objects); } /* @@ -1957,6 +1979,7 @@ static inline void *acquire_slab(struct kmem_cache *s, return NULL; remove_partial(n, page); + __update_partial_free(n, -*objects); WARN_ON(!freelist); return freelist; } @@ -2286,8 +2309,11 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page, "unfreezing slab")) goto redo; - if (lock) + if (lock) { + if (m == M_PARTIAL) + __update_partial_free(n, new.objects - new.inuse); spin_unlock(>list_lock); + } if (m == M_PARTIAL) stat(s, tail); @@ -2353,6 +2379,7 @@ static void unfreeze_partials(struct kmem_cache *s, discard_page = page; } else { add_partial(n, page, DEACTIVATE_TO_TAIL); + __update_partial_free(n, new.objects - new.inuse); stat(s, FREE_ADD_PARTIAL); } } @@ -3039,6 +3066,13 @@ static void __slab_free(struct kmem_cache *s, struct page *page, head, new.counters, "__slab_free")); + if (!was_frozen && prior) { + if (n) + __update_partial_free(n, cnt); + else + __update_partial_free(get_node(s, page_to_nid(page)), cnt); + } + if (likely(!n)) { if (likely(was_frozen)) { @@ -3069,6 +3103,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, if (!kmem_cache_has_cpu_partial(s) && unlikely(!prior)) { remove_full(s, n, page); add_partial(n, page, DEACTIVATE_TO_TAIL); + __update_partial_free(n, cnt); stat(s, FREE_ADD_PARTIAL); } spin_unlock_irqrestore(>list_lock, flags); @@ -3080,6 +3115,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, * Slab on the partial list.
[PATCH v3 3/4] percpu: Export per_cpu_sum()
per_cpu_sum() is useful, and deserves to be exported. Tested-by: James Wang Signed-off-by: Xunlei Pang --- include/linux/percpu-defs.h | 10 ++ kernel/locking/percpu-rwsem.c | 10 -- 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/include/linux/percpu-defs.h b/include/linux/percpu-defs.h index dff7040..0e71b68 100644 --- a/include/linux/percpu-defs.h +++ b/include/linux/percpu-defs.h @@ -220,6 +220,16 @@ (void)__vpp_verify; \ } while (0) +#define per_cpu_sum(var) \ +({ \ + typeof(var) __sum = 0; \ + int cpu;\ + compiletime_assert_atomic_type(__sum); \ + for_each_possible_cpu(cpu) \ + __sum += per_cpu(var, cpu); \ + __sum; \ +}) + #ifdef CONFIG_SMP /* diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c index 70a32a5..0980e51 100644 --- a/kernel/locking/percpu-rwsem.c +++ b/kernel/locking/percpu-rwsem.c @@ -178,16 +178,6 @@ bool __percpu_down_read(struct percpu_rw_semaphore *sem, bool try) } EXPORT_SYMBOL_GPL(__percpu_down_read); -#define per_cpu_sum(var) \ -({ \ - typeof(var) __sum = 0; \ - int cpu;\ - compiletime_assert_atomic_type(__sum); \ - for_each_possible_cpu(cpu) \ - __sum += per_cpu(var, cpu); \ - __sum; \ -}) - /* * Return true if the modular sum of the sem->read_count per-CPU variable is * zero. If this sum is zero, then it is stable due to the fact that if any -- 1.8.3.1
[PATCH v3 0/4] mm/slub: Fix count_partial() problem
count_partial() can hold n->list_lock spinlock for quite long, which makes much trouble to the system. This series eliminate this problem. v1->v2: - Improved changelog and variable naming for PATCH 1~2. - PATCH3 adds per-cpu counter to avoid performance regression in concurrent __slab_free(). v2->v3: - Changed "page->inuse" to the safe "new.inuse", etc. - Used CONFIG_SLUB_DEBUG and CONFIG_SYSFS condition for new counters. - atomic_long_t -> unsigned long [Testing] There seems might be a little performance impact under extreme __slab_free() concurrent calls according to my tests. On my 32-cpu 2-socket physical machine: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz 1) perf stat --null --repeat 10 -- hackbench 20 thread 2 == original, no patched Performance counter stats for 'hackbench 20 thread 2' (10 runs): 24.536050899 seconds time elapsed ( +- 0.24% ) Performance counter stats for 'hackbench 20 thread 2' (10 runs): 24.588049142 seconds time elapsed ( +- 0.35% ) == patched with patch1~4 Performance counter stats for 'hackbench 20 thread 2' (10 runs): 24.670892273 seconds time elapsed ( +- 0.29% ) Performance counter stats for 'hackbench 20 thread 2' (10 runs): 24.746755689 seconds time elapsed ( +- 0.21% ) 2) perf stat --null --repeat 10 -- hackbench 32 thread 2 == original, no patched Performance counter stats for 'hackbench 32 thread 2' (10 runs): 39.784911855 seconds time elapsed ( +- 0.14% ) Performance counter stats for 'hackbench 32 thread 2' (10 runs): 39.868687608 seconds time elapsed ( +- 0.19% ) == patched with patch1~4 Performance counter stats for 'hackbench 32 thread 2' (10 runs): 39.681273015 seconds time elapsed ( +- 0.21% ) Performance counter stats for 'hackbench 32 thread 2' (10 runs): 39.681238459 seconds time elapsed ( +- 0.09% ) Xunlei Pang (4): mm/slub: Introduce two counters for partial objects mm/slub: Get rid of count_partial() percpu: Export per_cpu_sum() mm/slub: Use percpu partial free counter include/linux/percpu-defs.h | 10 kernel/locking/percpu-rwsem.c | 10 mm/slab.h | 4 ++ mm/slub.c | 120 +- 4 files changed, 97 insertions(+), 47 deletions(-) -- 1.8.3.1
[PATCH v3 4/4] mm/slub: Use percpu partial free counter
The only concern of introducing partial counter is that, partial_free_objs may cause cache and atomic operation contention in case of same SLUB concurrent __slab_free(). This patch changes it to be a percpu counter, also replace the counter variables to avoid cacheline issues. Tested-by: James Wang Reviewed-by: Pekka Enberg Signed-off-by: Xunlei Pang --- mm/slab.h | 6 -- mm/slub.c | 30 +++--- 2 files changed, 27 insertions(+), 9 deletions(-) diff --git a/mm/slab.h b/mm/slab.h index 817bfa0..c819597 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -546,16 +546,18 @@ struct kmem_cache_node { #ifdef CONFIG_SLUB unsigned long nr_partial; - struct list_head partial; #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS) - atomic_long_t partial_free_objs; unsigned long partial_total_objs; #endif + struct list_head partial; #ifdef CONFIG_SLUB_DEBUG atomic_long_t nr_slabs; atomic_long_t total_objects; struct list_head full; #endif +#if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS) + unsigned long __percpu *partial_free_objs; +#endif #endif }; diff --git a/mm/slub.c b/mm/slub.c index 3f76b57..b6ec065 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1894,7 +1894,7 @@ static void discard_slab(struct kmem_cache *s, struct page *page) static inline void __update_partial_free(struct kmem_cache_node *n, long delta) { - atomic_long_add(delta, >partial_free_objs); + this_cpu_add(*n->partial_free_objs, delta); } static inline void @@ -2548,11 +2548,16 @@ static unsigned long partial_counter(struct kmem_cache_node *n, unsigned long ret = 0; if (item == PARTIAL_FREE) { - ret = atomic_long_read(>partial_free_objs); + ret = per_cpu_sum(*n->partial_free_objs); + if ((long)ret < 0) + ret = 0; } else if (item == PARTIAL_TOTAL) { ret = n->partial_total_objs; } else if (item == PARTIAL_INUSE) { - ret = n->partial_total_objs - atomic_long_read(>partial_free_objs); + ret = per_cpu_sum(*n->partial_free_objs); + if ((long)ret < 0) + ret = 0; + ret = n->partial_total_objs - ret; if ((long)ret < 0) ret = 0; } @@ -3552,14 +3557,16 @@ static inline int calculate_order(unsigned int size) return -ENOSYS; } -static void +static int init_kmem_cache_node(struct kmem_cache_node *n) { n->nr_partial = 0; spin_lock_init(>list_lock); INIT_LIST_HEAD(>partial); #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS) - atomic_long_set(>partial_free_objs, 0); + n->partial_free_objs = alloc_percpu(unsigned long); + if (!n->partial_free_objs) + return -ENOMEM; n->partial_total_objs = 0; #endif #ifdef CONFIG_SLUB_DEBUG @@ -3567,6 +3574,8 @@ static inline int calculate_order(unsigned int size) atomic_long_set(>total_objects, 0); INIT_LIST_HEAD(>full); #endif + + return 0; } static inline int alloc_kmem_cache_cpus(struct kmem_cache *s) @@ -3626,7 +3635,7 @@ static void early_kmem_cache_node_alloc(int node) page->inuse = 1; page->frozen = 0; kmem_cache_node->node[node] = n; - init_kmem_cache_node(n); + BUG_ON(init_kmem_cache_node(n) < 0); inc_slabs_node(kmem_cache_node, node, page->objects); /* @@ -3644,6 +3653,9 @@ static void free_kmem_cache_nodes(struct kmem_cache *s) for_each_kmem_cache_node(s, node, n) { s->node[node] = NULL; +#if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS) + free_percpu(n->partial_free_objs); +#endif kmem_cache_free(kmem_cache_node, n); } } @@ -3674,7 +3686,11 @@ static int init_kmem_cache_nodes(struct kmem_cache *s) return 0; } - init_kmem_cache_node(n); + if (init_kmem_cache_node(n) < 0) { + free_kmem_cache_nodes(s); + return 0; + } + s->node[node] = n; } return 1; -- 1.8.3.1
Re: [PATCH v2 3/3] mm/slub: Use percpu partial free counter
On 3/2/21 5:14 PM, Christoph Lameter wrote: > On Mon, 10 Aug 2020, Xunlei Pang wrote: > >> >> diff --git a/mm/slab.h b/mm/slab.h >> index c85e2fa..a709a70 100644 >> --- a/mm/slab.h >> +++ b/mm/slab.h >> @@ -616,7 +616,7 @@ struct kmem_cache_node { >> #ifdef CONFIG_SLUB >> unsigned long nr_partial; >> struct list_head partial; >> -atomic_long_t partial_free_objs; >> +atomic_long_t __percpu *partial_free_objs; > > A percpu counter is never atomic. Just use unsigned long and use this_cpu > operations for this thing. That should cut down further on the overhead. > >> --- a/mm/slub.c >> +++ b/mm/slub.c >> @@ -1775,11 +1775,21 @@ static void discard_slab(struct kmem_cache *s, >> struct page *page) >> /* >> * Management of partially allocated slabs. >> */ >> +static inline long get_partial_free(struct kmem_cache_node *n) >> +{ >> +long nr = 0; >> +int cpu; >> + >> +for_each_possible_cpu(cpu) >> +nr += atomic_long_read(per_cpu_ptr(n->partial_free_objs, cpu)); > > this_cpu_read(*n->partial_free_objs) > >> static inline void >> __update_partial_free(struct kmem_cache_node *n, long delta) >> { >> -atomic_long_add(delta, >partial_free_objs); >> +atomic_long_add(delta, this_cpu_ptr(n->partial_free_objs)); > > this_cpu_add() > > and so on. > Thanks, I changed them both to use "unsigned long", and will send v3 out after our internal performance regression test passes.
Re: [PATCH v2 0/3] mm/slub: Fix count_partial() problem
On 3/1/21 6:31 PM, Shu Ming wrote: > Any progress on this? The problem addressed by this patch has also > made jitters to our online apps which are quite annoying. > Thanks for the attention. There's some further improvements on v2, I'm gonna send v3 out later. > On Mon, Aug 24, 2020 at 6:05 PM xunlei wrote: >> >> On 2020/8/20 下午10:02, Pekka Enberg wrote: >>> On Mon, Aug 10, 2020 at 3:18 PM Xunlei Pang >>> wrote: >>>> >>>> v1->v2: >>>> - Improved changelog and variable naming for PATCH 1~2. >>>> - PATCH3 adds per-cpu counter to avoid performance regression >>>> in concurrent __slab_free(). >>>> >>>> [Testing] >>>> On my 32-cpu 2-socket physical machine: >>>> Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz >>>> perf stat --null --repeat 10 -- hackbench 20 thread 2 >>>> >>>> == original, no patched >>>> 19.211637055 seconds time elapsed >>>> ( +- 0.57% ) >>>> >>>> == patched with patch1~2 >>>> Performance counter stats for 'hackbench 20 thread 2' (10 runs): >>>> >>>> 21.731833146 seconds time elapsed >>>> ( +- 0.17% ) >>>> >>>> == patched with patch1~3 >>>> Performance counter stats for 'hackbench 20 thread 2' (10 runs): >>>> >>>> 19.112106847 seconds time elapsed >>>> ( +- 0.64% ) >>>> >>>> >>>> Xunlei Pang (3): >>>> mm/slub: Introduce two counters for partial objects >>>> mm/slub: Get rid of count_partial() >>>> mm/slub: Use percpu partial free counter >>>> >>>> mm/slab.h | 2 + >>>> mm/slub.c | 124 >>>> +++--- >>>> 2 files changed, 89 insertions(+), 37 deletions(-) >>> >>> We probably need to wrap the counters under CONFIG_SLUB_DEBUG because >>> AFAICT all the code that uses them is also wrapped under it. >> >> /sys/kernel/slab/***/partial sysfs also uses it, I can wrap it with >> CONFIG_SLUB_DEBUG or CONFIG_SYSFS for backward compatibility. >> >>> >>> An alternative approach for this patch would be to somehow make the >>> lock in count_partial() more granular, but I don't know how feasible >>> that actually is. >>> >>> Anyway, I am OK with this approach: >>> >>> Reviewed-by: Pekka Enberg >> >> Thanks! >> >>> >>> You still need to convince Christoph, though, because he had >>> objections over this approach. >> >> Christoph, what do you think, or any better suggestion to address this >> *in production* issue? >> >>> >>> - Pekka >>>
[tip: sched/core] sched/fair: Fix wrong cpu selecting from isolated domain
The following commit has been merged into the sched/core branch of tip: Commit-ID: df3cb4ea1fb63ff326488efd671ba3c39034255e Gitweb: https://git.kernel.org/tip/df3cb4ea1fb63ff326488efd671ba3c39034255e Author:Xunlei Pang AuthorDate:Thu, 24 Sep 2020 14:48:47 +08:00 Committer: Peter Zijlstra CommitterDate: Fri, 25 Sep 2020 14:23:25 +02:00 sched/fair: Fix wrong cpu selecting from isolated domain We've met problems that occasionally tasks with full cpumask (e.g. by putting it into a cpuset or setting to full affinity) were migrated to our isolated cpus in production environment. After some analysis, we found that it is due to the current select_idle_smt() not considering the sched_domain mask. Steps to reproduce on my 31-CPU hyperthreads machine: 1. with boot parameter: "isolcpus=domain,2-31" (thread lists: 0,16 and 1,17) 2. cgcreate -g cpu:test; cgexec -g cpu:test "test_threads" 3. some threads will be migrated to the isolated cpu16~17. Fix it by checking the valid domain mask in select_idle_smt(). Fixes: 10e2f1acd010 ("sched/core: Rewrite and improve select_idle_siblings()) Reported-by: Wetp Zhang Signed-off-by: Xunlei Pang Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Jiang Biao Reviewed-by: Vincent Guittot Link: https://lkml.kernel.org/r/1600930127-76857-1-git-send-email-xlp...@linux.alibaba.com --- kernel/sched/fair.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a15deb2..9613e5d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6080,7 +6080,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int /* * Scan the local SMT mask for idle CPUs. */ -static int select_idle_smt(struct task_struct *p, int target) +static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target) { int cpu; @@ -6088,7 +6088,8 @@ static int select_idle_smt(struct task_struct *p, int target) return -1; for_each_cpu(cpu, cpu_smt_mask(target)) { - if (!cpumask_test_cpu(cpu, p->cpus_ptr)) + if (!cpumask_test_cpu(cpu, p->cpus_ptr) || + !cpumask_test_cpu(cpu, sched_domain_span(sd))) continue; if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) return cpu; @@ -6104,7 +6105,7 @@ static inline int select_idle_core(struct task_struct *p, struct sched_domain *s return -1; } -static inline int select_idle_smt(struct task_struct *p, int target) +static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target) { return -1; } @@ -6279,7 +6280,7 @@ symmetric: if ((unsigned)i < nr_cpumask_bits) return i; - i = select_idle_smt(p, target); + i = select_idle_smt(p, sd, target); if ((unsigned)i < nr_cpumask_bits) return i;
Re: [PATCH RESEND] sched/fair: Fix wrong cpu selecting from isolated domain
On 9/24/20 3:18 PM, Vincent Guittot wrote: > On Thu, 24 Sep 2020 at 08:48, Xunlei Pang wrote: >> >> We've met problems that occasionally tasks with full cpumask >> (e.g. by putting it into a cpuset or setting to full affinity) >> were migrated to our isolated cpus in production environment. >> >> After some analysis, we found that it is due to the current >> select_idle_smt() not considering the sched_domain mask. >> >> Steps to reproduce on my 31-CPU hyperthreads machine: >> 1. with boot parameter: "isolcpus=domain,2-31" >>(thread lists: 0,16 and 1,17) >> 2. cgcreate -g cpu:test; cgexec -g cpu:test "test_threads" >> 3. some threads will be migrated to the isolated cpu16~17. >> >> Fix it by checking the valid domain mask in select_idle_smt(). >> >> Fixes: 10e2f1acd010 ("sched/core: Rewrite and improve select_idle_siblings()) >> Reported-by: Wetp Zhang >> Reviewed-by: Jiang Biao >> Signed-off-by: Xunlei Pang > > Reviewed-by: Vincent Guittot > Thanks, Vincent :-)
[PATCH RESEND] sched/fair: Fix wrong cpu selecting from isolated domain
We've met problems that occasionally tasks with full cpumask (e.g. by putting it into a cpuset or setting to full affinity) were migrated to our isolated cpus in production environment. After some analysis, we found that it is due to the current select_idle_smt() not considering the sched_domain mask. Steps to reproduce on my 31-CPU hyperthreads machine: 1. with boot parameter: "isolcpus=domain,2-31" (thread lists: 0,16 and 1,17) 2. cgcreate -g cpu:test; cgexec -g cpu:test "test_threads" 3. some threads will be migrated to the isolated cpu16~17. Fix it by checking the valid domain mask in select_idle_smt(). Fixes: 10e2f1acd010 ("sched/core: Rewrite and improve select_idle_siblings()) Reported-by: Wetp Zhang Reviewed-by: Jiang Biao Signed-off-by: Xunlei Pang --- kernel/sched/fair.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1a68a05..fa942c4 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6075,7 +6075,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int /* * Scan the local SMT mask for idle CPUs. */ -static int select_idle_smt(struct task_struct *p, int target) +static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target) { int cpu; @@ -6083,7 +6083,8 @@ static int select_idle_smt(struct task_struct *p, int target) return -1; for_each_cpu(cpu, cpu_smt_mask(target)) { - if (!cpumask_test_cpu(cpu, p->cpus_ptr)) + if (!cpumask_test_cpu(cpu, p->cpus_ptr) || + !cpumask_test_cpu(cpu, sched_domain_span(sd))) continue; if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) return cpu; @@ -6099,7 +6100,7 @@ static inline int select_idle_core(struct task_struct *p, struct sched_domain *s return -1; } -static inline int select_idle_smt(struct task_struct *p, int target) +static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target) { return -1; } @@ -6274,7 +6275,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if ((unsigned)i < nr_cpumask_bits) return i; - i = select_idle_smt(p, target); + i = select_idle_smt(p, sd, target); if ((unsigned)i < nr_cpumask_bits) return i; -- 1.8.3.1
Re: [PATCH] mm: memcg: yield cpu when we fail to charge pages
On 2020/9/9 AM2:50, Julius Hemanth Pitti wrote: > For non root CG, in try_charge(), we keep trying > to charge until we succeed. On non-preemptive > kernel, when we are OOM, this results in holding > CPU forever. > > On SMP systems, this doesn't create a big problem > because oom_reaper get a change to kill victim > and make some free pages. However on a single-core > CPU (or cases where oom_reaper pinned to same CPU > where try_charge is executing), oom_reaper shall > never get scheduled and we stay in try_charge forever. > > Steps to repo this on non-smp: > 1. mount -t tmpfs none /sys/fs/cgroup > 2. mkdir /sys/fs/cgroup/memory > 3. mount -t cgroup none /sys/fs/cgroup/memory -o memory > 4. mkdir /sys/fs/cgroup/memory/0 > 5. echo 40M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes > 6. echo $$ > /sys/fs/cgroup/memory/0/tasks > 7. stress -m 5 --vm-bytes 10M --vm-hang 0 > > Signed-off-by: Julius Hemanth Pitti > --- > mm/memcontrol.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 0d6f3ea86738..4620d70267cb 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2652,6 +2652,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t > gfp_mask, > if (fatal_signal_pending(current)) > goto force; > > + cond_resched(); > + > /* >* keep retrying as long as the memcg oom killer is able to make >* a forward progress or bypass the charge if the oom killer > This should be fixed by: https://lkml.org/lkml/2020/8/26/1440 Thanks, Xunlei
Re: [PATCH] sched/fair: Fix wrong cpu selecting from isolated domain
On 2020/8/24 PM8:30, Xunlei Pang wrote: > We've met problems that occasionally tasks with full cpumask > (e.g. by putting it into a cpuset or setting to full affinity) > were migrated to our isolated cpus in production environment. > > After some analysis, we found that it is due to the current > select_idle_smt() not considering the sched_domain mask. > > Fix it by checking the valid domain mask in select_idle_smt(). > > Fixes: 10e2f1acd010 ("sched/core: Rewrite and improve select_idle_siblings()) > Reported-by: Wetp Zhang > Signed-off-by: Xunlei Pang > --- > kernel/sched/fair.c | 9 + > 1 file changed, 5 insertions(+), 4 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 1a68a05..fa942c4 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6075,7 +6075,7 @@ static int select_idle_core(struct task_struct *p, > struct sched_domain *sd, int > /* > * Scan the local SMT mask for idle CPUs. > */ > -static int select_idle_smt(struct task_struct *p, int target) > +static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, > int target) > { > int cpu; > > @@ -6083,7 +6083,8 @@ static int select_idle_smt(struct task_struct *p, int > target) > return -1; > > for_each_cpu(cpu, cpu_smt_mask(target)) { > - if (!cpumask_test_cpu(cpu, p->cpus_ptr)) > + if (!cpumask_test_cpu(cpu, p->cpus_ptr) || > + !cpumask_test_cpu(cpu, sched_domain_span(sd))) > continue; > if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) > return cpu; > @@ -6099,7 +6100,7 @@ static inline int select_idle_core(struct task_struct > *p, struct sched_domain *s > return -1; > } > > -static inline int select_idle_smt(struct task_struct *p, int target) > +static inline int select_idle_smt(struct task_struct *p, struct sched_domain > *sd, int target) > { > return -1; > } > @@ -6274,7 +6275,7 @@ static int select_idle_sibling(struct task_struct *p, > int prev, int target) > if ((unsigned)i < nr_cpumask_bits) > return i; > > - i = select_idle_smt(p, target); > + i = select_idle_smt(p, sd, target); > if ((unsigned)i < nr_cpumask_bits) > return i; > > Hi Peter, any other comments?
[PATCH v3] mm: memcg: Fix memcg reclaim soft lockup
We've met softlockup with "CONFIG_PREEMPT_NONE=y", when the target memcg doesn't have any reclaimable memory. It can be easily reproduced as below: watchdog: BUG: soft lockup - CPU#0 stuck for 111s![memcg_test:2204] CPU: 0 PID: 2204 Comm: memcg_test Not tainted 5.9.0-rc2+ #12 Call Trace: shrink_lruvec+0x49f/0x640 shrink_node+0x2a6/0x6f0 do_try_to_free_pages+0xe9/0x3e0 try_to_free_mem_cgroup_pages+0xef/0x1f0 try_charge+0x2c1/0x750 mem_cgroup_charge+0xd7/0x240 __add_to_page_cache_locked+0x2fd/0x370 add_to_page_cache_lru+0x4a/0xc0 pagecache_get_page+0x10b/0x2f0 filemap_fault+0x661/0xad0 ext4_filemap_fault+0x2c/0x40 __do_fault+0x4d/0xf9 handle_mm_fault+0x1080/0x1790 It only happens on our 1-vcpu instances, because there's no chance for oom reaper to run to reclaim the to-be-killed process. Add a cond_resched() at the upper shrink_node_memcgs() to solve this issue, this will mean that we will get a scheduling point for each memcg in the reclaimed hierarchy without any dependency on the reclaimable memory in that memcg thus making it more predictable. Acked-by: Chris Down Acked-by: Michal Hocko Suggested-by: Michal Hocko Signed-off-by: Xunlei Pang --- mm/vmscan.c | 8 1 file changed, 8 insertions(+) diff --git a/mm/vmscan.c b/mm/vmscan.c index 99e1796..9727dd8 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2615,6 +2615,14 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) unsigned long reclaimed; unsigned long scanned; + /* +* This loop can become CPU-bound when target memcgs +* aren't eligible for reclaim - either because they +* don't have any reclaimable pages, or because their +* memory is explicitly protected. Avoid soft lockups. +*/ + cond_resched(); + mem_cgroup_calculate_protection(target_memcg, memcg); if (mem_cgroup_below_min(memcg)) { -- 1.8.3.1
[PATCH v2] mm: memcg: Fix memcg reclaim soft lockup
We've met softlockup with "CONFIG_PREEMPT_NONE=y", when the target memcg doesn't have any reclaimable memory. It can be easily reproduced as below: watchdog: BUG: soft lockup - CPU#0 stuck for 111s![memcg_test:2204] CPU: 0 PID: 2204 Comm: memcg_test Not tainted 5.9.0-rc2+ #12 Call Trace: shrink_lruvec+0x49f/0x640 shrink_node+0x2a6/0x6f0 do_try_to_free_pages+0xe9/0x3e0 try_to_free_mem_cgroup_pages+0xef/0x1f0 try_charge+0x2c1/0x750 mem_cgroup_charge+0xd7/0x240 __add_to_page_cache_locked+0x2fd/0x370 add_to_page_cache_lru+0x4a/0xc0 pagecache_get_page+0x10b/0x2f0 filemap_fault+0x661/0xad0 ext4_filemap_fault+0x2c/0x40 __do_fault+0x4d/0xf9 handle_mm_fault+0x1080/0x1790 It only happens on our 1-vcpu instances, because there's no chance for oom reaper to run to reclaim the to-be-killed process. Add cond_resched() at the upper shrink_node_memcgs() to solve this issue, and any other possible issue like meomry.min protection. Suggested-by: Michal Hocko Signed-off-by: Xunlei Pang --- mm/vmscan.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/vmscan.c b/mm/vmscan.c index 99e1796..bbdc38b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2617,6 +2617,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) mem_cgroup_calculate_protection(target_memcg, memcg); + cond_resched(); + if (mem_cgroup_below_min(memcg)) { /* * Hard protection. -- 1.8.3.1
Re: [PATCH] mm: memcg: Fix memcg reclaim soft lockup
On 2020/8/26 下午7:45, xunlei wrote: > On 2020/8/26 下午7:00, Michal Hocko wrote: >> On Wed 26-08-20 18:41:18, xunlei wrote: >>> On 2020/8/26 下午4:11, Michal Hocko wrote: >>>> On Wed 26-08-20 15:27:02, Xunlei Pang wrote: >>>>> We've met softlockup with "CONFIG_PREEMPT_NONE=y", when >>>>> the target memcg doesn't have any reclaimable memory. >>>> >>>> Do you have any scenario when this happens or is this some sort of a >>>> test case? >>> >>> It can happen on tiny guest scenarios. >> >> OK, you made me more curious. If this is a tiny guest and this is a hard >> limit reclaim path then we should trigger an oom killer which should >> kill the offender and that in turn bail out from the try_charge lopp >> (see should_force_charge). So how come this repeats enough in your setup >> that it causes soft lockups? >> > > oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask, >get_order(nr_pages * PAGE_SIZE)); > switch (oom_status) { > case OOM_SUCCESS: > nr_retries = MAX_RECLAIM_RETRIES; Actually we can add "cond_resched()" here, but I think it's better to have one at the memcg reclaim path to avoid other unexpected issues. > goto retry; > > It retries here endlessly, because oom reaper has no cpu to schedule. >
[PATCH] mm: memcg: Fix memcg reclaim soft lockup
We've met softlockup with "CONFIG_PREEMPT_NONE=y", when the target memcg doesn't have any reclaimable memory. It can be easily reproduced as below: watchdog: BUG: soft lockup - CPU#0 stuck for 111s![memcg_test:2204] CPU: 0 PID: 2204 Comm: memcg_test Not tainted 5.9.0-rc2+ #12 Call Trace: shrink_lruvec+0x49f/0x640 shrink_node+0x2a6/0x6f0 do_try_to_free_pages+0xe9/0x3e0 try_to_free_mem_cgroup_pages+0xef/0x1f0 try_charge+0x2c1/0x750 mem_cgroup_charge+0xd7/0x240 __add_to_page_cache_locked+0x2fd/0x370 add_to_page_cache_lru+0x4a/0xc0 pagecache_get_page+0x10b/0x2f0 filemap_fault+0x661/0xad0 ext4_filemap_fault+0x2c/0x40 __do_fault+0x4d/0xf9 handle_mm_fault+0x1080/0x1790 It only happens on our 1-vcpu instances, because there's no chance for oom reaper to run to reclaim the to-be-killed process. Add cond_resched() in such cases at the beginning of shrink_lruvec() to give up the cpu to others. Signed-off-by: Xunlei Pang --- mm/vmscan.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/mm/vmscan.c b/mm/vmscan.c index 99e1796..349a88e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2449,6 +2449,12 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) scan_adjusted = (!cgroup_reclaim(sc) && !current_is_kswapd() && sc->priority == DEF_PRIORITY); + /* memcg reclaim may run into no reclaimable lru pages */ + if (nr[LRU_ACTIVE_FILE] == 0 && + nr[LRU_INACTIVE_FILE] == 0 && + nr[LRU_INACTIVE_ANON] == 0) + cond_resched(); + blk_start_plug(); while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] || nr[LRU_INACTIVE_FILE]) { -- 1.8.3.1
[PATCH] sched/fair: Fix wrong cpu selecting from isolated domain
We've met problems that occasionally tasks with full cpumask (e.g. by putting it into a cpuset or setting to full affinity) were migrated to our isolated cpus in production environment. After some analysis, we found that it is due to the current select_idle_smt() not considering the sched_domain mask. Fix it by checking the valid domain mask in select_idle_smt(). Fixes: 10e2f1acd010 ("sched/core: Rewrite and improve select_idle_siblings()) Reported-by: Wetp Zhang Signed-off-by: Xunlei Pang --- kernel/sched/fair.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1a68a05..fa942c4 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6075,7 +6075,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int /* * Scan the local SMT mask for idle CPUs. */ -static int select_idle_smt(struct task_struct *p, int target) +static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target) { int cpu; @@ -6083,7 +6083,8 @@ static int select_idle_smt(struct task_struct *p, int target) return -1; for_each_cpu(cpu, cpu_smt_mask(target)) { - if (!cpumask_test_cpu(cpu, p->cpus_ptr)) + if (!cpumask_test_cpu(cpu, p->cpus_ptr) || + !cpumask_test_cpu(cpu, sched_domain_span(sd))) continue; if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) return cpu; @@ -6099,7 +6100,7 @@ static inline int select_idle_core(struct task_struct *p, struct sched_domain *s return -1; } -static inline int select_idle_smt(struct task_struct *p, int target) +static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target) { return -1; } @@ -6274,7 +6275,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if ((unsigned)i < nr_cpumask_bits) return i; - i = select_idle_smt(p, target); + i = select_idle_smt(p, sd, target); if ((unsigned)i < nr_cpumask_bits) return i; -- 1.8.3.1
[PATCH v2 2/3] mm/slub: Get rid of count_partial()
Now the partial counters are ready, let's use them directly and get rid of count_partial(). Co-developed-by: Wen Yang Signed-off-by: Xunlei Pang --- mm/slub.c | 57 - 1 file changed, 24 insertions(+), 33 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 6708d96..25a4421 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2414,11 +2414,6 @@ static inline int node_match(struct page *page, int node) } #ifdef CONFIG_SLUB_DEBUG -static int count_free(struct page *page) -{ - return page->objects - page->inuse; -} - static inline unsigned long node_nr_objs(struct kmem_cache_node *n) { return atomic_long_read(>total_objects); @@ -2426,19 +2421,27 @@ static inline unsigned long node_nr_objs(struct kmem_cache_node *n) #endif /* CONFIG_SLUB_DEBUG */ #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS) -static unsigned long count_partial(struct kmem_cache_node *n, - int (*get_count)(struct page *)) -{ - unsigned long flags; - unsigned long x = 0; - struct page *page; +enum partial_item { PARTIAL_FREE, PARTIAL_INUSE, PARTIAL_TOTAL }; + +static unsigned long partial_counter(struct kmem_cache_node *n, + enum partial_item item) +{ + unsigned long ret = 0; + + if (item == PARTIAL_FREE) { + ret = atomic_long_read(>partial_free_objs); + } else if (item == PARTIAL_TOTAL) { + ret = atomic_long_read(>partial_total_objs); + } else if (item == PARTIAL_INUSE) { + ret = atomic_long_read(>partial_total_objs) - + atomic_long_read(>partial_free_objs); + if ((long)ret < 0) + ret = 0; + } - spin_lock_irqsave(>list_lock, flags); - list_for_each_entry(page, >partial, slab_list) - x += get_count(page); - spin_unlock_irqrestore(>list_lock, flags); - return x; + return ret; } + #endif /* CONFIG_SLUB_DEBUG || CONFIG_SYSFS */ static noinline void @@ -2468,7 +2471,7 @@ static unsigned long count_partial(struct kmem_cache_node *n, unsigned long nr_objs; unsigned long nr_free; - nr_free = count_partial(n, count_free); + nr_free = partial_counter(n, PARTIAL_FREE); nr_slabs = node_nr_slabs(n); nr_objs = node_nr_objs(n); @@ -,18 +4447,6 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags, } #endif -#ifdef CONFIG_SYSFS -static int count_inuse(struct page *page) -{ - return page->inuse; -} - -static int count_total(struct page *page) -{ - return page->objects; -} -#endif - #ifdef CONFIG_SLUB_DEBUG static void validate_slab(struct kmem_cache *s, struct page *page) { @@ -4912,7 +4903,7 @@ static ssize_t show_slab_objects(struct kmem_cache *s, x = atomic_long_read(>total_objects); else if (flags & SO_OBJECTS) x = atomic_long_read(>total_objects) - - count_partial(n, count_free); + partial_counter(n, PARTIAL_FREE); else x = atomic_long_read(>nr_slabs); total += x; @@ -4926,9 +4917,9 @@ static ssize_t show_slab_objects(struct kmem_cache *s, for_each_kmem_cache_node(s, node, n) { if (flags & SO_TOTAL) - x = count_partial(n, count_total); + x = partial_counter(n, PARTIAL_TOTAL); else if (flags & SO_OBJECTS) - x = count_partial(n, count_inuse); + x = partial_counter(n, PARTIAL_INUSE); else x = n->nr_partial; total += x; @@ -5961,7 +5952,7 @@ void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo) for_each_kmem_cache_node(s, node, n) { nr_slabs += node_nr_slabs(n); nr_objs += node_nr_objs(n); - nr_free += count_partial(n, count_free); + nr_free += partial_counter(n, PARTIAL_FREE); } sinfo->active_objs = nr_objs - nr_free; -- 1.8.3.1
[PATCH v2 0/3] mm/slub: Fix count_partial() problem
v1->v2: - Improved changelog and variable naming for PATCH 1~2. - PATCH3 adds per-cpu counter to avoid performance regression in concurrent __slab_free(). [Testing] On my 32-cpu 2-socket physical machine: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz perf stat --null --repeat 10 -- hackbench 20 thread 2 == original, no patched 19.211637055 seconds time elapsed ( +- 0.57% ) == patched with patch1~2 Performance counter stats for 'hackbench 20 thread 2' (10 runs): 21.731833146 seconds time elapsed ( +- 0.17% ) == patched with patch1~3 Performance counter stats for 'hackbench 20 thread 2' (10 runs): 19.112106847 seconds time elapsed ( +- 0.64% ) Xunlei Pang (3): mm/slub: Introduce two counters for partial objects mm/slub: Get rid of count_partial() mm/slub: Use percpu partial free counter mm/slab.h | 2 + mm/slub.c | 124 +++--- 2 files changed, 89 insertions(+), 37 deletions(-) -- 1.8.3.1
[PATCH v2 3/3] mm/slub: Use percpu partial free counter
The only concern of introducing partial counter is that, partial_free_objs may cause atomic operation contention in case of same SLUB concurrent __slab_free(). This patch changes it to be a percpu counter to avoid that. Co-developed-by: Wen Yang Signed-off-by: Xunlei Pang --- mm/slab.h | 2 +- mm/slub.c | 38 +++--- 2 files changed, 32 insertions(+), 8 deletions(-) diff --git a/mm/slab.h b/mm/slab.h index c85e2fa..a709a70 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -616,7 +616,7 @@ struct kmem_cache_node { #ifdef CONFIG_SLUB unsigned long nr_partial; struct list_head partial; - atomic_long_t partial_free_objs; + atomic_long_t __percpu *partial_free_objs; atomic_long_t partial_total_objs; #ifdef CONFIG_SLUB_DEBUG atomic_long_t nr_slabs; diff --git a/mm/slub.c b/mm/slub.c index 25a4421..f6fc60b 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1775,11 +1775,21 @@ static void discard_slab(struct kmem_cache *s, struct page *page) /* * Management of partially allocated slabs. */ +static inline long get_partial_free(struct kmem_cache_node *n) +{ + long nr = 0; + int cpu; + + for_each_possible_cpu(cpu) + nr += atomic_long_read(per_cpu_ptr(n->partial_free_objs, cpu)); + + return nr; +} static inline void __update_partial_free(struct kmem_cache_node *n, long delta) { - atomic_long_add(delta, >partial_free_objs); + atomic_long_add(delta, this_cpu_ptr(n->partial_free_objs)); } static inline void @@ -2429,12 +2439,12 @@ static unsigned long partial_counter(struct kmem_cache_node *n, unsigned long ret = 0; if (item == PARTIAL_FREE) { - ret = atomic_long_read(>partial_free_objs); + ret = get_partial_free(n); } else if (item == PARTIAL_TOTAL) { ret = atomic_long_read(>partial_total_objs); } else if (item == PARTIAL_INUSE) { ret = atomic_long_read(>partial_total_objs) - - atomic_long_read(>partial_free_objs); + get_partial_free(n); if ((long)ret < 0) ret = 0; } @@ -3390,19 +3400,28 @@ static inline int calculate_order(unsigned int size) return -ENOSYS; } -static void +static int init_kmem_cache_node(struct kmem_cache_node *n) { + int cpu; + n->nr_partial = 0; spin_lock_init(>list_lock); INIT_LIST_HEAD(>partial); - atomic_long_set(>partial_free_objs, 0); + + n->partial_free_objs = alloc_percpu(atomic_long_t); + if (!n->partial_free_objs) + return -ENOMEM; + for_each_possible_cpu(cpu) + atomic_long_set(per_cpu_ptr(n->partial_free_objs, cpu), 0); atomic_long_set(>partial_total_objs, 0); #ifdef CONFIG_SLUB_DEBUG atomic_long_set(>nr_slabs, 0); atomic_long_set(>total_objects, 0); INIT_LIST_HEAD(>full); #endif + + return 0; } static inline int alloc_kmem_cache_cpus(struct kmem_cache *s) @@ -3463,7 +3482,7 @@ static void early_kmem_cache_node_alloc(int node) page->inuse = 1; page->frozen = 0; kmem_cache_node->node[node] = n; - init_kmem_cache_node(n); + BUG_ON(init_kmem_cache_node(n) < 0); inc_slabs_node(kmem_cache_node, node, page->objects); /* @@ -3481,6 +3500,7 @@ static void free_kmem_cache_nodes(struct kmem_cache *s) for_each_kmem_cache_node(s, node, n) { s->node[node] = NULL; + free_percpu(n->partial_free_objs); kmem_cache_free(kmem_cache_node, n); } } @@ -3511,7 +3531,11 @@ static int init_kmem_cache_nodes(struct kmem_cache *s) return 0; } - init_kmem_cache_node(n); + if (init_kmem_cache_node(n) < 0) { + free_kmem_cache_nodes(s); + return 0; + } + s->node[node] = n; } return 1; -- 1.8.3.1
[PATCH v2 1/3] mm/slub: Introduce two counters for partial objects
The node list_lock in count_partial() spends long time iterating in case of large amount of partial page lists, which can cause thunder herd effect to the list_lock contention. We have HSF RT(High-speed Service Framework Response-Time) monitors, the RT figures fluctuated randomly, then we deployed a tool detecting "irq off" and "preempt off" to dump the culprit's calltrace, capturing the list_lock cost nearly 100ms with irq off issued by "ss", this also caused network timeouts. This patch introduces two counters to maintain the actual number of partial objects dynamically instead of iterating the partial page lists with list_lock held. New counters of kmem_cache_node: partial_free_objs, partial_total_objs. The main operations are under list_lock in slow path, its performance impact should be minimal except the __slab_free() path which will be addressed later. Acked-by: Pekka Enberg Co-developed-by: Wen Yang Signed-off-by: Xunlei Pang --- mm/slab.h | 2 ++ mm/slub.c | 37 - 2 files changed, 38 insertions(+), 1 deletion(-) diff --git a/mm/slab.h b/mm/slab.h index 7e94700..c85e2fa 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -616,6 +616,8 @@ struct kmem_cache_node { #ifdef CONFIG_SLUB unsigned long nr_partial; struct list_head partial; + atomic_long_t partial_free_objs; + atomic_long_t partial_total_objs; #ifdef CONFIG_SLUB_DEBUG atomic_long_t nr_slabs; atomic_long_t total_objects; diff --git a/mm/slub.c b/mm/slub.c index 6589b41..6708d96 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1775,10 +1775,24 @@ static void discard_slab(struct kmem_cache *s, struct page *page) /* * Management of partially allocated slabs. */ + +static inline void +__update_partial_free(struct kmem_cache_node *n, long delta) +{ + atomic_long_add(delta, >partial_free_objs); +} + +static inline void +__update_partial_total(struct kmem_cache_node *n, long delta) +{ + atomic_long_add(delta, >partial_total_objs); +} + static inline void __add_partial(struct kmem_cache_node *n, struct page *page, int tail) { n->nr_partial++; + __update_partial_total(n, page->objects); if (tail == DEACTIVATE_TO_TAIL) list_add_tail(>slab_list, >partial); else @@ -1798,6 +1812,7 @@ static inline void remove_partial(struct kmem_cache_node *n, lockdep_assert_held(>list_lock); list_del(>slab_list); n->nr_partial--; + __update_partial_total(n, -page->objects); } /* @@ -1842,6 +1857,7 @@ static inline void *acquire_slab(struct kmem_cache *s, return NULL; remove_partial(n, page); + __update_partial_free(n, -*objects); WARN_ON(!freelist); return freelist; } @@ -2174,8 +2190,11 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page, "unfreezing slab")) goto redo; - if (lock) + if (lock) { + if (m == M_PARTIAL) + __update_partial_free(n, page->objects - page->inuse); spin_unlock(>list_lock); + } if (m == M_PARTIAL) stat(s, tail); @@ -2241,6 +2260,7 @@ static void unfreeze_partials(struct kmem_cache *s, discard_page = page; } else { add_partial(n, page, DEACTIVATE_TO_TAIL); + __update_partial_free(n, page->objects - page->inuse); stat(s, FREE_ADD_PARTIAL); } } @@ -2915,6 +2935,13 @@ static void __slab_free(struct kmem_cache *s, struct page *page, head, new.counters, "__slab_free")); + if (!was_frozen && prior) { + if (n) + __update_partial_free(n, cnt); + else + __update_partial_free(get_node(s, page_to_nid(page)), cnt); + } + if (likely(!n)) { /* @@ -2944,6 +2971,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, if (!kmem_cache_has_cpu_partial(s) && unlikely(!prior)) { remove_full(s, n, page); add_partial(n, page, DEACTIVATE_TO_TAIL); + __update_partial_free(n, page->objects - page->inuse); stat(s, FREE_ADD_PARTIAL); } spin_unlock_irqrestore(>list_lock, flags); @@ -2955,6 +2983,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, * Slab on the partial list. */ remove_partial(n, page); + __update_partial_free(n, page->inuse - page->objects); stat(s, FREE_REMOVE_PARTIAL); } else { /* Slab must be on the full list */ @@ -3364,6 +3393,8 @@ static inlin
[PATCH 2/2] mm/slub: Get rid of count_partial()
Now the partial counters are ready, let's use them directly and get rid of count_partial(). Co-developed-by: Wen Yang Signed-off-by: Xunlei Pang --- mm/slub.c | 57 - 1 file changed, 24 insertions(+), 33 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 53890f3..f1946ed 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2414,11 +2414,6 @@ static inline int node_match(struct page *page, int node) } #ifdef CONFIG_SLUB_DEBUG -static int count_free(struct page *page) -{ - return page->objects - page->inuse; -} - static inline unsigned long node_nr_objs(struct kmem_cache_node *n) { return atomic_long_read(>total_objects); @@ -2426,19 +2421,27 @@ static inline unsigned long node_nr_objs(struct kmem_cache_node *n) #endif /* CONFIG_SLUB_DEBUG */ #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS) -static unsigned long count_partial(struct kmem_cache_node *n, - int (*get_count)(struct page *)) -{ - unsigned long flags; - unsigned long x = 0; - struct page *page; +enum partial_item { PARTIAL_FREE, PARTIAL_INUSE, PARTIAL_TOTAL }; + +static unsigned long partial_counter(struct kmem_cache_node *n, + enum partial_item item) +{ + unsigned long ret = 0; + + if (item == PARTIAL_FREE) { + ret = atomic_long_read(>pfree_objects); + } else if (item == PARTIAL_TOTAL) { + ret = atomic_long_read(>total_objects); + } else if (item == PARTIAL_INUSE) { + ret = atomic_long_read(>total_objects) - + atomic_long_read(>pfree_objects); + if ((long)ret < 0) + ret = 0; + } - spin_lock_irqsave(>list_lock, flags); - list_for_each_entry(page, >partial, slab_list) - x += get_count(page); - spin_unlock_irqrestore(>list_lock, flags); - return x; + return ret; } + #endif /* CONFIG_SLUB_DEBUG || CONFIG_SYSFS */ static noinline void @@ -2468,7 +2471,7 @@ static unsigned long count_partial(struct kmem_cache_node *n, unsigned long nr_objs; unsigned long nr_free; - nr_free = count_partial(n, count_free); + nr_free = partial_counter(n, PARTIAL_FREE); nr_slabs = node_nr_slabs(n); nr_objs = node_nr_objs(n); @@ -4445,18 +4448,6 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags, } #endif -#ifdef CONFIG_SYSFS -static int count_inuse(struct page *page) -{ - return page->inuse; -} - -static int count_total(struct page *page) -{ - return page->objects; -} -#endif - #ifdef CONFIG_SLUB_DEBUG static void validate_slab(struct kmem_cache *s, struct page *page) { @@ -4913,7 +4904,7 @@ static ssize_t show_slab_objects(struct kmem_cache *s, x = atomic_long_read(>total_objects); else if (flags & SO_OBJECTS) x = atomic_long_read(>total_objects) - - count_partial(n, count_free); + partial_counter(n, PARTIAL_FREE); else x = atomic_long_read(>nr_slabs); total += x; @@ -4927,9 +4918,9 @@ static ssize_t show_slab_objects(struct kmem_cache *s, for_each_kmem_cache_node(s, node, n) { if (flags & SO_TOTAL) - x = count_partial(n, count_total); + x = partial_counter(n, PARTIAL_TOTAL); else if (flags & SO_OBJECTS) - x = count_partial(n, count_inuse); + x = partial_counter(n, PARTIAL_INUSE); else x = n->nr_partial; total += x; @@ -5962,7 +5953,7 @@ void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo) for_each_kmem_cache_node(s, node, n) { nr_slabs += node_nr_slabs(n); nr_objs += node_nr_objs(n); - nr_free += count_partial(n, count_free); + nr_free += partial_counter(n, PARTIAL_FREE); } sinfo->active_objs = nr_objs - nr_free; -- 1.8.3.1
[PATCH 1/2] mm/slub: Introduce two counters for the partial objects
The node list_lock in count_partial() spend long time iterating in case of large amount of partial page lists, which can cause thunder herd effect to the list_lock contention, e.g. it cause business response-time jitters when accessing "/proc/slabinfo" in our production environments. This patch introduces two counters to maintain the actual number of partial objects dynamically instead of iterating the partial page lists with list_lock held. New counters of kmem_cache_node are: pfree_objects, ptotal_objects. The main operations are under list_lock in slow path, its performance impact is minimal. Co-developed-by: Wen Yang Signed-off-by: Xunlei Pang --- mm/slab.h | 2 ++ mm/slub.c | 38 +- 2 files changed, 39 insertions(+), 1 deletion(-) diff --git a/mm/slab.h b/mm/slab.h index 7e94700..5935749 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -616,6 +616,8 @@ struct kmem_cache_node { #ifdef CONFIG_SLUB unsigned long nr_partial; struct list_head partial; + atomic_long_t pfree_objects; /* partial free objects */ + atomic_long_t ptotal_objects; /* partial total objects */ #ifdef CONFIG_SLUB_DEBUG atomic_long_t nr_slabs; atomic_long_t total_objects; diff --git a/mm/slub.c b/mm/slub.c index 6589b41..53890f3 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1775,10 +1775,24 @@ static void discard_slab(struct kmem_cache *s, struct page *page) /* * Management of partially allocated slabs. */ + +static inline void +__update_partial_free(struct kmem_cache_node *n, long delta) +{ + atomic_long_add(delta, >pfree_objects); +} + +static inline void +__update_partial_total(struct kmem_cache_node *n, long delta) +{ + atomic_long_add(delta, >ptotal_objects); +} + static inline void __add_partial(struct kmem_cache_node *n, struct page *page, int tail) { n->nr_partial++; + __update_partial_total(n, page->objects); if (tail == DEACTIVATE_TO_TAIL) list_add_tail(>slab_list, >partial); else @@ -1798,6 +1812,7 @@ static inline void remove_partial(struct kmem_cache_node *n, lockdep_assert_held(>list_lock); list_del(>slab_list); n->nr_partial--; + __update_partial_total(n, -page->objects); } /* @@ -1842,6 +1857,7 @@ static inline void *acquire_slab(struct kmem_cache *s, return NULL; remove_partial(n, page); + __update_partial_free(n, -*objects); WARN_ON(!freelist); return freelist; } @@ -2174,8 +2190,11 @@ static void deactivate_slab(struct kmem_cache *s, struct page *page, "unfreezing slab")) goto redo; - if (lock) + if (lock) { + if (m == M_PARTIAL) + __update_partial_free(n, page->objects - page->inuse); spin_unlock(>list_lock); + } if (m == M_PARTIAL) stat(s, tail); @@ -2241,6 +2260,7 @@ static void unfreeze_partials(struct kmem_cache *s, discard_page = page; } else { add_partial(n, page, DEACTIVATE_TO_TAIL); + __update_partial_free(n, page->objects - page->inuse); stat(s, FREE_ADD_PARTIAL); } } @@ -2915,6 +2935,14 @@ static void __slab_free(struct kmem_cache *s, struct page *page, head, new.counters, "__slab_free")); + if (!was_frozen && prior) { + if (n) + __update_partial_free(n, cnt); + else + __update_partial_free(get_node(s, page_to_nid(page)), + cnt); + } + if (likely(!n)) { /* @@ -2944,6 +2972,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, if (!kmem_cache_has_cpu_partial(s) && unlikely(!prior)) { remove_full(s, n, page); add_partial(n, page, DEACTIVATE_TO_TAIL); + __update_partial_free(n, page->objects - page->inuse); stat(s, FREE_ADD_PARTIAL); } spin_unlock_irqrestore(>list_lock, flags); @@ -2955,6 +2984,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, * Slab on the partial list. */ remove_partial(n, page); + __update_partial_free(n, page->inuse - page->objects); stat(s, FREE_REMOVE_PARTIAL); } else { /* Slab must be on the full list */ @@ -3364,6 +3394,8 @@ static inline int calculate_order(unsigned int size) n->nr_partial = 0; spin_lock_init(>list_lock); INIT_LIST_HEAD(>partial); + atomic_long_set(>pfree_objects, 0); + atomic_long_set(>ptotal_objects, 0)
Re: [PATCH] memcg: Ignore unprotected parent in mem_cgroup_protected()
Hi Chris, On 2019/6/16 PM 6:37, Chris Down wrote: > Hi Xunlei, > > Xunlei Pang writes: >> docker and various types(different memory capacity) of containers >> are managed by k8s, it's a burden for k8s to maintain those dynamic >> figures, simply set "max" to key containers is always welcome. > > Right, setting "max" is generally a fine way of going about it. > >> Set "max" to docker also protects docker cgroup memory(as docker >> itself has tasks) unnecessarily. > > That's not correct -- leaf memcgs have to _explicitly_ request memory > protection. From the documentation: > > memory.low > > [...] > > Best-effort memory protection. If the memory usages of a > cgroup and all its ancestors are below their low boundaries, > the cgroup's memory won't be reclaimed unless memory can be > reclaimed from unprotected cgroups. > > Note the part that the cgroup itself also must be within its low > boundary, which is not implied simply by having ancestors that would > permit propagation of protections. > > In this case, Docker just shouldn't request it for those Docker-related > tasks, and they won't get any. That seems a lot simpler and more > intuitive than special casing "0" in ancestors. > >> This patch doesn't take effect on any intermediate layer with >> positive memory.min set, it requires all the ancestors having >> 0 memory.min to work. >> >> Nothing special change, but more flexible to business deployment... > > Not so, this change is extremely "special". It violates the basic > expectation that 0 means no possibility of propagation of protection, > and I still don't see a compelling argument why Docker can't just set > "max" in the intermediate cgroup and not accept any protection in leaf > memcgs that it doesn't want protection for. I got the reason, I'm using cgroup v1(with memory.min backport) which permits tasks existent in "docker" cgroup.procs. For cgroup v2, it's not a problem. Thanks, Xunlei
Re: [PATCH] memcg: Ignore unprotected parent in mem_cgroup_protected()
Hi Chirs, On 2019/6/16 AM 12:08, Chris Down wrote: > Hi Xunlei, > > Xunlei Pang writes: >> Currently memory.min|low implementation requires the whole >> hierarchy has the settings, otherwise the protection will >> be broken. >> >> Our hierarchy is kind of like(memory.min value in brackets), >> >> root >> | >> docker(0) >> / \ >> c1(max) c2(0) >> >> Note that "docker" doesn't set memory.min. When kswapd runs, >> mem_cgroup_protected() returns "0" emin for "c1" due to "0" >> @parent_emin of "docker", as a result "c1" gets reclaimed. >> >> But it's hard to maintain parent's "memory.min" when there're >> uncertain protected children because only some important types >> of containers need the protection. Further, control tasks >> belonging to parent constantly reproduce trivial memory which >> should not be protected at all. It makes sense to ignore >> unprotected parent in this scenario to achieve the flexibility. > > I'm really confused by this, why don't you just set memory.{min,low} in > the docker cgroup and only propagate it to the children that want it? > > If you only want some children to have the protection, only request it > in those children, or create an additional intermediate layer of the > cgroup hierarchy with protections further limited if you don't trust the > task to request the right amount. > > Breaking the requirement for hierarchical propagation of protections > seems like a really questionable API change, not least because it makes > it harder to set systemwide policies about the constraints of > protections within a subtree. docker and various types(different memory capacity) of containers are managed by k8s, it's a burden for k8s to maintain those dynamic figures, simply set "max" to key containers is always welcome. Set "max" to docker also protects docker cgroup memory(as docker itself has tasks) unnecessarily. This patch doesn't take effect on any intermediate layer with positive memory.min set, it requires all the ancestors having 0 memory.min to work. Nothing special change, but more flexible to business deployment...
Re: [PATCH] psi: Don't account force reclaim as memory pressure
Hi Chris, On 2019/6/15 PM 11:58, Chris Down wrote: > Hi Xunlei, > > Xunlei Pang writes: >> There're several cases like resize and force_empty that don't >> need to account to psi, otherwise is misleading. > > I'm afraid I'm quite confused by this patch. Why do you think accounting > for force reclaim in PSI is misleading? I completely expect that force > reclaim should still be accounted for as memory pressure, can you > present some reason why it shouldn't be? We expect psi stands for negative factors to applications which affect their response time, but force reclaims are behaviours triggered on purpose like "/proc/sys/vm/drop_caches", not the real negative pressure. e.g. my module force reclaims the dead memcgs, there's no application attached to it, and its memory(page caches) is usually useless, force reclaiming them doesn't mean the system or parent memcg is under memory pressure, while actually the whole system or the parent memcg has plenty of free memory. If the force reclaim causes further memory pressure like hot page cache miss, then the workingset refault psi will catch that.
[PATCH] psi: Don't account force reclaim as memory pressure
There're several cases like resize and force_empty that don't need to account to psi, otherwise is misleading. We also have a module reclaiming dying memcgs at background to avoid too many dead memcgs which can cause lots of trouble, then it makes the psi inaccuracy even worse without this patch. Signed-off-by: Xunlei Pang --- include/linux/swap.h | 3 ++- mm/memcontrol.c | 13 +++-- mm/vmscan.c | 9 ++--- 3 files changed, 15 insertions(+), 10 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 4bfb5c4ac108..74b5443877d4 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -354,7 +354,8 @@ extern int __isolate_lru_page(struct page *page, isolate_mode_t mode); extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, - bool may_swap); + bool may_swap, + bool force_reclaim); extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem, gfp_t gfp_mask, bool noswap, pg_data_t *pgdat, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f1dfa651f55d..f4ec57876ada 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2237,7 +2237,8 @@ static void reclaim_high(struct mem_cgroup *memcg, if (page_counter_read(>memory) <= memcg->high) continue; memcg_memory_event(memcg, MEMCG_HIGH); - try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true); + try_to_free_mem_cgroup_pages(memcg, nr_pages, + gfp_mask, true, false); } while ((memcg = parent_mem_cgroup(memcg))); } @@ -2330,7 +2331,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, memcg_memory_event(mem_over_limit, MEMCG_MAX); nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages, - gfp_mask, may_swap); +gfp_mask, may_swap, false); if (mem_cgroup_margin(mem_over_limit) >= nr_pages) goto retry; @@ -2860,7 +2861,7 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg, } if (!try_to_free_mem_cgroup_pages(memcg, 1, - GFP_KERNEL, !memsw)) { + GFP_KERNEL, !memsw, true)) { ret = -EBUSY; break; } @@ -2993,7 +2994,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg) return -EINTR; progress = try_to_free_mem_cgroup_pages(memcg, 1, - GFP_KERNEL, true); + GFP_KERNEL, true, true); if (!progress) { nr_retries--; /* maybe some writeback is necessary */ @@ -5549,7 +5550,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, nr_pages = page_counter_read(>memory); if (nr_pages > high) try_to_free_mem_cgroup_pages(memcg, nr_pages - high, -GFP_KERNEL, true); +GFP_KERNEL, true, true); memcg_wb_domain_size_changed(memcg); return nbytes; @@ -5596,7 +5597,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, if (nr_reclaims) { if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max, - GFP_KERNEL, true)) + GFP_KERNEL, true, true)) nr_reclaims--; continue; } diff --git a/mm/vmscan.c b/mm/vmscan.c index 7acd0afdfc2a..3831848fca5a 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3212,7 +3212,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, - bool may_swap) + bool may_swap, + bool force_reclaim) { struct zonelist *zonelist; unsigned long nr_reclaimed; @@ -3243,13 +3244,15 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, trace_mm_vmscan_memcg_reclaim_be
[PATCH] memcg: Ignore unprotected parent in mem_cgroup_protected()
Currently memory.min|low implementation requires the whole hierarchy has the settings, otherwise the protection will be broken. Our hierarchy is kind of like(memory.min value in brackets), root | docker(0) /\ c1(max) c2(0) Note that "docker" doesn't set memory.min. When kswapd runs, mem_cgroup_protected() returns "0" emin for "c1" due to "0" @parent_emin of "docker", as a result "c1" gets reclaimed. But it's hard to maintain parent's "memory.min" when there're uncertain protected children because only some important types of containers need the protection. Further, control tasks belonging to parent constantly reproduce trivial memory which should not be protected at all. It makes sense to ignore unprotected parent in this scenario to achieve the flexibility. In order not to break previous hierarchical behaviour, only ignore the parent when there's no protected ancestor upwards the hierarchy. Signed-off-by: Xunlei Pang --- include/linux/page_counter.h | 2 ++ mm/memcontrol.c | 5 + mm/page_counter.c| 24 3 files changed, 31 insertions(+) diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h index bab7e57f659b..aed7ed28b458 100644 --- a/include/linux/page_counter.h +++ b/include/linux/page_counter.h @@ -55,6 +55,8 @@ bool page_counter_try_charge(struct page_counter *counter, void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages); void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages); void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages); +bool page_counter_has_min(struct page_counter *counter); +bool page_counter_has_low(struct page_counter *counter); int page_counter_set_max(struct page_counter *counter, unsigned long nr_pages); int page_counter_memparse(const char *buf, const char *max, unsigned long *nr_pages); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ca0bc6e6be13..f1dfa651f55d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5917,6 +5917,8 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, if (parent == root) goto exit; + if (!page_counter_has_min(>memory)) + goto elow; parent_emin = READ_ONCE(parent->memory.emin); emin = min(emin, parent_emin); if (emin && parent_emin) { @@ -5931,6 +5933,9 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, siblings_min_usage); } +elow: + if (!page_counter_has_low(>memory)) + goto exit; parent_elow = READ_ONCE(parent->memory.elow); elow = min(elow, parent_elow); if (elow && parent_elow) { diff --git a/mm/page_counter.c b/mm/page_counter.c index de31470655f6..8c668eae2af5 100644 --- a/mm/page_counter.c +++ b/mm/page_counter.c @@ -202,6 +202,30 @@ int page_counter_set_max(struct page_counter *counter, unsigned long nr_pages) } } +bool page_counter_has_min(struct page_counter *counter) +{ + struct page_counter *c; + + for (c = counter; c; c = c->parent) { + if (counter->min) + return true; + } + + return false; +} + +bool page_counter_has_low(struct page_counter *counter) +{ + struct page_counter *c; + + for (c = counter; c; c = c->parent) { + if (counter->low) + return true; + } + + return false; +} + /** * page_counter_set_min - set the amount of protected memory * @counter: counter -- 2.14.4.44.g2045bb6
Re: [PATCH] sched/fair: don't push cfs_bandwith slack timers forward
On 2019/6/6 AM 4:06, bseg...@google.com wrote: > When a cfs_rq sleeps and returns its quota, we delay for 5ms before > waking any throttled cfs_rqs to coalesce with other cfs_rqs going to > sleep, as this has has to be done outside of the rq lock we hold. two "has". > > The current code waits for 5ms without any sleeps, instead of waiting > for 5ms from the first sleep, which can delay the unthrottle more than > we want. Switch this around so that we can't push this forward forever. > > This requires an extra flag rather than using hrtimer_active, since we > need to start a new timer if the current one is in the process of > finishing. > > Signed-off-by: Ben Segall > --- We've also suffered from this performance issue recently: Reviewed-by: Xunlei Pang > kernel/sched/fair.c | 7 +++ > kernel/sched/sched.h | 1 + > 2 files changed, 8 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 8213ff6e365d..2ead252cfa32 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4729,6 +4729,11 @@ static void start_cfs_slack_bandwidth(struct > cfs_bandwidth *cfs_b) > if (runtime_refresh_within(cfs_b, min_left)) > return; > > + /* don't push forwards an existing deferred unthrottle */ > + if (cfs_b->slack_started) > + return; > + cfs_b->slack_started = true; > + > hrtimer_start(_b->slack_timer, > ns_to_ktime(cfs_bandwidth_slack_period), > HRTIMER_MODE_REL); > @@ -4782,6 +4787,7 @@ static void do_sched_cfs_slack_timer(struct > cfs_bandwidth *cfs_b) > > /* confirm we're still not at a refresh boundary */ > raw_spin_lock_irqsave(_b->lock, flags); > + cfs_b->slack_started = false; > if (cfs_b->distribute_running) { > raw_spin_unlock_irqrestore(_b->lock, flags); > return; > @@ -4920,6 +4926,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) > hrtimer_init(_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); > cfs_b->slack_timer.function = sched_cfs_slack_timer; > cfs_b->distribute_running = 0; > + cfs_b->slack_started = false; > } > > static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index efa686eeff26..60219acda94b 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -356,6 +356,7 @@ struct cfs_bandwidth { > u64 throttled_time; > > booldistribute_running; > + boolslack_started; > #endif > }; > >
Re: [PATCH 1/3] mm/memcg: Fix min/low usage in propagate_protected_usage()
Hi Roman, On 2018/12/4 AM 2:00, Roman Gushchin wrote: > On Mon, Dec 03, 2018 at 04:01:17PM +0800, Xunlei Pang wrote: >> When usage exceeds min, min usage should be min other than 0. >> Apply the same for low. >> >> Signed-off-by: Xunlei Pang >> --- >> mm/page_counter.c | 12 ++-- >> 1 file changed, 2 insertions(+), 10 deletions(-) >> >> diff --git a/mm/page_counter.c b/mm/page_counter.c >> index de31470655f6..75d53f15f040 100644 >> --- a/mm/page_counter.c >> +++ b/mm/page_counter.c >> @@ -23,11 +23,7 @@ static void propagate_protected_usage(struct page_counter >> *c, >> return; >> >> if (c->min || atomic_long_read(>min_usage)) { >> -if (usage <= c->min) >> -protected = usage; >> -else >> -protected = 0; >> - >> +protected = min(usage, c->min); > > This change makes sense in the combination with the patch 3, but not as a > standlone "fix". It's not a bug, it's a required thing unless you start > scanning > proportionally to memory.low/min excess. > > Please, reflect this in the commit message. Or, even better, merge it into > the patch 3. The more I looked the more I think it's a bug, but anyway I'm fine with merging it into patch 3 :-) > > Also, please, make sure that cgroup kselftests are passing after your changes. Sure, will do and send v2. Thanks for your inputs.
Re: [PATCH 1/3] mm/memcg: Fix min/low usage in propagate_protected_usage()
Hi Roman, On 2018/12/4 AM 2:00, Roman Gushchin wrote: > On Mon, Dec 03, 2018 at 04:01:17PM +0800, Xunlei Pang wrote: >> When usage exceeds min, min usage should be min other than 0. >> Apply the same for low. >> >> Signed-off-by: Xunlei Pang >> --- >> mm/page_counter.c | 12 ++-- >> 1 file changed, 2 insertions(+), 10 deletions(-) >> >> diff --git a/mm/page_counter.c b/mm/page_counter.c >> index de31470655f6..75d53f15f040 100644 >> --- a/mm/page_counter.c >> +++ b/mm/page_counter.c >> @@ -23,11 +23,7 @@ static void propagate_protected_usage(struct page_counter >> *c, >> return; >> >> if (c->min || atomic_long_read(>min_usage)) { >> -if (usage <= c->min) >> -protected = usage; >> -else >> -protected = 0; >> - >> +protected = min(usage, c->min); > > This change makes sense in the combination with the patch 3, but not as a > standlone "fix". It's not a bug, it's a required thing unless you start > scanning > proportionally to memory.low/min excess. > > Please, reflect this in the commit message. Or, even better, merge it into > the patch 3. The more I looked the more I think it's a bug, but anyway I'm fine with merging it into patch 3 :-) > > Also, please, make sure that cgroup kselftests are passing after your changes. Sure, will do and send v2. Thanks for your inputs.
Re: [PATCH 2/3] mm/vmscan: Enable kswapd to reclaim low-protected memory
On 2018/12/4 PM 3:25, Michal Hocko wrote: > On Tue 04-12-18 10:40:29, Xunlei Pang wrote: >> On 2018/12/4 AM 1:22, Michal Hocko wrote: >>> On Mon 03-12-18 23:20:31, Xunlei Pang wrote: >>>> On 2018/12/3 下午7:56, Michal Hocko wrote: >>>>> On Mon 03-12-18 16:01:18, Xunlei Pang wrote: >>>>>> There may be cgroup memory overcommitment, it will become >>>>>> even common in the future. >>>>>> >>>>>> Let's enable kswapd to reclaim low-protected memory in case >>>>>> of memory pressure, to mitigate the global direct reclaim >>>>>> pressures which could cause jitters to the response time of >>>>>> lantency-sensitive groups. >>>>> >>>>> Please be more descriptive about the problem you are trying to handle >>>>> here. I haven't actually read the patch but let me emphasise that the >>>>> low limit protection is important isolation tool. And allowing kswapd to >>>>> reclaim protected memcgs is going to break the semantic as it has been >>>>> introduced and designed. >>>> >>>> We have two types of memcgs: online groups(important business) >>>> and offline groups(unimportant business). Online groups are >>>> all configured with MAX low protection, while offline groups >>>> are not at all protected(with default 0 low). >>>> >>>> When offline groups are overcommitted, the global memory pressure >>>> suffers. This will cause the memory allocations from online groups >>>> constantly go to the slow global direct reclaim in order to reclaim >>>> online's page caches, as kswap is not able to reclaim low-protection >>>> memory. low is not hard limit, it's reasonable to be reclaimed by >>>> kswapd if there's no other reclaimable memory. >>> >>> I am sorry I still do not follow. What role do offline cgroups play. >>> Those are certainly not low mem protected because mem_cgroup_css_offline >>> will reset them to 0. >>> >> >> Oh, I meant "offline groups" to be "offline-business groups", memcgs >> refered to here are all "online state" from kernel's perspective. > > What is offline-business group? Please try to explain the actual problem > in much more details and do not let us guess. > Maybe I choosed the wrong word, let me rephase it, and here is an example. root 200GB / \ important(100GB) unimportant(100GB+DYNAMIC) / | \ / \ docker0 docker1... normal(100GB) oversold(DYNAMIC) / | \ / | \ j0 j1 ...w0 w1 ... "DYNAMIC" is controlled by the cluster job scheduler dynamically, it periodically samples the available system memory(/proc/meminfo "MemAvailable"), and use part of that to launch oversold jobs under some special conditions. When "oversold" is active, the whole system is put under heavy global memory pressure although memcgs are not. IOW "DYNAMIC" is primarily borrowed from "dockers" temporarily, oversold workers will be killed in a timely fashion if "dockers" needs their memory back suddenly which is rare. If kswapd doesn't reclaim low-protected memory configured among "important" dockers, memory allocations from dockers will trap into global direct reclaim constantly which harms their performance and response time. The inactive caches from dockers are allowed to be reclaimed although they are under low-protected(we used a simple MAX setting), we allow the inactive low-protected memory to be reclaimed immediately and asynchronously as long as there's no unprotected reclaimable memory. Its's also friendly to disk IO. For really latency-sensitive docker, memory.min is supposed to be used to guarantee its memory QoS. Thanks
Re: [PATCH 2/3] mm/vmscan: Enable kswapd to reclaim low-protected memory
On 2018/12/4 PM 3:25, Michal Hocko wrote: > On Tue 04-12-18 10:40:29, Xunlei Pang wrote: >> On 2018/12/4 AM 1:22, Michal Hocko wrote: >>> On Mon 03-12-18 23:20:31, Xunlei Pang wrote: >>>> On 2018/12/3 下午7:56, Michal Hocko wrote: >>>>> On Mon 03-12-18 16:01:18, Xunlei Pang wrote: >>>>>> There may be cgroup memory overcommitment, it will become >>>>>> even common in the future. >>>>>> >>>>>> Let's enable kswapd to reclaim low-protected memory in case >>>>>> of memory pressure, to mitigate the global direct reclaim >>>>>> pressures which could cause jitters to the response time of >>>>>> lantency-sensitive groups. >>>>> >>>>> Please be more descriptive about the problem you are trying to handle >>>>> here. I haven't actually read the patch but let me emphasise that the >>>>> low limit protection is important isolation tool. And allowing kswapd to >>>>> reclaim protected memcgs is going to break the semantic as it has been >>>>> introduced and designed. >>>> >>>> We have two types of memcgs: online groups(important business) >>>> and offline groups(unimportant business). Online groups are >>>> all configured with MAX low protection, while offline groups >>>> are not at all protected(with default 0 low). >>>> >>>> When offline groups are overcommitted, the global memory pressure >>>> suffers. This will cause the memory allocations from online groups >>>> constantly go to the slow global direct reclaim in order to reclaim >>>> online's page caches, as kswap is not able to reclaim low-protection >>>> memory. low is not hard limit, it's reasonable to be reclaimed by >>>> kswapd if there's no other reclaimable memory. >>> >>> I am sorry I still do not follow. What role do offline cgroups play. >>> Those are certainly not low mem protected because mem_cgroup_css_offline >>> will reset them to 0. >>> >> >> Oh, I meant "offline groups" to be "offline-business groups", memcgs >> refered to here are all "online state" from kernel's perspective. > > What is offline-business group? Please try to explain the actual problem > in much more details and do not let us guess. > Maybe I choosed the wrong word, let me rephase it, and here is an example. root 200GB / \ important(100GB) unimportant(100GB+DYNAMIC) / | \ / \ docker0 docker1... normal(100GB) oversold(DYNAMIC) / | \ / | \ j0 j1 ...w0 w1 ... "DYNAMIC" is controlled by the cluster job scheduler dynamically, it periodically samples the available system memory(/proc/meminfo "MemAvailable"), and use part of that to launch oversold jobs under some special conditions. When "oversold" is active, the whole system is put under heavy global memory pressure although memcgs are not. IOW "DYNAMIC" is primarily borrowed from "dockers" temporarily, oversold workers will be killed in a timely fashion if "dockers" needs their memory back suddenly which is rare. If kswapd doesn't reclaim low-protected memory configured among "important" dockers, memory allocations from dockers will trap into global direct reclaim constantly which harms their performance and response time. The inactive caches from dockers are allowed to be reclaimed although they are under low-protected(we used a simple MAX setting), we allow the inactive low-protected memory to be reclaimed immediately and asynchronously as long as there's no unprotected reclaimable memory. Its's also friendly to disk IO. For really latency-sensitive docker, memory.min is supposed to be used to guarantee its memory QoS. Thanks
Re: [PATCH 3/3] mm/memcg: Avoid reclaiming below hard protection
On 2018/12/3 PM 7:57, Michal Hocko wrote: > On Mon 03-12-18 16:01:19, Xunlei Pang wrote: >> When memcgs get reclaimed after its usage exceeds min, some >> usages below the min may also be reclaimed in the current >> implementation, the amount is considerably large during kswapd >> reclaim according to my ftrace results. > > And here again. Describe the setup and the behavior please? > step 1 mkdir -p /sys/fs/cgroup/memory/online cd /sys/fs/cgroup/memory/online echo 512M > memory.max echo 40960 > memory.min echo $$ > tasks dd if=/dev/sda of=/dev/null while true; do sleep 1; cat memory.current ; cat memory.min; done step 2 create global memory pressure by allocating annoymous and cached pages to constantly trigger kswap: dd if=/dev/sdb of=/dev/null step 3 Then observe "online" groups, hundreds of kbytes a little over memory.min can cause tens of MiB to be reclaimed by kswapd. Here is one of test results I got: cat memory.current; cat memory.min; echo; 409485312 // current 40960 // min 385052672 // See current got over reclaimed for 23MB 40960 // min Its corresponding ftrace output I monitored: kswapd_0-281 [000] 304.706632: shrink_node_memcg: min_excess=24, nr_reclaimed=6013, sc->nr_to_reclaim=147, exceeds 5989pages
Re: [PATCH 3/3] mm/memcg: Avoid reclaiming below hard protection
On 2018/12/3 PM 7:57, Michal Hocko wrote: > On Mon 03-12-18 16:01:19, Xunlei Pang wrote: >> When memcgs get reclaimed after its usage exceeds min, some >> usages below the min may also be reclaimed in the current >> implementation, the amount is considerably large during kswapd >> reclaim according to my ftrace results. > > And here again. Describe the setup and the behavior please? > step 1 mkdir -p /sys/fs/cgroup/memory/online cd /sys/fs/cgroup/memory/online echo 512M > memory.max echo 40960 > memory.min echo $$ > tasks dd if=/dev/sda of=/dev/null while true; do sleep 1; cat memory.current ; cat memory.min; done step 2 create global memory pressure by allocating annoymous and cached pages to constantly trigger kswap: dd if=/dev/sdb of=/dev/null step 3 Then observe "online" groups, hundreds of kbytes a little over memory.min can cause tens of MiB to be reclaimed by kswapd. Here is one of test results I got: cat memory.current; cat memory.min; echo; 409485312 // current 40960 // min 385052672 // See current got over reclaimed for 23MB 40960 // min Its corresponding ftrace output I monitored: kswapd_0-281 [000] 304.706632: shrink_node_memcg: min_excess=24, nr_reclaimed=6013, sc->nr_to_reclaim=147, exceeds 5989pages
Re: [PATCH 2/3] mm/vmscan: Enable kswapd to reclaim low-protected memory
On 2018/12/4 AM 1:22, Michal Hocko wrote: > On Mon 03-12-18 23:20:31, Xunlei Pang wrote: >> On 2018/12/3 下午7:56, Michal Hocko wrote: >>> On Mon 03-12-18 16:01:18, Xunlei Pang wrote: >>>> There may be cgroup memory overcommitment, it will become >>>> even common in the future. >>>> >>>> Let's enable kswapd to reclaim low-protected memory in case >>>> of memory pressure, to mitigate the global direct reclaim >>>> pressures which could cause jitters to the response time of >>>> lantency-sensitive groups. >>> >>> Please be more descriptive about the problem you are trying to handle >>> here. I haven't actually read the patch but let me emphasise that the >>> low limit protection is important isolation tool. And allowing kswapd to >>> reclaim protected memcgs is going to break the semantic as it has been >>> introduced and designed. >> >> We have two types of memcgs: online groups(important business) >> and offline groups(unimportant business). Online groups are >> all configured with MAX low protection, while offline groups >> are not at all protected(with default 0 low). >> >> When offline groups are overcommitted, the global memory pressure >> suffers. This will cause the memory allocations from online groups >> constantly go to the slow global direct reclaim in order to reclaim >> online's page caches, as kswap is not able to reclaim low-protection >> memory. low is not hard limit, it's reasonable to be reclaimed by >> kswapd if there's no other reclaimable memory. > > I am sorry I still do not follow. What role do offline cgroups play. > Those are certainly not low mem protected because mem_cgroup_css_offline > will reset them to 0. > Oh, I meant "offline groups" to be "offline-business groups", memcgs refered to here are all "online state" from kernel's perspective.
Re: [PATCH 2/3] mm/vmscan: Enable kswapd to reclaim low-protected memory
On 2018/12/4 AM 1:22, Michal Hocko wrote: > On Mon 03-12-18 23:20:31, Xunlei Pang wrote: >> On 2018/12/3 下午7:56, Michal Hocko wrote: >>> On Mon 03-12-18 16:01:18, Xunlei Pang wrote: >>>> There may be cgroup memory overcommitment, it will become >>>> even common in the future. >>>> >>>> Let's enable kswapd to reclaim low-protected memory in case >>>> of memory pressure, to mitigate the global direct reclaim >>>> pressures which could cause jitters to the response time of >>>> lantency-sensitive groups. >>> >>> Please be more descriptive about the problem you are trying to handle >>> here. I haven't actually read the patch but let me emphasise that the >>> low limit protection is important isolation tool. And allowing kswapd to >>> reclaim protected memcgs is going to break the semantic as it has been >>> introduced and designed. >> >> We have two types of memcgs: online groups(important business) >> and offline groups(unimportant business). Online groups are >> all configured with MAX low protection, while offline groups >> are not at all protected(with default 0 low). >> >> When offline groups are overcommitted, the global memory pressure >> suffers. This will cause the memory allocations from online groups >> constantly go to the slow global direct reclaim in order to reclaim >> online's page caches, as kswap is not able to reclaim low-protection >> memory. low is not hard limit, it's reasonable to be reclaimed by >> kswapd if there's no other reclaimable memory. > > I am sorry I still do not follow. What role do offline cgroups play. > Those are certainly not low mem protected because mem_cgroup_css_offline > will reset them to 0. > Oh, I meant "offline groups" to be "offline-business groups", memcgs refered to here are all "online state" from kernel's perspective.
Re: [PATCH 2/3] mm/vmscan: Enable kswapd to reclaim low-protected memory
On 2018/12/3 下午7:56, Michal Hocko wrote: > On Mon 03-12-18 16:01:18, Xunlei Pang wrote: >> There may be cgroup memory overcommitment, it will become >> even common in the future. >> >> Let's enable kswapd to reclaim low-protected memory in case >> of memory pressure, to mitigate the global direct reclaim >> pressures which could cause jitters to the response time of >> lantency-sensitive groups. > > Please be more descriptive about the problem you are trying to handle > here. I haven't actually read the patch but let me emphasise that the > low limit protection is important isolation tool. And allowing kswapd to > reclaim protected memcgs is going to break the semantic as it has been > introduced and designed. We have two types of memcgs: online groups(important business) and offline groups(unimportant business). Online groups are all configured with MAX low protection, while offline groups are not at all protected(with default 0 low). When offline groups are overcommitted, the global memory pressure suffers. This will cause the memory allocations from online groups constantly go to the slow global direct reclaim in order to reclaim online's page caches, as kswap is not able to reclaim low-protection memory. low is not hard limit, it's reasonable to be reclaimed by kswapd if there's no other reclaimable memory. > >> >> Signed-off-by: Xunlei Pang >> --- >> mm/vmscan.c | 8 >> 1 file changed, 8 insertions(+) >> >> diff --git a/mm/vmscan.c b/mm/vmscan.c >> index 62ac0c488624..3d412eb91f73 100644 >> --- a/mm/vmscan.c >> +++ b/mm/vmscan.c >> @@ -3531,6 +3531,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, >> int classzone_idx) >> >> count_vm_event(PAGEOUTRUN); >> >> +retry: >> do { >> unsigned long nr_reclaimed = sc.nr_reclaimed; >> bool raise_priority = true; >> @@ -3622,6 +3623,13 @@ static int balance_pgdat(pg_data_t *pgdat, int order, >> int classzone_idx) >> sc.priority--; >> } while (sc.priority >= 1); >> >> +if (!sc.nr_reclaimed && sc.memcg_low_skipped) { >> +sc.priority = DEF_PRIORITY; >> +sc.memcg_low_reclaim = 1; >> +sc.memcg_low_skipped = 0; >> +goto retry; >> +} >> + >> if (!sc.nr_reclaimed) >> pgdat->kswapd_failures++; >> >> -- >> 2.13.5 (Apple Git-94) >> >
Re: [PATCH 2/3] mm/vmscan: Enable kswapd to reclaim low-protected memory
On 2018/12/3 下午7:56, Michal Hocko wrote: > On Mon 03-12-18 16:01:18, Xunlei Pang wrote: >> There may be cgroup memory overcommitment, it will become >> even common in the future. >> >> Let's enable kswapd to reclaim low-protected memory in case >> of memory pressure, to mitigate the global direct reclaim >> pressures which could cause jitters to the response time of >> lantency-sensitive groups. > > Please be more descriptive about the problem you are trying to handle > here. I haven't actually read the patch but let me emphasise that the > low limit protection is important isolation tool. And allowing kswapd to > reclaim protected memcgs is going to break the semantic as it has been > introduced and designed. We have two types of memcgs: online groups(important business) and offline groups(unimportant business). Online groups are all configured with MAX low protection, while offline groups are not at all protected(with default 0 low). When offline groups are overcommitted, the global memory pressure suffers. This will cause the memory allocations from online groups constantly go to the slow global direct reclaim in order to reclaim online's page caches, as kswap is not able to reclaim low-protection memory. low is not hard limit, it's reasonable to be reclaimed by kswapd if there's no other reclaimable memory. > >> >> Signed-off-by: Xunlei Pang >> --- >> mm/vmscan.c | 8 >> 1 file changed, 8 insertions(+) >> >> diff --git a/mm/vmscan.c b/mm/vmscan.c >> index 62ac0c488624..3d412eb91f73 100644 >> --- a/mm/vmscan.c >> +++ b/mm/vmscan.c >> @@ -3531,6 +3531,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, >> int classzone_idx) >> >> count_vm_event(PAGEOUTRUN); >> >> +retry: >> do { >> unsigned long nr_reclaimed = sc.nr_reclaimed; >> bool raise_priority = true; >> @@ -3622,6 +3623,13 @@ static int balance_pgdat(pg_data_t *pgdat, int order, >> int classzone_idx) >> sc.priority--; >> } while (sc.priority >= 1); >> >> +if (!sc.nr_reclaimed && sc.memcg_low_skipped) { >> +sc.priority = DEF_PRIORITY; >> +sc.memcg_low_reclaim = 1; >> +sc.memcg_low_skipped = 0; >> +goto retry; >> +} >> + >> if (!sc.nr_reclaimed) >> pgdat->kswapd_failures++; >> >> -- >> 2.13.5 (Apple Git-94) >> >
Re: [PATCH 1/3] mm/memcg: Fix min/low usage in propagate_protected_usage()
On 2018/12/3 下午7:54, Michal Hocko wrote: > On Mon 03-12-18 16:01:17, Xunlei Pang wrote: >> When usage exceeds min, min usage should be min other than 0. >> Apply the same for low. > > Why? What is the actual problem. children_min_usage tracks the total children usages under min, it's natural that min should be added into children_min_usage when above min, I can't image why 0 is added, is there special history I missed? See mem_cgroup_protected(), when usage exceeds min, emin is calculated as "parent_emin*min/children_min_usage". > >> Signed-off-by: Xunlei Pang >> --- >> mm/page_counter.c | 12 ++-- >> 1 file changed, 2 insertions(+), 10 deletions(-) >> >> diff --git a/mm/page_counter.c b/mm/page_counter.c >> index de31470655f6..75d53f15f040 100644 >> --- a/mm/page_counter.c >> +++ b/mm/page_counter.c >> @@ -23,11 +23,7 @@ static void propagate_protected_usage(struct page_counter >> *c, >> return; >> >> if (c->min || atomic_long_read(>min_usage)) { >> -if (usage <= c->min) >> -protected = usage; >> -else >> -protected = 0; >> - >> +protected = min(usage, c->min); >> old_protected = atomic_long_xchg(>min_usage, protected); >> delta = protected - old_protected; >> if (delta) >> @@ -35,11 +31,7 @@ static void propagate_protected_usage(struct page_counter >> *c, >> } >> >> if (c->low || atomic_long_read(>low_usage)) { >> -if (usage <= c->low) >> -protected = usage; >> -else >> -protected = 0; >> - >> +protected = min(usage, c->low); >> old_protected = atomic_long_xchg(>low_usage, protected); >> delta = protected - old_protected; >> if (delta) >> -- >> 2.13.5 (Apple Git-94) >> >
Re: [PATCH 1/3] mm/memcg: Fix min/low usage in propagate_protected_usage()
On 2018/12/3 下午7:54, Michal Hocko wrote: > On Mon 03-12-18 16:01:17, Xunlei Pang wrote: >> When usage exceeds min, min usage should be min other than 0. >> Apply the same for low. > > Why? What is the actual problem. children_min_usage tracks the total children usages under min, it's natural that min should be added into children_min_usage when above min, I can't image why 0 is added, is there special history I missed? See mem_cgroup_protected(), when usage exceeds min, emin is calculated as "parent_emin*min/children_min_usage". > >> Signed-off-by: Xunlei Pang >> --- >> mm/page_counter.c | 12 ++-- >> 1 file changed, 2 insertions(+), 10 deletions(-) >> >> diff --git a/mm/page_counter.c b/mm/page_counter.c >> index de31470655f6..75d53f15f040 100644 >> --- a/mm/page_counter.c >> +++ b/mm/page_counter.c >> @@ -23,11 +23,7 @@ static void propagate_protected_usage(struct page_counter >> *c, >> return; >> >> if (c->min || atomic_long_read(>min_usage)) { >> -if (usage <= c->min) >> -protected = usage; >> -else >> -protected = 0; >> - >> +protected = min(usage, c->min); >> old_protected = atomic_long_xchg(>min_usage, protected); >> delta = protected - old_protected; >> if (delta) >> @@ -35,11 +31,7 @@ static void propagate_protected_usage(struct page_counter >> *c, >> } >> >> if (c->low || atomic_long_read(>low_usage)) { >> -if (usage <= c->low) >> -protected = usage; >> -else >> -protected = 0; >> - >> +protected = min(usage, c->low); >> old_protected = atomic_long_xchg(>low_usage, protected); >> delta = protected - old_protected; >> if (delta) >> -- >> 2.13.5 (Apple Git-94) >> >
[PATCH 2/3] mm/vmscan: Enable kswapd to reclaim low-protected memory
There may be cgroup memory overcommitment, it will become even common in the future. Let's enable kswapd to reclaim low-protected memory in case of memory pressure, to mitigate the global direct reclaim pressures which could cause jitters to the response time of lantency-sensitive groups. Signed-off-by: Xunlei Pang --- mm/vmscan.c | 8 1 file changed, 8 insertions(+) diff --git a/mm/vmscan.c b/mm/vmscan.c index 62ac0c488624..3d412eb91f73 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3531,6 +3531,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) count_vm_event(PAGEOUTRUN); +retry: do { unsigned long nr_reclaimed = sc.nr_reclaimed; bool raise_priority = true; @@ -3622,6 +3623,13 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) sc.priority--; } while (sc.priority >= 1); + if (!sc.nr_reclaimed && sc.memcg_low_skipped) { + sc.priority = DEF_PRIORITY; + sc.memcg_low_reclaim = 1; + sc.memcg_low_skipped = 0; + goto retry; + } + if (!sc.nr_reclaimed) pgdat->kswapd_failures++; -- 2.13.5 (Apple Git-94)
[PATCH 3/3] mm/memcg: Avoid reclaiming below hard protection
When memcgs get reclaimed after its usage exceeds min, some usages below the min may also be reclaimed in the current implementation, the amount is considerably large during kswapd reclaim according to my ftrace results. This patch calculates the part over hard protection limit, and allows only this part of usages to be reclaimed. Signed-off-by: Xunlei Pang --- include/linux/memcontrol.h | 7 +-- mm/memcontrol.c| 9 +++-- mm/vmscan.c| 17 +++-- 3 files changed, 27 insertions(+), 6 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 7ab2120155a4..637ef975792f 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -334,7 +334,8 @@ static inline bool mem_cgroup_disabled(void) } enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, - struct mem_cgroup *memcg); + struct mem_cgroup *memcg, + unsigned long *min_excess); int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask, struct mem_cgroup **memcgp, @@ -818,7 +819,9 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, } static inline enum mem_cgroup_protection mem_cgroup_protected( - struct mem_cgroup *root, struct mem_cgroup *memcg) + struct mem_cgroup *root, + struct mem_cgroup *memcg, + unsigned long *min_excess) { return MEMCG_PROT_NONE; } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6e1469b80cb7..ca96f68e07a0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5694,6 +5694,7 @@ struct cgroup_subsys memory_cgrp_subsys = { * mem_cgroup_protected - check if memory consumption is in the normal range * @root: the top ancestor of the sub-tree being checked * @memcg: the memory cgroup to check + * @min_excess: store the number of pages exceeding hard protection * * WARNING: This function is not stateless! It can only be used as part * of a top-down tree iteration, not for isolated queries. @@ -5761,7 +5762,8 @@ struct cgroup_subsys memory_cgrp_subsys = { * as memory.low is a best-effort mechanism. */ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, - struct mem_cgroup *memcg) + struct mem_cgroup *memcg, + unsigned long *min_excess) { struct mem_cgroup *parent; unsigned long emin, parent_emin; @@ -5827,8 +5829,11 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, return MEMCG_PROT_MIN; else if (usage <= elow) return MEMCG_PROT_LOW; - else + else { + if (emin) + *min_excess = usage - emin; return MEMCG_PROT_NONE; + } } /** diff --git a/mm/vmscan.c b/mm/vmscan.c index 3d412eb91f73..e4fa7a2a63d0 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -66,6 +66,9 @@ struct scan_control { /* How many pages shrink_list() should reclaim */ unsigned long nr_to_reclaim; + /* How many pages hard protection allows */ + unsigned long min_excess; + /* * Nodemask of nodes allowed by the caller. If NULL, all nodes * are scanned. @@ -2503,10 +2506,14 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc unsigned long nr_to_scan; enum lru_list lru; unsigned long nr_reclaimed = 0; - unsigned long nr_to_reclaim = sc->nr_to_reclaim; + unsigned long nr_to_reclaim; struct blk_plug plug; bool scan_adjusted; + nr_to_reclaim = sc->nr_to_reclaim; + if (sc->min_excess) + nr_to_reclaim = min(nr_to_reclaim, sc->min_excess); + get_scan_count(lruvec, memcg, sc, nr, lru_pages); /* Record the original scan target for proportional adjustments later */ @@ -2544,6 +2551,10 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc cond_resched(); + /* Abort proportional reclaim when hard protection applies */ + if (sc->min_excess && nr_reclaimed >= sc->min_excess) + break; + if (nr_reclaimed < nr_to_reclaim || scan_adjusted) continue; @@ -2725,8 +2736,9 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) unsigned long lru_pages; unsigned long reclaimed; unsigned long scanned; + unsigned long excess = 0; -
[PATCH 2/3] mm/vmscan: Enable kswapd to reclaim low-protected memory
There may be cgroup memory overcommitment, it will become even common in the future. Let's enable kswapd to reclaim low-protected memory in case of memory pressure, to mitigate the global direct reclaim pressures which could cause jitters to the response time of lantency-sensitive groups. Signed-off-by: Xunlei Pang --- mm/vmscan.c | 8 1 file changed, 8 insertions(+) diff --git a/mm/vmscan.c b/mm/vmscan.c index 62ac0c488624..3d412eb91f73 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3531,6 +3531,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) count_vm_event(PAGEOUTRUN); +retry: do { unsigned long nr_reclaimed = sc.nr_reclaimed; bool raise_priority = true; @@ -3622,6 +3623,13 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) sc.priority--; } while (sc.priority >= 1); + if (!sc.nr_reclaimed && sc.memcg_low_skipped) { + sc.priority = DEF_PRIORITY; + sc.memcg_low_reclaim = 1; + sc.memcg_low_skipped = 0; + goto retry; + } + if (!sc.nr_reclaimed) pgdat->kswapd_failures++; -- 2.13.5 (Apple Git-94)
[PATCH 3/3] mm/memcg: Avoid reclaiming below hard protection
When memcgs get reclaimed after its usage exceeds min, some usages below the min may also be reclaimed in the current implementation, the amount is considerably large during kswapd reclaim according to my ftrace results. This patch calculates the part over hard protection limit, and allows only this part of usages to be reclaimed. Signed-off-by: Xunlei Pang --- include/linux/memcontrol.h | 7 +-- mm/memcontrol.c| 9 +++-- mm/vmscan.c| 17 +++-- 3 files changed, 27 insertions(+), 6 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 7ab2120155a4..637ef975792f 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -334,7 +334,8 @@ static inline bool mem_cgroup_disabled(void) } enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, - struct mem_cgroup *memcg); + struct mem_cgroup *memcg, + unsigned long *min_excess); int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask, struct mem_cgroup **memcgp, @@ -818,7 +819,9 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, } static inline enum mem_cgroup_protection mem_cgroup_protected( - struct mem_cgroup *root, struct mem_cgroup *memcg) + struct mem_cgroup *root, + struct mem_cgroup *memcg, + unsigned long *min_excess) { return MEMCG_PROT_NONE; } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6e1469b80cb7..ca96f68e07a0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5694,6 +5694,7 @@ struct cgroup_subsys memory_cgrp_subsys = { * mem_cgroup_protected - check if memory consumption is in the normal range * @root: the top ancestor of the sub-tree being checked * @memcg: the memory cgroup to check + * @min_excess: store the number of pages exceeding hard protection * * WARNING: This function is not stateless! It can only be used as part * of a top-down tree iteration, not for isolated queries. @@ -5761,7 +5762,8 @@ struct cgroup_subsys memory_cgrp_subsys = { * as memory.low is a best-effort mechanism. */ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, - struct mem_cgroup *memcg) + struct mem_cgroup *memcg, + unsigned long *min_excess) { struct mem_cgroup *parent; unsigned long emin, parent_emin; @@ -5827,8 +5829,11 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, return MEMCG_PROT_MIN; else if (usage <= elow) return MEMCG_PROT_LOW; - else + else { + if (emin) + *min_excess = usage - emin; return MEMCG_PROT_NONE; + } } /** diff --git a/mm/vmscan.c b/mm/vmscan.c index 3d412eb91f73..e4fa7a2a63d0 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -66,6 +66,9 @@ struct scan_control { /* How many pages shrink_list() should reclaim */ unsigned long nr_to_reclaim; + /* How many pages hard protection allows */ + unsigned long min_excess; + /* * Nodemask of nodes allowed by the caller. If NULL, all nodes * are scanned. @@ -2503,10 +2506,14 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc unsigned long nr_to_scan; enum lru_list lru; unsigned long nr_reclaimed = 0; - unsigned long nr_to_reclaim = sc->nr_to_reclaim; + unsigned long nr_to_reclaim; struct blk_plug plug; bool scan_adjusted; + nr_to_reclaim = sc->nr_to_reclaim; + if (sc->min_excess) + nr_to_reclaim = min(nr_to_reclaim, sc->min_excess); + get_scan_count(lruvec, memcg, sc, nr, lru_pages); /* Record the original scan target for proportional adjustments later */ @@ -2544,6 +2551,10 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc cond_resched(); + /* Abort proportional reclaim when hard protection applies */ + if (sc->min_excess && nr_reclaimed >= sc->min_excess) + break; + if (nr_reclaimed < nr_to_reclaim || scan_adjusted) continue; @@ -2725,8 +2736,9 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) unsigned long lru_pages; unsigned long reclaimed; unsigned long scanned; + unsigned long excess = 0; -
[PATCH 1/3] mm/memcg: Fix min/low usage in propagate_protected_usage()
When usage exceeds min, min usage should be min other than 0. Apply the same for low. Signed-off-by: Xunlei Pang --- mm/page_counter.c | 12 ++-- 1 file changed, 2 insertions(+), 10 deletions(-) diff --git a/mm/page_counter.c b/mm/page_counter.c index de31470655f6..75d53f15f040 100644 --- a/mm/page_counter.c +++ b/mm/page_counter.c @@ -23,11 +23,7 @@ static void propagate_protected_usage(struct page_counter *c, return; if (c->min || atomic_long_read(>min_usage)) { - if (usage <= c->min) - protected = usage; - else - protected = 0; - + protected = min(usage, c->min); old_protected = atomic_long_xchg(>min_usage, protected); delta = protected - old_protected; if (delta) @@ -35,11 +31,7 @@ static void propagate_protected_usage(struct page_counter *c, } if (c->low || atomic_long_read(>low_usage)) { - if (usage <= c->low) - protected = usage; - else - protected = 0; - + protected = min(usage, c->low); old_protected = atomic_long_xchg(>low_usage, protected); delta = protected - old_protected; if (delta) -- 2.13.5 (Apple Git-94)
[PATCH 1/3] mm/memcg: Fix min/low usage in propagate_protected_usage()
When usage exceeds min, min usage should be min other than 0. Apply the same for low. Signed-off-by: Xunlei Pang --- mm/page_counter.c | 12 ++-- 1 file changed, 2 insertions(+), 10 deletions(-) diff --git a/mm/page_counter.c b/mm/page_counter.c index de31470655f6..75d53f15f040 100644 --- a/mm/page_counter.c +++ b/mm/page_counter.c @@ -23,11 +23,7 @@ static void propagate_protected_usage(struct page_counter *c, return; if (c->min || atomic_long_read(>min_usage)) { - if (usage <= c->min) - protected = usage; - else - protected = 0; - + protected = min(usage, c->min); old_protected = atomic_long_xchg(>min_usage, protected); delta = protected - old_protected; if (delta) @@ -35,11 +31,7 @@ static void propagate_protected_usage(struct page_counter *c, } if (c->low || atomic_long_read(>low_usage)) { - if (usage <= c->low) - protected = usage; - else - protected = 0; - + protected = min(usage, c->low); old_protected = atomic_long_xchg(>low_usage, protected); delta = protected - old_protected; if (delta) -- 2.13.5 (Apple Git-94)
Re: [PATCH] sched/fair: sync expires_seq in distribute_cfs_runtime()
On 8/1/18 4:55 AM, Cong Wang wrote: > On Tue, Jul 31, 2018 at 10:13 AM wrote: >> >> Xunlei Pang writes: >> >>> On 7/31/18 1:55 AM, Cong Wang wrote: >>>> On Sun, Jul 29, 2018 at 10:29 PM Xunlei Pang >>>> wrote: >>>>> >>>>> Hi Cong, >>>>> >>>>> On 7/28/18 8:24 AM, Cong Wang wrote: >>>>>> Each time we sync cfs_rq->runtime_expires with cfs_b->runtime_expires, >>>>>> we should sync its ->expires_seq too. However it is missing >>>>>> for distribute_cfs_runtime(), especially the slack timer call path. >>>>> >>>>> I don't think it's a problem, as expires_seq will get synced in >>>>> assign_cfs_rq_runtime(). >>>> >>>> Sure, but there is a small window during which they are not synced. >>>> Why do you want to wait until the next assign_cfs_rq_runtime() when >>>> you already know runtime_expires is synced? >>>> >>>> Also, expire_cfs_rq_runtime() is called before assign_cfs_rq_runtime() >>>> inside __account_cfs_rq_runtime(), which means the check of >>>> cfs_rq->expires_seq is not accurate for unthrottling case if the clock >>>> drift happens soon enough? >>>> >>> >>> expire_cfs_rq_runtime(): >>> if (cfs_rq->expires_seq == cfs_b->expires_seq) { >>> /* extend local deadline, drift is bounded above by 2 ticks */ >>> cfs_rq->runtime_expires += TICK_NSEC; >>> } else { >>> /* global deadline is ahead, expiration has passed */ >>> cfs_rq->runtime_remaining = 0; >>> } >>> >>> So if clock drift happens soon, then expires_seq decides the correct >>> thing we should do: if cfs_b->expires_seq advanced, then clear the stale >>> cfs_rq->runtime_remaining from the slack timer of the past period, then >>> assign_cfs_rq_runtime() will refresh them afterwards, otherwise it is a >>> real clock drift. I am still not getting where the race is? > > But expires_seq is supposed to be the same here, after > distribute_cfs_runtime(), therefore runtime_remaining is not supposed > to be cleared. > > Which part do I misunderstand? expires_seq should not be same here? > Or you are saying a wrongly clear of runtime_remaning is fine? > Let's see the unthrottle cases. 1. for the periodic timer distribute_cfs_runtime updates the throttled cfs_rq->runtime_expires to be a new value, so expire_cfs_rq_runtime does nothing because of: rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires < 0 Afterwards assign_cfs_rq_runtime() will sync its expires_seq. 2. for the slack timer the two expires_seq should be the same, so if clock drift happens soon, expire_cfs_rq_runtime regards it as true clock drift: cfs_rq->runtime_expires += TICK_NSEC If it happens that global expires_seq advances, it also doesn't matter, expire_cfs_rq_runtime will clear the stale expire_cfs_rq_runtime as expected. > >> >> Nothing /important/ goes wrong because distribute_cfs_runtime only fills >> runtime_remaining up to 1, not a real amount. > > No, runtime_remaining is updated right before expire_cfs_rq_runtime(): > > static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) > { > /* dock delta_exec before expiring quota (as it could span periods) */ > cfs_rq->runtime_remaining -= delta_exec; > expire_cfs_rq_runtime(cfs_rq); > > so almost certainly it can't be 1. I think Ben means it firstly gets a distributtion of 1 to run after unthrottling, soon it will have a negative runtime_remaining, and go to assign_cfs_rq_runtime(). Thanks, Xunlei > > Which means the following check could be passed: > > 4655 if (cfs_rq->runtime_remaining < 0) > 4656 return; > > therefore we are reaching the clock drift logic code inside > expire_cfs_rq_runtime() > where expires_seq is supposed to be same as they should be sync'ed. > Therefore without patch, we wrongly clear the runtime_remainng? > > Thanks. >
Re: [PATCH] sched/fair: sync expires_seq in distribute_cfs_runtime()
On 8/1/18 4:55 AM, Cong Wang wrote: > On Tue, Jul 31, 2018 at 10:13 AM wrote: >> >> Xunlei Pang writes: >> >>> On 7/31/18 1:55 AM, Cong Wang wrote: >>>> On Sun, Jul 29, 2018 at 10:29 PM Xunlei Pang >>>> wrote: >>>>> >>>>> Hi Cong, >>>>> >>>>> On 7/28/18 8:24 AM, Cong Wang wrote: >>>>>> Each time we sync cfs_rq->runtime_expires with cfs_b->runtime_expires, >>>>>> we should sync its ->expires_seq too. However it is missing >>>>>> for distribute_cfs_runtime(), especially the slack timer call path. >>>>> >>>>> I don't think it's a problem, as expires_seq will get synced in >>>>> assign_cfs_rq_runtime(). >>>> >>>> Sure, but there is a small window during which they are not synced. >>>> Why do you want to wait until the next assign_cfs_rq_runtime() when >>>> you already know runtime_expires is synced? >>>> >>>> Also, expire_cfs_rq_runtime() is called before assign_cfs_rq_runtime() >>>> inside __account_cfs_rq_runtime(), which means the check of >>>> cfs_rq->expires_seq is not accurate for unthrottling case if the clock >>>> drift happens soon enough? >>>> >>> >>> expire_cfs_rq_runtime(): >>> if (cfs_rq->expires_seq == cfs_b->expires_seq) { >>> /* extend local deadline, drift is bounded above by 2 ticks */ >>> cfs_rq->runtime_expires += TICK_NSEC; >>> } else { >>> /* global deadline is ahead, expiration has passed */ >>> cfs_rq->runtime_remaining = 0; >>> } >>> >>> So if clock drift happens soon, then expires_seq decides the correct >>> thing we should do: if cfs_b->expires_seq advanced, then clear the stale >>> cfs_rq->runtime_remaining from the slack timer of the past period, then >>> assign_cfs_rq_runtime() will refresh them afterwards, otherwise it is a >>> real clock drift. I am still not getting where the race is? > > But expires_seq is supposed to be the same here, after > distribute_cfs_runtime(), therefore runtime_remaining is not supposed > to be cleared. > > Which part do I misunderstand? expires_seq should not be same here? > Or you are saying a wrongly clear of runtime_remaning is fine? > Let's see the unthrottle cases. 1. for the periodic timer distribute_cfs_runtime updates the throttled cfs_rq->runtime_expires to be a new value, so expire_cfs_rq_runtime does nothing because of: rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires < 0 Afterwards assign_cfs_rq_runtime() will sync its expires_seq. 2. for the slack timer the two expires_seq should be the same, so if clock drift happens soon, expire_cfs_rq_runtime regards it as true clock drift: cfs_rq->runtime_expires += TICK_NSEC If it happens that global expires_seq advances, it also doesn't matter, expire_cfs_rq_runtime will clear the stale expire_cfs_rq_runtime as expected. > >> >> Nothing /important/ goes wrong because distribute_cfs_runtime only fills >> runtime_remaining up to 1, not a real amount. > > No, runtime_remaining is updated right before expire_cfs_rq_runtime(): > > static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) > { > /* dock delta_exec before expiring quota (as it could span periods) */ > cfs_rq->runtime_remaining -= delta_exec; > expire_cfs_rq_runtime(cfs_rq); > > so almost certainly it can't be 1. I think Ben means it firstly gets a distributtion of 1 to run after unthrottling, soon it will have a negative runtime_remaining, and go to assign_cfs_rq_runtime(). Thanks, Xunlei > > Which means the following check could be passed: > > 4655 if (cfs_rq->runtime_remaining < 0) > 4656 return; > > therefore we are reaching the clock drift logic code inside > expire_cfs_rq_runtime() > where expires_seq is supposed to be same as they should be sync'ed. > Therefore without patch, we wrongly clear the runtime_remainng? > > Thanks. >
Re: [PATCH] sched/fair: sync expires_seq in distribute_cfs_runtime()
On 7/31/18 1:55 AM, Cong Wang wrote: > On Sun, Jul 29, 2018 at 10:29 PM Xunlei Pang wrote: >> >> Hi Cong, >> >> On 7/28/18 8:24 AM, Cong Wang wrote: >>> Each time we sync cfs_rq->runtime_expires with cfs_b->runtime_expires, >>> we should sync its ->expires_seq too. However it is missing >>> for distribute_cfs_runtime(), especially the slack timer call path. >> >> I don't think it's a problem, as expires_seq will get synced in >> assign_cfs_rq_runtime(). > > Sure, but there is a small window during which they are not synced. > Why do you want to wait until the next assign_cfs_rq_runtime() when > you already know runtime_expires is synced? > > Also, expire_cfs_rq_runtime() is called before assign_cfs_rq_runtime() > inside __account_cfs_rq_runtime(), which means the check of > cfs_rq->expires_seq is not accurate for unthrottling case if the clock > drift happens soon enough? > expire_cfs_rq_runtime(): if (cfs_rq->expires_seq == cfs_b->expires_seq) { /* extend local deadline, drift is bounded above by 2 ticks */ cfs_rq->runtime_expires += TICK_NSEC; } else { /* global deadline is ahead, expiration has passed */ cfs_rq->runtime_remaining = 0; } So if clock drift happens soon, then expires_seq decides the correct thing we should do: if cfs_b->expires_seq advanced, then clear the stale cfs_rq->runtime_remaining from the slack timer of the past period, then assign_cfs_rq_runtime() will refresh them afterwards, otherwise it is a real clock drift. I am still not getting where the race is?
Re: [PATCH] sched/fair: sync expires_seq in distribute_cfs_runtime()
On 7/31/18 1:55 AM, Cong Wang wrote: > On Sun, Jul 29, 2018 at 10:29 PM Xunlei Pang wrote: >> >> Hi Cong, >> >> On 7/28/18 8:24 AM, Cong Wang wrote: >>> Each time we sync cfs_rq->runtime_expires with cfs_b->runtime_expires, >>> we should sync its ->expires_seq too. However it is missing >>> for distribute_cfs_runtime(), especially the slack timer call path. >> >> I don't think it's a problem, as expires_seq will get synced in >> assign_cfs_rq_runtime(). > > Sure, but there is a small window during which they are not synced. > Why do you want to wait until the next assign_cfs_rq_runtime() when > you already know runtime_expires is synced? > > Also, expire_cfs_rq_runtime() is called before assign_cfs_rq_runtime() > inside __account_cfs_rq_runtime(), which means the check of > cfs_rq->expires_seq is not accurate for unthrottling case if the clock > drift happens soon enough? > expire_cfs_rq_runtime(): if (cfs_rq->expires_seq == cfs_b->expires_seq) { /* extend local deadline, drift is bounded above by 2 ticks */ cfs_rq->runtime_expires += TICK_NSEC; } else { /* global deadline is ahead, expiration has passed */ cfs_rq->runtime_remaining = 0; } So if clock drift happens soon, then expires_seq decides the correct thing we should do: if cfs_b->expires_seq advanced, then clear the stale cfs_rq->runtime_remaining from the slack timer of the past period, then assign_cfs_rq_runtime() will refresh them afterwards, otherwise it is a real clock drift. I am still not getting where the race is?
Re: [PATCH] sched/fair: sync expires_seq in distribute_cfs_runtime()
Hi Cong, On 7/28/18 8:24 AM, Cong Wang wrote: > Each time we sync cfs_rq->runtime_expires with cfs_b->runtime_expires, > we should sync its ->expires_seq too. However it is missing > for distribute_cfs_runtime(), especially the slack timer call path. I don't think it's a problem, as expires_seq will get synced in assign_cfs_rq_runtime(). Thanks, Xunlei > > Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition") > Cc: Xunlei Pang > Cc: Ben Segall > Cc: Linus Torvalds > Cc: Peter Zijlstra > Cc: Thomas Gleixner > Signed-off-by: Cong Wang > --- > kernel/sched/fair.c | 12 > 1 file changed, 8 insertions(+), 4 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 2f0a0be4d344..910c50db3d74 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4857,7 +4857,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) > } > > static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, > - u64 remaining, u64 expires) > + u64 remaining, u64 expires, int expires_seq) > { > struct cfs_rq *cfs_rq; > u64 runtime; > @@ -4880,6 +4880,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth > *cfs_b, > > cfs_rq->runtime_remaining += runtime; > cfs_rq->runtime_expires = expires; > + cfs_rq->expires_seq = expires_seq; > > /* we check whether we're throttled above */ > if (cfs_rq->runtime_remaining > 0) > @@ -4905,7 +4906,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth > *cfs_b, > static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int > overrun) > { > u64 runtime, runtime_expires; > - int throttled; > + int throttled, expires_seq; > > /* no need to continue the timer with no bandwidth constraint */ > if (cfs_b->quota == RUNTIME_INF) > @@ -4933,6 +4934,7 @@ static int do_sched_cfs_period_timer(struct > cfs_bandwidth *cfs_b, int overrun) > cfs_b->nr_throttled += overrun; > > runtime_expires = cfs_b->runtime_expires; > + expires_seq = cfs_b->expires_seq; > > /* >* This check is repeated as we are holding onto the new bandwidth while > @@ -4946,7 +4948,7 @@ static int do_sched_cfs_period_timer(struct > cfs_bandwidth *cfs_b, int overrun) > raw_spin_unlock(_b->lock); > /* we can't nest cfs_b->lock while distributing bandwidth */ > runtime = distribute_cfs_runtime(cfs_b, runtime, > - runtime_expires); > + runtime_expires, expires_seq); > raw_spin_lock(_b->lock); > > throttled = !list_empty(_b->throttled_cfs_rq); > @@ -5055,6 +5057,7 @@ static __always_inline void > return_cfs_rq_runtime(struct cfs_rq *cfs_rq) > static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) > { > u64 runtime = 0, slice = sched_cfs_bandwidth_slice(); > + int expires_seq; > u64 expires; > > /* confirm we're still not at a refresh boundary */ > @@ -5068,12 +5071,13 @@ static void do_sched_cfs_slack_timer(struct > cfs_bandwidth *cfs_b) > runtime = cfs_b->runtime; > > expires = cfs_b->runtime_expires; > + expires_seq = cfs_b->expires_seq; > raw_spin_unlock(_b->lock); > > if (!runtime) > return; > > - runtime = distribute_cfs_runtime(cfs_b, runtime, expires); > + runtime = distribute_cfs_runtime(cfs_b, runtime, expires, expires_seq); > > raw_spin_lock(_b->lock); > if (expires == cfs_b->runtime_expires) >
Re: [PATCH] sched/fair: sync expires_seq in distribute_cfs_runtime()
Hi Cong, On 7/28/18 8:24 AM, Cong Wang wrote: > Each time we sync cfs_rq->runtime_expires with cfs_b->runtime_expires, > we should sync its ->expires_seq too. However it is missing > for distribute_cfs_runtime(), especially the slack timer call path. I don't think it's a problem, as expires_seq will get synced in assign_cfs_rq_runtime(). Thanks, Xunlei > > Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition") > Cc: Xunlei Pang > Cc: Ben Segall > Cc: Linus Torvalds > Cc: Peter Zijlstra > Cc: Thomas Gleixner > Signed-off-by: Cong Wang > --- > kernel/sched/fair.c | 12 > 1 file changed, 8 insertions(+), 4 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 2f0a0be4d344..910c50db3d74 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4857,7 +4857,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) > } > > static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, > - u64 remaining, u64 expires) > + u64 remaining, u64 expires, int expires_seq) > { > struct cfs_rq *cfs_rq; > u64 runtime; > @@ -4880,6 +4880,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth > *cfs_b, > > cfs_rq->runtime_remaining += runtime; > cfs_rq->runtime_expires = expires; > + cfs_rq->expires_seq = expires_seq; > > /* we check whether we're throttled above */ > if (cfs_rq->runtime_remaining > 0) > @@ -4905,7 +4906,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth > *cfs_b, > static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int > overrun) > { > u64 runtime, runtime_expires; > - int throttled; > + int throttled, expires_seq; > > /* no need to continue the timer with no bandwidth constraint */ > if (cfs_b->quota == RUNTIME_INF) > @@ -4933,6 +4934,7 @@ static int do_sched_cfs_period_timer(struct > cfs_bandwidth *cfs_b, int overrun) > cfs_b->nr_throttled += overrun; > > runtime_expires = cfs_b->runtime_expires; > + expires_seq = cfs_b->expires_seq; > > /* >* This check is repeated as we are holding onto the new bandwidth while > @@ -4946,7 +4948,7 @@ static int do_sched_cfs_period_timer(struct > cfs_bandwidth *cfs_b, int overrun) > raw_spin_unlock(_b->lock); > /* we can't nest cfs_b->lock while distributing bandwidth */ > runtime = distribute_cfs_runtime(cfs_b, runtime, > - runtime_expires); > + runtime_expires, expires_seq); > raw_spin_lock(_b->lock); > > throttled = !list_empty(_b->throttled_cfs_rq); > @@ -5055,6 +5057,7 @@ static __always_inline void > return_cfs_rq_runtime(struct cfs_rq *cfs_rq) > static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) > { > u64 runtime = 0, slice = sched_cfs_bandwidth_slice(); > + int expires_seq; > u64 expires; > > /* confirm we're still not at a refresh boundary */ > @@ -5068,12 +5071,13 @@ static void do_sched_cfs_slack_timer(struct > cfs_bandwidth *cfs_b) > runtime = cfs_b->runtime; > > expires = cfs_b->runtime_expires; > + expires_seq = cfs_b->expires_seq; > raw_spin_unlock(_b->lock); > > if (!runtime) > return; > > - runtime = distribute_cfs_runtime(cfs_b, runtime, expires); > + runtime = distribute_cfs_runtime(cfs_b, runtime, expires, expires_seq); > > raw_spin_lock(_b->lock); > if (expires == cfs_b->runtime_expires) >
Re: [tip:sched/core] sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()
On 7/23/18 5:21 PM, Peter Zijlstra wrote: > On Tue, Jul 17, 2018 at 12:08:36PM +0800, Xunlei Pang wrote: >> The trace data corresponds to the last sample period: >> trace entry 1: >> cat-20755 [022] d... 1370.106496: cputime_adjust: task >> tick-based utime 36256000 stime 255100, scheduler rtime 333060702626 >> cat-20755 [022] d... 1370.106497: cputime_adjust: result: >> old utime 330729718142 stime 2306983867, new utime 330733635372 stime >> 2327067254 >> >> trace entry 2: >> cat-20773 [005] d... 1371.109825: cputime_adjust: task >> tick-based utime 36256700 stime 354700, scheduler rtime 334063718912 >> cat-20773 [005] d... 1371.109826: cputime_adjust: result: >> old utime 330733635372 stime 2327067254, new utime 330827229702 stime >> 3236489210 >> >> 1) expected behaviour >> Let's compare the last two trace entries(all the data below is in ns): >> task tick-based utime: 36256000->36256700 increased 700 >> task tick-based stime: 255100 ->354700 increased 99600 >> scheduler rtime: 333060702626->334063718912 increased 1003016286 >> >> The application actually runs almost 100%sys at the moment, we can >> use the task tick-based utime and stime increased to double check: >> 99600/(700+99600) > 99%sys >> >> 2) the current cputime_adjust() inaccurate result >> But for the current cputime_adjust(), we get the following adjusted >> utime and stime increase in this sample period: >> adjusted utime: 330733635372->330827229702 increased 93594330 >> adjusted stime: 2327067254 ->3236489210 increased 909421956 >> >> so 909421956/(93594330+909421956)=91%sys as the shell script shows above. >> >> 3) root cause >> The root cause of the issue is that the current cputime_adjust() always >> passes the whole times to scale_stime() to split the whole utime and >> stime. In this patch, we pass all the increased deltas in 1) within >> user's sample period to scale_stime() instead and accumulate the >> corresponding results to the previous saved adjusted utime and stime, >> so guarantee the accurate usr and sys increase within the user sample >> period. > > But why it this a problem? > > Since its sample based there's really nothing much you can guarantee. > What if your test program were to run in userspace for 50% of the time > but is so constructed to always be in kernel space when the tick > happens? > > Then you would 'expect' it to be 50% user and 50% sys, but you're also > not getting that. > > This stuff cannot be perfect, and the current code provides 'sensible' > numbers over the long run for most programs. Why muck with it? > Basically I am ok with the current implementation, except for one scenario we've met: when kernel went wrong for some reason with 100% sys suddenly for seconds(even trigger softlockup), the statistics monitor didn't reflect the fact, which confused people. One example with our per-cgroup top, we ever noticed "20% usr, 80% sys" displayed while in fact the kernel was in some busy loop(100% sys) at that moment, and the tick based time are of course all sys samples in such case.
Re: [tip:sched/core] sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()
On 7/23/18 5:21 PM, Peter Zijlstra wrote: > On Tue, Jul 17, 2018 at 12:08:36PM +0800, Xunlei Pang wrote: >> The trace data corresponds to the last sample period: >> trace entry 1: >> cat-20755 [022] d... 1370.106496: cputime_adjust: task >> tick-based utime 36256000 stime 255100, scheduler rtime 333060702626 >> cat-20755 [022] d... 1370.106497: cputime_adjust: result: >> old utime 330729718142 stime 2306983867, new utime 330733635372 stime >> 2327067254 >> >> trace entry 2: >> cat-20773 [005] d... 1371.109825: cputime_adjust: task >> tick-based utime 36256700 stime 354700, scheduler rtime 334063718912 >> cat-20773 [005] d... 1371.109826: cputime_adjust: result: >> old utime 330733635372 stime 2327067254, new utime 330827229702 stime >> 3236489210 >> >> 1) expected behaviour >> Let's compare the last two trace entries(all the data below is in ns): >> task tick-based utime: 36256000->36256700 increased 700 >> task tick-based stime: 255100 ->354700 increased 99600 >> scheduler rtime: 333060702626->334063718912 increased 1003016286 >> >> The application actually runs almost 100%sys at the moment, we can >> use the task tick-based utime and stime increased to double check: >> 99600/(700+99600) > 99%sys >> >> 2) the current cputime_adjust() inaccurate result >> But for the current cputime_adjust(), we get the following adjusted >> utime and stime increase in this sample period: >> adjusted utime: 330733635372->330827229702 increased 93594330 >> adjusted stime: 2327067254 ->3236489210 increased 909421956 >> >> so 909421956/(93594330+909421956)=91%sys as the shell script shows above. >> >> 3) root cause >> The root cause of the issue is that the current cputime_adjust() always >> passes the whole times to scale_stime() to split the whole utime and >> stime. In this patch, we pass all the increased deltas in 1) within >> user's sample period to scale_stime() instead and accumulate the >> corresponding results to the previous saved adjusted utime and stime, >> so guarantee the accurate usr and sys increase within the user sample >> period. > > But why it this a problem? > > Since its sample based there's really nothing much you can guarantee. > What if your test program were to run in userspace for 50% of the time > but is so constructed to always be in kernel space when the tick > happens? > > Then you would 'expect' it to be 50% user and 50% sys, but you're also > not getting that. > > This stuff cannot be perfect, and the current code provides 'sensible' > numbers over the long run for most programs. Why muck with it? > Basically I am ok with the current implementation, except for one scenario we've met: when kernel went wrong for some reason with 100% sys suddenly for seconds(even trigger softlockup), the statistics monitor didn't reflect the fact, which confused people. One example with our per-cgroup top, we ever noticed "20% usr, 80% sys" displayed while in fact the kernel was in some busy loop(100% sys) at that moment, and the tick based time are of course all sys samples in such case.
Re: [tip:sched/core] sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()
On 7/17/18 1:41 AM, Ingo Molnar wrote: > > * Peter Zijlstra wrote: > >> On Sun, Jul 15, 2018 at 04:36:17PM -0700, tip-bot for Xunlei Pang wrote: >>> Commit-ID: 8d4c00dc38a8aa30dae8402955e55e7b34e74bc8 >>> Gitweb: >>> https://git.kernel.org/tip/8d4c00dc38a8aa30dae8402955e55e7b34e74bc8 >>> Author: Xunlei Pang >>> AuthorDate: Mon, 9 Jul 2018 22:58:43 +0800 >>> Committer: Ingo Molnar >>> CommitDate: Mon, 16 Jul 2018 00:28:31 +0200 >>> >>> sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust() >>> >>> If users access "/proc/pid/stat", the utime and stime ratio in the >>> current SAMPLE period are excepted, but currently cputime_adjust() >>> always calculates with the ratio of the WHOLE lifetime of the process. >>> >>> This results in inaccurate utime and stime in "/proc/pid/stat". For >>> example, a process runs for a while with "50% usr, 0% sys", then >>> followed by "100% sys". For later while, the following is excepted: >>> >>> 0.0 usr, 100.0 sys >>> >>> but we get: >>> >>> 10.0 usr, 90.0 sys >>> >>> This patch uses the accurate ratio in cputime_adjust() to address the >>> issue. A new 'task_cputime' type field is added in prev_cputime to record >>> previous 'task_cputime' so that we can get the elapsed times as the accurate >>> ratio. >> >> Ingo, please make this one go away. > > Sure, I've removed it from sched/core. > Hi Ingo, Peter, Frederic, I captured some runtime data using trace to explain it, hope this can illustrate the motive behind my patch. Anyone could help improve my changelog is appreciated if you think so after reading. Here are the simple trace_printk I added to capture the real data: diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 0796f938c4f0..b65c1f250941 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -611,6 +611,9 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev, stime = curr->stime; utime = curr->utime; + if (!strncmp(current->comm, "cat", 3)) + trace_printk("task tick-based utime %lld stime %lld, scheduler rtime %lld\n", utime, stime, rtime); + /* * If either stime or utime are 0, assume all runtime is userspace. * Once a task gets some ticks, the monotonicy code at 'update:' @@ -651,9 +654,14 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev, stime = rtime - utime; } + if (!strncmp(current->comm, "cat", 3)) + trace_printk("result: old utime %lld stime %lld, new utime %lld stime %lld\n", + prev->utime, prev->stime, utime, stime); + prev->stime = stime; prev->utime = utime; out: *ut = prev->utime; *st = prev->stime; raw_spin_unlock_irqrestore(>lock, flags); Using the reproducer I described in the changelog, it runs for a while with "50% usr, 0% sys", then followed by "100% sys". A shell script accesses its /proc/pid/stat at the interval of one second, and got: 50.0 usr, 0.0 sys 51.0 usr, 0.0 sys 50.0 usr, 0.0 sys ... 9.0 usr, 91.0 sys 9.0 usr, 91.0 sys The trace data corresponds to the last sample period: trace entry 1: cat-20755 [022] d... 1370.106496: cputime_adjust: task tick-based utime 36256000 stime 255100, scheduler rtime 333060702626 cat-20755 [022] d... 1370.106497: cputime_adjust: result: old utime 330729718142 stime 2306983867, new utime 330733635372 stime 2327067254 trace entry 2: cat-20773 [005] d... 1371.109825: cputime_adjust: task tick-based utime 36256700 stime 354700, scheduler rtime 334063718912 cat-20773 [005] d... 1371.109826: cputime_adjust: result: old utime 330733635372 stime 2327067254, new utime 330827229702 stime 3236489210 1) expected behaviour Let's compare the last two trace entries(all the data below is in ns): task tick-based utime: 36256000->36256700 increased 700 task tick-based stime: 255100 ->354700 increased 99600 scheduler rtime: 333060702626->334063718912 increased 1003016286 The application actually runs almost 100%sys at the moment, we can use the task tick-based utime and stime increased to double check: 99600/(700+99600) > 99%sys 2) the current cputime_adjust() inaccurate result But for the current cputime_adjust(), we get the following adjusted utime and stime increase in this sample period: adjusted utime: 330733635372->330827229702 increased 93594330 adjusted sti
Re: [tip:sched/core] sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()
On 7/17/18 1:41 AM, Ingo Molnar wrote: > > * Peter Zijlstra wrote: > >> On Sun, Jul 15, 2018 at 04:36:17PM -0700, tip-bot for Xunlei Pang wrote: >>> Commit-ID: 8d4c00dc38a8aa30dae8402955e55e7b34e74bc8 >>> Gitweb: >>> https://git.kernel.org/tip/8d4c00dc38a8aa30dae8402955e55e7b34e74bc8 >>> Author: Xunlei Pang >>> AuthorDate: Mon, 9 Jul 2018 22:58:43 +0800 >>> Committer: Ingo Molnar >>> CommitDate: Mon, 16 Jul 2018 00:28:31 +0200 >>> >>> sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust() >>> >>> If users access "/proc/pid/stat", the utime and stime ratio in the >>> current SAMPLE period are excepted, but currently cputime_adjust() >>> always calculates with the ratio of the WHOLE lifetime of the process. >>> >>> This results in inaccurate utime and stime in "/proc/pid/stat". For >>> example, a process runs for a while with "50% usr, 0% sys", then >>> followed by "100% sys". For later while, the following is excepted: >>> >>> 0.0 usr, 100.0 sys >>> >>> but we get: >>> >>> 10.0 usr, 90.0 sys >>> >>> This patch uses the accurate ratio in cputime_adjust() to address the >>> issue. A new 'task_cputime' type field is added in prev_cputime to record >>> previous 'task_cputime' so that we can get the elapsed times as the accurate >>> ratio. >> >> Ingo, please make this one go away. > > Sure, I've removed it from sched/core. > Hi Ingo, Peter, Frederic, I captured some runtime data using trace to explain it, hope this can illustrate the motive behind my patch. Anyone could help improve my changelog is appreciated if you think so after reading. Here are the simple trace_printk I added to capture the real data: diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 0796f938c4f0..b65c1f250941 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -611,6 +611,9 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev, stime = curr->stime; utime = curr->utime; + if (!strncmp(current->comm, "cat", 3)) + trace_printk("task tick-based utime %lld stime %lld, scheduler rtime %lld\n", utime, stime, rtime); + /* * If either stime or utime are 0, assume all runtime is userspace. * Once a task gets some ticks, the monotonicy code at 'update:' @@ -651,9 +654,14 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev, stime = rtime - utime; } + if (!strncmp(current->comm, "cat", 3)) + trace_printk("result: old utime %lld stime %lld, new utime %lld stime %lld\n", + prev->utime, prev->stime, utime, stime); + prev->stime = stime; prev->utime = utime; out: *ut = prev->utime; *st = prev->stime; raw_spin_unlock_irqrestore(>lock, flags); Using the reproducer I described in the changelog, it runs for a while with "50% usr, 0% sys", then followed by "100% sys". A shell script accesses its /proc/pid/stat at the interval of one second, and got: 50.0 usr, 0.0 sys 51.0 usr, 0.0 sys 50.0 usr, 0.0 sys ... 9.0 usr, 91.0 sys 9.0 usr, 91.0 sys The trace data corresponds to the last sample period: trace entry 1: cat-20755 [022] d... 1370.106496: cputime_adjust: task tick-based utime 36256000 stime 255100, scheduler rtime 333060702626 cat-20755 [022] d... 1370.106497: cputime_adjust: result: old utime 330729718142 stime 2306983867, new utime 330733635372 stime 2327067254 trace entry 2: cat-20773 [005] d... 1371.109825: cputime_adjust: task tick-based utime 36256700 stime 354700, scheduler rtime 334063718912 cat-20773 [005] d... 1371.109826: cputime_adjust: result: old utime 330733635372 stime 2327067254, new utime 330827229702 stime 3236489210 1) expected behaviour Let's compare the last two trace entries(all the data below is in ns): task tick-based utime: 36256000->36256700 increased 700 task tick-based stime: 255100 ->354700 increased 99600 scheduler rtime: 333060702626->334063718912 increased 1003016286 The application actually runs almost 100%sys at the moment, we can use the task tick-based utime and stime increased to double check: 99600/(700+99600) > 99%sys 2) the current cputime_adjust() inaccurate result But for the current cputime_adjust(), we get the following adjusted utime and stime increase in this sample period: adjusted utime: 330733635372->330827229702 increased 93594330 adjusted sti
[tip:sched/core] sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()
Commit-ID: 8d4c00dc38a8aa30dae8402955e55e7b34e74bc8 Gitweb: https://git.kernel.org/tip/8d4c00dc38a8aa30dae8402955e55e7b34e74bc8 Author: Xunlei Pang AuthorDate: Mon, 9 Jul 2018 22:58:43 +0800 Committer: Ingo Molnar CommitDate: Mon, 16 Jul 2018 00:28:31 +0200 sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust() If users access "/proc/pid/stat", the utime and stime ratio in the current SAMPLE period are excepted, but currently cputime_adjust() always calculates with the ratio of the WHOLE lifetime of the process. This results in inaccurate utime and stime in "/proc/pid/stat". For example, a process runs for a while with "50% usr, 0% sys", then followed by "100% sys". For later while, the following is excepted: 0.0 usr, 100.0 sys but we get: 10.0 usr, 90.0 sys This patch uses the accurate ratio in cputime_adjust() to address the issue. A new 'task_cputime' type field is added in prev_cputime to record previous 'task_cputime' so that we can get the elapsed times as the accurate ratio. Signed-off-by: Xunlei Pang Cc: Frederic Weisbecker Cc: Linus Torvalds Cc: Luiz Capitulino Cc: Peter Zijlstra Cc: Tejun Heo Cc: Thomas Gleixner Cc: baoyou@gmail.com Link: http://lkml.kernel.org/r/20180709145843.126583-1-xlp...@linux.alibaba.com Signed-off-by: Ingo Molnar --- include/linux/sched.h | 34 include/linux/sched/cputime.h | 12 - kernel/sched/cputime.c| 61 --- 3 files changed, 52 insertions(+), 55 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 43731fe51c97..fedc69d4a425 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -223,10 +223,27 @@ extern void io_schedule_finish(int token); extern long io_schedule_timeout(long timeout); extern void io_schedule(void); +/** + * struct task_cputime - collected CPU time counts + * @utime: time spent in user mode, in nanoseconds + * @stime: time spent in kernel mode, in nanoseconds + * @sum_exec_runtime: total time spent on the CPU, in nanoseconds + * + * This structure groups together three kinds of CPU time that are tracked for + * threads and thread groups. Most things considering CPU time want to group + * these counts together and treat all three of them in parallel. + */ +struct task_cputime { + u64 utime; + u64 stime; + unsigned long long sum_exec_runtime; +}; + /** * struct prev_cputime - snapshot of system and user cputime * @utime: time spent in user mode * @stime: time spent in system mode + * @cputime: previous task_cputime to calculate utime/stime * @lock: protects the above two fields * * Stores previous user/system time values such that we can guarantee @@ -236,26 +253,11 @@ struct prev_cputime { #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE u64 utime; u64 stime; + struct task_cputime cputime; raw_spinlock_t lock; #endif }; -/** - * struct task_cputime - collected CPU time counts - * @utime: time spent in user mode, in nanoseconds - * @stime: time spent in kernel mode, in nanoseconds - * @sum_exec_runtime: total time spent on the CPU, in nanoseconds - * - * This structure groups together three kinds of CPU time that are tracked for - * threads and thread groups. Most things considering CPU time want to group - * these counts together and treat all three of them in parallel. - */ -struct task_cputime { - u64 utime; - u64 stime; - unsigned long long sum_exec_runtime; -}; - /* Alternate field names when used on cache expirations: */ #define virt_exp utime #define prof_exp stime diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h index 53f883f5a2fd..49f8fd2564ed 100644 --- a/include/linux/sched/cputime.h +++ b/include/linux/sched/cputime.h @@ -175,10 +175,20 @@ static inline void account_group_exec_runtime(struct task_struct *tsk, atomic64_add(ns, >cputime_atomic.sum_exec_runtime); } -static inline void prev_cputime_init(struct prev_cputime *prev) +static inline void prev_cputime_clear(struct prev_cputime *prev) { #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE prev->utime = prev->stime = 0; + prev->cputime.utime = 0; + prev->cputime.stime = 0; + prev->cputime.sum_exec_runtime = 0; +#endif +} + +static inline void prev_cputime_init(struct prev_cputime *prev) +{ +#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE + prev_cputime_clear(prev); raw_spin_lock_init(>lock); #endif } diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 0796f938c4f0..a68483
[tip:sched/core] sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()
Commit-ID: 8d4c00dc38a8aa30dae8402955e55e7b34e74bc8 Gitweb: https://git.kernel.org/tip/8d4c00dc38a8aa30dae8402955e55e7b34e74bc8 Author: Xunlei Pang AuthorDate: Mon, 9 Jul 2018 22:58:43 +0800 Committer: Ingo Molnar CommitDate: Mon, 16 Jul 2018 00:28:31 +0200 sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust() If users access "/proc/pid/stat", the utime and stime ratio in the current SAMPLE period are excepted, but currently cputime_adjust() always calculates with the ratio of the WHOLE lifetime of the process. This results in inaccurate utime and stime in "/proc/pid/stat". For example, a process runs for a while with "50% usr, 0% sys", then followed by "100% sys". For later while, the following is excepted: 0.0 usr, 100.0 sys but we get: 10.0 usr, 90.0 sys This patch uses the accurate ratio in cputime_adjust() to address the issue. A new 'task_cputime' type field is added in prev_cputime to record previous 'task_cputime' so that we can get the elapsed times as the accurate ratio. Signed-off-by: Xunlei Pang Cc: Frederic Weisbecker Cc: Linus Torvalds Cc: Luiz Capitulino Cc: Peter Zijlstra Cc: Tejun Heo Cc: Thomas Gleixner Cc: baoyou@gmail.com Link: http://lkml.kernel.org/r/20180709145843.126583-1-xlp...@linux.alibaba.com Signed-off-by: Ingo Molnar --- include/linux/sched.h | 34 include/linux/sched/cputime.h | 12 - kernel/sched/cputime.c| 61 --- 3 files changed, 52 insertions(+), 55 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 43731fe51c97..fedc69d4a425 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -223,10 +223,27 @@ extern void io_schedule_finish(int token); extern long io_schedule_timeout(long timeout); extern void io_schedule(void); +/** + * struct task_cputime - collected CPU time counts + * @utime: time spent in user mode, in nanoseconds + * @stime: time spent in kernel mode, in nanoseconds + * @sum_exec_runtime: total time spent on the CPU, in nanoseconds + * + * This structure groups together three kinds of CPU time that are tracked for + * threads and thread groups. Most things considering CPU time want to group + * these counts together and treat all three of them in parallel. + */ +struct task_cputime { + u64 utime; + u64 stime; + unsigned long long sum_exec_runtime; +}; + /** * struct prev_cputime - snapshot of system and user cputime * @utime: time spent in user mode * @stime: time spent in system mode + * @cputime: previous task_cputime to calculate utime/stime * @lock: protects the above two fields * * Stores previous user/system time values such that we can guarantee @@ -236,26 +253,11 @@ struct prev_cputime { #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE u64 utime; u64 stime; + struct task_cputime cputime; raw_spinlock_t lock; #endif }; -/** - * struct task_cputime - collected CPU time counts - * @utime: time spent in user mode, in nanoseconds - * @stime: time spent in kernel mode, in nanoseconds - * @sum_exec_runtime: total time spent on the CPU, in nanoseconds - * - * This structure groups together three kinds of CPU time that are tracked for - * threads and thread groups. Most things considering CPU time want to group - * these counts together and treat all three of them in parallel. - */ -struct task_cputime { - u64 utime; - u64 stime; - unsigned long long sum_exec_runtime; -}; - /* Alternate field names when used on cache expirations: */ #define virt_exp utime #define prof_exp stime diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h index 53f883f5a2fd..49f8fd2564ed 100644 --- a/include/linux/sched/cputime.h +++ b/include/linux/sched/cputime.h @@ -175,10 +175,20 @@ static inline void account_group_exec_runtime(struct task_struct *tsk, atomic64_add(ns, >cputime_atomic.sum_exec_runtime); } -static inline void prev_cputime_init(struct prev_cputime *prev) +static inline void prev_cputime_clear(struct prev_cputime *prev) { #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE prev->utime = prev->stime = 0; + prev->cputime.utime = 0; + prev->cputime.stime = 0; + prev->cputime.sum_exec_runtime = 0; +#endif +} + +static inline void prev_cputime_init(struct prev_cputime *prev) +{ +#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE + prev_cputime_clear(prev); raw_spin_lock_init(>lock); #endif } diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 0796f938c4f0..a68483
Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion
Hi Peter, On 7/9/18 6:48 PM, Peter Zijlstra wrote: > On Mon, Jul 09, 2018 at 01:52:38PM +0800, Xunlei Pang wrote: >> Please see the enclosure for the reproducer cputime_adjust.tgz > > No, I'm not going to reverse engineer something if you cannot even > explain what the problem is. > I rewrote the whole changelog and sent v2, please have a look. Thanks, Xunlei
Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion
Hi Peter, On 7/9/18 6:48 PM, Peter Zijlstra wrote: > On Mon, Jul 09, 2018 at 01:52:38PM +0800, Xunlei Pang wrote: >> Please see the enclosure for the reproducer cputime_adjust.tgz > > No, I'm not going to reverse engineer something if you cannot even > explain what the problem is. > I rewrote the whole changelog and sent v2, please have a look. Thanks, Xunlei
[PATCH v2] sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()
If users access "/proc/pid/stat", the utime and stime ratio in the current SAMPLE period are excepted, but currently cputime_adjust() always calculates with the ratio of the WHOLE lifetime of the process. This results in inaccurate utime and stime in "/proc/pid/stat". For example, a process runs for a while with "50% usr, 0% sys", then followed by "100% sys". For later while, the following is excepted: 0.0 usr, 100.0 sys but we got: 10.0 usr, 90.0 sys This patch uses the accurate ratio in cputime_adjust() to address the issue. A new task_cputime type field is added in prev_cputime to record previous task_cputime so that we can get the elapsed times as the accurate ratio. Signed-off-by: Xunlei Pang --- v1->v2: - Rewrite the changelog. include/linux/sched.h | 34 include/linux/sched/cputime.h | 12 - kernel/sched/cputime.c| 61 --- 3 files changed, 52 insertions(+), 55 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 87bf02d93a27..9cb76005b638 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -223,10 +223,27 @@ extern void io_schedule_finish(int token); extern long io_schedule_timeout(long timeout); extern void io_schedule(void); +/** + * struct task_cputime - collected CPU time counts + * @utime: time spent in user mode, in nanoseconds + * @stime: time spent in kernel mode, in nanoseconds + * @sum_exec_runtime: total time spent on the CPU, in nanoseconds + * + * This structure groups together three kinds of CPU time that are tracked for + * threads and thread groups. Most things considering CPU time want to group + * these counts together and treat all three of them in parallel. + */ +struct task_cputime { + u64 utime; + u64 stime; + unsigned long long sum_exec_runtime; +}; + /** * struct prev_cputime - snapshot of system and user cputime * @utime: time spent in user mode * @stime: time spent in system mode + * @cputime: previous task_cputime to calculate utime/stime * @lock: protects the above two fields * * Stores previous user/system time values such that we can guarantee @@ -236,26 +253,11 @@ struct prev_cputime { #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE u64 utime; u64 stime; + struct task_cputime cputime; raw_spinlock_t lock; #endif }; -/** - * struct task_cputime - collected CPU time counts - * @utime: time spent in user mode, in nanoseconds - * @stime: time spent in kernel mode, in nanoseconds - * @sum_exec_runtime: total time spent on the CPU, in nanoseconds - * - * This structure groups together three kinds of CPU time that are tracked for - * threads and thread groups. Most things considering CPU time want to group - * these counts together and treat all three of them in parallel. - */ -struct task_cputime { - u64 utime; - u64 stime; - unsigned long long sum_exec_runtime; -}; - /* Alternate field names when used on cache expirations: */ #define virt_exp utime #define prof_exp stime diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h index 53f883f5a2fd..49f8fd2564ed 100644 --- a/include/linux/sched/cputime.h +++ b/include/linux/sched/cputime.h @@ -175,10 +175,20 @@ static inline void account_group_exec_runtime(struct task_struct *tsk, atomic64_add(ns, >cputime_atomic.sum_exec_runtime); } -static inline void prev_cputime_init(struct prev_cputime *prev) +static inline void prev_cputime_clear(struct prev_cputime *prev) { #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE prev->utime = prev->stime = 0; + prev->cputime.utime = 0; + prev->cputime.stime = 0; + prev->cputime.sum_exec_runtime = 0; +#endif +} + +static inline void prev_cputime_init(struct prev_cputime *prev) +{ +#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE + prev_cputime_clear(prev); raw_spin_lock_init(>lock); #endif } diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 0796f938c4f0..a68483ee3ad7 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -590,69 +590,54 @@ static u64 scale_stime(u64 stime, u64 rtime, u64 total) void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev, u64 *ut, u64 *st) { - u64 rtime, stime, utime; + u64 rtime_delta, stime_delta, utime_delta; unsigned long flags; /* Serialize concurrent callers such that we can honour our guarantees */ raw_spin_lock_irqsave(>lock, flags); - rtime = curr->sum_exec_run
[PATCH v2] sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()
If users access "/proc/pid/stat", the utime and stime ratio in the current SAMPLE period are excepted, but currently cputime_adjust() always calculates with the ratio of the WHOLE lifetime of the process. This results in inaccurate utime and stime in "/proc/pid/stat". For example, a process runs for a while with "50% usr, 0% sys", then followed by "100% sys". For later while, the following is excepted: 0.0 usr, 100.0 sys but we got: 10.0 usr, 90.0 sys This patch uses the accurate ratio in cputime_adjust() to address the issue. A new task_cputime type field is added in prev_cputime to record previous task_cputime so that we can get the elapsed times as the accurate ratio. Signed-off-by: Xunlei Pang --- v1->v2: - Rewrite the changelog. include/linux/sched.h | 34 include/linux/sched/cputime.h | 12 - kernel/sched/cputime.c| 61 --- 3 files changed, 52 insertions(+), 55 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 87bf02d93a27..9cb76005b638 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -223,10 +223,27 @@ extern void io_schedule_finish(int token); extern long io_schedule_timeout(long timeout); extern void io_schedule(void); +/** + * struct task_cputime - collected CPU time counts + * @utime: time spent in user mode, in nanoseconds + * @stime: time spent in kernel mode, in nanoseconds + * @sum_exec_runtime: total time spent on the CPU, in nanoseconds + * + * This structure groups together three kinds of CPU time that are tracked for + * threads and thread groups. Most things considering CPU time want to group + * these counts together and treat all three of them in parallel. + */ +struct task_cputime { + u64 utime; + u64 stime; + unsigned long long sum_exec_runtime; +}; + /** * struct prev_cputime - snapshot of system and user cputime * @utime: time spent in user mode * @stime: time spent in system mode + * @cputime: previous task_cputime to calculate utime/stime * @lock: protects the above two fields * * Stores previous user/system time values such that we can guarantee @@ -236,26 +253,11 @@ struct prev_cputime { #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE u64 utime; u64 stime; + struct task_cputime cputime; raw_spinlock_t lock; #endif }; -/** - * struct task_cputime - collected CPU time counts - * @utime: time spent in user mode, in nanoseconds - * @stime: time spent in kernel mode, in nanoseconds - * @sum_exec_runtime: total time spent on the CPU, in nanoseconds - * - * This structure groups together three kinds of CPU time that are tracked for - * threads and thread groups. Most things considering CPU time want to group - * these counts together and treat all three of them in parallel. - */ -struct task_cputime { - u64 utime; - u64 stime; - unsigned long long sum_exec_runtime; -}; - /* Alternate field names when used on cache expirations: */ #define virt_exp utime #define prof_exp stime diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h index 53f883f5a2fd..49f8fd2564ed 100644 --- a/include/linux/sched/cputime.h +++ b/include/linux/sched/cputime.h @@ -175,10 +175,20 @@ static inline void account_group_exec_runtime(struct task_struct *tsk, atomic64_add(ns, >cputime_atomic.sum_exec_runtime); } -static inline void prev_cputime_init(struct prev_cputime *prev) +static inline void prev_cputime_clear(struct prev_cputime *prev) { #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE prev->utime = prev->stime = 0; + prev->cputime.utime = 0; + prev->cputime.stime = 0; + prev->cputime.sum_exec_runtime = 0; +#endif +} + +static inline void prev_cputime_init(struct prev_cputime *prev) +{ +#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE + prev_cputime_clear(prev); raw_spin_lock_init(>lock); #endif } diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 0796f938c4f0..a68483ee3ad7 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -590,69 +590,54 @@ static u64 scale_stime(u64 stime, u64 rtime, u64 total) void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev, u64 *ut, u64 *st) { - u64 rtime, stime, utime; + u64 rtime_delta, stime_delta, utime_delta; unsigned long flags; /* Serialize concurrent callers such that we can honour our guarantees */ raw_spin_lock_irqsave(>lock, flags); - rtime = curr->sum_exec_run
Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion
Hi Peter, On 7/5/18 9:21 PM, Xunlei Pang wrote: > On 7/5/18 6:46 PM, Peter Zijlstra wrote: >> On Wed, Jun 27, 2018 at 08:22:42PM +0800, Xunlei Pang wrote: >>> tick-based whole utime is utime_0, tick-based whole stime >>> is stime_0, scheduler time is rtime_0. >> >>> For a long time, the process runs mainly in userspace with >>> run-sleep patterns, and because two different clocks, it >>> is possible to have the following condition: >>> rtime_0 < utime_0 (as with little stime_0) >> >> I don't follow... what? >> >> Why are you, and why do you think it makes sense to, compare rtime_0 >> against utime_0 ? >> >> The [us]time_0 are, per your earlier definition, ticks. They're not an >> actual measure of time. Do not compare the two, that makes no bloody >> sense. >> > > [us]time_0 is task_struct:utime{stime}, I cited directly from > cputime_adjust(), both in nanoseconds. I assumed "rtime_0 < utime_0" > here to simple the following proof to help explain the problem we met. > Please see the enclosure for the reproducer cputime_adjust.tgz (process_top.sh, usr_sys.c): gcc usr_sys.c -o usr_sys Firstly, the function consume_sys() in usr_sys.c yields 100% sys which can be verified as follows: $ taskset -c 0 ./usr_sys 1 $ ./process_top.sh $(pidof usr_sys) 0.0 usr, 100.0 sys 0.0 usr, 100.0 sys Tested on my local box on 4.17.0 by executing "taskset -c 0 ./usr_sys", then executing "./process_top.sh $(pidof usr_sys)" to watch. 1) Before this patch 50.0 usr, 0.0 sys 50.0 usr, 1.0 sys 50.0 usr, 0.0 sys 50.0 usr, 0.0 sys 49.0 usr, 4.0 sys //switch to consume 100% sys, ignore this line 12.0 usr, 88.0 sys 11.0 usr, 89.0 sys 10.0 usr, 90.0 sys 10.0 usr, 90.0 sys 9.0 usr, 91.0 sys 8.0 usr, 91.0 sys Obviously there were around 10% sys wrongly goes to usr 2) After this patch 50.0 usr, 0.0 sys 50.0 usr, 0.0 sys 50.0 usr, 0.0 sys 50.0 usr, 0.0 sys 11.0 usr, 76.0 sys //switch to consume 100% sys, ignore this line 1.0 usr, 100.0 sys 0.0 usr, 100.0 sys 1.0 usr, 100.0 sys 0.0 usr, 100.0 sys 0.0 usr, 100.0 sys So it displayed the correct result as we expected after this patch. Thanks, Xunlei cputime_adjust.tgz Description: GNU Zip compressed data
Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion
Hi Peter, On 7/5/18 9:21 PM, Xunlei Pang wrote: > On 7/5/18 6:46 PM, Peter Zijlstra wrote: >> On Wed, Jun 27, 2018 at 08:22:42PM +0800, Xunlei Pang wrote: >>> tick-based whole utime is utime_0, tick-based whole stime >>> is stime_0, scheduler time is rtime_0. >> >>> For a long time, the process runs mainly in userspace with >>> run-sleep patterns, and because two different clocks, it >>> is possible to have the following condition: >>> rtime_0 < utime_0 (as with little stime_0) >> >> I don't follow... what? >> >> Why are you, and why do you think it makes sense to, compare rtime_0 >> against utime_0 ? >> >> The [us]time_0 are, per your earlier definition, ticks. They're not an >> actual measure of time. Do not compare the two, that makes no bloody >> sense. >> > > [us]time_0 is task_struct:utime{stime}, I cited directly from > cputime_adjust(), both in nanoseconds. I assumed "rtime_0 < utime_0" > here to simple the following proof to help explain the problem we met. > Please see the enclosure for the reproducer cputime_adjust.tgz (process_top.sh, usr_sys.c): gcc usr_sys.c -o usr_sys Firstly, the function consume_sys() in usr_sys.c yields 100% sys which can be verified as follows: $ taskset -c 0 ./usr_sys 1 $ ./process_top.sh $(pidof usr_sys) 0.0 usr, 100.0 sys 0.0 usr, 100.0 sys Tested on my local box on 4.17.0 by executing "taskset -c 0 ./usr_sys", then executing "./process_top.sh $(pidof usr_sys)" to watch. 1) Before this patch 50.0 usr, 0.0 sys 50.0 usr, 1.0 sys 50.0 usr, 0.0 sys 50.0 usr, 0.0 sys 49.0 usr, 4.0 sys //switch to consume 100% sys, ignore this line 12.0 usr, 88.0 sys 11.0 usr, 89.0 sys 10.0 usr, 90.0 sys 10.0 usr, 90.0 sys 9.0 usr, 91.0 sys 8.0 usr, 91.0 sys Obviously there were around 10% sys wrongly goes to usr 2) After this patch 50.0 usr, 0.0 sys 50.0 usr, 0.0 sys 50.0 usr, 0.0 sys 50.0 usr, 0.0 sys 11.0 usr, 76.0 sys //switch to consume 100% sys, ignore this line 1.0 usr, 100.0 sys 0.0 usr, 100.0 sys 1.0 usr, 100.0 sys 0.0 usr, 100.0 sys 0.0 usr, 100.0 sys So it displayed the correct result as we expected after this patch. Thanks, Xunlei cputime_adjust.tgz Description: GNU Zip compressed data
Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion
On 7/5/18 6:46 PM, Peter Zijlstra wrote: > On Wed, Jun 27, 2018 at 08:22:42PM +0800, Xunlei Pang wrote: >> tick-based whole utime is utime_0, tick-based whole stime >> is stime_0, scheduler time is rtime_0. > >> For a long time, the process runs mainly in userspace with >> run-sleep patterns, and because two different clocks, it >> is possible to have the following condition: >> rtime_0 < utime_0 (as with little stime_0) > > I don't follow... what? > > Why are you, and why do you think it makes sense to, compare rtime_0 > against utime_0 ? > > The [us]time_0 are, per your earlier definition, ticks. They're not an > actual measure of time. Do not compare the two, that makes no bloody > sense. > [us]time_0 is task_struct:utime{stime}, I cited directly from cputime_adjust(), both in nanoseconds. I assumed "rtime_0 < utime_0" here to simple the following proof to help explain the problem we met.
Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion
On 7/5/18 6:46 PM, Peter Zijlstra wrote: > On Wed, Jun 27, 2018 at 08:22:42PM +0800, Xunlei Pang wrote: >> tick-based whole utime is utime_0, tick-based whole stime >> is stime_0, scheduler time is rtime_0. > >> For a long time, the process runs mainly in userspace with >> run-sleep patterns, and because two different clocks, it >> is possible to have the following condition: >> rtime_0 < utime_0 (as with little stime_0) > > I don't follow... what? > > Why are you, and why do you think it makes sense to, compare rtime_0 > against utime_0 ? > > The [us]time_0 are, per your earlier definition, ticks. They're not an > actual measure of time. Do not compare the two, that makes no bloody > sense. > [us]time_0 is task_struct:utime{stime}, I cited directly from cputime_adjust(), both in nanoseconds. I assumed "rtime_0 < utime_0" here to simple the following proof to help explain the problem we met.
Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion
On 7/2/18 11:21 PM, Tejun Heo wrote: > Hello, Peter. > > On Tue, Jun 26, 2018 at 05:49:08PM +0200, Peter Zijlstra wrote: >> Well, no, because the Changelog is incomprehensible and the patch >> doesn't really have useful comments, so I'll have to reverse engineer >> the entire thing, and I've just not had time for that. > > Just as an additional data point, we also sometimes see artifacts from > cpu_adjust_time() in the form of per-task user or sys time getting > stuck for some period (in extreme cases for over a minute) while the > application isn't doing anything differently. We're telling the users > that it's an inherent sampling artifact but it'd be nice to improve it > if possible without adding noticeable overhead. No idea whether this > patch's approach is a good one tho. The patch has no noticeable overhead except the extra cputime fileds added into task_struct. We've been running this patch on our servers for months, looks good till now. Thanks!
Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion
On 7/2/18 11:21 PM, Tejun Heo wrote: > Hello, Peter. > > On Tue, Jun 26, 2018 at 05:49:08PM +0200, Peter Zijlstra wrote: >> Well, no, because the Changelog is incomprehensible and the patch >> doesn't really have useful comments, so I'll have to reverse engineer >> the entire thing, and I've just not had time for that. > > Just as an additional data point, we also sometimes see artifacts from > cpu_adjust_time() in the form of per-task user or sys time getting > stuck for some period (in extreme cases for over a minute) while the > application isn't doing anything differently. We're telling the users > that it's an inherent sampling artifact but it'd be nice to improve it > if possible without adding noticeable overhead. No idea whether this > patch's approach is a good one tho. The patch has no noticeable overhead except the extra cputime fileds added into task_struct. We've been running this patch on our servers for months, looks good till now. Thanks!
[tip:sched/core] sched/fair: Advance global expiration when period timer is restarted
Commit-ID: f1d1be8aee6c461652aea8f58bedebaa73d7f4d3 Gitweb: https://git.kernel.org/tip/f1d1be8aee6c461652aea8f58bedebaa73d7f4d3 Author: Xunlei Pang AuthorDate: Wed, 20 Jun 2018 18:18:34 +0800 Committer: Ingo Molnar CommitDate: Tue, 3 Jul 2018 09:17:29 +0200 sched/fair: Advance global expiration when period timer is restarted When period gets restarted after some idle time, start_cfs_bandwidth() doesn't update the expiration information, expire_cfs_rq_runtime() will see cfs_rq->runtime_expires smaller than rq clock and go to the clock drift logic, wasting needless CPU cycles on the scheduler hot path. Update the global expiration in start_cfs_bandwidth() to avoid frequent expire_cfs_rq_runtime() calls once a new period begins. Signed-off-by: Xunlei Pang Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Ben Segall Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/20180620101834.24455-2-xlp...@linux.alibaba.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 791707c56886..840b92ee6f89 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5204,13 +5204,18 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b) { + u64 overrun; + lockdep_assert_held(_b->lock); - if (!cfs_b->period_active) { - cfs_b->period_active = 1; - hrtimer_forward_now(_b->period_timer, cfs_b->period); - hrtimer_start_expires(_b->period_timer, HRTIMER_MODE_ABS_PINNED); - } + if (cfs_b->period_active) + return; + + cfs_b->period_active = 1; + overrun = hrtimer_forward_now(_b->period_timer, cfs_b->period); + cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period); + cfs_b->expires_seq++; + hrtimer_start_expires(_b->period_timer, HRTIMER_MODE_ABS_PINNED); } static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
[tip:sched/core] sched/fair: Advance global expiration when period timer is restarted
Commit-ID: f1d1be8aee6c461652aea8f58bedebaa73d7f4d3 Gitweb: https://git.kernel.org/tip/f1d1be8aee6c461652aea8f58bedebaa73d7f4d3 Author: Xunlei Pang AuthorDate: Wed, 20 Jun 2018 18:18:34 +0800 Committer: Ingo Molnar CommitDate: Tue, 3 Jul 2018 09:17:29 +0200 sched/fair: Advance global expiration when period timer is restarted When period gets restarted after some idle time, start_cfs_bandwidth() doesn't update the expiration information, expire_cfs_rq_runtime() will see cfs_rq->runtime_expires smaller than rq clock and go to the clock drift logic, wasting needless CPU cycles on the scheduler hot path. Update the global expiration in start_cfs_bandwidth() to avoid frequent expire_cfs_rq_runtime() calls once a new period begins. Signed-off-by: Xunlei Pang Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Ben Segall Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/20180620101834.24455-2-xlp...@linux.alibaba.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 791707c56886..840b92ee6f89 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5204,13 +5204,18 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b) { + u64 overrun; + lockdep_assert_held(_b->lock); - if (!cfs_b->period_active) { - cfs_b->period_active = 1; - hrtimer_forward_now(_b->period_timer, cfs_b->period); - hrtimer_start_expires(_b->period_timer, HRTIMER_MODE_ABS_PINNED); - } + if (cfs_b->period_active) + return; + + cfs_b->period_active = 1; + overrun = hrtimer_forward_now(_b->period_timer, cfs_b->period); + cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period); + cfs_b->expires_seq++; + hrtimer_start_expires(_b->period_timer, HRTIMER_MODE_ABS_PINNED); } static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
[tip:sched/core] sched/fair: Fix bandwidth timer clock drift condition
Commit-ID: 512ac999d2755d2b7109e996a76b6fb8b888631d Gitweb: https://git.kernel.org/tip/512ac999d2755d2b7109e996a76b6fb8b888631d Author: Xunlei Pang AuthorDate: Wed, 20 Jun 2018 18:18:33 +0800 Committer: Ingo Molnar CommitDate: Tue, 3 Jul 2018 09:17:29 +0200 sched/fair: Fix bandwidth timer clock drift condition I noticed that cgroup task groups constantly get throttled even if they have low CPU usage, this causes some jitters on the response time to some of our business containers when enabling CPU quotas. It's very simple to reproduce: mkdir /sys/fs/cgroup/cpu/test cd /sys/fs/cgroup/cpu/test echo 10 > cpu.cfs_quota_us echo $$ > tasks then repeat: cat cpu.stat | grep nr_throttled # nr_throttled will increase steadily After some analysis, we found that cfs_rq::runtime_remaining will be cleared by expire_cfs_rq_runtime() due to two equal but stale "cfs_{b|q}->runtime_expires" after period timer is re-armed. The current condition to judge clock drift in expire_cfs_rq_runtime() is wrong, the two runtime_expires are actually the same when clock drift happens, so this condtion can never hit. The orginal design was correctly done by this commit: a9cf55b28610 ("sched: Expire invalid runtime") ... but was changed to be the current implementation due to its locking bug. This patch introduces another way, it adds a new field in both structures cfs_rq and cfs_bandwidth to record the expiration update sequence, and uses them to figure out if clock drift happens (true if they are equal). Signed-off-by: Xunlei Pang Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Ben Segall Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Fixes: 51f2176d74ac ("sched/fair: Fix unlocked reads of some cfs_b->quota/period") Link: http://lkml.kernel.org/r/20180620101834.24455-1-xlp...@linux.alibaba.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 14 -- kernel/sched/sched.h | 6 -- 2 files changed, 12 insertions(+), 8 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1866e64792a7..791707c56886 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4590,6 +4590,7 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b) now = sched_clock_cpu(smp_processor_id()); cfs_b->runtime = cfs_b->quota; cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period); + cfs_b->expires_seq++; } static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) @@ -4612,6 +4613,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) struct task_group *tg = cfs_rq->tg; struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg); u64 amount = 0, min_amount, expires; + int expires_seq; /* note: this is a positive sum as runtime_remaining <= 0 */ min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining; @@ -4628,6 +4630,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) cfs_b->idle = 0; } } + expires_seq = cfs_b->expires_seq; expires = cfs_b->runtime_expires; raw_spin_unlock(_b->lock); @@ -4637,8 +4640,10 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) * spread between our sched_clock and the one on which runtime was * issued. */ - if ((s64)(expires - cfs_rq->runtime_expires) > 0) + if (cfs_rq->expires_seq != expires_seq) { + cfs_rq->expires_seq = expires_seq; cfs_rq->runtime_expires = expires; + } return cfs_rq->runtime_remaining > 0; } @@ -4664,12 +4669,9 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq) * has not truly expired. * * Fortunately we can check determine whether this the case by checking -* whether the global deadline has advanced. It is valid to compare -* cfs_b->runtime_expires without any locks since we only care about -* exact equality, so a partial write will still work. +* whether the global deadline(cfs_b->expires_seq) has advanced. */ - - if (cfs_rq->runtime_expires != cfs_b->runtime_expires) { + if (cfs_rq->expires_seq == cfs_b->expires_seq) { /* extend local deadline, drift is bounded above by 2 ticks */ cfs_rq->runtime_expires += TICK_NSEC; } else { diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 27ddec334601..c7742dcc136c 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -334,9 +334,10 @@ struct cfs_bandwidth { u64 runtime; s64 hierarchical_quota; u64 runtime_expires; + int expires_seq; - int idle; - int
[tip:sched/core] sched/fair: Fix bandwidth timer clock drift condition
Commit-ID: 512ac999d2755d2b7109e996a76b6fb8b888631d Gitweb: https://git.kernel.org/tip/512ac999d2755d2b7109e996a76b6fb8b888631d Author: Xunlei Pang AuthorDate: Wed, 20 Jun 2018 18:18:33 +0800 Committer: Ingo Molnar CommitDate: Tue, 3 Jul 2018 09:17:29 +0200 sched/fair: Fix bandwidth timer clock drift condition I noticed that cgroup task groups constantly get throttled even if they have low CPU usage, this causes some jitters on the response time to some of our business containers when enabling CPU quotas. It's very simple to reproduce: mkdir /sys/fs/cgroup/cpu/test cd /sys/fs/cgroup/cpu/test echo 10 > cpu.cfs_quota_us echo $$ > tasks then repeat: cat cpu.stat | grep nr_throttled # nr_throttled will increase steadily After some analysis, we found that cfs_rq::runtime_remaining will be cleared by expire_cfs_rq_runtime() due to two equal but stale "cfs_{b|q}->runtime_expires" after period timer is re-armed. The current condition to judge clock drift in expire_cfs_rq_runtime() is wrong, the two runtime_expires are actually the same when clock drift happens, so this condtion can never hit. The orginal design was correctly done by this commit: a9cf55b28610 ("sched: Expire invalid runtime") ... but was changed to be the current implementation due to its locking bug. This patch introduces another way, it adds a new field in both structures cfs_rq and cfs_bandwidth to record the expiration update sequence, and uses them to figure out if clock drift happens (true if they are equal). Signed-off-by: Xunlei Pang Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Ben Segall Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Fixes: 51f2176d74ac ("sched/fair: Fix unlocked reads of some cfs_b->quota/period") Link: http://lkml.kernel.org/r/20180620101834.24455-1-xlp...@linux.alibaba.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 14 -- kernel/sched/sched.h | 6 -- 2 files changed, 12 insertions(+), 8 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1866e64792a7..791707c56886 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4590,6 +4590,7 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b) now = sched_clock_cpu(smp_processor_id()); cfs_b->runtime = cfs_b->quota; cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period); + cfs_b->expires_seq++; } static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) @@ -4612,6 +4613,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) struct task_group *tg = cfs_rq->tg; struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg); u64 amount = 0, min_amount, expires; + int expires_seq; /* note: this is a positive sum as runtime_remaining <= 0 */ min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining; @@ -4628,6 +4630,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) cfs_b->idle = 0; } } + expires_seq = cfs_b->expires_seq; expires = cfs_b->runtime_expires; raw_spin_unlock(_b->lock); @@ -4637,8 +4640,10 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) * spread between our sched_clock and the one on which runtime was * issued. */ - if ((s64)(expires - cfs_rq->runtime_expires) > 0) + if (cfs_rq->expires_seq != expires_seq) { + cfs_rq->expires_seq = expires_seq; cfs_rq->runtime_expires = expires; + } return cfs_rq->runtime_remaining > 0; } @@ -4664,12 +4669,9 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq) * has not truly expired. * * Fortunately we can check determine whether this the case by checking -* whether the global deadline has advanced. It is valid to compare -* cfs_b->runtime_expires without any locks since we only care about -* exact equality, so a partial write will still work. +* whether the global deadline(cfs_b->expires_seq) has advanced. */ - - if (cfs_rq->runtime_expires != cfs_b->runtime_expires) { + if (cfs_rq->expires_seq == cfs_b->expires_seq) { /* extend local deadline, drift is bounded above by 2 ticks */ cfs_rq->runtime_expires += TICK_NSEC; } else { diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 27ddec334601..c7742dcc136c 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -334,9 +334,10 @@ struct cfs_bandwidth { u64 runtime; s64 hierarchical_quota; u64 runtime_expires; + int expires_seq; - int idle; - int
Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion
On 6/26/18 11:49 PM, Peter Zijlstra wrote: > On Tue, Jun 26, 2018 at 08:19:49PM +0800, Xunlei Pang wrote: >> On 6/22/18 3:15 PM, Xunlei Pang wrote: >>> We use per-cgroup cpu usage statistics similar to "cgroup rstat", >>> and encountered a problem that user and sys usages are wrongly >>> split sometimes. >>> >>> Run tasks with some random run-sleep pattern for a long time, and >>> when tick-based time and scheduler sum_exec_runtime hugely drifts >>> apart(scheduler sum_exec_runtime is less than tick-based time), >>> the current implementation of cputime_adjust() will produce less >>> sys usage than the actual use after changing to run a different >>> workload pattern with high sys. This is because total tick-based >>> utime and stime are used to split the total sum_exec_runtime. >>> >>> Same problem exists on utime and stime from "/proc//stat". >>> >>> [Example] >>> Run some random run-sleep patterns for minutes, then change to run >>> high sys pattern, and watch. >>> 1) standard "top"(which is the correct one): >>>4.6 us, 94.5 sy, 0.0 ni, 0.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st >>> 2) our tool parsing utime and stime from "/proc//stat": >>>20.5 usr, 78.4 sys >>> We can see "20.5 usr" displayed in 2) was incorrect, it recovers >>> gradually with time: 9.7 usr, 89.5 sys >>> >> High sys probably means there's something abnormal on the kernel >> path, it may hide issues, so we should make it fairly reliable. >> It can easily hit this problem with our per-cgroup statistics. >> >> Hi Peter, any comment on this patch? > > Well, no, because the Changelog is incomprehensible and the patch > doesn't really have useful comments, so I'll have to reverse engineer > the entire thing, and I've just not had time for that. > Let me try the best to describe it again. There are two types of run time for a process: 1) task_struct::utime, task_struct::stime in ticks. 2) scheduler task_struct::se.sum_exec_runtime(rtime) in ns. In case of no vtime accounting, the utime/stime fileds of /proc/pid/stat are calculated by cputime_adjust(), which splits the precise rtime in the proportion of tick-based utime and stime. However cputime_adjust() always does the split using the total utime/stime of the process, this may cause wrong splitting in some cases, e.g. A typical statistic collector accesses "/proc/pid/stat". 1) moment t0 After accessed /proc/pid/stat in t0: tick-based whole utime is utime_0, tick-based whole stime is stime_0, scheduler time is rtime_0. The ajusted utime caculated by cputime_adjust() is autime_0, ajusted stime is astime_0, so astime_0=rtime_0*stime_0/(utime_0+stime_0). For a long time, the process runs mainly in userspace with run-sleep patterns, and because two different clocks, it is possible to have the following condition: rtime_0 < utime_0 (as with little stime_0) 2) moment t1(after dt, i.e. t0+dt) Then the process suddenly runs 100% sys workload afterwards lasting "dt", when accessing /proc/pid/stat at t1="t0+dt", both rtime_0 and stime_0 increase "dt", thus cputime_ajust() does the calculation for new adjusted astime_1 as follows: (rtime_0+dt)*(stime_0+dt)/(utime_0+stime_0+dt) = (rtime_0*stime_0+rtime_0*dt+stime_0*dt+dt*dt)/(utime_0+stime_0+dt) = (rtime_0*stime_0+rtime_0*dt-utime_0*dt)/(utime_0+stime_0+dt) + dt < rtime_0*stime_0/(utime_0+stime_0+dt) + dt (for rtime_0 < utime_0) < rtime_0*stime_0/(utime_0+stime_0) + dt < astime_0+dt The actual astime_1 should be "astime_0+dt"(as it runs 100% sys during dt), but the caculated figure by cputime_adjust() becomes much smaller, as a result the statistics collector shows less cpu sys usage than the actual one. That's why we occasionally observed the incorrect cpu usage described in the changelog: [Example] Run some random run-sleep patterns for minutes, then change to run high sys pattern, and watch. 1) standard "top"(which is the correct one): 4.6 us, 94.5 sy, 0.0 ni, 0.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st 2) Parsing utime and stime from "/proc//stat": 20.5 usr, 78.4 sys We can see "20.5 usr" displayed in 2) was incorrect, it recovers gradually with time: 9.7 usr, 89.5 sys Thanks, Xunlei
Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion
On 6/26/18 11:49 PM, Peter Zijlstra wrote: > On Tue, Jun 26, 2018 at 08:19:49PM +0800, Xunlei Pang wrote: >> On 6/22/18 3:15 PM, Xunlei Pang wrote: >>> We use per-cgroup cpu usage statistics similar to "cgroup rstat", >>> and encountered a problem that user and sys usages are wrongly >>> split sometimes. >>> >>> Run tasks with some random run-sleep pattern for a long time, and >>> when tick-based time and scheduler sum_exec_runtime hugely drifts >>> apart(scheduler sum_exec_runtime is less than tick-based time), >>> the current implementation of cputime_adjust() will produce less >>> sys usage than the actual use after changing to run a different >>> workload pattern with high sys. This is because total tick-based >>> utime and stime are used to split the total sum_exec_runtime. >>> >>> Same problem exists on utime and stime from "/proc//stat". >>> >>> [Example] >>> Run some random run-sleep patterns for minutes, then change to run >>> high sys pattern, and watch. >>> 1) standard "top"(which is the correct one): >>>4.6 us, 94.5 sy, 0.0 ni, 0.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st >>> 2) our tool parsing utime and stime from "/proc//stat": >>>20.5 usr, 78.4 sys >>> We can see "20.5 usr" displayed in 2) was incorrect, it recovers >>> gradually with time: 9.7 usr, 89.5 sys >>> >> High sys probably means there's something abnormal on the kernel >> path, it may hide issues, so we should make it fairly reliable. >> It can easily hit this problem with our per-cgroup statistics. >> >> Hi Peter, any comment on this patch? > > Well, no, because the Changelog is incomprehensible and the patch > doesn't really have useful comments, so I'll have to reverse engineer > the entire thing, and I've just not had time for that. > Let me try the best to describe it again. There are two types of run time for a process: 1) task_struct::utime, task_struct::stime in ticks. 2) scheduler task_struct::se.sum_exec_runtime(rtime) in ns. In case of no vtime accounting, the utime/stime fileds of /proc/pid/stat are calculated by cputime_adjust(), which splits the precise rtime in the proportion of tick-based utime and stime. However cputime_adjust() always does the split using the total utime/stime of the process, this may cause wrong splitting in some cases, e.g. A typical statistic collector accesses "/proc/pid/stat". 1) moment t0 After accessed /proc/pid/stat in t0: tick-based whole utime is utime_0, tick-based whole stime is stime_0, scheduler time is rtime_0. The ajusted utime caculated by cputime_adjust() is autime_0, ajusted stime is astime_0, so astime_0=rtime_0*stime_0/(utime_0+stime_0). For a long time, the process runs mainly in userspace with run-sleep patterns, and because two different clocks, it is possible to have the following condition: rtime_0 < utime_0 (as with little stime_0) 2) moment t1(after dt, i.e. t0+dt) Then the process suddenly runs 100% sys workload afterwards lasting "dt", when accessing /proc/pid/stat at t1="t0+dt", both rtime_0 and stime_0 increase "dt", thus cputime_ajust() does the calculation for new adjusted astime_1 as follows: (rtime_0+dt)*(stime_0+dt)/(utime_0+stime_0+dt) = (rtime_0*stime_0+rtime_0*dt+stime_0*dt+dt*dt)/(utime_0+stime_0+dt) = (rtime_0*stime_0+rtime_0*dt-utime_0*dt)/(utime_0+stime_0+dt) + dt < rtime_0*stime_0/(utime_0+stime_0+dt) + dt (for rtime_0 < utime_0) < rtime_0*stime_0/(utime_0+stime_0) + dt < astime_0+dt The actual astime_1 should be "astime_0+dt"(as it runs 100% sys during dt), but the caculated figure by cputime_adjust() becomes much smaller, as a result the statistics collector shows less cpu sys usage than the actual one. That's why we occasionally observed the incorrect cpu usage described in the changelog: [Example] Run some random run-sleep patterns for minutes, then change to run high sys pattern, and watch. 1) standard "top"(which is the correct one): 4.6 us, 94.5 sy, 0.0 ni, 0.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st 2) Parsing utime and stime from "/proc//stat": 20.5 usr, 78.4 sys We can see "20.5 usr" displayed in 2) was incorrect, it recovers gradually with time: 9.7 usr, 89.5 sys Thanks, Xunlei
Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion
On 6/22/18 3:15 PM, Xunlei Pang wrote: > We use per-cgroup cpu usage statistics similar to "cgroup rstat", > and encountered a problem that user and sys usages are wrongly > split sometimes. > > Run tasks with some random run-sleep pattern for a long time, and > when tick-based time and scheduler sum_exec_runtime hugely drifts > apart(scheduler sum_exec_runtime is less than tick-based time), > the current implementation of cputime_adjust() will produce less > sys usage than the actual use after changing to run a different > workload pattern with high sys. This is because total tick-based > utime and stime are used to split the total sum_exec_runtime. > > Same problem exists on utime and stime from "/proc//stat". > > [Example] > Run some random run-sleep patterns for minutes, then change to run > high sys pattern, and watch. > 1) standard "top"(which is the correct one): >4.6 us, 94.5 sy, 0.0 ni, 0.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st > 2) our tool parsing utime and stime from "/proc//stat": >20.5 usr, 78.4 sys > We can see "20.5 usr" displayed in 2) was incorrect, it recovers > gradually with time: 9.7 usr, 89.5 sys > High sys probably means there's something abnormal on the kernel path, it may hide issues, so we should make it fairly reliable. It can easily hit this problem with our per-cgroup statistics. Hi Peter, any comment on this patch?
Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion
On 6/22/18 3:15 PM, Xunlei Pang wrote: > We use per-cgroup cpu usage statistics similar to "cgroup rstat", > and encountered a problem that user and sys usages are wrongly > split sometimes. > > Run tasks with some random run-sleep pattern for a long time, and > when tick-based time and scheduler sum_exec_runtime hugely drifts > apart(scheduler sum_exec_runtime is less than tick-based time), > the current implementation of cputime_adjust() will produce less > sys usage than the actual use after changing to run a different > workload pattern with high sys. This is because total tick-based > utime and stime are used to split the total sum_exec_runtime. > > Same problem exists on utime and stime from "/proc//stat". > > [Example] > Run some random run-sleep patterns for minutes, then change to run > high sys pattern, and watch. > 1) standard "top"(which is the correct one): >4.6 us, 94.5 sy, 0.0 ni, 0.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st > 2) our tool parsing utime and stime from "/proc//stat": >20.5 usr, 78.4 sys > We can see "20.5 usr" displayed in 2) was incorrect, it recovers > gradually with time: 9.7 usr, 89.5 sys > High sys probably means there's something abnormal on the kernel path, it may hide issues, so we should make it fairly reliable. It can easily hit this problem with our per-cgroup statistics. Hi Peter, any comment on this patch?
Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion
On 6/22/18 6:35 PM, kbuild test robot wrote: > Hi Xunlei, > > Thank you for the patch! Perhaps something to improve: > > [auto build test WARNING on tip/sched/core] > [also build test WARNING on v4.18-rc1 next-20180622] > [if your patch is applied to the wrong git tree, please drop us a note to > help improve the system] > > url: > https://github.com/0day-ci/linux/commits/Xunlei-Pang/sched-cputime-Ensure-correct-utime-and-stime-proportion/20180622-153720 > reproduce: make htmldocs > > All warnings (new ones prefixed by >>): > >WARNING: convert(1) not found, for SVG to PDF conversion install > ImageMagick (https://www.imagemagick.org) >mm/mempool.c:228: warning: Function parameter or member 'pool' not > described in 'mempool_init' >include/net/cfg80211.h:4279: warning: Function parameter or member > 'wext.ibss' not described in 'wireless_dev' >include/net/cfg80211.h:4279: warning: Function parameter or member > 'wext.connect' not described in 'wireless_dev' >include/net/cfg80211.h:4279: warning: Function parameter or member > 'wext.keys' not described in 'wireless_dev' >include/net/cfg80211.h:4279: warning: Function parameter or member > 'wext.ie' not described in 'wireless_dev' >include/net/cfg80211.h:4279: warning: Function parameter or member > 'wext.ie_len' not described in 'wireless_dev' >include/net/cfg80211.h:4279: warning: Function parameter or member > 'wext.bssid' not described in 'wireless_dev' >include/net/cfg80211.h:4279: warning: Function parameter or member > 'wext.ssid' not described in 'wireless_dev' >include/net/cfg80211.h:4279: warning: Function parameter or member > 'wext.default_key' not described in 'wireless_dev' >include/net/cfg80211.h:4279: warning: Function parameter or member > 'wext.default_mgmt_key' not described in 'wireless_dev' >include/net/cfg80211.h:4279: warning: Function parameter or member > 'wext.prev_bssid_valid' not described in 'wireless_dev' >include/net/mac80211.h:2282: warning: Function parameter or member > 'radiotap_timestamp.units_pos' not described in 'ieee80211_hw' >include/net/mac80211.h:2282: warning: Function parameter or member > 'radiotap_timestamp.accuracy' not described in 'ieee80211_hw' >include/net/mac80211.h:955: warning: Function parameter or member > 'control.rates' not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member > 'control.rts_cts_rate_idx' not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member > 'control.use_rts' not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member > 'control.use_cts_prot' not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member > 'control.short_preamble' not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member > 'control.skip_table' not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member > 'control.jiffies' not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member > 'control.vif' not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member > 'control.hw_key' not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member > 'control.flags' not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member > 'control.enqueue_time' not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member 'ack' > not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member > 'ack.cookie' not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member > 'status.rates' not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member > 'status.ack_signal' not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member > 'status.ampdu_ack_len' not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member > 'status.ampdu_len' not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member > 'status.antenna' not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member > 'status.tx_time' not described in 'ieee80211_tx_info' >include/net/mac80211
Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion
On 6/22/18 6:35 PM, kbuild test robot wrote: > Hi Xunlei, > > Thank you for the patch! Perhaps something to improve: > > [auto build test WARNING on tip/sched/core] > [also build test WARNING on v4.18-rc1 next-20180622] > [if your patch is applied to the wrong git tree, please drop us a note to > help improve the system] > > url: > https://github.com/0day-ci/linux/commits/Xunlei-Pang/sched-cputime-Ensure-correct-utime-and-stime-proportion/20180622-153720 > reproduce: make htmldocs > > All warnings (new ones prefixed by >>): > >WARNING: convert(1) not found, for SVG to PDF conversion install > ImageMagick (https://www.imagemagick.org) >mm/mempool.c:228: warning: Function parameter or member 'pool' not > described in 'mempool_init' >include/net/cfg80211.h:4279: warning: Function parameter or member > 'wext.ibss' not described in 'wireless_dev' >include/net/cfg80211.h:4279: warning: Function parameter or member > 'wext.connect' not described in 'wireless_dev' >include/net/cfg80211.h:4279: warning: Function parameter or member > 'wext.keys' not described in 'wireless_dev' >include/net/cfg80211.h:4279: warning: Function parameter or member > 'wext.ie' not described in 'wireless_dev' >include/net/cfg80211.h:4279: warning: Function parameter or member > 'wext.ie_len' not described in 'wireless_dev' >include/net/cfg80211.h:4279: warning: Function parameter or member > 'wext.bssid' not described in 'wireless_dev' >include/net/cfg80211.h:4279: warning: Function parameter or member > 'wext.ssid' not described in 'wireless_dev' >include/net/cfg80211.h:4279: warning: Function parameter or member > 'wext.default_key' not described in 'wireless_dev' >include/net/cfg80211.h:4279: warning: Function parameter or member > 'wext.default_mgmt_key' not described in 'wireless_dev' >include/net/cfg80211.h:4279: warning: Function parameter or member > 'wext.prev_bssid_valid' not described in 'wireless_dev' >include/net/mac80211.h:2282: warning: Function parameter or member > 'radiotap_timestamp.units_pos' not described in 'ieee80211_hw' >include/net/mac80211.h:2282: warning: Function parameter or member > 'radiotap_timestamp.accuracy' not described in 'ieee80211_hw' >include/net/mac80211.h:955: warning: Function parameter or member > 'control.rates' not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member > 'control.rts_cts_rate_idx' not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member > 'control.use_rts' not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member > 'control.use_cts_prot' not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member > 'control.short_preamble' not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member > 'control.skip_table' not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member > 'control.jiffies' not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member > 'control.vif' not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member > 'control.hw_key' not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member > 'control.flags' not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member > 'control.enqueue_time' not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member 'ack' > not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member > 'ack.cookie' not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member > 'status.rates' not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member > 'status.ack_signal' not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member > 'status.ampdu_ack_len' not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member > 'status.ampdu_len' not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member > 'status.antenna' not described in 'ieee80211_tx_info' >include/net/mac80211.h:955: warning: Function parameter or member > 'status.tx_time' not described in 'ieee80211_tx_info' >include/net/mac80211
[PATCH] sched/cputime: Ensure correct utime and stime proportion
We use per-cgroup cpu usage statistics similar to "cgroup rstat", and encountered a problem that user and sys usages are wrongly split sometimes. Run tasks with some random run-sleep pattern for a long time, and when tick-based time and scheduler sum_exec_runtime hugely drifts apart(scheduler sum_exec_runtime is less than tick-based time), the current implementation of cputime_adjust() will produce less sys usage than the actual use after changing to run a different workload pattern with high sys. This is because total tick-based utime and stime are used to split the total sum_exec_runtime. Same problem exists on utime and stime from "/proc//stat". [Example] Run some random run-sleep patterns for minutes, then change to run high sys pattern, and watch. 1) standard "top"(which is the correct one): 4.6 us, 94.5 sy, 0.0 ni, 0.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st 2) our tool parsing utime and stime from "/proc//stat": 20.5 usr, 78.4 sys We can see "20.5 usr" displayed in 2) was incorrect, it recovers gradually with time: 9.7 usr, 89.5 sys This patch fixes the issue by calculating using all kinds of time elapsed since last parse in cputime_adjust(), and accumulate the corresponding results calculated into prev_cputime. A new field of task_cputime type is added in structure prev_cputime to record previous task_cputime so that we can get the elapsed time deltas. Signed-off-by: Xunlei Pang --- include/linux/sched.h | 33 +++ include/linux/sched/cputime.h | 12 - kernel/sched/cputime.c| 61 --- 3 files changed, 51 insertions(+), 55 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 87bf02d93a27..5108ac8414e0 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -223,6 +223,22 @@ extern void io_schedule_finish(int token); extern long io_schedule_timeout(long timeout); extern void io_schedule(void); +/** + * struct task_cputime - collected CPU time counts + * @utime: time spent in user mode, in nanoseconds + * @stime: time spent in kernel mode, in nanoseconds + * @sum_exec_runtime: total time spent on the CPU, in nanoseconds + * + * This structure groups together three kinds of CPU time that are tracked for + * threads and thread groups. Most things considering CPU time want to group + * these counts together and treat all three of them in parallel. + */ +struct task_cputime { + u64 utime; + u64 stime; + unsigned long long sum_exec_runtime; +}; + /** * struct prev_cputime - snapshot of system and user cputime * @utime: time spent in user mode @@ -236,26 +252,11 @@ struct prev_cputime { #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE u64 utime; u64 stime; + struct task_cputime cputime; raw_spinlock_t lock; #endif }; -/** - * struct task_cputime - collected CPU time counts - * @utime: time spent in user mode, in nanoseconds - * @stime: time spent in kernel mode, in nanoseconds - * @sum_exec_runtime: total time spent on the CPU, in nanoseconds - * - * This structure groups together three kinds of CPU time that are tracked for - * threads and thread groups. Most things considering CPU time want to group - * these counts together and treat all three of them in parallel. - */ -struct task_cputime { - u64 utime; - u64 stime; - unsigned long long sum_exec_runtime; -}; - /* Alternate field names when used on cache expirations: */ #define virt_exp utime #define prof_exp stime diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h index 53f883f5a2fd..49f8fd2564ed 100644 --- a/include/linux/sched/cputime.h +++ b/include/linux/sched/cputime.h @@ -175,10 +175,20 @@ static inline void account_group_exec_runtime(struct task_struct *tsk, atomic64_add(ns, >cputime_atomic.sum_exec_runtime); } -static inline void prev_cputime_init(struct prev_cputime *prev) +static inline void prev_cputime_clear(struct prev_cputime *prev) { #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE prev->utime = prev->stime = 0; + prev->cputime.utime = 0; + prev->cputime.stime = 0; + prev->cputime.sum_exec_runtime = 0; +#endif +} + +static inline void prev_cputime_init(struct prev_cputime *prev) +{ +#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE + prev_cputime_clear(prev); raw_spin_lock_init(>lock); #endif } diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 0796f938c4f0..a68483ee3ad7 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -590,69 +590,54 @@ static u64 s
[PATCH] sched/cputime: Ensure correct utime and stime proportion
We use per-cgroup cpu usage statistics similar to "cgroup rstat", and encountered a problem that user and sys usages are wrongly split sometimes. Run tasks with some random run-sleep pattern for a long time, and when tick-based time and scheduler sum_exec_runtime hugely drifts apart(scheduler sum_exec_runtime is less than tick-based time), the current implementation of cputime_adjust() will produce less sys usage than the actual use after changing to run a different workload pattern with high sys. This is because total tick-based utime and stime are used to split the total sum_exec_runtime. Same problem exists on utime and stime from "/proc//stat". [Example] Run some random run-sleep patterns for minutes, then change to run high sys pattern, and watch. 1) standard "top"(which is the correct one): 4.6 us, 94.5 sy, 0.0 ni, 0.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st 2) our tool parsing utime and stime from "/proc//stat": 20.5 usr, 78.4 sys We can see "20.5 usr" displayed in 2) was incorrect, it recovers gradually with time: 9.7 usr, 89.5 sys This patch fixes the issue by calculating using all kinds of time elapsed since last parse in cputime_adjust(), and accumulate the corresponding results calculated into prev_cputime. A new field of task_cputime type is added in structure prev_cputime to record previous task_cputime so that we can get the elapsed time deltas. Signed-off-by: Xunlei Pang --- include/linux/sched.h | 33 +++ include/linux/sched/cputime.h | 12 - kernel/sched/cputime.c| 61 --- 3 files changed, 51 insertions(+), 55 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 87bf02d93a27..5108ac8414e0 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -223,6 +223,22 @@ extern void io_schedule_finish(int token); extern long io_schedule_timeout(long timeout); extern void io_schedule(void); +/** + * struct task_cputime - collected CPU time counts + * @utime: time spent in user mode, in nanoseconds + * @stime: time spent in kernel mode, in nanoseconds + * @sum_exec_runtime: total time spent on the CPU, in nanoseconds + * + * This structure groups together three kinds of CPU time that are tracked for + * threads and thread groups. Most things considering CPU time want to group + * these counts together and treat all three of them in parallel. + */ +struct task_cputime { + u64 utime; + u64 stime; + unsigned long long sum_exec_runtime; +}; + /** * struct prev_cputime - snapshot of system and user cputime * @utime: time spent in user mode @@ -236,26 +252,11 @@ struct prev_cputime { #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE u64 utime; u64 stime; + struct task_cputime cputime; raw_spinlock_t lock; #endif }; -/** - * struct task_cputime - collected CPU time counts - * @utime: time spent in user mode, in nanoseconds - * @stime: time spent in kernel mode, in nanoseconds - * @sum_exec_runtime: total time spent on the CPU, in nanoseconds - * - * This structure groups together three kinds of CPU time that are tracked for - * threads and thread groups. Most things considering CPU time want to group - * these counts together and treat all three of them in parallel. - */ -struct task_cputime { - u64 utime; - u64 stime; - unsigned long long sum_exec_runtime; -}; - /* Alternate field names when used on cache expirations: */ #define virt_exp utime #define prof_exp stime diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h index 53f883f5a2fd..49f8fd2564ed 100644 --- a/include/linux/sched/cputime.h +++ b/include/linux/sched/cputime.h @@ -175,10 +175,20 @@ static inline void account_group_exec_runtime(struct task_struct *tsk, atomic64_add(ns, >cputime_atomic.sum_exec_runtime); } -static inline void prev_cputime_init(struct prev_cputime *prev) +static inline void prev_cputime_clear(struct prev_cputime *prev) { #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE prev->utime = prev->stime = 0; + prev->cputime.utime = 0; + prev->cputime.stime = 0; + prev->cputime.sum_exec_runtime = 0; +#endif +} + +static inline void prev_cputime_init(struct prev_cputime *prev) +{ +#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE + prev_cputime_clear(prev); raw_spin_lock_init(>lock); #endif } diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 0796f938c4f0..a68483ee3ad7 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -590,69 +590,54 @@ static u64 s
Re: [PATCH v2 1/2] sched/fair: Fix bandwidth timer clock drift condition
On 6/21/18 4:08 PM, Peter Zijlstra wrote: > On Thu, Jun 21, 2018 at 11:56:56AM +0800, Xunlei Pang wrote: >>>> Fixes: 51f2176d74ac ("sched/fair: Fix unlocked reads of some >>>> cfs_b->quota/period") >>>> Cc: Ben Segall >>> >>> Reviewed-By: Ben Segall >> >> Thanks Ben :-) >> >> Hi Peter, could you please have a look at them? > > I grabbed them, thanks guys! > >> >> Cc sta...@vger.kernel.org > > For both or just the first? > Both, thanks!
Re: [PATCH v2 1/2] sched/fair: Fix bandwidth timer clock drift condition
On 6/21/18 4:08 PM, Peter Zijlstra wrote: > On Thu, Jun 21, 2018 at 11:56:56AM +0800, Xunlei Pang wrote: >>>> Fixes: 51f2176d74ac ("sched/fair: Fix unlocked reads of some >>>> cfs_b->quota/period") >>>> Cc: Ben Segall >>> >>> Reviewed-By: Ben Segall >> >> Thanks Ben :-) >> >> Hi Peter, could you please have a look at them? > > I grabbed them, thanks guys! > >> >> Cc sta...@vger.kernel.org > > For both or just the first? > Both, thanks!
Re: [PATCH v2 1/2] sched/fair: Fix bandwidth timer clock drift condition
On 6/21/18 1:01 AM, bseg...@google.com wrote: > Xunlei Pang writes: > >> I noticed the group constantly got throttled even it consumed >> low cpu usage, this caused some jitters on the response time >> to some of our business containers enabling cpu quota. >> >> It's very simple to reproduce: >> mkdir /sys/fs/cgroup/cpu/test >> cd /sys/fs/cgroup/cpu/test >> echo 10 > cpu.cfs_quota_us >> echo $$ > tasks >> then repeat: >> cat cpu.stat |grep nr_throttled // nr_throttled will increase >> >> After some analysis, we found that cfs_rq::runtime_remaining will >> be cleared by expire_cfs_rq_runtime() due to two equal but stale >> "cfs_{b|q}->runtime_expires" after period timer is re-armed. >> >> The current condition to judge clock drift in expire_cfs_rq_runtime() >> is wrong, the two runtime_expires are actually the same when clock >> drift happens, so this condtion can never hit. The orginal design was >> correctly done by commit a9cf55b28610 ("sched: Expire invalid runtime"), >> but was changed to be the current one due to its locking issue. >> >> This patch introduces another way, it adds a new field in both structures >> cfs_rq and cfs_bandwidth to record the expiration update sequence, and >> use them to figure out if clock drift happens(true if they equal). >> >> Fixes: 51f2176d74ac ("sched/fair: Fix unlocked reads of some >> cfs_b->quota/period") >> Cc: Ben Segall > > Reviewed-By: Ben Segall Thanks Ben :-) Hi Peter, could you please have a look at them? Cc sta...@vger.kernel.org > >> Signed-off-by: Xunlei Pang >> --- >> kernel/sched/fair.c | 14 -- >> kernel/sched/sched.h | 6 -- >> 2 files changed, 12 insertions(+), 8 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index e497c05aab7f..e6bb68d52962 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -4590,6 +4590,7 @@ void __refill_cfs_bandwidth_runtime(struct >> cfs_bandwidth *cfs_b) >> now = sched_clock_cpu(smp_processor_id()); >> cfs_b->runtime = cfs_b->quota; >> cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period); >> +cfs_b->expires_seq++; >> } >> >> static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) >> @@ -4612,6 +4613,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) >> struct task_group *tg = cfs_rq->tg; >> struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg); >> u64 amount = 0, min_amount, expires; >> +int expires_seq; >> >> /* note: this is a positive sum as runtime_remaining <= 0 */ >> min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining; >> @@ -4628,6 +4630,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) >> cfs_b->idle = 0; >> } >> } >> +expires_seq = cfs_b->expires_seq; >> expires = cfs_b->runtime_expires; >> raw_spin_unlock(_b->lock); >> >> @@ -4637,8 +4640,10 @@ static int assign_cfs_rq_runtime(struct cfs_rq >> *cfs_rq) >> * spread between our sched_clock and the one on which runtime was >> * issued. >> */ >> -if ((s64)(expires - cfs_rq->runtime_expires) > 0) >> +if (cfs_rq->expires_seq != expires_seq) { >> +cfs_rq->expires_seq = expires_seq; >> cfs_rq->runtime_expires = expires; >> +} >> >> return cfs_rq->runtime_remaining > 0; >> } >> @@ -4664,12 +4669,9 @@ static void expire_cfs_rq_runtime(struct cfs_rq >> *cfs_rq) >> * has not truly expired. >> * >> * Fortunately we can check determine whether this the case by checking >> - * whether the global deadline has advanced. It is valid to compare >> - * cfs_b->runtime_expires without any locks since we only care about >> - * exact equality, so a partial write will still work. >> + * whether the global deadline(cfs_b->expires_seq) has advanced. >> */ >> - >> -if (cfs_rq->runtime_expires != cfs_b->runtime_expires) { >> +if (cfs_rq->expires_seq == cfs_b->expires_seq) { >> /* extend local deadline, drift is bounded above by 2 ticks */ >> cfs_rq->runtime_expires += TICK_NSEC; >> } else { >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >> index 6601baf2361c..e977e04f8daf