Re: [PATCH 4/4] mm/page_reporting: Fix possible user allocation failure

2021-04-06 Thread Xunlei Pang
On 4/3/21 3:55 AM, Alexander Duyck wrote:
> On Fri, Mar 26, 2021 at 2:45 AM Xunlei Pang  wrote:
>>
>> We encountered user memory allocation failure(OOM) on our
>> 512MiB tiny instances, it didn't happen after turning off
>> the page reporting.
>>
>> After some debugging, it turns out 32*4MB=128MB(order-10)
>> free pages were isolated during reporting window resulting
>> in no free available.
>>
>> Actually this might also happen on large instances when
>> having a few free memory.
>>
>> This patch introduces a rule to limit reporting capacity
>> according to current free memory, and reduce accordingly
>> for higher orders which could break this rule.
>>
>> For example,
>>  100MiB free, sgl capacity for different orders are:
>>    order-9 : 32
>>order-10: 16
>>
>> Reported-by: Helin Guo 
>> Tested-by: Helin Guo 
>> Signed-off-by: Xunlei Pang 
> 
> I'm curious how much of this would be solved by just making it so that
> we reduce the capacity by half if we increase the order? So
> specifically if we did something such as:
>   capacity = (PAGE_REPORTING_CAPACITY << PAGE_REPORTING_MIN_ORDER) >> order;
> 
> We just have to make sure the capacity is greater than zero before
> entering the processing loop.
> 
> An alternative that occured to me while I reviewed this is to look at
> just adding a reserve. That would be something like:
>   reserve = PAGE_REPORTING_CAPACITY - capacity;
> 
> Basically the reserve would take up some space at the start of the
> list so that you wouldn't need to actually change the capacity
> directly. It would just be a matter of making certain we deducted it
> and updated the offsets of the scatterlist as necessary.
> 
> 
>> ---
>>  mm/page_reporting.c | 89 
>> +++--
>>  1 file changed, 72 insertions(+), 17 deletions(-)
>>
>> diff --git a/mm/page_reporting.c b/mm/page_reporting.c
>> index 6ffedb8..2ec0ec0 100644
>> --- a/mm/page_reporting.c
>> +++ b/mm/page_reporting.c
>> @@ -129,8 +129,8 @@ void __page_reporting_notify(void)
>>   */
>>  static int
>>  page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone 
>> *zone,
>> -unsigned int order, unsigned int mt,
>> -struct scatterlist *sgl, unsigned int *offset)
>> +unsigned int order, unsigned int mt, struct scatterlist 
>> *sgl,
>> +const unsigned int capacity, unsigned int *offset)
>>  {
>> struct free_area *area = >free_area[order];
>> struct list_head *list = >free_list[mt];
>> @@ -161,10 +161,10 @@ void __page_reporting_notify(void)
>>  * list processed. This should result in us reporting all pages on
>>  * an idle system in about 30 seconds.
>>  *
>> -* The division here should be cheap since PAGE_REPORTING_CAPACITY
>> -* should always be a power of 2.
>> +* The division here should be cheap since capacity should
>> +* always be a power of 2.
>>  */
>> -   budget = DIV_ROUND_UP(area->nr_free, PAGE_REPORTING_CAPACITY * 16);
>> +   budget = DIV_ROUND_UP(area->nr_free, capacity * 16);
> 
> So the comment here is no longer valid when capacity became a
> variable. An alternative to look at if we were to assume the shift
> approach I mentioned above would be to then shift the budget based on
> the reduced capacity.
> 
>> /* loop through free list adding unreported pages to sg list */
>> list_for_each_entry_safe(page, next, list, lru) {
>> @@ -196,7 +196,7 @@ void __page_reporting_notify(void)
>> --(*offset);
>> sg_set_page([*offset], page, page_len, 0);
>>
>> -   nr_pages = (PAGE_REPORTING_CAPACITY - *offset) << 
>> order;
>> +   nr_pages = (capacity - *offset) << order;
>> if (zone->reported_pages + nr_pages >= threshold) {
>> err = 1;
>> break;
> 
> Rather than adding a capacity value it might work better to add a
> "reserve" value so that we are just padding the start of the
> scatterlist rather than having to reset it every time we change the
> total capacity of the scatterlist. The advantage to that is that you
> could drop all the changes where you are having to reset the list and
> change the capacity.
> 
> Instead you would just need to up

Re: [PATCH 2/4] mm/page_reporting: Introduce free page reporting factor

2021-04-06 Thread Xunlei Pang
On 4/3/21 2:56 AM, Alexander Duyck wrote:
> On Fri, Mar 26, 2021 at 2:45 AM Xunlei Pang  wrote:
>>
>> Add new "/sys/kernel/mm/page_reporting/reporting_factor"
>> within [0, 100], and stop page reporting when it reaches
>> the configured threshold. Default is 100 which means no
>> limitation is imposed. Percentile is adopted to reflect
>> the fact that it reports on the per-zone basis.
>>
>> We can control the total number of reporting pages via
>> this knob to avoid EPT violations which may affect the
>> performance of the business, imagine the guest memory
>> allocation burst or host long-tail memory reclaiming
>> really hurt.
> 
> I'm not a fan of the concept as I don't think it really does what it
> was meant to do. The way page reporting was meant to work is that when
> we have enough free pages we will cycle through memory a few pages at
> a time reporting what is unused to the hypervisor. It was meant to be
> a scan more than something that just would stop once it touched a
> certain part of the memory.
> 
> If you are wanting to truly reserve some amount of memory so that it
> is always left held by the guest then it might make more sense to make
> the value a fixed amount of memory rather than trying to do it as a
> percentage.
> 
> Also we may need to look at adding some sort of
> linearization/defragmentation logic for the reported pages. One issue
> is that there are several things that will add pages to the end of the
> free page lists. One of the reasons why I was processing the entire
> list when I was processing reported pages was because the page freeing
> functions will normally cause pages to be interleaved with the
> reported pages on the end of the list. So if you are wanting to
> reserve some pages as being non-reported we may need to add something
> sort the lists periodically.

Yes, agreed. To make the counter accurate, I also noticed this problem,
I'm going to figure out a way to handle it, e.g. maybe adding a new
migratetype for reported free_list is a good choice, this also helps
reduce the zone lock latency during the report procedue on large VM,
which can be in milliseconds.

> 
>> This knob can help make customized control policies according
>> to VM priority, it is also useful for testing, gray-release, etc.
> 
> As far as the knob itself it would make sense to combine this with
> patch 3 since they are just different versions of the same control
> 
>> ---
>>  mm/page_reporting.c | 60 
>> -
>>  1 file changed, 59 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/page_reporting.c b/mm/page_reporting.c
>> index ba195ea..86c6479 100644
>> --- a/mm/page_reporting.c
>> +++ b/mm/page_reporting.c
>> @@ -11,6 +11,8 @@
>>  #include "page_reporting.h"
>>  #include "internal.h"
>>
>> +static int reporting_factor = 100;
>> +
>>  #define PAGE_REPORTING_DELAY   (2 * HZ)
>>  static struct page_reporting_dev_info __rcu *pr_dev_info __read_mostly;
>>
>> @@ -134,6 +136,7 @@ void __page_reporting_notify(void)
>> struct list_head *list = >free_list[mt];
>> unsigned int page_len = PAGE_SIZE << order;
>> struct page *page, *next;
>> +   unsigned long threshold;
>> long budget;
>> int err = 0;
>>
>> @@ -144,6 +147,7 @@ void __page_reporting_notify(void)
>> if (list_empty(list))
>> return err;
>>
>> +   threshold = atomic_long_read(>managed_pages) * 
>> reporting_factor / 100;
> 
> So at 0 you are setting this threshold to 0, however based on the code
> below you are still pulling at least one page.
> 
>> spin_lock_irq(>lock);
>>
>> /*
>> @@ -181,6 +185,8 @@ void __page_reporting_notify(void)
>>
>> /* Attempt to pull page from list and place in scatterlist */
>> if (*offset) {
>> +   unsigned long nr_pages;
>> +
>> if (!__isolate_free_page(page, order)) {
>> next = page;
>> break;
>> @@ -190,6 +196,12 @@ void __page_reporting_notify(void)
>> --(*offset);
>> sg_set_page([*offset], page, page_len, 0);
>>
>> +   nr_pages = (PAGE_REPORTING_CAPACITY - *offset) << 
>> order;
>> +   if (zone->reported_pages + nr_pages >= threshold) {
>> +   err = 1;
>> + 

Re: [PATCH 0/4] mm/page_reporting: Some knobs and fixes

2021-04-01 Thread Xunlei Pang
On 3/26/21 5:44 PM, Xunlei Pang wrote:
> Add the following knobs in PATCH 1~3:
>  /sys/kernel/mm/page_reporting/reported_kbytes
>  /sys/kernel/mm/page_reporting/refault_kbytes
>  /sys/kernel/mm/page_reporting/reporting_factor
> 
> Fix unexpected user OOM in PATCH 4.
> 
> Xunlei Pang (4):
>   mm/page_reporting: Introduce free page reported counters
>   mm/page_reporting: Introduce free page reporting factor
>   mm/page_reporting: Introduce "page_reporting_factor=" boot parameter
>   mm/page_reporting: Fix possible user allocation failure
> 
>  Documentation/admin-guide/kernel-parameters.txt |   3 +
>  include/linux/mmzone.h  |   3 +
>  mm/page_alloc.c |   6 +-
>  mm/page_reporting.c | 268 
> ++--
>  4 files changed, 260 insertions(+), 20 deletions(-)
> 

Hi guys,

Looks "Alexander Duyck " was not
available, so Cced more, any comment?

Thanks!


[PATCH 4/4] mm/page_reporting: Fix possible user allocation failure

2021-03-26 Thread Xunlei Pang
We encountered user memory allocation failure(OOM) on our
512MiB tiny instances, it didn't happen after turning off
the page reporting.

After some debugging, it turns out 32*4MB=128MB(order-10)
free pages were isolated during reporting window resulting
in no free available.

Actually this might also happen on large instances when
having a few free memory.

This patch introduces a rule to limit reporting capacity
according to current free memory, and reduce accordingly
for higher orders which could break this rule.

For example,
 100MiB free, sgl capacity for different orders are:
   order-9 : 32
   order-10: 16

Reported-by: Helin Guo 
Tested-by: Helin Guo 
Signed-off-by: Xunlei Pang 
---
 mm/page_reporting.c | 89 +++--
 1 file changed, 72 insertions(+), 17 deletions(-)

diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index 6ffedb8..2ec0ec0 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -129,8 +129,8 @@ void __page_reporting_notify(void)
  */
 static int
 page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
-unsigned int order, unsigned int mt,
-struct scatterlist *sgl, unsigned int *offset)
+unsigned int order, unsigned int mt, struct scatterlist 
*sgl,
+const unsigned int capacity, unsigned int *offset)
 {
struct free_area *area = >free_area[order];
struct list_head *list = >free_list[mt];
@@ -161,10 +161,10 @@ void __page_reporting_notify(void)
 * list processed. This should result in us reporting all pages on
 * an idle system in about 30 seconds.
 *
-* The division here should be cheap since PAGE_REPORTING_CAPACITY
-* should always be a power of 2.
+* The division here should be cheap since capacity should
+* always be a power of 2.
 */
-   budget = DIV_ROUND_UP(area->nr_free, PAGE_REPORTING_CAPACITY * 16);
+   budget = DIV_ROUND_UP(area->nr_free, capacity * 16);
 
/* loop through free list adding unreported pages to sg list */
list_for_each_entry_safe(page, next, list, lru) {
@@ -196,7 +196,7 @@ void __page_reporting_notify(void)
--(*offset);
sg_set_page([*offset], page, page_len, 0);
 
-   nr_pages = (PAGE_REPORTING_CAPACITY - *offset) << order;
+   nr_pages = (capacity - *offset) << order;
if (zone->reported_pages + nr_pages >= threshold) {
err = 1;
break;
@@ -217,10 +217,10 @@ void __page_reporting_notify(void)
spin_unlock_irq(>lock);
 
/* begin processing pages in local list */
-   err = prdev->report(prdev, sgl, PAGE_REPORTING_CAPACITY);
+   err = prdev->report(prdev, sgl, capacity);
 
/* reset offset since the full list was reported */
-   *offset = PAGE_REPORTING_CAPACITY;
+   *offset = capacity;
 
/* update budget to reflect call to report function */
budget--;
@@ -229,7 +229,7 @@ void __page_reporting_notify(void)
spin_lock_irq(>lock);
 
/* flush reported pages from the sg list */
-   page_reporting_drain(prdev, sgl, zone, PAGE_REPORTING_CAPACITY, 
!err);
+   page_reporting_drain(prdev, sgl, zone, capacity, !err);
 
/*
 * Reset next to first entry, the old next isn't valid
@@ -251,12 +251,39 @@ void __page_reporting_notify(void)
return err;
 }
 
+/*
+ * For guest with little free memory, we should tune reporting capacity
+ * correctly to avoid reporting too much once, otherwise user allocation
+ * may fail and OOM during reporting window between __isolate_free_page()
+ * and page_reporting_drain().
+ *
+ * Calculate from which order we begin to reduce the scatterlist capacity,
+ * in order not to isolate too many pages to fail the user allocation.
+ */
+static unsigned int calculate_zone_order_threshold(struct zone *z)
+{
+   unsigned int order;
+   long pages_threshold;
+
+   pages_threshold = zone_page_state(z, NR_FREE_PAGES) - 
low_wmark_pages(z);
+   for (order = PAGE_REPORTING_MIN_ORDER; order < MAX_ORDER; order++) {
+   if ((PAGE_REPORTING_CAPACITY << order) > pages_threshold)
+   break;
+   }
+
+   return order;
+}
+
 static int
 page_reporting_process_zone(struct page_reporting_dev_info *prdev,
struct scatterlist *sgl, struct zone *zone)
 {
-   unsigned int order, mt, leftover, offset = PAGE_REPORTING_CAPACITY;
+   unsigned int order, mt, leftover, offset;
unsigned long watermark, threshold;
+   unsigned int capacity = PAGE_REPORTING_CAPACITY;
+   

[PATCH 1/4] mm/page_reporting: Introduce free page reported counters

2021-03-26 Thread Xunlei Pang
It's useful to know how many memory has been actually reported,
so add new zone::reported_pages to record that.

Add "/sys/kernel/mm/page_reporting/reported_kbytes" for the
actual memory has been reported.

Add "/sys/kernel/mm/page_reporting/refault_kbytes" for the
accumulated memory has refaulted in after been reported out.

Signed-off-by: Xunlei Pang 
---
 include/linux/mmzone.h |   3 ++
 mm/page_alloc.c|   4 +-
 mm/page_reporting.c| 112 +++--
 mm/page_reporting.h|   5 +++
 4 files changed, 119 insertions(+), 5 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 47946ce..ebd169f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -530,6 +530,9 @@ struct zone {
atomic_long_t   managed_pages;
unsigned long   spanned_pages;
unsigned long   present_pages;
+#ifdef CONFIG_PAGE_REPORTING
+   unsigned long   reported_pages;
+#endif
 #ifdef CONFIG_CMA
unsigned long   cma_pages;
 #endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3e4b29ee..c2c5688 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -930,8 +930,10 @@ static inline void del_page_from_free_list(struct page 
*page, struct zone *zone,
   unsigned int order)
 {
/* clear reported state and update reported page count */
-   if (page_reported(page))
+   if (page_reported(page)) {
__ClearPageReported(page);
+   page_reporting_update_refault(zone, 1 << order);
+   }
 
list_del(>lru);
__ClearPageBuddy(page);
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index c50d93f..ba195ea 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -1,4 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0
+#include 
 #include 
 #include 
 #include 
@@ -19,6 +20,22 @@ enum {
PAGE_REPORTING_ACTIVE
 };
 
+#ifdef CONFIG_SYSFS
+static struct percpu_counter refault_pages;
+
+void page_reporting_update_refault(struct zone *zone, unsigned int pages)
+{
+   zone->reported_pages -= pages;
+   percpu_counter_add_batch(_pages, pages, INT_MAX / 2);
+}
+#else
+void page_reporting_update_refault(struct zone *zone, unsigned int pages)
+{
+   zone->reported_pages -= pages;
+}
+#endif
+
+
 /* request page reporting */
 static void
 __page_reporting_request(struct page_reporting_dev_info *prdev)
@@ -66,7 +83,8 @@ void __page_reporting_notify(void)
 
 static void
 page_reporting_drain(struct page_reporting_dev_info *prdev,
-struct scatterlist *sgl, unsigned int nents, bool reported)
+struct scatterlist *sgl, struct zone *zone,
+unsigned int nents, bool reported)
 {
struct scatterlist *sg = sgl;
 
@@ -92,8 +110,10 @@ void __page_reporting_notify(void)
 * report on the new larger page when we make our way
 * up to that higher order.
 */
-   if (PageBuddy(page) && buddy_order(page) == order)
+   if (PageBuddy(page) && buddy_order(page) == order) {
__SetPageReported(page);
+   zone->reported_pages += (1 << order);
+   }
} while ((sg = sg_next(sg)));
 
/* reinitialize scatterlist now that it is empty */
@@ -197,7 +217,7 @@ void __page_reporting_notify(void)
spin_lock_irq(>lock);
 
/* flush reported pages from the sg list */
-   page_reporting_drain(prdev, sgl, PAGE_REPORTING_CAPACITY, !err);
+   page_reporting_drain(prdev, sgl, zone, PAGE_REPORTING_CAPACITY, 
!err);
 
/*
 * Reset next to first entry, the old next isn't valid
@@ -260,7 +280,7 @@ void __page_reporting_notify(void)
 
/* flush any remaining pages out from the last report */
spin_lock_irq(>lock);
-   page_reporting_drain(prdev, sgl, leftover, !err);
+   page_reporting_drain(prdev, sgl, zone, leftover, !err);
spin_unlock_irq(>lock);
}
 
@@ -362,3 +382,87 @@ void page_reporting_unregister(struct 
page_reporting_dev_info *prdev)
mutex_unlock(_reporting_mutex);
 }
 EXPORT_SYMBOL_GPL(page_reporting_unregister);
+
+#ifdef CONFIG_SYSFS
+#define REPORTING_ATTR(_name) \
+   static struct kobj_attribute _name##_attr = \
+   __ATTR(_name, 0644, _name##_show, _name##_store)
+
+static unsigned long get_reported_kbytes(void)
+{
+   struct zone *z;
+   unsigned long nr_reported = 0;
+
+   for_each_populated_zone(z)
+   nr_reported += z->reported_pages;
+
+   return nr_reported << (PAGE_SHIFT - 10);
+}
+
+static ssize_t reported_kbytes_show(struct kobject *kobj,
+   struct kobj_attribute *attr, char *buf)
+{
+  

[PATCH 3/4] mm/page_reporting: Introduce "page_reporting_factor=" boot parameter

2021-03-26 Thread Xunlei Pang
Currently the default behaviour(value 100) is to do full report,
although we can control it after boot via "page_reporting_factor"
sysfs knob, it could still be late because some amount of memory
has already been reported before operating this knob.

Sometimes we really want it safely off by default and turn it on
as needed at runtime, so "page_reporting_factor=" boot parameter
is that guarantee and meets different default setting requirements.

There's also a real-world problem that I noticed on tiny instances,
it always reports some memory at boot stage before application
starts and uses up memory which retriggers EPT fault after boot.

The following data(right after boot) indicates that 172032KiB pages
were unnecessarily reported and refault in:
 $ cat /sys/kernel/mm/page_reporting/refault_kbytes
 172032
 $ cat /sys/kernel/mm/page_reporting/reported_kbytes
 0

Thus it's reasonable to turn the page reporting off by default and
enable it at runtime as needed.

Signed-off-by: Xunlei Pang 
---
 Documentation/admin-guide/kernel-parameters.txt |  3 +++
 mm/page_reporting.c | 13 +
 2 files changed, 16 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 0454572..46e296c 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3524,6 +3524,9 @@
off: turn off poisoning (default)
on: turn on poisoning
 
+   page_reporting_factor= [KNL] Guest Free Page Reporting percentile.
+   [0, 100]: 0 - off, not report; 100 - default, full 
report.
+
panic=  [KNL] Kernel behaviour on panic: delay 
timeout > 0: seconds before rebooting
timeout = 0: wait forever
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index 86c6479..6ffedb8 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -524,3 +524,16 @@ static int __init page_reporting_init(void)
 }
 
 module_init(page_reporting_init);
+
+static int __init setup_reporting_factor(char *str)
+{
+   int v;
+
+   if (kstrtoint(str, 10, ))
+   return -EINVAL;
+   if (v >= 0 && v <= 100)
+   reporting_factor = v;
+
+   return 0;
+}
+__setup("page_reporting_factor=", setup_reporting_factor);
-- 
1.8.3.1



[PATCH 0/4] mm/page_reporting: Some knobs and fixes

2021-03-26 Thread Xunlei Pang
Add the following knobs in PATCH 1~3:
 /sys/kernel/mm/page_reporting/reported_kbytes
 /sys/kernel/mm/page_reporting/refault_kbytes
 /sys/kernel/mm/page_reporting/reporting_factor

Fix unexpected user OOM in PATCH 4.

Xunlei Pang (4):
  mm/page_reporting: Introduce free page reported counters
  mm/page_reporting: Introduce free page reporting factor
  mm/page_reporting: Introduce "page_reporting_factor=" boot parameter
  mm/page_reporting: Fix possible user allocation failure

 Documentation/admin-guide/kernel-parameters.txt |   3 +
 include/linux/mmzone.h  |   3 +
 mm/page_alloc.c |   6 +-
 mm/page_reporting.c | 268 ++--
 4 files changed, 260 insertions(+), 20 deletions(-)

-- 
1.8.3.1



[PATCH 2/4] mm/page_reporting: Introduce free page reporting factor

2021-03-26 Thread Xunlei Pang
Add new "/sys/kernel/mm/page_reporting/reporting_factor"
within [0, 100], and stop page reporting when it reaches
the configured threshold. Default is 100 which means no
limitation is imposed. Percentile is adopted to reflect
the fact that it reports on the per-zone basis.

We can control the total number of reporting pages via
this knob to avoid EPT violations which may affect the
performance of the business, imagine the guest memory
allocation burst or host long-tail memory reclaiming
really hurt.

This knob can help make customized control policies according
to VM priority, it is also useful for testing, gray-release, etc.

Signed-off-by: Xunlei Pang 
---
 mm/page_reporting.c | 60 -
 1 file changed, 59 insertions(+), 1 deletion(-)

diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index ba195ea..86c6479 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -11,6 +11,8 @@
 #include "page_reporting.h"
 #include "internal.h"
 
+static int reporting_factor = 100;
+
 #define PAGE_REPORTING_DELAY   (2 * HZ)
 static struct page_reporting_dev_info __rcu *pr_dev_info __read_mostly;
 
@@ -134,6 +136,7 @@ void __page_reporting_notify(void)
struct list_head *list = >free_list[mt];
unsigned int page_len = PAGE_SIZE << order;
struct page *page, *next;
+   unsigned long threshold;
long budget;
int err = 0;
 
@@ -144,6 +147,7 @@ void __page_reporting_notify(void)
if (list_empty(list))
return err;
 
+   threshold = atomic_long_read(>managed_pages) * reporting_factor / 
100;
spin_lock_irq(>lock);
 
/*
@@ -181,6 +185,8 @@ void __page_reporting_notify(void)
 
/* Attempt to pull page from list and place in scatterlist */
if (*offset) {
+   unsigned long nr_pages;
+
if (!__isolate_free_page(page, order)) {
next = page;
break;
@@ -190,6 +196,12 @@ void __page_reporting_notify(void)
--(*offset);
sg_set_page([*offset], page, page_len, 0);
 
+   nr_pages = (PAGE_REPORTING_CAPACITY - *offset) << order;
+   if (zone->reported_pages + nr_pages >= threshold) {
+   err = 1;
+   break;
+   }
+
continue;
}
 
@@ -244,9 +256,13 @@ void __page_reporting_notify(void)
struct scatterlist *sgl, struct zone *zone)
 {
unsigned int order, mt, leftover, offset = PAGE_REPORTING_CAPACITY;
-   unsigned long watermark;
+   unsigned long watermark, threshold;
int err = 0;
 
+   threshold = atomic_long_read(>managed_pages) * reporting_factor / 
100;
+   if (zone->reported_pages >= threshold)
+   return err;
+
/* Generate minimum watermark to be able to guarantee progress */
watermark = low_wmark_pages(zone) +
(PAGE_REPORTING_CAPACITY << PAGE_REPORTING_MIN_ORDER);
@@ -267,11 +283,18 @@ void __page_reporting_notify(void)
 
err = page_reporting_cycle(prdev, zone, order, mt,
   sgl, );
+   /* Exceed threshold go to report leftover */
+   if (err > 0) {
+   err = 0;
+   goto leftover;
+   }
+
if (err)
return err;
}
}
 
+leftover:
/* report the leftover pages before going idle */
leftover = PAGE_REPORTING_CAPACITY - offset;
if (leftover) {
@@ -435,9 +458,44 @@ static ssize_t refault_kbytes_store(struct kobject *kobj,
 }
 REPORTING_ATTR(refault_kbytes);
 
+static ssize_t reporting_factor_show(struct kobject *kobj,
+   struct kobj_attribute *attr, char *buf)
+{
+   return sprintf(buf, "%u\n", reporting_factor);
+}
+
+static ssize_t reporting_factor_store(struct kobject *kobj,
+   struct kobj_attribute *attr,
+   const char *buf, size_t count)
+{
+   int new, old, err;
+   struct page *page;
+
+   err = kstrtoint(buf, 10, );
+   if (err || (new < 0 || new > 100))
+   return -EINVAL;
+
+   old = reporting_factor;
+   reporting_factor = new;
+
+   if (new <= old)
+   goto out;
+
+   /* Trigger reporting with new larger reporting_factor */
+   page = alloc_pages(__GFP_HIGHMEM | __GFP_NOWARN,
+   PAGE_REPORTING_MIN_ORDER);
+   if (page)
+   __free_pages(page, PAGE_REPORTING_MIN_ORDER);
+
+out:
+   return count;
+}
+REPORTING_ATTR(

Re: [PATCH v4 1/3] mm/slub: Introduce two counters for partial objects

2021-03-18 Thread Xunlei Pang



On 3/18/21 8:18 PM, Vlastimil Babka wrote:
> On 3/17/21 8:54 AM, Xunlei Pang wrote:
>> The node list_lock in count_partial() spends long time iterating
>> in case of large amount of partial page lists, which can cause
>> thunder herd effect to the list_lock contention.
>>
>> We have HSF RT(High-speed Service Framework Response-Time) monitors,
>> the RT figures fluctuated randomly, then we deployed a tool detecting
>> "irq off" and "preempt off" to dump the culprit's calltrace, capturing
>> the list_lock cost nearly 100ms with irq off issued by "ss", this also
>> caused network timeouts.
> 
> I forgot to ask, how does "ss" come into this? It displays network connections
> AFAIK. Does it read any SLUB counters or slabinfo?
> 

ss may access /proc/slabinfo to acquire network related slab statistics.


Re: [PATCH v4 1/3] mm/slub: Introduce two counters for partial objects

2021-03-17 Thread Xunlei Pang
On 3/18/21 2:45 AM, Vlastimil Babka wrote:
> On 3/17/21 8:54 AM, Xunlei Pang wrote:
>> The node list_lock in count_partial() spends long time iterating
>> in case of large amount of partial page lists, which can cause
>> thunder herd effect to the list_lock contention.
>>
>> We have HSF RT(High-speed Service Framework Response-Time) monitors,
>> the RT figures fluctuated randomly, then we deployed a tool detecting
>> "irq off" and "preempt off" to dump the culprit's calltrace, capturing
>> the list_lock cost nearly 100ms with irq off issued by "ss", this also
>> caused network timeouts.
>>
>> This patch introduces two counters to maintain the actual number
>> of partial objects dynamically instead of iterating the partial
>> page lists with list_lock held.
>>
>> New counters of kmem_cache_node: partial_free_objs, partial_total_objs.
>> The main operations are under list_lock in slow path, its performance
>> impact is expected to be minimal except the __slab_free() path.
>>
>> The only concern of introducing partial counter is that partial_free_objs
>> may cause cacheline contention and false sharing issues in case of same
>> SLUB concurrent __slab_free(), so define it to be a percpu counter and
>> places it carefully.
> 
> Hm I wonder, is it possible that this will eventually overflow/underflow the
> counter on some CPU? (I guess practially only on 32bit). Maybe the operations
> that are already done under n->list_lock should flush the percpu counter to a
> shared counter?

You are right, thanks a lot for noticing this.

> 
> ...
> 
>> @@ -3039,6 +3066,13 @@ static void __slab_free(struct kmem_cache *s, struct 
>> page *page,
>>  head, new.counters,
>>  "__slab_free"));
>>  
>> +if (!was_frozen && prior) {
>> +if (n)
>> +__update_partial_free(n, cnt);
>> +else
>> +__update_partial_free(get_node(s, page_to_nid(page)), 
>> cnt);
>> +}
> 
> I would guess this is the part that makes your measurements notice that
> (although tiny) difference. We didn't need to obtain the node pointer before 
> and
> now we do. And that is really done just for the per-node breakdown in 
> "objects"
> and "objects_partial" files under /sys/kernel/slab - distinguishing nodes is 
> not
> needed for /proc/slabinfo. So that kinda justifies putting this under a new
> CONFIG as you did. Although perhaps somebody interested in these kind of stats
> would enable CONFIG_SLUB_STATS anyway, so that's still an option to use 
> instead
> of introducing a new oddly specific CONFIG? At least until somebody comes up 
> and
> presents an use case where they want the per-node breakdowns in /sys but 
> cannot
> afford CONFIG_SLUB_STATS.
> 
> But I'm also still thinking about simply counting all free objects (for the
> purposes of accurate  in /proc/slabinfo) as a percpu variable in
> struct kmem_cache itself. That would basically put this_cpu_add() in all the
> fast paths, but AFAICS thanks to the segment register it doesn't mean 
> disabling
> interrupts nor a LOCK operation, so maybe it wouldn't be that bad? And it
> shouldn't need to deal with these node pointers. So maybe that would be
> acceptable for CONFIG_SLUB_DEBUG? Guess I'll have to try...
> 

The percpu operation itself should be fine, it looks to be cacheline
pingpong issue due to extra percpu counter access, so making it
cacheline aligned improves a little according to my tests.


[PATCH v4 3/3] mm/slub: Get rid of count_partial()

2021-03-17 Thread Xunlei Pang
Now the partial counters are ready, let's use them to get rid
of count_partial().

The partial counters will involve in to calculate the accurate
partial usage when CONFIG_SLUB_DEBUG_PARTIAL is on, otherwise
simply assume their zero usage statistics.

Tested-by: James Wang 
Signed-off-by: Xunlei Pang 
---
 mm/slub.c | 64 +++
 1 file changed, 31 insertions(+), 33 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 856aea4..9bff669 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2533,11 +2533,6 @@ static inline int node_match(struct page *page, int node)
 }
 
 #ifdef CONFIG_SLUB_DEBUG
-static int count_free(struct page *page)
-{
-   return page->objects - page->inuse;
-}
-
 static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
 {
return atomic_long_read(>total_objects);
@@ -2545,18 +2540,33 @@ static inline unsigned long node_nr_objs(struct 
kmem_cache_node *n)
 #endif /* CONFIG_SLUB_DEBUG */
 
 #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
-static unsigned long count_partial(struct kmem_cache_node *n,
-   int (*get_count)(struct page *))
+enum partial_item { PARTIAL_FREE, PARTIAL_INUSE, PARTIAL_TOTAL, PARTIAL_SLAB };
+
+static unsigned long partial_counter(struct kmem_cache_node *n,
+   enum partial_item item)
 {
-   unsigned long flags;
-   unsigned long x = 0;
-   struct page *page;
+   unsigned long ret = 0;
 
-   spin_lock_irqsave(>list_lock, flags);
-   list_for_each_entry(page, >partial, slab_list)
-   x += get_count(page);
-   spin_unlock_irqrestore(>list_lock, flags);
-   return x;
+#ifdef CONFIG_SLUB_DEBUG_PARTIAL
+   if (item == PARTIAL_FREE) {
+   ret = per_cpu_sum(*n->partial_free_objs);
+   if ((long)ret < 0)
+   ret = 0;
+   } else if (item == PARTIAL_TOTAL) {
+   ret = n->partial_total_objs;
+   } else if (item == PARTIAL_INUSE) {
+   ret = per_cpu_sum(*n->partial_free_objs);
+   if ((long)ret < 0)
+   ret = 0;
+   ret = n->partial_total_objs - ret;
+   if ((long)ret < 0)
+   ret = 0;
+   } else { /* item == PARTIAL_SLAB */
+   ret = n->nr_partial;
+   }
+#endif
+
+   return ret;
 }
 #endif /* CONFIG_SLUB_DEBUG || CONFIG_SYSFS */
 
@@ -2587,7 +2597,7 @@ static unsigned long count_partial(struct kmem_cache_node 
*n,
unsigned long nr_objs;
unsigned long nr_free;
 
-   nr_free  = count_partial(n, count_free);
+   nr_free  = partial_counter(n, PARTIAL_FREE);
nr_slabs = node_nr_slabs(n);
nr_objs  = node_nr_objs(n);
 
@@ -4654,18 +4664,6 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t 
gfpflags,
 EXPORT_SYMBOL(__kmalloc_node_track_caller);
 #endif
 
-#ifdef CONFIG_SYSFS
-static int count_inuse(struct page *page)
-{
-   return page->inuse;
-}
-
-static int count_total(struct page *page)
-{
-   return page->objects;
-}
-#endif
-
 #ifdef CONFIG_SLUB_DEBUG
 static void validate_slab(struct kmem_cache *s, struct page *page)
 {
@@ -5102,7 +5100,7 @@ static ssize_t show_slab_objects(struct kmem_cache *s,
x = atomic_long_read(>total_objects);
else if (flags & SO_OBJECTS)
x = atomic_long_read(>total_objects) -
-   count_partial(n, count_free);
+   partial_counter(n, PARTIAL_FREE);
else
x = atomic_long_read(>nr_slabs);
total += x;
@@ -5116,11 +5114,11 @@ static ssize_t show_slab_objects(struct kmem_cache *s,
 
for_each_kmem_cache_node(s, node, n) {
if (flags & SO_TOTAL)
-   x = count_partial(n, count_total);
+   x = partial_counter(n, PARTIAL_TOTAL);
else if (flags & SO_OBJECTS)
-   x = count_partial(n, count_inuse);
+   x = partial_counter(n, PARTIAL_INUSE);
else
-   x = n->nr_partial;
+   x = partial_counter(n, PARTIAL_SLAB);
total += x;
nodes[node] += x;
}
@@ -5884,7 +5882,7 @@ void get_slabinfo(struct kmem_cache *s, struct slabinfo 
*sinfo)
for_each_kmem_cache_node(s, node, n) {
nr_slabs += node_nr_slabs(n);
nr_objs += node_nr_objs(n);
-   nr_free += count_partial(n, count_free);
+   nr_free += partial_counter(n, PARTIAL_FREE);
}
 
sinfo->active_objs = nr_objs - nr_free;
-- 
1.8.3.1



[PATCH v4 0/3] mm/slub: Fix count_partial() problem

2021-03-17 Thread Xunlei Pang
count_partial() can hold n->list_lock spinlock for quite long, which
makes much trouble to the system. This series eliminate this problem.

v1->v2:
- Improved changelog and variable naming for PATCH 1~2.
- PATCH3 adds per-cpu counter to avoid performance regression
  in concurrent __slab_free().

v2->v3:
- Changed "page->inuse" to the safe "new.inuse", etc.
- Used CONFIG_SLUB_DEBUG and CONFIG_SYSFS condition for new counters.
- atomic_long_t -> unsigned long

v3->v4:
- introduced new CONFIG_SLUB_DEBUG_PARTIAL to give a chance to be enabled for 
production use.
- Merged PATCH 4 into PATCH 1.

[Testing]
There seems might be a little performance impact under extreme
__slab_free() concurrent calls according to my tests.

On my 32-cpu 2-socket physical machine:
Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz

1) perf stat --null --repeat 10 -- hackbench 20 thread 2

== original, no patched
Performance counter stats for 'hackbench 20 thread 2' (10 runs):

  24.536050899 seconds time elapsed 
 ( +-  0.24% )


Performance counter stats for 'hackbench 20 thread 2' (10 runs):

  24.588049142 seconds time elapsed 
 ( +-  0.35% )


== patched with patch1~4
Performance counter stats for 'hackbench 20 thread 2' (10 runs):

  24.670892273 seconds time elapsed 
 ( +-  0.29% )


Performance counter stats for 'hackbench 20 thread 2' (10 runs):

  24.746755689 seconds time elapsed 
 ( +-  0.21% )


2) perf stat --null --repeat 10 -- hackbench 32 thread 2

== original, no patched
 Performance counter stats for 'hackbench 32 thread 2' (10 runs):

  39.784911855 seconds time elapsed 
 ( +-  0.14% )

 Performance counter stats for 'hackbench 32 thread 2' (10 runs):

  39.868687608 seconds time elapsed 
 ( +-  0.19% )

== patched with patch1~4
 Performance counter stats for 'hackbench 32 thread 2' (10 runs):

  39.681273015 seconds time elapsed 
 ( +-  0.21% )

 Performance counter stats for 'hackbench 32 thread 2' (10 runs):

  39.681238459 seconds time elapsed     
 ( +-  0.09% )

Xunlei Pang (3):
  mm/slub: Introduce two counters for partial objects
  percpu: Export per_cpu_sum()
  mm/slub: Get rid of count_partial()

 include/linux/percpu-defs.h   |  10 
 init/Kconfig  |  13 +
 kernel/locking/percpu-rwsem.c |  10 
 mm/slab.h |   6 ++
 mm/slub.c | 129 +-
 5 files changed, 120 insertions(+), 48 deletions(-)

-- 
1.8.3.1



[PATCH v4 1/3] mm/slub: Introduce two counters for partial objects

2021-03-17 Thread Xunlei Pang
The node list_lock in count_partial() spends long time iterating
in case of large amount of partial page lists, which can cause
thunder herd effect to the list_lock contention.

We have HSF RT(High-speed Service Framework Response-Time) monitors,
the RT figures fluctuated randomly, then we deployed a tool detecting
"irq off" and "preempt off" to dump the culprit's calltrace, capturing
the list_lock cost nearly 100ms with irq off issued by "ss", this also
caused network timeouts.

This patch introduces two counters to maintain the actual number
of partial objects dynamically instead of iterating the partial
page lists with list_lock held.

New counters of kmem_cache_node: partial_free_objs, partial_total_objs.
The main operations are under list_lock in slow path, its performance
impact is expected to be minimal except the __slab_free() path.

The only concern of introducing partial counter is that partial_free_objs
may cause cacheline contention and false sharing issues in case of same
SLUB concurrent __slab_free(), so define it to be a percpu counter and
places it carefully.

As "Vlastimil Babka" and "Christoph Lameter" both suggested that we
don't care about the accurate partial objects, and in case that somebody
might actually want accurate object statistics at the expense of peak
performance, and it would be nice to give them such option in SLUB.

Thus a new CONFIG_SLUB_DEBUG_PARTIAL is introduced to maintain accurate
partial objects, which is default off to avoid unexpected performance
degradation.

When CONFIG_SLUB_DEBUG_PARTIAL is not selected, the sysfs or proc files
related to the partial list (/proc/slabinfo, sysfs objects_partial, etc)
assume their zero usage data, e.g.,
  Value of "active_objs" and "num_objs" fields of "/proc/slabinfo" equal.
  "cat /sys/kernel/slab//partial" displays "0".

Tested-by: James Wang 
Signed-off-by: Xunlei Pang 
---
 init/Kconfig | 13 +
 mm/slab.h|  6 ++
 mm/slub.c| 63 
 3 files changed, 78 insertions(+), 4 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 22946fe..686851b 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1867,6 +1867,19 @@ config SLUB_DEBUG
  SLUB sysfs support. /sys/slab will not exist and there will be
  no support for cache validation etc.
 
+config SLUB_DEBUG_PARTIAL
+   default n
+   bool "Enable SLUB debugging for the node partial list" if EXPERT
+   depends on SLUB && SYSFS
+   help
+ Maintain runtime counters for the node partial list debugging.
+ When CONFIG_SLUB_DEBUG_PARTIAL is not selected, the sysfs or proc
+ files related to the partial list assume zero data usage, e.g.,
+ value of "active_objs" and "num_objs" of "/proc/slabinfo" equals.
+ "cat /sys/kernel/slab//partial" displays "0".
+
+ It might have slight performance impact for production use.
+
 config COMPAT_BRK
bool "Disable heap randomization"
default y
diff --git a/mm/slab.h b/mm/slab.h
index 076582f..f47c811 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -546,12 +546,18 @@ struct kmem_cache_node {
 
 #ifdef CONFIG_SLUB
unsigned long nr_partial;
+#ifdef CONFIG_SLUB_DEBUG_PARTIAL
+   unsigned long partial_total_objs;
+#endif
struct list_head partial;
 #ifdef CONFIG_SLUB_DEBUG
atomic_long_t nr_slabs;
atomic_long_t total_objects;
struct list_head full;
 #endif
+#ifdef CONFIG_SLUB_DEBUG_PARTIAL
+   unsigned long __percpu *partial_free_objs cacheline_aligned_in_smp;
+#endif
 #endif
 
 };
diff --git a/mm/slub.c b/mm/slub.c
index e26c274..856aea4 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1890,10 +1890,31 @@ static void discard_slab(struct kmem_cache *s, struct 
page *page)
 /*
  * Management of partially allocated slabs.
  */
+#ifdef CONFIG_SLUB_DEBUG_PARTIAL
+static inline void
+__update_partial_free(struct kmem_cache_node *n, long delta)
+{
+   this_cpu_add(*n->partial_free_objs, delta);
+}
+
+static inline void
+__update_partial_total(struct kmem_cache_node *n, long delta)
+{
+   n->partial_total_objs += delta;
+}
+#else
+static inline void
+__update_partial_free(struct kmem_cache_node *n, long delta) { }
+
+static inline void
+__update_partial_total(struct kmem_cache_node *n, long delta) { }
+#endif
+
 static inline void
 __add_partial(struct kmem_cache_node *n, struct page *page, int tail)
 {
n->nr_partial++;
+   __update_partial_total(n, page->objects);
if (tail == DEACTIVATE_TO_TAIL)
list_add_tail(>slab_list, >partial);
else
@@ -1913,6 +1934,7 @@ static inline void remove_partial(struct kmem_cache_node 
*n,
lockdep_assert_held(>list_lock);
list_del(>slab_list);
n->n

[PATCH v4 2/3] percpu: Export per_cpu_sum()

2021-03-17 Thread Xunlei Pang
per_cpu_sum() is useful, and deserves to be exported.

Tested-by: James Wang 
Signed-off-by: Xunlei Pang 
---
 include/linux/percpu-defs.h   | 10 ++
 kernel/locking/percpu-rwsem.c | 10 --
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/percpu-defs.h b/include/linux/percpu-defs.h
index dff7040..0e71b68 100644
--- a/include/linux/percpu-defs.h
+++ b/include/linux/percpu-defs.h
@@ -220,6 +220,16 @@
(void)__vpp_verify; \
 } while (0)
 
+#define per_cpu_sum(var)   \
+({ \
+   typeof(var) __sum = 0;  \
+   int cpu;\
+   compiletime_assert_atomic_type(__sum);  \
+   for_each_possible_cpu(cpu)  \
+   __sum += per_cpu(var, cpu); \
+   __sum;  \
+})
+
 #ifdef CONFIG_SMP
 
 /*
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index 70a32a5..0980e51 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -178,16 +178,6 @@ bool __percpu_down_read(struct percpu_rw_semaphore *sem, 
bool try)
 }
 EXPORT_SYMBOL_GPL(__percpu_down_read);
 
-#define per_cpu_sum(var)   \
-({ \
-   typeof(var) __sum = 0;  \
-   int cpu;\
-   compiletime_assert_atomic_type(__sum);  \
-   for_each_possible_cpu(cpu)  \
-   __sum += per_cpu(var, cpu); \
-   __sum;  \
-})
-
 /*
  * Return true if the modular sum of the sem->read_count per-CPU variable is
  * zero.  If this sum is zero, then it is stable due to the fact that if any
-- 
1.8.3.1



Re: [PATCH v3 0/4] mm/slub: Fix count_partial() problem

2021-03-16 Thread Xunlei Pang
On 3/16/21 7:02 PM, Vlastimil Babka wrote:
> On 3/16/21 11:42 AM, Xunlei Pang wrote:
>> On 3/16/21 2:49 AM, Vlastimil Babka wrote:
>>> On 3/9/21 4:25 PM, Xunlei Pang wrote:
>>>> count_partial() can hold n->list_lock spinlock for quite long, which
>>>> makes much trouble to the system. This series eliminate this problem.
>>>
>>> Before I check the details, I have two high-level comments:
>>>
>>> - patch 1 introduces some counting scheme that patch 4 then changes, could 
>>> we do
>>> this in one step to avoid the churn?
>>>
>>> - the series addresses the concern that spinlock is being held, but doesn't
>>> address the fact that counting partial per-node slabs is not nearly enough 
>>> if we
>>> want accurate  in /proc/slabinfo because there are also percpu
>>> slabs and per-cpu partial slabs, where we don't track the free objects at 
>>> all.
>>> So after this series while the readers of /proc/slabinfo won't block the
>>> spinlock, they will get the same garbage data as before. So Christoph is not
>>> wrong to say that we can just report active_objs == num_objs and it won't
>>> actually break any ABI.
>>
>> If maintainers don't mind this inaccuracy which I also doubt its
>> importance, then it becomes easy. For fear that some people who really
>> cares, introducing an extra config(default-off) for it would be a good
>> option.
> 
> Great.
> 
>>> At the same time somebody might actually want accurate object statistics at 
>>> the
>>> expense of peak performance, and it would be nice to give them such option 
>>> in
>>> SLUB. Right now we don't provide this accuracy even with CONFIG_SLUB_STATS,
>>> although that option provides many additional tuning stats, with additional
>>> overhead.
>>> So my proposal would be a new config for "accurate active objects" (or just 
>>> tie
>>> it to CONFIG_SLUB_DEBUG?) that would extend the approach of percpu counters 
>>> in
>>> patch 4 to all alloc/free, so that it includes percpu slabs. Without this 
>>> config
>>> enabled, let's just report active_objs == num_objs.
>> For percpu slabs, the numbers can be retrieved from the existing
>> slub_percpu_partial()->pobjects, looks no need extra work.
> 
> Hm, unfortunately it's not that simple, the number there is a snapshot that 
> can
> become wildly inacurate afterwards.
> 

It's hard to make it absoultely accurate using percpu, the data can
change during you iterating all the cpus and total_objects, I can't
imagine its real-world usage, not to mention the percpu freelist cache.
I think sysfs slabs_cpu_partial should work enough for common debug purpose.


Re: [PATCH v3 0/4] mm/slub: Fix count_partial() problem

2021-03-16 Thread Xunlei Pang
On 3/16/21 2:49 AM, Vlastimil Babka wrote:
> On 3/9/21 4:25 PM, Xunlei Pang wrote:
>> count_partial() can hold n->list_lock spinlock for quite long, which
>> makes much trouble to the system. This series eliminate this problem.
> 
> Before I check the details, I have two high-level comments:
> 
> - patch 1 introduces some counting scheme that patch 4 then changes, could we 
> do
> this in one step to avoid the churn?
> 
> - the series addresses the concern that spinlock is being held, but doesn't
> address the fact that counting partial per-node slabs is not nearly enough if 
> we
> want accurate  in /proc/slabinfo because there are also percpu
> slabs and per-cpu partial slabs, where we don't track the free objects at all.
> So after this series while the readers of /proc/slabinfo won't block the
> spinlock, they will get the same garbage data as before. So Christoph is not
> wrong to say that we can just report active_objs == num_objs and it won't
> actually break any ABI.

If maintainers don't mind this inaccuracy which I also doubt its
importance, then it becomes easy. For fear that some people who really
cares, introducing an extra config(default-off) for it would be a good
option.

> At the same time somebody might actually want accurate object statistics at 
> the
> expense of peak performance, and it would be nice to give them such option in
> SLUB. Right now we don't provide this accuracy even with CONFIG_SLUB_STATS,
> although that option provides many additional tuning stats, with additional
> overhead.
> So my proposal would be a new config for "accurate active objects" (or just 
> tie
> it to CONFIG_SLUB_DEBUG?) that would extend the approach of percpu counters in
> patch 4 to all alloc/free, so that it includes percpu slabs. Without this 
> config
> enabled, let's just report active_objs == num_objs.
For percpu slabs, the numbers can be retrieved from the existing
slub_percpu_partial()->pobjects, looks no need extra work.

> 
> Vlastimil
> 
>> v1->v2:
>> - Improved changelog and variable naming for PATCH 1~2.
>> - PATCH3 adds per-cpu counter to avoid performance regression
>>   in concurrent __slab_free().
>>
>> v2->v3:
>> - Changed "page->inuse" to the safe "new.inuse", etc.
>> - Used CONFIG_SLUB_DEBUG and CONFIG_SYSFS condition for new counters.
>> - atomic_long_t -> unsigned long
>>
>> [Testing]
>> There seems might be a little performance impact under extreme
>> __slab_free() concurrent calls according to my tests.
>>
>> On my 32-cpu 2-socket physical machine:
>> Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
>>
>> 1) perf stat --null --repeat 10 -- hackbench 20 thread 2
>>
>> == original, no patched
>> Performance counter stats for 'hackbench 20 thread 2' (10 runs):
>>
>>   24.536050899 seconds time elapsed  
>> ( +-  0.24% )
>>
>>
>> Performance counter stats for 'hackbench 20 thread 2' (10 runs):
>>
>>   24.588049142 seconds time elapsed  
>> ( +-  0.35% )
>>
>>
>> == patched with patch1~4
>> Performance counter stats for 'hackbench 20 thread 2' (10 runs):
>>
>>   24.670892273 seconds time elapsed  
>> ( +-  0.29% )
>>
>>
>> Performance counter stats for 'hackbench 20 thread 2' (10 runs):
>>
>>   24.746755689 seconds time elapsed  
>> ( +-  0.21% )
>>
>>
>> 2) perf stat --null --repeat 10 -- hackbench 32 thread 2
>>
>> == original, no patched
>>  Performance counter stats for 'hackbench 32 thread 2' (10 runs):
>>
>>   39.784911855 seconds time elapsed  
>> ( +-  0.14% )
>>
>>  Performance counter stats for 'hackbench 32 thread 2' (10 runs):
>>
>>   39.868687608 seconds time elapsed  
>> ( +-  0.19% )
>>
>> == patched with patch1~4
>>  Performance counter stats for 'hackbench 32 thread 2' (10 runs):
>>
>>   39.681273015 seconds time elapsed  
>> ( +-  0.21% )
>>
>>  Performance counter stats for 'hackbench 32 thread 2' (10 runs):
>>
>>   39.681238459 seconds time elapsed  
>> ( +-  0.09% )
>>
>>
>> Xunlei Pang (4):
>>   mm/slub: Introduce two counters for partial objects
>>   mm/slub: Get rid of count_partial()
>>   percpu: Export per_cpu_sum()
>>   mm/slub: Use percpu partial free counter
>>
>>  include/linux/percpu-defs.h   |  10 
>>  kernel/locking/percpu-rwsem.c |  10 
>>  mm/slab.h |   4 ++
>>  mm/slub.c | 120 
>> +-
>>  4 files changed, 97 insertions(+), 47 deletions(-)
>>


[PATCH v3 2/4] mm/slub: Get rid of count_partial()

2021-03-09 Thread Xunlei Pang
Now the partial counters are ready, let's use them directly
and get rid of count_partial().

Tested-by: James Wang 
Reviewed-by: Pekka Enberg 
Signed-off-by: Xunlei Pang 
---
 mm/slub.c | 54 ++
 1 file changed, 22 insertions(+), 32 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 4d02831..3f76b57 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2533,11 +2533,6 @@ static inline int node_match(struct page *page, int node)
 }
 
 #ifdef CONFIG_SLUB_DEBUG
-static int count_free(struct page *page)
-{
-   return page->objects - page->inuse;
-}
-
 static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
 {
return atomic_long_read(>total_objects);
@@ -2545,19 +2540,26 @@ static inline unsigned long node_nr_objs(struct 
kmem_cache_node *n)
 #endif /* CONFIG_SLUB_DEBUG */
 
 #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
-static unsigned long count_partial(struct kmem_cache_node *n,
-   int (*get_count)(struct page *))
+enum partial_item { PARTIAL_FREE, PARTIAL_INUSE, PARTIAL_TOTAL };
+
+static unsigned long partial_counter(struct kmem_cache_node *n,
+   enum partial_item item)
 {
-   unsigned long flags;
-   unsigned long x = 0;
-   struct page *page;
+   unsigned long ret = 0;
 
-   spin_lock_irqsave(>list_lock, flags);
-   list_for_each_entry(page, >partial, slab_list)
-   x += get_count(page);
-   spin_unlock_irqrestore(>list_lock, flags);
-   return x;
+   if (item == PARTIAL_FREE) {
+   ret = atomic_long_read(>partial_free_objs);
+   } else if (item == PARTIAL_TOTAL) {
+   ret = n->partial_total_objs;
+   } else if (item == PARTIAL_INUSE) {
+   ret = n->partial_total_objs - 
atomic_long_read(>partial_free_objs);
+   if ((long)ret < 0)
+   ret = 0;
+   }
+
+   return ret;
 }
+
 #endif /* CONFIG_SLUB_DEBUG || CONFIG_SYSFS */
 
 static noinline void
@@ -2587,7 +2589,7 @@ static unsigned long count_partial(struct kmem_cache_node 
*n,
unsigned long nr_objs;
unsigned long nr_free;
 
-   nr_free  = count_partial(n, count_free);
+   nr_free  = partial_counter(n, PARTIAL_FREE);
nr_slabs = node_nr_slabs(n);
nr_objs  = node_nr_objs(n);
 
@@ -4643,18 +4645,6 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t 
gfpflags,
 EXPORT_SYMBOL(__kmalloc_node_track_caller);
 #endif
 
-#ifdef CONFIG_SYSFS
-static int count_inuse(struct page *page)
-{
-   return page->inuse;
-}
-
-static int count_total(struct page *page)
-{
-   return page->objects;
-}
-#endif
-
 #ifdef CONFIG_SLUB_DEBUG
 static void validate_slab(struct kmem_cache *s, struct page *page)
 {
@@ -5091,7 +5081,7 @@ static ssize_t show_slab_objects(struct kmem_cache *s,
x = atomic_long_read(>total_objects);
else if (flags & SO_OBJECTS)
x = atomic_long_read(>total_objects) -
-   count_partial(n, count_free);
+   partial_counter(n, PARTIAL_FREE);
else
x = atomic_long_read(>nr_slabs);
total += x;
@@ -5105,9 +5095,9 @@ static ssize_t show_slab_objects(struct kmem_cache *s,
 
for_each_kmem_cache_node(s, node, n) {
if (flags & SO_TOTAL)
-   x = count_partial(n, count_total);
+   x = partial_counter(n, PARTIAL_TOTAL);
else if (flags & SO_OBJECTS)
-   x = count_partial(n, count_inuse);
+   x = partial_counter(n, PARTIAL_INUSE);
else
x = n->nr_partial;
total += x;
@@ -5873,7 +5863,7 @@ void get_slabinfo(struct kmem_cache *s, struct slabinfo 
*sinfo)
for_each_kmem_cache_node(s, node, n) {
nr_slabs += node_nr_slabs(n);
nr_objs += node_nr_objs(n);
-   nr_free += count_partial(n, count_free);
+   nr_free += partial_counter(n, PARTIAL_FREE);
}
 
sinfo->active_objs = nr_objs - nr_free;
-- 
1.8.3.1



[PATCH v3 1/4] mm/slub: Introduce two counters for partial objects

2021-03-09 Thread Xunlei Pang
The node list_lock in count_partial() spends long time iterating
in case of large amount of partial page lists, which can cause
thunder herd effect to the list_lock contention.

We have HSF RT(High-speed Service Framework Response-Time) monitors,
the RT figures fluctuated randomly, then we deployed a tool detecting
"irq off" and "preempt off" to dump the culprit's calltrace, capturing
the list_lock cost nearly 100ms with irq off issued by "ss", this also
caused network timeouts.

This patch introduces two counters to maintain the actual number
of partial objects dynamically instead of iterating the partial
page lists with list_lock held.

New counters of kmem_cache_node: partial_free_objs, partial_total_objs.
The main operations are under list_lock in slow path, its performance
impact should be minimal except the __slab_free() path which will be
addressed later.

Tested-by: James Wang 
Reviewed-by: Pekka Enberg 
Signed-off-by: Xunlei Pang 
---
 mm/slab.h |  4 
 mm/slub.c | 46 +-
 2 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/mm/slab.h b/mm/slab.h
index 076582f..817bfa0 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -547,6 +547,10 @@ struct kmem_cache_node {
 #ifdef CONFIG_SLUB
unsigned long nr_partial;
struct list_head partial;
+#if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
+   atomic_long_t partial_free_objs;
+   unsigned long partial_total_objs;
+#endif
 #ifdef CONFIG_SLUB_DEBUG
atomic_long_t nr_slabs;
atomic_long_t total_objects;
diff --git a/mm/slub.c b/mm/slub.c
index e26c274..4d02831 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1890,10 +1890,31 @@ static void discard_slab(struct kmem_cache *s, struct 
page *page)
 /*
  * Management of partially allocated slabs.
  */
+#if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
+static inline void
+__update_partial_free(struct kmem_cache_node *n, long delta)
+{
+   atomic_long_add(delta, >partial_free_objs);
+}
+
+static inline void
+__update_partial_total(struct kmem_cache_node *n, long delta)
+{
+   n->partial_total_objs += delta;
+}
+#else
+static inline void
+__update_partial_free(struct kmem_cache_node *n, long delta) { }
+
+static inline void
+__update_partial_total(struct kmem_cache_node *n, long delta) { }
+#endif
+
 static inline void
 __add_partial(struct kmem_cache_node *n, struct page *page, int tail)
 {
n->nr_partial++;
+   __update_partial_total(n, page->objects);
if (tail == DEACTIVATE_TO_TAIL)
list_add_tail(>slab_list, >partial);
else
@@ -1913,6 +1934,7 @@ static inline void remove_partial(struct kmem_cache_node 
*n,
lockdep_assert_held(>list_lock);
list_del(>slab_list);
n->nr_partial--;
+   __update_partial_total(n, -page->objects);
 }
 
 /*
@@ -1957,6 +1979,7 @@ static inline void *acquire_slab(struct kmem_cache *s,
return NULL;
 
remove_partial(n, page);
+   __update_partial_free(n, -*objects);
WARN_ON(!freelist);
return freelist;
 }
@@ -2286,8 +2309,11 @@ static void deactivate_slab(struct kmem_cache *s, struct 
page *page,
"unfreezing slab"))
goto redo;
 
-   if (lock)
+   if (lock) {
+   if (m == M_PARTIAL)
+   __update_partial_free(n, new.objects - new.inuse);
spin_unlock(>list_lock);
+   }
 
if (m == M_PARTIAL)
stat(s, tail);
@@ -2353,6 +2379,7 @@ static void unfreeze_partials(struct kmem_cache *s,
discard_page = page;
} else {
add_partial(n, page, DEACTIVATE_TO_TAIL);
+   __update_partial_free(n, new.objects - new.inuse);
stat(s, FREE_ADD_PARTIAL);
}
}
@@ -3039,6 +3066,13 @@ static void __slab_free(struct kmem_cache *s, struct 
page *page,
head, new.counters,
"__slab_free"));
 
+   if (!was_frozen && prior) {
+   if (n)
+   __update_partial_free(n, cnt);
+   else
+   __update_partial_free(get_node(s, page_to_nid(page)), 
cnt);
+   }
+
if (likely(!n)) {
 
if (likely(was_frozen)) {
@@ -3069,6 +3103,7 @@ static void __slab_free(struct kmem_cache *s, struct page 
*page,
if (!kmem_cache_has_cpu_partial(s) && unlikely(!prior)) {
remove_full(s, n, page);
add_partial(n, page, DEACTIVATE_TO_TAIL);
+   __update_partial_free(n, cnt);
stat(s, FREE_ADD_PARTIAL);
}
spin_unlock_irqrestore(>list_lock, flags);
@@ -3080,6 +3115,7 @@ static void __slab_free(struct kmem_cache *s, struct page 
*page,
 * Slab on the partial list.
 

[PATCH v3 3/4] percpu: Export per_cpu_sum()

2021-03-09 Thread Xunlei Pang
per_cpu_sum() is useful, and deserves to be exported.

Tested-by: James Wang 
Signed-off-by: Xunlei Pang 
---
 include/linux/percpu-defs.h   | 10 ++
 kernel/locking/percpu-rwsem.c | 10 --
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/percpu-defs.h b/include/linux/percpu-defs.h
index dff7040..0e71b68 100644
--- a/include/linux/percpu-defs.h
+++ b/include/linux/percpu-defs.h
@@ -220,6 +220,16 @@
(void)__vpp_verify; \
 } while (0)
 
+#define per_cpu_sum(var)   \
+({ \
+   typeof(var) __sum = 0;  \
+   int cpu;\
+   compiletime_assert_atomic_type(__sum);  \
+   for_each_possible_cpu(cpu)  \
+   __sum += per_cpu(var, cpu); \
+   __sum;  \
+})
+
 #ifdef CONFIG_SMP
 
 /*
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index 70a32a5..0980e51 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -178,16 +178,6 @@ bool __percpu_down_read(struct percpu_rw_semaphore *sem, 
bool try)
 }
 EXPORT_SYMBOL_GPL(__percpu_down_read);
 
-#define per_cpu_sum(var)   \
-({ \
-   typeof(var) __sum = 0;  \
-   int cpu;\
-   compiletime_assert_atomic_type(__sum);  \
-   for_each_possible_cpu(cpu)  \
-   __sum += per_cpu(var, cpu); \
-   __sum;  \
-})
-
 /*
  * Return true if the modular sum of the sem->read_count per-CPU variable is
  * zero.  If this sum is zero, then it is stable due to the fact that if any
-- 
1.8.3.1



[PATCH v3 0/4] mm/slub: Fix count_partial() problem

2021-03-09 Thread Xunlei Pang
count_partial() can hold n->list_lock spinlock for quite long, which
makes much trouble to the system. This series eliminate this problem.

v1->v2:
- Improved changelog and variable naming for PATCH 1~2.
- PATCH3 adds per-cpu counter to avoid performance regression
  in concurrent __slab_free().

v2->v3:
- Changed "page->inuse" to the safe "new.inuse", etc.
- Used CONFIG_SLUB_DEBUG and CONFIG_SYSFS condition for new counters.
- atomic_long_t -> unsigned long

[Testing]
There seems might be a little performance impact under extreme
__slab_free() concurrent calls according to my tests.

On my 32-cpu 2-socket physical machine:
Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz

1) perf stat --null --repeat 10 -- hackbench 20 thread 2

== original, no patched
Performance counter stats for 'hackbench 20 thread 2' (10 runs):

  24.536050899 seconds time elapsed 
 ( +-  0.24% )


Performance counter stats for 'hackbench 20 thread 2' (10 runs):

  24.588049142 seconds time elapsed 
 ( +-  0.35% )


== patched with patch1~4
Performance counter stats for 'hackbench 20 thread 2' (10 runs):

  24.670892273 seconds time elapsed 
 ( +-  0.29% )


Performance counter stats for 'hackbench 20 thread 2' (10 runs):

  24.746755689 seconds time elapsed 
 ( +-  0.21% )


2) perf stat --null --repeat 10 -- hackbench 32 thread 2

== original, no patched
 Performance counter stats for 'hackbench 32 thread 2' (10 runs):

  39.784911855 seconds time elapsed 
 ( +-  0.14% )

 Performance counter stats for 'hackbench 32 thread 2' (10 runs):

  39.868687608 seconds time elapsed 
 ( +-  0.19% )

== patched with patch1~4
 Performance counter stats for 'hackbench 32 thread 2' (10 runs):

  39.681273015 seconds time elapsed 
 ( +-  0.21% )

 Performance counter stats for 'hackbench 32 thread 2' (10 runs):

  39.681238459 seconds time elapsed     
 ( +-  0.09% )


Xunlei Pang (4):
  mm/slub: Introduce two counters for partial objects
  mm/slub: Get rid of count_partial()
  percpu: Export per_cpu_sum()
  mm/slub: Use percpu partial free counter

 include/linux/percpu-defs.h   |  10 
 kernel/locking/percpu-rwsem.c |  10 
 mm/slab.h |   4 ++
 mm/slub.c | 120 +-
 4 files changed, 97 insertions(+), 47 deletions(-)

-- 
1.8.3.1



[PATCH v3 4/4] mm/slub: Use percpu partial free counter

2021-03-09 Thread Xunlei Pang
The only concern of introducing partial counter is that,
partial_free_objs may cause cache and atomic operation
contention in case of same SLUB concurrent __slab_free().

This patch changes it to be a percpu counter, also replace
the counter variables to avoid cacheline issues.

Tested-by: James Wang 
Reviewed-by: Pekka Enberg 
Signed-off-by: Xunlei Pang 
---
 mm/slab.h |  6 --
 mm/slub.c | 30 +++---
 2 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index 817bfa0..c819597 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -546,16 +546,18 @@ struct kmem_cache_node {
 
 #ifdef CONFIG_SLUB
unsigned long nr_partial;
-   struct list_head partial;
 #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
-   atomic_long_t partial_free_objs;
unsigned long partial_total_objs;
 #endif
+   struct list_head partial;
 #ifdef CONFIG_SLUB_DEBUG
atomic_long_t nr_slabs;
atomic_long_t total_objects;
struct list_head full;
 #endif
+#if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
+   unsigned long __percpu *partial_free_objs;
+#endif
 #endif
 
 };
diff --git a/mm/slub.c b/mm/slub.c
index 3f76b57..b6ec065 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1894,7 +1894,7 @@ static void discard_slab(struct kmem_cache *s, struct 
page *page)
 static inline void
 __update_partial_free(struct kmem_cache_node *n, long delta)
 {
-   atomic_long_add(delta, >partial_free_objs);
+   this_cpu_add(*n->partial_free_objs, delta);
 }
 
 static inline void
@@ -2548,11 +2548,16 @@ static unsigned long partial_counter(struct 
kmem_cache_node *n,
unsigned long ret = 0;
 
if (item == PARTIAL_FREE) {
-   ret = atomic_long_read(>partial_free_objs);
+   ret = per_cpu_sum(*n->partial_free_objs);
+   if ((long)ret < 0)
+   ret = 0;
} else if (item == PARTIAL_TOTAL) {
ret = n->partial_total_objs;
} else if (item == PARTIAL_INUSE) {
-   ret = n->partial_total_objs - 
atomic_long_read(>partial_free_objs);
+   ret = per_cpu_sum(*n->partial_free_objs);
+   if ((long)ret < 0)
+   ret = 0;
+   ret = n->partial_total_objs - ret;
if ((long)ret < 0)
ret = 0;
}
@@ -3552,14 +3557,16 @@ static inline int calculate_order(unsigned int size)
return -ENOSYS;
 }
 
-static void
+static int
 init_kmem_cache_node(struct kmem_cache_node *n)
 {
n->nr_partial = 0;
spin_lock_init(>list_lock);
INIT_LIST_HEAD(>partial);
 #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
-   atomic_long_set(>partial_free_objs, 0);
+   n->partial_free_objs = alloc_percpu(unsigned long);
+   if (!n->partial_free_objs)
+   return -ENOMEM;
n->partial_total_objs = 0;
 #endif
 #ifdef CONFIG_SLUB_DEBUG
@@ -3567,6 +3574,8 @@ static inline int calculate_order(unsigned int size)
atomic_long_set(>total_objects, 0);
INIT_LIST_HEAD(>full);
 #endif
+
+   return 0;
 }
 
 static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
@@ -3626,7 +3635,7 @@ static void early_kmem_cache_node_alloc(int node)
page->inuse = 1;
page->frozen = 0;
kmem_cache_node->node[node] = n;
-   init_kmem_cache_node(n);
+   BUG_ON(init_kmem_cache_node(n) < 0);
inc_slabs_node(kmem_cache_node, node, page->objects);
 
/*
@@ -3644,6 +3653,9 @@ static void free_kmem_cache_nodes(struct kmem_cache *s)
 
for_each_kmem_cache_node(s, node, n) {
s->node[node] = NULL;
+#if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
+   free_percpu(n->partial_free_objs);
+#endif
kmem_cache_free(kmem_cache_node, n);
}
 }
@@ -3674,7 +3686,11 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
return 0;
}
 
-   init_kmem_cache_node(n);
+   if (init_kmem_cache_node(n) < 0) {
+   free_kmem_cache_nodes(s);
+   return 0;
+   }
+
s->node[node] = n;
}
return 1;
-- 
1.8.3.1



Re: [PATCH v2 3/3] mm/slub: Use percpu partial free counter

2021-03-03 Thread Xunlei Pang
On 3/2/21 5:14 PM, Christoph Lameter wrote:
> On Mon, 10 Aug 2020, Xunlei Pang wrote:
> 
>>
>> diff --git a/mm/slab.h b/mm/slab.h
>> index c85e2fa..a709a70 100644
>> --- a/mm/slab.h
>> +++ b/mm/slab.h
>> @@ -616,7 +616,7 @@ struct kmem_cache_node {
>>  #ifdef CONFIG_SLUB
>>  unsigned long nr_partial;
>>  struct list_head partial;
>> -atomic_long_t partial_free_objs;
>> +atomic_long_t __percpu *partial_free_objs;
> 
> A percpu counter is never atomic. Just use unsigned long and use this_cpu
> operations for this thing. That should cut down further on the overhead.
> 
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -1775,11 +1775,21 @@ static void discard_slab(struct kmem_cache *s, 
>> struct page *page)
>>  /*
>>   * Management of partially allocated slabs.
>>   */
>> +static inline long get_partial_free(struct kmem_cache_node *n)
>> +{
>> +long nr = 0;
>> +int cpu;
>> +
>> +for_each_possible_cpu(cpu)
>> +nr += atomic_long_read(per_cpu_ptr(n->partial_free_objs, cpu));
> 
> this_cpu_read(*n->partial_free_objs)
> 
>>  static inline void
>>  __update_partial_free(struct kmem_cache_node *n, long delta)
>>  {
>> -atomic_long_add(delta, >partial_free_objs);
>> +atomic_long_add(delta, this_cpu_ptr(n->partial_free_objs));
> 
> this_cpu_add()
> 
> and so on.
> 

Thanks, I changed them both to use "unsigned long", and will send v3 out
after our internal performance regression test passes.


Re: [PATCH v2 0/3] mm/slub: Fix count_partial() problem

2021-03-03 Thread Xunlei Pang
On 3/1/21 6:31 PM, Shu Ming wrote:
> Any progress on this?  The problem addressed by this patch has also
> made jitters to our online apps which are quite annoying.
> 

Thanks for the attention.

There's some further improvements on v2, I'm gonna send v3 out later.

> On Mon, Aug 24, 2020 at 6:05 PM xunlei  wrote:
>>
>> On 2020/8/20 下午10:02, Pekka Enberg wrote:
>>> On Mon, Aug 10, 2020 at 3:18 PM Xunlei Pang  
>>> wrote:
>>>>
>>>> v1->v2:
>>>> - Improved changelog and variable naming for PATCH 1~2.
>>>> - PATCH3 adds per-cpu counter to avoid performance regression
>>>>   in concurrent __slab_free().
>>>>
>>>> [Testing]
>>>> On my 32-cpu 2-socket physical machine:
>>>> Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
>>>> perf stat --null --repeat 10 -- hackbench 20 thread 2
>>>>
>>>> == original, no patched
>>>>   19.211637055 seconds time elapsed
>>>>   ( +-  0.57% )
>>>>
>>>> == patched with patch1~2
>>>>  Performance counter stats for 'hackbench 20 thread 2' (10 runs):
>>>>
>>>>   21.731833146 seconds time elapsed    
>>>>   ( +-  0.17% )
>>>>
>>>> == patched with patch1~3
>>>>  Performance counter stats for 'hackbench 20 thread 2' (10 runs):
>>>>
>>>>   19.112106847 seconds time elapsed
>>>>   ( +-  0.64% )
>>>>
>>>>
>>>> Xunlei Pang (3):
>>>>   mm/slub: Introduce two counters for partial objects
>>>>   mm/slub: Get rid of count_partial()
>>>>   mm/slub: Use percpu partial free counter
>>>>
>>>>  mm/slab.h |   2 +
>>>>  mm/slub.c | 124 
>>>> +++---
>>>>  2 files changed, 89 insertions(+), 37 deletions(-)
>>>
>>> We probably need to wrap the counters under CONFIG_SLUB_DEBUG because
>>> AFAICT all the code that uses them is also wrapped under it.
>>
>> /sys/kernel/slab/***/partial sysfs also uses it, I can wrap it with
>> CONFIG_SLUB_DEBUG or CONFIG_SYSFS for backward compatibility.
>>
>>>
>>> An alternative approach for this patch would be to somehow make the
>>> lock in count_partial() more granular, but I don't know how feasible
>>> that actually is.
>>>
>>> Anyway, I am OK with this approach:
>>>
>>> Reviewed-by: Pekka Enberg 
>>
>> Thanks!
>>
>>>
>>> You still need to convince Christoph, though, because he had
>>> objections over this approach.
>>
>> Christoph, what do you think, or any better suggestion to address this
>> *in production* issue?
>>
>>>
>>> - Pekka
>>>


[tip: sched/core] sched/fair: Fix wrong cpu selecting from isolated domain

2020-09-29 Thread tip-bot2 for Xunlei Pang
The following commit has been merged into the sched/core branch of tip:

Commit-ID: df3cb4ea1fb63ff326488efd671ba3c39034255e
Gitweb:
https://git.kernel.org/tip/df3cb4ea1fb63ff326488efd671ba3c39034255e
Author:Xunlei Pang 
AuthorDate:Thu, 24 Sep 2020 14:48:47 +08:00
Committer: Peter Zijlstra 
CommitterDate: Fri, 25 Sep 2020 14:23:25 +02:00

sched/fair: Fix wrong cpu selecting from isolated domain

We've met problems that occasionally tasks with full cpumask
(e.g. by putting it into a cpuset or setting to full affinity)
were migrated to our isolated cpus in production environment.

After some analysis, we found that it is due to the current
select_idle_smt() not considering the sched_domain mask.

Steps to reproduce on my 31-CPU hyperthreads machine:
1. with boot parameter: "isolcpus=domain,2-31"
   (thread lists: 0,16 and 1,17)
2. cgcreate -g cpu:test; cgexec -g cpu:test "test_threads"
3. some threads will be migrated to the isolated cpu16~17.

Fix it by checking the valid domain mask in select_idle_smt().

Fixes: 10e2f1acd010 ("sched/core: Rewrite and improve select_idle_siblings())
Reported-by: Wetp Zhang 
Signed-off-by: Xunlei Pang 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Jiang Biao 
Reviewed-by: Vincent Guittot 
Link: 
https://lkml.kernel.org/r/1600930127-76857-1-git-send-email-xlp...@linux.alibaba.com
---
 kernel/sched/fair.c |  9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a15deb2..9613e5d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6080,7 +6080,7 @@ static int select_idle_core(struct task_struct *p, struct 
sched_domain *sd, int 
 /*
  * Scan the local SMT mask for idle CPUs.
  */
-static int select_idle_smt(struct task_struct *p, int target)
+static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int 
target)
 {
int cpu;
 
@@ -6088,7 +6088,8 @@ static int select_idle_smt(struct task_struct *p, int 
target)
return -1;
 
for_each_cpu(cpu, cpu_smt_mask(target)) {
-   if (!cpumask_test_cpu(cpu, p->cpus_ptr))
+   if (!cpumask_test_cpu(cpu, p->cpus_ptr) ||
+   !cpumask_test_cpu(cpu, sched_domain_span(sd)))
continue;
if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
return cpu;
@@ -6104,7 +6105,7 @@ static inline int select_idle_core(struct task_struct *p, 
struct sched_domain *s
return -1;
 }
 
-static inline int select_idle_smt(struct task_struct *p, int target)
+static inline int select_idle_smt(struct task_struct *p, struct sched_domain 
*sd, int target)
 {
return -1;
 }
@@ -6279,7 +6280,7 @@ symmetric:
if ((unsigned)i < nr_cpumask_bits)
return i;
 
-   i = select_idle_smt(p, target);
+   i = select_idle_smt(p, sd, target);
if ((unsigned)i < nr_cpumask_bits)
return i;
 


Re: [PATCH RESEND] sched/fair: Fix wrong cpu selecting from isolated domain

2020-09-24 Thread Xunlei Pang
On 9/24/20 3:18 PM, Vincent Guittot wrote:
> On Thu, 24 Sep 2020 at 08:48, Xunlei Pang  wrote:
>>
>> We've met problems that occasionally tasks with full cpumask
>> (e.g. by putting it into a cpuset or setting to full affinity)
>> were migrated to our isolated cpus in production environment.
>>
>> After some analysis, we found that it is due to the current
>> select_idle_smt() not considering the sched_domain mask.
>>
>> Steps to reproduce on my 31-CPU hyperthreads machine:
>> 1. with boot parameter: "isolcpus=domain,2-31"
>>(thread lists: 0,16 and 1,17)
>> 2. cgcreate -g cpu:test; cgexec -g cpu:test "test_threads"
>> 3. some threads will be migrated to the isolated cpu16~17.
>>
>> Fix it by checking the valid domain mask in select_idle_smt().
>>
>> Fixes: 10e2f1acd010 ("sched/core: Rewrite and improve select_idle_siblings())
>> Reported-by: Wetp Zhang 
>> Reviewed-by: Jiang Biao 
>> Signed-off-by: Xunlei Pang 
> 
> Reviewed-by: Vincent Guittot 
> 

Thanks, Vincent :-)


[PATCH RESEND] sched/fair: Fix wrong cpu selecting from isolated domain

2020-09-24 Thread Xunlei Pang
We've met problems that occasionally tasks with full cpumask
(e.g. by putting it into a cpuset or setting to full affinity)
were migrated to our isolated cpus in production environment.

After some analysis, we found that it is due to the current
select_idle_smt() not considering the sched_domain mask.

Steps to reproduce on my 31-CPU hyperthreads machine:
1. with boot parameter: "isolcpus=domain,2-31"
   (thread lists: 0,16 and 1,17)
2. cgcreate -g cpu:test; cgexec -g cpu:test "test_threads"
3. some threads will be migrated to the isolated cpu16~17.

Fix it by checking the valid domain mask in select_idle_smt().

Fixes: 10e2f1acd010 ("sched/core: Rewrite and improve select_idle_siblings())
Reported-by: Wetp Zhang 
Reviewed-by: Jiang Biao 
Signed-off-by: Xunlei Pang 
---
 kernel/sched/fair.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1a68a05..fa942c4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6075,7 +6075,7 @@ static int select_idle_core(struct task_struct *p, struct 
sched_domain *sd, int
 /*
  * Scan the local SMT mask for idle CPUs.
  */
-static int select_idle_smt(struct task_struct *p, int target)
+static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int 
target)
 {
int cpu;
 
@@ -6083,7 +6083,8 @@ static int select_idle_smt(struct task_struct *p, int 
target)
return -1;
 
for_each_cpu(cpu, cpu_smt_mask(target)) {
-   if (!cpumask_test_cpu(cpu, p->cpus_ptr))
+   if (!cpumask_test_cpu(cpu, p->cpus_ptr) ||
+   !cpumask_test_cpu(cpu, sched_domain_span(sd)))
continue;
if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
return cpu;
@@ -6099,7 +6100,7 @@ static inline int select_idle_core(struct task_struct *p, 
struct sched_domain *s
return -1;
 }
 
-static inline int select_idle_smt(struct task_struct *p, int target)
+static inline int select_idle_smt(struct task_struct *p, struct sched_domain 
*sd, int target)
 {
return -1;
 }
@@ -6274,7 +6275,7 @@ static int select_idle_sibling(struct task_struct *p, int 
prev, int target)
if ((unsigned)i < nr_cpumask_bits)
return i;
 
-   i = select_idle_smt(p, target);
+   i = select_idle_smt(p, sd, target);
if ((unsigned)i < nr_cpumask_bits)
return i;
 
-- 
1.8.3.1



Re: [PATCH] mm: memcg: yield cpu when we fail to charge pages

2020-09-13 Thread Xunlei Pang
On 2020/9/9 AM2:50, Julius Hemanth Pitti wrote:
> For non root CG, in try_charge(), we keep trying
> to charge until we succeed. On non-preemptive
> kernel, when we are OOM, this results in holding
> CPU forever.
> 
> On SMP systems, this doesn't create a big problem
> because oom_reaper get a change to kill victim
> and make some free pages. However on a single-core
> CPU (or cases where oom_reaper pinned to same CPU
> where try_charge is executing), oom_reaper shall
> never get scheduled and we stay in try_charge forever.
> 
> Steps to repo this on non-smp:
> 1. mount -t tmpfs none /sys/fs/cgroup
> 2. mkdir /sys/fs/cgroup/memory
> 3. mount -t cgroup none /sys/fs/cgroup/memory -o memory
> 4. mkdir /sys/fs/cgroup/memory/0
> 5. echo 40M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes
> 6. echo $$ > /sys/fs/cgroup/memory/0/tasks
> 7. stress -m 5 --vm-bytes 10M --vm-hang 0
> 
> Signed-off-by: Julius Hemanth Pitti 
> ---
>  mm/memcontrol.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 0d6f3ea86738..4620d70267cb 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2652,6 +2652,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t 
> gfp_mask,
>   if (fatal_signal_pending(current))
>   goto force;
>  
> + cond_resched();
> +
>   /*
>* keep retrying as long as the memcg oom killer is able to make
>* a forward progress or bypass the charge if the oom killer
> 

This should be fixed by:
https://lkml.org/lkml/2020/8/26/1440

Thanks,
Xunlei


Re: [PATCH] sched/fair: Fix wrong cpu selecting from isolated domain

2020-08-27 Thread Xunlei Pang
On 2020/8/24 PM8:30, Xunlei Pang wrote:
> We've met problems that occasionally tasks with full cpumask
> (e.g. by putting it into a cpuset or setting to full affinity)
> were migrated to our isolated cpus in production environment.
> 
> After some analysis, we found that it is due to the current
> select_idle_smt() not considering the sched_domain mask.
> 
> Fix it by checking the valid domain mask in select_idle_smt().
> 
> Fixes: 10e2f1acd010 ("sched/core: Rewrite and improve select_idle_siblings())
> Reported-by: Wetp Zhang 
> Signed-off-by: Xunlei Pang 
> ---
>  kernel/sched/fair.c | 9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1a68a05..fa942c4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6075,7 +6075,7 @@ static int select_idle_core(struct task_struct *p, 
> struct sched_domain *sd, int
>  /*
>   * Scan the local SMT mask for idle CPUs.
>   */
> -static int select_idle_smt(struct task_struct *p, int target)
> +static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, 
> int target)
>  {
>   int cpu;
>  
> @@ -6083,7 +6083,8 @@ static int select_idle_smt(struct task_struct *p, int 
> target)
>   return -1;
>  
>   for_each_cpu(cpu, cpu_smt_mask(target)) {
> - if (!cpumask_test_cpu(cpu, p->cpus_ptr))
> + if (!cpumask_test_cpu(cpu, p->cpus_ptr) ||
> + !cpumask_test_cpu(cpu, sched_domain_span(sd)))
>   continue;
>   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
>   return cpu;
> @@ -6099,7 +6100,7 @@ static inline int select_idle_core(struct task_struct 
> *p, struct sched_domain *s
>   return -1;
>  }
>  
> -static inline int select_idle_smt(struct task_struct *p, int target)
> +static inline int select_idle_smt(struct task_struct *p, struct sched_domain 
> *sd, int target)
>  {
>   return -1;
>  }
> @@ -6274,7 +6275,7 @@ static int select_idle_sibling(struct task_struct *p, 
> int prev, int target)
>   if ((unsigned)i < nr_cpumask_bits)
>   return i;
>  
> - i = select_idle_smt(p, target);
> + i = select_idle_smt(p, sd, target);
>   if ((unsigned)i < nr_cpumask_bits)
>   return i;
>  
> 

Hi Peter, any other comments?


[PATCH v3] mm: memcg: Fix memcg reclaim soft lockup

2020-08-26 Thread Xunlei Pang
We've met softlockup with "CONFIG_PREEMPT_NONE=y", when
the target memcg doesn't have any reclaimable memory.

It can be easily reproduced as below:
 watchdog: BUG: soft lockup - CPU#0 stuck for 111s![memcg_test:2204]
 CPU: 0 PID: 2204 Comm: memcg_test Not tainted 5.9.0-rc2+ #12
 Call Trace:
  shrink_lruvec+0x49f/0x640
  shrink_node+0x2a6/0x6f0
  do_try_to_free_pages+0xe9/0x3e0
  try_to_free_mem_cgroup_pages+0xef/0x1f0
  try_charge+0x2c1/0x750
  mem_cgroup_charge+0xd7/0x240
  __add_to_page_cache_locked+0x2fd/0x370
  add_to_page_cache_lru+0x4a/0xc0
  pagecache_get_page+0x10b/0x2f0
  filemap_fault+0x661/0xad0
  ext4_filemap_fault+0x2c/0x40
  __do_fault+0x4d/0xf9
  handle_mm_fault+0x1080/0x1790

It only happens on our 1-vcpu instances, because there's no chance
for oom reaper to run to reclaim the to-be-killed process.

Add a cond_resched() at the upper shrink_node_memcgs() to solve this
issue, this will mean that we will get a scheduling point for each
memcg in the reclaimed hierarchy without any dependency on the
reclaimable memory in that memcg thus making it more predictable.

Acked-by: Chris Down 
Acked-by: Michal Hocko 
Suggested-by: Michal Hocko 
Signed-off-by: Xunlei Pang 
---
 mm/vmscan.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 99e1796..9727dd8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2615,6 +2615,14 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct 
scan_control *sc)
unsigned long reclaimed;
unsigned long scanned;
 
+   /*
+* This loop can become CPU-bound when target memcgs
+* aren't eligible for reclaim - either because they
+* don't have any reclaimable pages, or because their
+* memory is explicitly protected. Avoid soft lockups.
+*/
+   cond_resched();
+
mem_cgroup_calculate_protection(target_memcg, memcg);
 
if (mem_cgroup_below_min(memcg)) {
-- 
1.8.3.1



[PATCH v2] mm: memcg: Fix memcg reclaim soft lockup

2020-08-26 Thread Xunlei Pang
We've met softlockup with "CONFIG_PREEMPT_NONE=y", when
the target memcg doesn't have any reclaimable memory.

It can be easily reproduced as below:
 watchdog: BUG: soft lockup - CPU#0 stuck for 111s![memcg_test:2204]
 CPU: 0 PID: 2204 Comm: memcg_test Not tainted 5.9.0-rc2+ #12
 Call Trace:
  shrink_lruvec+0x49f/0x640
  shrink_node+0x2a6/0x6f0
  do_try_to_free_pages+0xe9/0x3e0
  try_to_free_mem_cgroup_pages+0xef/0x1f0
  try_charge+0x2c1/0x750
  mem_cgroup_charge+0xd7/0x240
  __add_to_page_cache_locked+0x2fd/0x370
  add_to_page_cache_lru+0x4a/0xc0
  pagecache_get_page+0x10b/0x2f0
  filemap_fault+0x661/0xad0
  ext4_filemap_fault+0x2c/0x40
  __do_fault+0x4d/0xf9
  handle_mm_fault+0x1080/0x1790

It only happens on our 1-vcpu instances, because there's no chance
for oom reaper to run to reclaim the to-be-killed process.

Add cond_resched() at the upper shrink_node_memcgs() to solve this
issue, and any other possible issue like meomry.min protection.

Suggested-by: Michal Hocko 
Signed-off-by: Xunlei Pang 
---
 mm/vmscan.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 99e1796..bbdc38b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2617,6 +2617,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct 
scan_control *sc)
 
mem_cgroup_calculate_protection(target_memcg, memcg);
 
+   cond_resched();
+
if (mem_cgroup_below_min(memcg)) {
/*
 * Hard protection.
-- 
1.8.3.1



Re: [PATCH] mm: memcg: Fix memcg reclaim soft lockup

2020-08-26 Thread Xunlei Pang
On 2020/8/26 下午7:45, xunlei wrote:
> On 2020/8/26 下午7:00, Michal Hocko wrote:
>> On Wed 26-08-20 18:41:18, xunlei wrote:
>>> On 2020/8/26 下午4:11, Michal Hocko wrote:
>>>> On Wed 26-08-20 15:27:02, Xunlei Pang wrote:
>>>>> We've met softlockup with "CONFIG_PREEMPT_NONE=y", when
>>>>> the target memcg doesn't have any reclaimable memory.
>>>>
>>>> Do you have any scenario when this happens or is this some sort of a
>>>> test case?
>>>
>>> It can happen on tiny guest scenarios.
>>
>> OK, you made me more curious. If this is a tiny guest and this is a hard
>> limit reclaim path then we should trigger an oom killer which should
>> kill the offender and that in turn bail out from the try_charge lopp
>> (see should_force_charge). So how come this repeats enough in your setup
>> that it causes soft lockups?
>>
> 
> oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
>get_order(nr_pages * PAGE_SIZE));
> switch (oom_status) {
> case OOM_SUCCESS:
> nr_retries = MAX_RECLAIM_RETRIES;

Actually we can add "cond_resched()" here, but I think it's better to
have one at the memcg reclaim path to avoid other unexpected issues.

> goto retry;
> 
> It retries here endlessly, because oom reaper has no cpu to schedule.
> 


[PATCH] mm: memcg: Fix memcg reclaim soft lockup

2020-08-26 Thread Xunlei Pang
We've met softlockup with "CONFIG_PREEMPT_NONE=y", when
the target memcg doesn't have any reclaimable memory.

It can be easily reproduced as below:
 watchdog: BUG: soft lockup - CPU#0 stuck for 111s![memcg_test:2204]
 CPU: 0 PID: 2204 Comm: memcg_test Not tainted 5.9.0-rc2+ #12
 Call Trace:
  shrink_lruvec+0x49f/0x640
  shrink_node+0x2a6/0x6f0
  do_try_to_free_pages+0xe9/0x3e0
  try_to_free_mem_cgroup_pages+0xef/0x1f0
  try_charge+0x2c1/0x750
  mem_cgroup_charge+0xd7/0x240
  __add_to_page_cache_locked+0x2fd/0x370
  add_to_page_cache_lru+0x4a/0xc0
  pagecache_get_page+0x10b/0x2f0
  filemap_fault+0x661/0xad0
  ext4_filemap_fault+0x2c/0x40
  __do_fault+0x4d/0xf9
  handle_mm_fault+0x1080/0x1790

It only happens on our 1-vcpu instances, because there's no chance
for oom reaper to run to reclaim the to-be-killed process.

Add cond_resched() in such cases at the beginning of shrink_lruvec()
to give up the cpu to others.

Signed-off-by: Xunlei Pang 
---
 mm/vmscan.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 99e1796..349a88e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2449,6 +2449,12 @@ static void shrink_lruvec(struct lruvec *lruvec, struct 
scan_control *sc)
scan_adjusted = (!cgroup_reclaim(sc) && !current_is_kswapd() &&
 sc->priority == DEF_PRIORITY);
 
+   /* memcg reclaim may run into no reclaimable lru pages */
+   if (nr[LRU_ACTIVE_FILE] == 0 &&
+   nr[LRU_INACTIVE_FILE] == 0 &&
+   nr[LRU_INACTIVE_ANON] == 0)
+   cond_resched();
+
blk_start_plug();
while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
nr[LRU_INACTIVE_FILE]) {
-- 
1.8.3.1



[PATCH] sched/fair: Fix wrong cpu selecting from isolated domain

2020-08-24 Thread Xunlei Pang
We've met problems that occasionally tasks with full cpumask
(e.g. by putting it into a cpuset or setting to full affinity)
were migrated to our isolated cpus in production environment.

After some analysis, we found that it is due to the current
select_idle_smt() not considering the sched_domain mask.

Fix it by checking the valid domain mask in select_idle_smt().

Fixes: 10e2f1acd010 ("sched/core: Rewrite and improve select_idle_siblings())
Reported-by: Wetp Zhang 
Signed-off-by: Xunlei Pang 
---
 kernel/sched/fair.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1a68a05..fa942c4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6075,7 +6075,7 @@ static int select_idle_core(struct task_struct *p, struct 
sched_domain *sd, int
 /*
  * Scan the local SMT mask for idle CPUs.
  */
-static int select_idle_smt(struct task_struct *p, int target)
+static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int 
target)
 {
int cpu;
 
@@ -6083,7 +6083,8 @@ static int select_idle_smt(struct task_struct *p, int 
target)
return -1;
 
for_each_cpu(cpu, cpu_smt_mask(target)) {
-   if (!cpumask_test_cpu(cpu, p->cpus_ptr))
+   if (!cpumask_test_cpu(cpu, p->cpus_ptr) ||
+   !cpumask_test_cpu(cpu, sched_domain_span(sd)))
continue;
if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
return cpu;
@@ -6099,7 +6100,7 @@ static inline int select_idle_core(struct task_struct *p, 
struct sched_domain *s
return -1;
 }
 
-static inline int select_idle_smt(struct task_struct *p, int target)
+static inline int select_idle_smt(struct task_struct *p, struct sched_domain 
*sd, int target)
 {
return -1;
 }
@@ -6274,7 +6275,7 @@ static int select_idle_sibling(struct task_struct *p, int 
prev, int target)
if ((unsigned)i < nr_cpumask_bits)
return i;
 
-   i = select_idle_smt(p, target);
+   i = select_idle_smt(p, sd, target);
if ((unsigned)i < nr_cpumask_bits)
return i;
 
-- 
1.8.3.1



[PATCH v2 2/3] mm/slub: Get rid of count_partial()

2020-08-10 Thread Xunlei Pang
Now the partial counters are ready, let's use them directly
and get rid of count_partial().

Co-developed-by: Wen Yang 
Signed-off-by: Xunlei Pang 
---
 mm/slub.c | 57 -
 1 file changed, 24 insertions(+), 33 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 6708d96..25a4421 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2414,11 +2414,6 @@ static inline int node_match(struct page *page, int node)
 }
 
 #ifdef CONFIG_SLUB_DEBUG
-static int count_free(struct page *page)
-{
-   return page->objects - page->inuse;
-}
-
 static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
 {
return atomic_long_read(>total_objects);
@@ -2426,19 +2421,27 @@ static inline unsigned long node_nr_objs(struct 
kmem_cache_node *n)
 #endif /* CONFIG_SLUB_DEBUG */
 
 #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
-static unsigned long count_partial(struct kmem_cache_node *n,
-   int (*get_count)(struct page *))
-{
-   unsigned long flags;
-   unsigned long x = 0;
-   struct page *page;
+enum partial_item { PARTIAL_FREE, PARTIAL_INUSE, PARTIAL_TOTAL };
+
+static unsigned long partial_counter(struct kmem_cache_node *n,
+   enum partial_item item)
+{
+   unsigned long ret = 0;
+
+   if (item == PARTIAL_FREE) {
+   ret = atomic_long_read(>partial_free_objs);
+   } else if (item == PARTIAL_TOTAL) {
+   ret = atomic_long_read(>partial_total_objs);
+   } else if (item == PARTIAL_INUSE) {
+   ret = atomic_long_read(>partial_total_objs) -
+   atomic_long_read(>partial_free_objs);
+   if ((long)ret < 0)
+   ret = 0;
+   }
 
-   spin_lock_irqsave(>list_lock, flags);
-   list_for_each_entry(page, >partial, slab_list)
-   x += get_count(page);
-   spin_unlock_irqrestore(>list_lock, flags);
-   return x;
+   return ret;
 }
+
 #endif /* CONFIG_SLUB_DEBUG || CONFIG_SYSFS */
 
 static noinline void
@@ -2468,7 +2471,7 @@ static unsigned long count_partial(struct kmem_cache_node 
*n,
unsigned long nr_objs;
unsigned long nr_free;
 
-   nr_free  = count_partial(n, count_free);
+   nr_free  = partial_counter(n, PARTIAL_FREE);
nr_slabs = node_nr_slabs(n);
nr_objs  = node_nr_objs(n);
 
@@ -,18 +4447,6 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t 
gfpflags,
 }
 #endif
 
-#ifdef CONFIG_SYSFS
-static int count_inuse(struct page *page)
-{
-   return page->inuse;
-}
-
-static int count_total(struct page *page)
-{
-   return page->objects;
-}
-#endif
-
 #ifdef CONFIG_SLUB_DEBUG
 static void validate_slab(struct kmem_cache *s, struct page *page)
 {
@@ -4912,7 +4903,7 @@ static ssize_t show_slab_objects(struct kmem_cache *s,
x = atomic_long_read(>total_objects);
else if (flags & SO_OBJECTS)
x = atomic_long_read(>total_objects) -
-   count_partial(n, count_free);
+   partial_counter(n, PARTIAL_FREE);
else
x = atomic_long_read(>nr_slabs);
total += x;
@@ -4926,9 +4917,9 @@ static ssize_t show_slab_objects(struct kmem_cache *s,
 
for_each_kmem_cache_node(s, node, n) {
if (flags & SO_TOTAL)
-   x = count_partial(n, count_total);
+   x = partial_counter(n, PARTIAL_TOTAL);
else if (flags & SO_OBJECTS)
-   x = count_partial(n, count_inuse);
+   x = partial_counter(n, PARTIAL_INUSE);
else
x = n->nr_partial;
total += x;
@@ -5961,7 +5952,7 @@ void get_slabinfo(struct kmem_cache *s, struct slabinfo 
*sinfo)
for_each_kmem_cache_node(s, node, n) {
nr_slabs += node_nr_slabs(n);
nr_objs += node_nr_objs(n);
-   nr_free += count_partial(n, count_free);
+   nr_free += partial_counter(n, PARTIAL_FREE);
}
 
sinfo->active_objs = nr_objs - nr_free;
-- 
1.8.3.1



[PATCH v2 0/3] mm/slub: Fix count_partial() problem

2020-08-10 Thread Xunlei Pang
v1->v2:
- Improved changelog and variable naming for PATCH 1~2.
- PATCH3 adds per-cpu counter to avoid performance regression
  in concurrent __slab_free().

[Testing]
On my 32-cpu 2-socket physical machine:
Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
perf stat --null --repeat 10 -- hackbench 20 thread 2

== original, no patched
  19.211637055 seconds time elapsed 
 ( +-  0.57% )

== patched with patch1~2
 Performance counter stats for 'hackbench 20 thread 2' (10 runs):

  21.731833146 seconds time elapsed 
 ( +-  0.17% )

== patched with patch1~3
 Performance counter stats for 'hackbench 20 thread 2' (10 runs):

  19.112106847 seconds time elapsed 
 ( +-  0.64% )


Xunlei Pang (3):
  mm/slub: Introduce two counters for partial objects
  mm/slub: Get rid of count_partial()
  mm/slub: Use percpu partial free counter

 mm/slab.h |   2 +
 mm/slub.c | 124 +++---
 2 files changed, 89 insertions(+), 37 deletions(-)

-- 
1.8.3.1



[PATCH v2 3/3] mm/slub: Use percpu partial free counter

2020-08-10 Thread Xunlei Pang
The only concern of introducing partial counter is that,
partial_free_objs may cause atomic operation contention
in case of same SLUB concurrent __slab_free().

This patch changes it to be a percpu counter to avoid that.

Co-developed-by: Wen Yang 
Signed-off-by: Xunlei Pang 
---
 mm/slab.h |  2 +-
 mm/slub.c | 38 +++---
 2 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index c85e2fa..a709a70 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -616,7 +616,7 @@ struct kmem_cache_node {
 #ifdef CONFIG_SLUB
unsigned long nr_partial;
struct list_head partial;
-   atomic_long_t partial_free_objs;
+   atomic_long_t __percpu *partial_free_objs;
atomic_long_t partial_total_objs;
 #ifdef CONFIG_SLUB_DEBUG
atomic_long_t nr_slabs;
diff --git a/mm/slub.c b/mm/slub.c
index 25a4421..f6fc60b 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1775,11 +1775,21 @@ static void discard_slab(struct kmem_cache *s, struct 
page *page)
 /*
  * Management of partially allocated slabs.
  */
+static inline long get_partial_free(struct kmem_cache_node *n)
+{
+   long nr = 0;
+   int cpu;
+
+   for_each_possible_cpu(cpu)
+   nr += atomic_long_read(per_cpu_ptr(n->partial_free_objs, cpu));
+
+   return nr;
+}
 
 static inline void
 __update_partial_free(struct kmem_cache_node *n, long delta)
 {
-   atomic_long_add(delta, >partial_free_objs);
+   atomic_long_add(delta, this_cpu_ptr(n->partial_free_objs));
 }
 
 static inline void
@@ -2429,12 +2439,12 @@ static unsigned long partial_counter(struct 
kmem_cache_node *n,
unsigned long ret = 0;
 
if (item == PARTIAL_FREE) {
-   ret = atomic_long_read(>partial_free_objs);
+   ret = get_partial_free(n);
} else if (item == PARTIAL_TOTAL) {
ret = atomic_long_read(>partial_total_objs);
} else if (item == PARTIAL_INUSE) {
ret = atomic_long_read(>partial_total_objs) -
-   atomic_long_read(>partial_free_objs);
+   get_partial_free(n);
if ((long)ret < 0)
ret = 0;
}
@@ -3390,19 +3400,28 @@ static inline int calculate_order(unsigned int size)
return -ENOSYS;
 }
 
-static void
+static int
 init_kmem_cache_node(struct kmem_cache_node *n)
 {
+   int cpu;
+
n->nr_partial = 0;
spin_lock_init(>list_lock);
INIT_LIST_HEAD(>partial);
-   atomic_long_set(>partial_free_objs, 0);
+
+   n->partial_free_objs = alloc_percpu(atomic_long_t);
+   if (!n->partial_free_objs)
+   return -ENOMEM;
+   for_each_possible_cpu(cpu)
+   atomic_long_set(per_cpu_ptr(n->partial_free_objs, cpu), 0);
atomic_long_set(>partial_total_objs, 0);
 #ifdef CONFIG_SLUB_DEBUG
atomic_long_set(>nr_slabs, 0);
atomic_long_set(>total_objects, 0);
INIT_LIST_HEAD(>full);
 #endif
+
+   return 0;
 }
 
 static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
@@ -3463,7 +3482,7 @@ static void early_kmem_cache_node_alloc(int node)
page->inuse = 1;
page->frozen = 0;
kmem_cache_node->node[node] = n;
-   init_kmem_cache_node(n);
+   BUG_ON(init_kmem_cache_node(n) < 0);
inc_slabs_node(kmem_cache_node, node, page->objects);
 
/*
@@ -3481,6 +3500,7 @@ static void free_kmem_cache_nodes(struct kmem_cache *s)
 
for_each_kmem_cache_node(s, node, n) {
s->node[node] = NULL;
+   free_percpu(n->partial_free_objs);
kmem_cache_free(kmem_cache_node, n);
}
 }
@@ -3511,7 +3531,11 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
return 0;
}
 
-   init_kmem_cache_node(n);
+   if (init_kmem_cache_node(n) < 0) {
+   free_kmem_cache_nodes(s);
+   return 0;
+   }
+
s->node[node] = n;
}
return 1;
-- 
1.8.3.1



[PATCH v2 1/3] mm/slub: Introduce two counters for partial objects

2020-08-10 Thread Xunlei Pang
The node list_lock in count_partial() spends long time iterating
in case of large amount of partial page lists, which can cause
thunder herd effect to the list_lock contention.

We have HSF RT(High-speed Service Framework Response-Time) monitors,
the RT figures fluctuated randomly, then we deployed a tool detecting
"irq off" and "preempt off" to dump the culprit's calltrace, capturing
the list_lock cost nearly 100ms with irq off issued by "ss", this also
caused network timeouts.

This patch introduces two counters to maintain the actual number
of partial objects dynamically instead of iterating the partial
page lists with list_lock held.

New counters of kmem_cache_node: partial_free_objs, partial_total_objs.
The main operations are under list_lock in slow path, its performance
impact should be minimal except the __slab_free() path which will be
addressed later.

Acked-by: Pekka Enberg 
Co-developed-by: Wen Yang 
Signed-off-by: Xunlei Pang 
---
 mm/slab.h |  2 ++
 mm/slub.c | 37 -
 2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/mm/slab.h b/mm/slab.h
index 7e94700..c85e2fa 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -616,6 +616,8 @@ struct kmem_cache_node {
 #ifdef CONFIG_SLUB
unsigned long nr_partial;
struct list_head partial;
+   atomic_long_t partial_free_objs;
+   atomic_long_t partial_total_objs;
 #ifdef CONFIG_SLUB_DEBUG
atomic_long_t nr_slabs;
atomic_long_t total_objects;
diff --git a/mm/slub.c b/mm/slub.c
index 6589b41..6708d96 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1775,10 +1775,24 @@ static void discard_slab(struct kmem_cache *s, struct 
page *page)
 /*
  * Management of partially allocated slabs.
  */
+
+static inline void
+__update_partial_free(struct kmem_cache_node *n, long delta)
+{
+   atomic_long_add(delta, >partial_free_objs);
+}
+
+static inline void
+__update_partial_total(struct kmem_cache_node *n, long delta)
+{
+   atomic_long_add(delta, >partial_total_objs);
+}
+
 static inline void
 __add_partial(struct kmem_cache_node *n, struct page *page, int tail)
 {
n->nr_partial++;
+   __update_partial_total(n, page->objects);
if (tail == DEACTIVATE_TO_TAIL)
list_add_tail(>slab_list, >partial);
else
@@ -1798,6 +1812,7 @@ static inline void remove_partial(struct kmem_cache_node 
*n,
lockdep_assert_held(>list_lock);
list_del(>slab_list);
n->nr_partial--;
+   __update_partial_total(n, -page->objects);
 }
 
 /*
@@ -1842,6 +1857,7 @@ static inline void *acquire_slab(struct kmem_cache *s,
return NULL;
 
remove_partial(n, page);
+   __update_partial_free(n, -*objects);
WARN_ON(!freelist);
return freelist;
 }
@@ -2174,8 +2190,11 @@ static void deactivate_slab(struct kmem_cache *s, struct 
page *page,
"unfreezing slab"))
goto redo;
 
-   if (lock)
+   if (lock) {
+   if (m == M_PARTIAL)
+   __update_partial_free(n, page->objects - page->inuse);
spin_unlock(>list_lock);
+   }
 
if (m == M_PARTIAL)
stat(s, tail);
@@ -2241,6 +2260,7 @@ static void unfreeze_partials(struct kmem_cache *s,
discard_page = page;
} else {
add_partial(n, page, DEACTIVATE_TO_TAIL);
+   __update_partial_free(n, page->objects - page->inuse);
stat(s, FREE_ADD_PARTIAL);
}
}
@@ -2915,6 +2935,13 @@ static void __slab_free(struct kmem_cache *s, struct 
page *page,
head, new.counters,
"__slab_free"));
 
+   if (!was_frozen && prior) {
+   if (n)
+   __update_partial_free(n, cnt);
+   else
+   __update_partial_free(get_node(s, page_to_nid(page)), 
cnt);
+   }
+
if (likely(!n)) {
 
/*
@@ -2944,6 +2971,7 @@ static void __slab_free(struct kmem_cache *s, struct page 
*page,
if (!kmem_cache_has_cpu_partial(s) && unlikely(!prior)) {
remove_full(s, n, page);
add_partial(n, page, DEACTIVATE_TO_TAIL);
+   __update_partial_free(n, page->objects - page->inuse);
stat(s, FREE_ADD_PARTIAL);
}
spin_unlock_irqrestore(>list_lock, flags);
@@ -2955,6 +2983,7 @@ static void __slab_free(struct kmem_cache *s, struct page 
*page,
 * Slab on the partial list.
 */
remove_partial(n, page);
+   __update_partial_free(n, page->inuse - page->objects);
stat(s, FREE_REMOVE_PARTIAL);
} else {
/* Slab must be on the full list */
@@ -3364,6 +3393,8 @@ static inlin

[PATCH 2/2] mm/slub: Get rid of count_partial()

2020-07-02 Thread Xunlei Pang
Now the partial counters are ready, let's use them directly
and get rid of count_partial().

Co-developed-by: Wen Yang 
Signed-off-by: Xunlei Pang 
---
 mm/slub.c | 57 -
 1 file changed, 24 insertions(+), 33 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 53890f3..f1946ed 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2414,11 +2414,6 @@ static inline int node_match(struct page *page, int node)
 }
 
 #ifdef CONFIG_SLUB_DEBUG
-static int count_free(struct page *page)
-{
-   return page->objects - page->inuse;
-}
-
 static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
 {
return atomic_long_read(>total_objects);
@@ -2426,19 +2421,27 @@ static inline unsigned long node_nr_objs(struct 
kmem_cache_node *n)
 #endif /* CONFIG_SLUB_DEBUG */
 
 #if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
-static unsigned long count_partial(struct kmem_cache_node *n,
-   int (*get_count)(struct page *))
-{
-   unsigned long flags;
-   unsigned long x = 0;
-   struct page *page;
+enum partial_item { PARTIAL_FREE, PARTIAL_INUSE, PARTIAL_TOTAL };
+
+static unsigned long partial_counter(struct kmem_cache_node *n,
+   enum partial_item item)
+{
+   unsigned long ret = 0;
+
+   if (item == PARTIAL_FREE) {
+   ret = atomic_long_read(>pfree_objects);
+   } else if (item == PARTIAL_TOTAL) {
+   ret = atomic_long_read(>total_objects);
+   } else if (item == PARTIAL_INUSE) {
+   ret = atomic_long_read(>total_objects) -
+   atomic_long_read(>pfree_objects);
+   if ((long)ret < 0)
+   ret = 0;
+   }
 
-   spin_lock_irqsave(>list_lock, flags);
-   list_for_each_entry(page, >partial, slab_list)
-   x += get_count(page);
-   spin_unlock_irqrestore(>list_lock, flags);
-   return x;
+   return ret;
 }
+
 #endif /* CONFIG_SLUB_DEBUG || CONFIG_SYSFS */
 
 static noinline void
@@ -2468,7 +2471,7 @@ static unsigned long count_partial(struct kmem_cache_node 
*n,
unsigned long nr_objs;
unsigned long nr_free;
 
-   nr_free  = count_partial(n, count_free);
+   nr_free  = partial_counter(n, PARTIAL_FREE);
nr_slabs = node_nr_slabs(n);
nr_objs  = node_nr_objs(n);
 
@@ -4445,18 +4448,6 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t 
gfpflags,
 }
 #endif
 
-#ifdef CONFIG_SYSFS
-static int count_inuse(struct page *page)
-{
-   return page->inuse;
-}
-
-static int count_total(struct page *page)
-{
-   return page->objects;
-}
-#endif
-
 #ifdef CONFIG_SLUB_DEBUG
 static void validate_slab(struct kmem_cache *s, struct page *page)
 {
@@ -4913,7 +4904,7 @@ static ssize_t show_slab_objects(struct kmem_cache *s,
x = atomic_long_read(>total_objects);
else if (flags & SO_OBJECTS)
x = atomic_long_read(>total_objects) -
-   count_partial(n, count_free);
+   partial_counter(n, PARTIAL_FREE);
else
x = atomic_long_read(>nr_slabs);
total += x;
@@ -4927,9 +4918,9 @@ static ssize_t show_slab_objects(struct kmem_cache *s,
 
for_each_kmem_cache_node(s, node, n) {
if (flags & SO_TOTAL)
-   x = count_partial(n, count_total);
+   x = partial_counter(n, PARTIAL_TOTAL);
else if (flags & SO_OBJECTS)
-   x = count_partial(n, count_inuse);
+   x = partial_counter(n, PARTIAL_INUSE);
else
x = n->nr_partial;
total += x;
@@ -5962,7 +5953,7 @@ void get_slabinfo(struct kmem_cache *s, struct slabinfo 
*sinfo)
for_each_kmem_cache_node(s, node, n) {
nr_slabs += node_nr_slabs(n);
nr_objs += node_nr_objs(n);
-   nr_free += count_partial(n, count_free);
+   nr_free += partial_counter(n, PARTIAL_FREE);
}
 
sinfo->active_objs = nr_objs - nr_free;
-- 
1.8.3.1



[PATCH 1/2] mm/slub: Introduce two counters for the partial objects

2020-07-02 Thread Xunlei Pang
The node list_lock in count_partial() spend long time iterating
in case of large amount of partial page lists, which can cause
thunder herd effect to the list_lock contention, e.g. it cause
business response-time jitters when accessing "/proc/slabinfo"
in our production environments.

This patch introduces two counters to maintain the actual number
of partial objects dynamically instead of iterating the partial
page lists with list_lock held.

New counters of kmem_cache_node are: pfree_objects, ptotal_objects.
The main operations are under list_lock in slow path, its performance
impact is minimal.

Co-developed-by: Wen Yang 
Signed-off-by: Xunlei Pang 
---
 mm/slab.h |  2 ++
 mm/slub.c | 38 +-
 2 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/mm/slab.h b/mm/slab.h
index 7e94700..5935749 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -616,6 +616,8 @@ struct kmem_cache_node {
 #ifdef CONFIG_SLUB
unsigned long nr_partial;
struct list_head partial;
+   atomic_long_t pfree_objects; /* partial free objects */
+   atomic_long_t ptotal_objects; /* partial total objects */
 #ifdef CONFIG_SLUB_DEBUG
atomic_long_t nr_slabs;
atomic_long_t total_objects;
diff --git a/mm/slub.c b/mm/slub.c
index 6589b41..53890f3 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1775,10 +1775,24 @@ static void discard_slab(struct kmem_cache *s, struct 
page *page)
 /*
  * Management of partially allocated slabs.
  */
+
+static inline void
+__update_partial_free(struct kmem_cache_node *n, long delta)
+{
+   atomic_long_add(delta, >pfree_objects);
+}
+
+static inline void
+__update_partial_total(struct kmem_cache_node *n, long delta)
+{
+   atomic_long_add(delta, >ptotal_objects);
+}
+
 static inline void
 __add_partial(struct kmem_cache_node *n, struct page *page, int tail)
 {
n->nr_partial++;
+   __update_partial_total(n, page->objects);
if (tail == DEACTIVATE_TO_TAIL)
list_add_tail(>slab_list, >partial);
else
@@ -1798,6 +1812,7 @@ static inline void remove_partial(struct kmem_cache_node 
*n,
lockdep_assert_held(>list_lock);
list_del(>slab_list);
n->nr_partial--;
+   __update_partial_total(n, -page->objects);
 }
 
 /*
@@ -1842,6 +1857,7 @@ static inline void *acquire_slab(struct kmem_cache *s,
return NULL;
 
remove_partial(n, page);
+   __update_partial_free(n, -*objects);
WARN_ON(!freelist);
return freelist;
 }
@@ -2174,8 +2190,11 @@ static void deactivate_slab(struct kmem_cache *s, struct 
page *page,
"unfreezing slab"))
goto redo;
 
-   if (lock)
+   if (lock) {
+   if (m == M_PARTIAL)
+   __update_partial_free(n, page->objects - page->inuse);
spin_unlock(>list_lock);
+   }
 
if (m == M_PARTIAL)
stat(s, tail);
@@ -2241,6 +2260,7 @@ static void unfreeze_partials(struct kmem_cache *s,
discard_page = page;
} else {
add_partial(n, page, DEACTIVATE_TO_TAIL);
+   __update_partial_free(n, page->objects - page->inuse);
stat(s, FREE_ADD_PARTIAL);
}
}
@@ -2915,6 +2935,14 @@ static void __slab_free(struct kmem_cache *s, struct 
page *page,
head, new.counters,
"__slab_free"));
 
+   if (!was_frozen && prior) {
+   if (n)
+   __update_partial_free(n, cnt);
+   else
+   __update_partial_free(get_node(s, page_to_nid(page)),
+   cnt);
+   }
+
if (likely(!n)) {
 
/*
@@ -2944,6 +2972,7 @@ static void __slab_free(struct kmem_cache *s, struct page 
*page,
if (!kmem_cache_has_cpu_partial(s) && unlikely(!prior)) {
remove_full(s, n, page);
add_partial(n, page, DEACTIVATE_TO_TAIL);
+   __update_partial_free(n, page->objects - page->inuse);
stat(s, FREE_ADD_PARTIAL);
}
spin_unlock_irqrestore(>list_lock, flags);
@@ -2955,6 +2984,7 @@ static void __slab_free(struct kmem_cache *s, struct page 
*page,
 * Slab on the partial list.
 */
remove_partial(n, page);
+   __update_partial_free(n, page->inuse - page->objects);
stat(s, FREE_REMOVE_PARTIAL);
} else {
/* Slab must be on the full list */
@@ -3364,6 +3394,8 @@ static inline int calculate_order(unsigned int size)
n->nr_partial = 0;
spin_lock_init(>list_lock);
INIT_LIST_HEAD(>partial);
+   atomic_long_set(>pfree_objects, 0);
+   atomic_long_set(>ptotal_objects, 0)

Re: [PATCH] memcg: Ignore unprotected parent in mem_cgroup_protected()

2019-06-16 Thread Xunlei Pang
Hi Chris,

On 2019/6/16 PM 6:37, Chris Down wrote:
> Hi Xunlei,
> 
> Xunlei Pang writes:
>> docker and various types(different memory capacity) of containers
>> are managed by k8s, it's a burden for k8s to maintain those dynamic
>> figures, simply set "max" to key containers is always welcome.
> 
> Right, setting "max" is generally a fine way of going about it.
> 
>> Set "max" to docker also protects docker cgroup memory(as docker
>> itself has tasks) unnecessarily.
> 
> That's not correct -- leaf memcgs have to _explicitly_ request memory
> protection. From the documentation:
> 
>    memory.low
> 
>    [...]
> 
>    Best-effort memory protection.  If the memory usages of a
>    cgroup and all its ancestors are below their low boundaries,
>    the cgroup's memory won't be reclaimed unless memory can be
>    reclaimed from unprotected cgroups.
> 
> Note the part that the cgroup itself also must be within its low
> boundary, which is not implied simply by having ancestors that would
> permit propagation of protections.
> 
> In this case, Docker just shouldn't request it for those Docker-related
> tasks, and they won't get any. That seems a lot simpler and more
> intuitive than special casing "0" in ancestors.
> 
>> This patch doesn't take effect on any intermediate layer with
>> positive memory.min set, it requires all the ancestors having
>> 0 memory.min to work.
>>
>> Nothing special change, but more flexible to business deployment...
> 
> Not so, this change is extremely "special". It violates the basic
> expectation that 0 means no possibility of propagation of protection,
> and I still don't see a compelling argument why Docker can't just set
> "max" in the intermediate cgroup and not accept any protection in leaf
> memcgs that it doesn't want protection for.

I got the reason, I'm using cgroup v1(with memory.min backport)
which permits tasks existent in "docker" cgroup.procs.

For cgroup v2, it's not a problem.

Thanks,
Xunlei


Re: [PATCH] memcg: Ignore unprotected parent in mem_cgroup_protected()

2019-06-16 Thread Xunlei Pang
Hi Chirs,

On 2019/6/16 AM 12:08, Chris Down wrote:
> Hi Xunlei,
> 
> Xunlei Pang writes:
>> Currently memory.min|low implementation requires the whole
>> hierarchy has the settings, otherwise the protection will
>> be broken.
>>
>> Our hierarchy is kind of like(memory.min value in brackets),
>>
>>   root
>>    |
>>     docker(0)
>>  /    \
>>     c1(max)   c2(0)
>>
>> Note that "docker" doesn't set memory.min. When kswapd runs,
>> mem_cgroup_protected() returns "0" emin for "c1" due to "0"
>> @parent_emin of "docker", as a result "c1" gets reclaimed.
>>
>> But it's hard to maintain parent's "memory.min" when there're
>> uncertain protected children because only some important types
>> of containers need the protection.  Further, control tasks
>> belonging to parent constantly reproduce trivial memory which
>> should not be protected at all.  It makes sense to ignore
>> unprotected parent in this scenario to achieve the flexibility.
> 
> I'm really confused by this, why don't you just set memory.{min,low} in
> the docker cgroup and only propagate it to the children that want it?
> 
> If you only want some children to have the protection, only request it
> in those children, or create an additional intermediate layer of the
> cgroup hierarchy with protections further limited if you don't trust the
> task to request the right amount.
> 
> Breaking the requirement for hierarchical propagation of protections
> seems like a really questionable API change, not least because it makes
> it harder to set systemwide policies about the constraints of
> protections within a subtree.

docker and various types(different memory capacity) of containers
are managed by k8s, it's a burden for k8s to maintain those dynamic
figures, simply set "max" to key containers is always welcome.

Set "max" to docker also protects docker cgroup memory(as docker
itself has tasks) unnecessarily.

This patch doesn't take effect on any intermediate layer with
positive memory.min set, it requires all the ancestors having
0 memory.min to work.

Nothing special change, but more flexible to business deployment...


Re: [PATCH] psi: Don't account force reclaim as memory pressure

2019-06-15 Thread Xunlei Pang
Hi Chris,

On 2019/6/15 PM 11:58, Chris Down wrote:
> Hi Xunlei,
> 
> Xunlei Pang writes:
>> There're several cases like resize and force_empty that don't
>> need to account to psi, otherwise is misleading.
> 
> I'm afraid I'm quite confused by this patch. Why do you think accounting
> for force reclaim in PSI is misleading? I completely expect that force
> reclaim should still be accounted for as memory pressure, can you
> present some reason why it shouldn't be?

We expect psi stands for negative factors to applications
which affect their response time, but force reclaims are
behaviours triggered on purpose like "/proc/sys/vm/drop_caches",
not the real negative pressure.

e.g. my module force reclaims the dead memcgs, there's no
application attached to it, and its memory(page caches) is
usually useless, force reclaiming them doesn't mean the
system or parent memcg is under memory pressure, while
actually the whole system or the parent memcg has plenty
of free memory. If the force reclaim causes further memory
pressure like hot page cache miss, then the workingset
refault psi will catch that.


[PATCH] psi: Don't account force reclaim as memory pressure

2019-06-15 Thread Xunlei Pang
There're several cases like resize and force_empty that don't
need to account to psi, otherwise is misleading.

We also have a module reclaiming dying memcgs at background to
avoid too many dead memcgs which can cause lots of trouble, then
it makes the psi inaccuracy even worse without this patch.

Signed-off-by: Xunlei Pang 
---
 include/linux/swap.h |  3 ++-
 mm/memcontrol.c  | 13 +++--
 mm/vmscan.c  |  9 ++---
 3 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4bfb5c4ac108..74b5443877d4 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -354,7 +354,8 @@ extern int __isolate_lru_page(struct page *page, 
isolate_mode_t mode);
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
  unsigned long nr_pages,
  gfp_t gfp_mask,
- bool may_swap);
+ bool may_swap,
+ bool force_reclaim);
 extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
gfp_t gfp_mask, bool noswap,
pg_data_t *pgdat,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f1dfa651f55d..f4ec57876ada 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2237,7 +2237,8 @@ static void reclaim_high(struct mem_cgroup *memcg,
if (page_counter_read(>memory) <= memcg->high)
continue;
memcg_memory_event(memcg, MEMCG_HIGH);
-   try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
+   try_to_free_mem_cgroup_pages(memcg, nr_pages,
+   gfp_mask, true, false);
} while ((memcg = parent_mem_cgroup(memcg)));
 }
 
@@ -2330,7 +2331,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t 
gfp_mask,
memcg_memory_event(mem_over_limit, MEMCG_MAX);
 
nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
-   gfp_mask, may_swap);
+gfp_mask, may_swap, false);
 
if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
goto retry;
@@ -2860,7 +2861,7 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
}
 
if (!try_to_free_mem_cgroup_pages(memcg, 1,
-   GFP_KERNEL, !memsw)) {
+   GFP_KERNEL, !memsw, true)) {
ret = -EBUSY;
break;
}
@@ -2993,7 +2994,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup 
*memcg)
return -EINTR;
 
progress = try_to_free_mem_cgroup_pages(memcg, 1,
-   GFP_KERNEL, true);
+   GFP_KERNEL, true, true);
if (!progress) {
nr_retries--;
/* maybe some writeback is necessary */
@@ -5549,7 +5550,7 @@ static ssize_t memory_high_write(struct kernfs_open_file 
*of,
nr_pages = page_counter_read(>memory);
if (nr_pages > high)
try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
-GFP_KERNEL, true);
+GFP_KERNEL, true, true);
 
memcg_wb_domain_size_changed(memcg);
return nbytes;
@@ -5596,7 +5597,7 @@ static ssize_t memory_max_write(struct kernfs_open_file 
*of,
 
if (nr_reclaims) {
if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
- GFP_KERNEL, true))
+   GFP_KERNEL, true, true))
nr_reclaims--;
continue;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7acd0afdfc2a..3831848fca5a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3212,7 +3212,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup 
*memcg,
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
   unsigned long nr_pages,
   gfp_t gfp_mask,
-  bool may_swap)
+  bool may_swap,
+  bool force_reclaim)
 {
struct zonelist *zonelist;
unsigned long nr_reclaimed;
@@ -3243,13 +3244,15 @@ unsigned long try_to_free_mem_cgroup_pages(struct 
mem_cgroup *memcg,
 
trace_mm_vmscan_memcg_reclaim_be

[PATCH] memcg: Ignore unprotected parent in mem_cgroup_protected()

2019-06-15 Thread Xunlei Pang
Currently memory.min|low implementation requires the whole
hierarchy has the settings, otherwise the protection will
be broken.

Our hierarchy is kind of like(memory.min value in brackets),

   root
|
 docker(0)
  /\
 c1(max)   c2(0)

Note that "docker" doesn't set memory.min. When kswapd runs,
mem_cgroup_protected() returns "0" emin for "c1" due to "0"
@parent_emin of "docker", as a result "c1" gets reclaimed.

But it's hard to maintain parent's "memory.min" when there're
uncertain protected children because only some important types
of containers need the protection.  Further, control tasks
belonging to parent constantly reproduce trivial memory which
should not be protected at all.  It makes sense to ignore
unprotected parent in this scenario to achieve the flexibility.

In order not to break previous hierarchical behaviour, only
ignore the parent when there's no protected ancestor upwards
the hierarchy.

Signed-off-by: Xunlei Pang 
---
 include/linux/page_counter.h |  2 ++
 mm/memcontrol.c  |  5 +
 mm/page_counter.c| 24 
 3 files changed, 31 insertions(+)

diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index bab7e57f659b..aed7ed28b458 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -55,6 +55,8 @@ bool page_counter_try_charge(struct page_counter *counter,
 void page_counter_uncharge(struct page_counter *counter, unsigned long 
nr_pages);
 void page_counter_set_min(struct page_counter *counter, unsigned long 
nr_pages);
 void page_counter_set_low(struct page_counter *counter, unsigned long 
nr_pages);
+bool page_counter_has_min(struct page_counter *counter);
+bool page_counter_has_low(struct page_counter *counter);
 int page_counter_set_max(struct page_counter *counter, unsigned long nr_pages);
 int page_counter_memparse(const char *buf, const char *max,
  unsigned long *nr_pages);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ca0bc6e6be13..f1dfa651f55d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5917,6 +5917,8 @@ enum mem_cgroup_protection mem_cgroup_protected(struct 
mem_cgroup *root,
if (parent == root)
goto exit;
 
+   if (!page_counter_has_min(>memory))
+   goto elow;
parent_emin = READ_ONCE(parent->memory.emin);
emin = min(emin, parent_emin);
if (emin && parent_emin) {
@@ -5931,6 +5933,9 @@ enum mem_cgroup_protection mem_cgroup_protected(struct 
mem_cgroup *root,
   siblings_min_usage);
}
 
+elow:
+   if (!page_counter_has_low(>memory))
+   goto exit;
parent_elow = READ_ONCE(parent->memory.elow);
elow = min(elow, parent_elow);
if (elow && parent_elow) {
diff --git a/mm/page_counter.c b/mm/page_counter.c
index de31470655f6..8c668eae2af5 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -202,6 +202,30 @@ int page_counter_set_max(struct page_counter *counter, 
unsigned long nr_pages)
}
 }
 
+bool page_counter_has_min(struct page_counter *counter)
+{
+   struct page_counter *c;
+
+   for (c = counter; c; c = c->parent) {
+   if (counter->min)
+   return true;
+   }
+
+   return false;
+}
+
+bool page_counter_has_low(struct page_counter *counter)
+{
+   struct page_counter *c;
+
+   for (c = counter; c; c = c->parent) {
+   if (counter->low)
+   return true;
+   }
+
+   return false;
+}
+
 /**
  * page_counter_set_min - set the amount of protected memory
  * @counter: counter
-- 
2.14.4.44.g2045bb6



Re: [PATCH] sched/fair: don't push cfs_bandwith slack timers forward

2019-06-06 Thread Xunlei Pang
On 2019/6/6 AM 4:06, bseg...@google.com wrote:
> When a cfs_rq sleeps and returns its quota, we delay for 5ms before
> waking any throttled cfs_rqs to coalesce with other cfs_rqs going to
> sleep, as this has has to be done outside of the rq lock we hold.

two "has".

> 
> The current code waits for 5ms without any sleeps, instead of waiting
> for 5ms from the first sleep, which can delay the unthrottle more than
> we want. Switch this around so that we can't push this forward forever.
> 
> This requires an extra flag rather than using hrtimer_active, since we
> need to start a new timer if the current one is in the process of
> finishing.
> 
> Signed-off-by: Ben Segall 
> ---

We've also suffered from this performance issue recently:
Reviewed-by: Xunlei Pang 

>  kernel/sched/fair.c  | 7 +++
>  kernel/sched/sched.h | 1 +
>  2 files changed, 8 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8213ff6e365d..2ead252cfa32 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4729,6 +4729,11 @@ static void start_cfs_slack_bandwidth(struct 
> cfs_bandwidth *cfs_b)
>   if (runtime_refresh_within(cfs_b, min_left))
>   return;
>  
> + /* don't push forwards an existing deferred unthrottle */
> + if (cfs_b->slack_started)
> + return;
> + cfs_b->slack_started = true;
> +
>   hrtimer_start(_b->slack_timer,
>   ns_to_ktime(cfs_bandwidth_slack_period),
>   HRTIMER_MODE_REL);
> @@ -4782,6 +4787,7 @@ static void do_sched_cfs_slack_timer(struct 
> cfs_bandwidth *cfs_b)
>  
>   /* confirm we're still not at a refresh boundary */
>   raw_spin_lock_irqsave(_b->lock, flags);
> + cfs_b->slack_started = false;
>   if (cfs_b->distribute_running) {
>   raw_spin_unlock_irqrestore(_b->lock, flags);
>   return;
> @@ -4920,6 +4926,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>   hrtimer_init(_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>   cfs_b->slack_timer.function = sched_cfs_slack_timer;
>   cfs_b->distribute_running = 0;
> + cfs_b->slack_started = false;
>  }
>  
>  static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index efa686eeff26..60219acda94b 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -356,6 +356,7 @@ struct cfs_bandwidth {
>   u64 throttled_time;
>  
>   booldistribute_running;
> + boolslack_started;
>  #endif
>  };
>  
> 


Re: [PATCH 1/3] mm/memcg: Fix min/low usage in propagate_protected_usage()

2018-12-05 Thread Xunlei Pang
Hi Roman,

On 2018/12/4 AM 2:00, Roman Gushchin wrote:
> On Mon, Dec 03, 2018 at 04:01:17PM +0800, Xunlei Pang wrote:
>> When usage exceeds min, min usage should be min other than 0.
>> Apply the same for low.
>>
>> Signed-off-by: Xunlei Pang 
>> ---
>>  mm/page_counter.c | 12 ++--
>>  1 file changed, 2 insertions(+), 10 deletions(-)
>>
>> diff --git a/mm/page_counter.c b/mm/page_counter.c
>> index de31470655f6..75d53f15f040 100644
>> --- a/mm/page_counter.c
>> +++ b/mm/page_counter.c
>> @@ -23,11 +23,7 @@ static void propagate_protected_usage(struct page_counter 
>> *c,
>>  return;
>>  
>>  if (c->min || atomic_long_read(>min_usage)) {
>> -if (usage <= c->min)
>> -protected = usage;
>> -else
>> -protected = 0;
>> -
>> +protected = min(usage, c->min);
> 
> This change makes sense in the combination with the patch 3, but not as a
> standlone "fix". It's not a bug, it's a required thing unless you start 
> scanning
> proportionally to memory.low/min excess.
> 
> Please, reflect this in the commit message. Or, even better, merge it into
> the patch 3.

The more I looked the more I think it's a bug, but anyway I'm fine with
merging it into patch 3 :-)

> 
> Also, please, make sure that cgroup kselftests are passing after your changes.

Sure, will do and send v2. Thanks for your inputs.


Re: [PATCH 1/3] mm/memcg: Fix min/low usage in propagate_protected_usage()

2018-12-05 Thread Xunlei Pang
Hi Roman,

On 2018/12/4 AM 2:00, Roman Gushchin wrote:
> On Mon, Dec 03, 2018 at 04:01:17PM +0800, Xunlei Pang wrote:
>> When usage exceeds min, min usage should be min other than 0.
>> Apply the same for low.
>>
>> Signed-off-by: Xunlei Pang 
>> ---
>>  mm/page_counter.c | 12 ++--
>>  1 file changed, 2 insertions(+), 10 deletions(-)
>>
>> diff --git a/mm/page_counter.c b/mm/page_counter.c
>> index de31470655f6..75d53f15f040 100644
>> --- a/mm/page_counter.c
>> +++ b/mm/page_counter.c
>> @@ -23,11 +23,7 @@ static void propagate_protected_usage(struct page_counter 
>> *c,
>>  return;
>>  
>>  if (c->min || atomic_long_read(>min_usage)) {
>> -if (usage <= c->min)
>> -protected = usage;
>> -else
>> -protected = 0;
>> -
>> +protected = min(usage, c->min);
> 
> This change makes sense in the combination with the patch 3, but not as a
> standlone "fix". It's not a bug, it's a required thing unless you start 
> scanning
> proportionally to memory.low/min excess.
> 
> Please, reflect this in the commit message. Or, even better, merge it into
> the patch 3.

The more I looked the more I think it's a bug, but anyway I'm fine with
merging it into patch 3 :-)

> 
> Also, please, make sure that cgroup kselftests are passing after your changes.

Sure, will do and send v2. Thanks for your inputs.


Re: [PATCH 2/3] mm/vmscan: Enable kswapd to reclaim low-protected memory

2018-12-04 Thread Xunlei Pang
On 2018/12/4 PM 3:25, Michal Hocko wrote:
> On Tue 04-12-18 10:40:29, Xunlei Pang wrote:
>> On 2018/12/4 AM 1:22, Michal Hocko wrote:
>>> On Mon 03-12-18 23:20:31, Xunlei Pang wrote:
>>>> On 2018/12/3 下午7:56, Michal Hocko wrote:
>>>>> On Mon 03-12-18 16:01:18, Xunlei Pang wrote:
>>>>>> There may be cgroup memory overcommitment, it will become
>>>>>> even common in the future.
>>>>>>
>>>>>> Let's enable kswapd to reclaim low-protected memory in case
>>>>>> of memory pressure, to mitigate the global direct reclaim
>>>>>> pressures which could cause jitters to the response time of
>>>>>> lantency-sensitive groups.
>>>>>
>>>>> Please be more descriptive about the problem you are trying to handle
>>>>> here. I haven't actually read the patch but let me emphasise that the
>>>>> low limit protection is important isolation tool. And allowing kswapd to
>>>>> reclaim protected memcgs is going to break the semantic as it has been
>>>>> introduced and designed.
>>>>
>>>> We have two types of memcgs: online groups(important business)
>>>> and offline groups(unimportant business). Online groups are
>>>> all configured with MAX low protection, while offline groups
>>>> are not at all protected(with default 0 low).
>>>>
>>>> When offline groups are overcommitted, the global memory pressure
>>>> suffers. This will cause the memory allocations from online groups
>>>> constantly go to the slow global direct reclaim in order to reclaim
>>>> online's page caches, as kswap is not able to reclaim low-protection
>>>> memory. low is not hard limit, it's reasonable to be reclaimed by
>>>> kswapd if there's no other reclaimable memory.
>>>
>>> I am sorry I still do not follow. What role do offline cgroups play.
>>> Those are certainly not low mem protected because mem_cgroup_css_offline
>>> will reset them to 0.
>>>
>>
>> Oh, I meant "offline groups" to be "offline-business groups", memcgs
>> refered to here are all "online state" from kernel's perspective.
> 
> What is offline-business group? Please try to explain the actual problem
> in much more details and do not let us guess.
> 

Maybe I choosed the wrong word, let me rephase it, and
here is an example.

root 200GB
   /  \
important(100GB)  unimportant(100GB+DYNAMIC)
  / |  \ /  \
docker0 docker1...  normal(100GB) oversold(DYNAMIC)
  /  |  \  / |  \
 j0 j1 ...w0 w1 ...

"DYNAMIC" is controlled by the cluster job scheduler dynamically,
it periodically samples the available system memory(/proc/meminfo
"MemAvailable"), and use part of that to launch oversold jobs
under some special conditions. When "oversold" is active, the
whole system is put under heavy global memory pressure although
memcgs are not.

IOW "DYNAMIC" is primarily borrowed from "dockers" temporarily,
oversold workers will be killed in a timely fashion if "dockers"
needs their memory back suddenly which is rare.

If kswapd doesn't reclaim low-protected memory configured among
"important" dockers, memory allocations from dockers will trap
into global direct reclaim constantly which harms their performance
and response time. The inactive caches from dockers are allowed
to be reclaimed although they are under low-protected(we used a
simple MAX setting), we allow the inactive low-protected memory
to be reclaimed immediately and asynchronously as long as there's
no unprotected reclaimable memory. Its's also friendly to disk IO.

For really latency-sensitive docker, memory.min is supposed to be
used to guarantee its memory QoS.

Thanks


Re: [PATCH 2/3] mm/vmscan: Enable kswapd to reclaim low-protected memory

2018-12-04 Thread Xunlei Pang
On 2018/12/4 PM 3:25, Michal Hocko wrote:
> On Tue 04-12-18 10:40:29, Xunlei Pang wrote:
>> On 2018/12/4 AM 1:22, Michal Hocko wrote:
>>> On Mon 03-12-18 23:20:31, Xunlei Pang wrote:
>>>> On 2018/12/3 下午7:56, Michal Hocko wrote:
>>>>> On Mon 03-12-18 16:01:18, Xunlei Pang wrote:
>>>>>> There may be cgroup memory overcommitment, it will become
>>>>>> even common in the future.
>>>>>>
>>>>>> Let's enable kswapd to reclaim low-protected memory in case
>>>>>> of memory pressure, to mitigate the global direct reclaim
>>>>>> pressures which could cause jitters to the response time of
>>>>>> lantency-sensitive groups.
>>>>>
>>>>> Please be more descriptive about the problem you are trying to handle
>>>>> here. I haven't actually read the patch but let me emphasise that the
>>>>> low limit protection is important isolation tool. And allowing kswapd to
>>>>> reclaim protected memcgs is going to break the semantic as it has been
>>>>> introduced and designed.
>>>>
>>>> We have two types of memcgs: online groups(important business)
>>>> and offline groups(unimportant business). Online groups are
>>>> all configured with MAX low protection, while offline groups
>>>> are not at all protected(with default 0 low).
>>>>
>>>> When offline groups are overcommitted, the global memory pressure
>>>> suffers. This will cause the memory allocations from online groups
>>>> constantly go to the slow global direct reclaim in order to reclaim
>>>> online's page caches, as kswap is not able to reclaim low-protection
>>>> memory. low is not hard limit, it's reasonable to be reclaimed by
>>>> kswapd if there's no other reclaimable memory.
>>>
>>> I am sorry I still do not follow. What role do offline cgroups play.
>>> Those are certainly not low mem protected because mem_cgroup_css_offline
>>> will reset them to 0.
>>>
>>
>> Oh, I meant "offline groups" to be "offline-business groups", memcgs
>> refered to here are all "online state" from kernel's perspective.
> 
> What is offline-business group? Please try to explain the actual problem
> in much more details and do not let us guess.
> 

Maybe I choosed the wrong word, let me rephase it, and
here is an example.

root 200GB
   /  \
important(100GB)  unimportant(100GB+DYNAMIC)
  / |  \ /  \
docker0 docker1...  normal(100GB) oversold(DYNAMIC)
  /  |  \  / |  \
 j0 j1 ...w0 w1 ...

"DYNAMIC" is controlled by the cluster job scheduler dynamically,
it periodically samples the available system memory(/proc/meminfo
"MemAvailable"), and use part of that to launch oversold jobs
under some special conditions. When "oversold" is active, the
whole system is put under heavy global memory pressure although
memcgs are not.

IOW "DYNAMIC" is primarily borrowed from "dockers" temporarily,
oversold workers will be killed in a timely fashion if "dockers"
needs their memory back suddenly which is rare.

If kswapd doesn't reclaim low-protected memory configured among
"important" dockers, memory allocations from dockers will trap
into global direct reclaim constantly which harms their performance
and response time. The inactive caches from dockers are allowed
to be reclaimed although they are under low-protected(we used a
simple MAX setting), we allow the inactive low-protected memory
to be reclaimed immediately and asynchronously as long as there's
no unprotected reclaimable memory. Its's also friendly to disk IO.

For really latency-sensitive docker, memory.min is supposed to be
used to guarantee its memory QoS.

Thanks


Re: [PATCH 3/3] mm/memcg: Avoid reclaiming below hard protection

2018-12-03 Thread Xunlei Pang
On 2018/12/3 PM 7:57, Michal Hocko wrote:
> On Mon 03-12-18 16:01:19, Xunlei Pang wrote:
>> When memcgs get reclaimed after its usage exceeds min, some
>> usages below the min may also be reclaimed in the current
>> implementation, the amount is considerably large during kswapd
>> reclaim according to my ftrace results.
> 
> And here again. Describe the setup and the behavior please?
> 

step 1
mkdir -p /sys/fs/cgroup/memory/online
cd /sys/fs/cgroup/memory/online
echo 512M > memory.max
echo 40960 > memory.min
echo $$ > tasks
dd if=/dev/sda of=/dev/null


while true; do sleep 1; cat memory.current ; cat memory.min; done


step 2
create global memory pressure by allocating annoymous and cached
pages to constantly trigger kswap: dd if=/dev/sdb of=/dev/null

step 3
Then observe "online" groups, hundreds of kbytes a little over
memory.min can cause tens of MiB to be reclaimed by kswapd.

Here is one of test results I got:
cat memory.current; cat memory.min; echo;
409485312   // current
40960   // min

385052672   // See current got over reclaimed for 23MB
40960   // min

Its corresponding ftrace output I monitored:
kswapd_0-281   [000]    304.706632: shrink_node_memcg:
min_excess=24, nr_reclaimed=6013, sc->nr_to_reclaim=147, exceeds
5989pages


Re: [PATCH 3/3] mm/memcg: Avoid reclaiming below hard protection

2018-12-03 Thread Xunlei Pang
On 2018/12/3 PM 7:57, Michal Hocko wrote:
> On Mon 03-12-18 16:01:19, Xunlei Pang wrote:
>> When memcgs get reclaimed after its usage exceeds min, some
>> usages below the min may also be reclaimed in the current
>> implementation, the amount is considerably large during kswapd
>> reclaim according to my ftrace results.
> 
> And here again. Describe the setup and the behavior please?
> 

step 1
mkdir -p /sys/fs/cgroup/memory/online
cd /sys/fs/cgroup/memory/online
echo 512M > memory.max
echo 40960 > memory.min
echo $$ > tasks
dd if=/dev/sda of=/dev/null


while true; do sleep 1; cat memory.current ; cat memory.min; done


step 2
create global memory pressure by allocating annoymous and cached
pages to constantly trigger kswap: dd if=/dev/sdb of=/dev/null

step 3
Then observe "online" groups, hundreds of kbytes a little over
memory.min can cause tens of MiB to be reclaimed by kswapd.

Here is one of test results I got:
cat memory.current; cat memory.min; echo;
409485312   // current
40960   // min

385052672   // See current got over reclaimed for 23MB
40960   // min

Its corresponding ftrace output I monitored:
kswapd_0-281   [000]    304.706632: shrink_node_memcg:
min_excess=24, nr_reclaimed=6013, sc->nr_to_reclaim=147, exceeds
5989pages


Re: [PATCH 2/3] mm/vmscan: Enable kswapd to reclaim low-protected memory

2018-12-03 Thread Xunlei Pang
On 2018/12/4 AM 1:22, Michal Hocko wrote:
> On Mon 03-12-18 23:20:31, Xunlei Pang wrote:
>> On 2018/12/3 下午7:56, Michal Hocko wrote:
>>> On Mon 03-12-18 16:01:18, Xunlei Pang wrote:
>>>> There may be cgroup memory overcommitment, it will become
>>>> even common in the future.
>>>>
>>>> Let's enable kswapd to reclaim low-protected memory in case
>>>> of memory pressure, to mitigate the global direct reclaim
>>>> pressures which could cause jitters to the response time of
>>>> lantency-sensitive groups.
>>>
>>> Please be more descriptive about the problem you are trying to handle
>>> here. I haven't actually read the patch but let me emphasise that the
>>> low limit protection is important isolation tool. And allowing kswapd to
>>> reclaim protected memcgs is going to break the semantic as it has been
>>> introduced and designed.
>>
>> We have two types of memcgs: online groups(important business)
>> and offline groups(unimportant business). Online groups are
>> all configured with MAX low protection, while offline groups
>> are not at all protected(with default 0 low).
>>
>> When offline groups are overcommitted, the global memory pressure
>> suffers. This will cause the memory allocations from online groups
>> constantly go to the slow global direct reclaim in order to reclaim
>> online's page caches, as kswap is not able to reclaim low-protection
>> memory. low is not hard limit, it's reasonable to be reclaimed by
>> kswapd if there's no other reclaimable memory.
> 
> I am sorry I still do not follow. What role do offline cgroups play.
> Those are certainly not low mem protected because mem_cgroup_css_offline
> will reset them to 0.
> 

Oh, I meant "offline groups" to be "offline-business groups", memcgs
refered to here are all "online state" from kernel's perspective.


Re: [PATCH 2/3] mm/vmscan: Enable kswapd to reclaim low-protected memory

2018-12-03 Thread Xunlei Pang
On 2018/12/4 AM 1:22, Michal Hocko wrote:
> On Mon 03-12-18 23:20:31, Xunlei Pang wrote:
>> On 2018/12/3 下午7:56, Michal Hocko wrote:
>>> On Mon 03-12-18 16:01:18, Xunlei Pang wrote:
>>>> There may be cgroup memory overcommitment, it will become
>>>> even common in the future.
>>>>
>>>> Let's enable kswapd to reclaim low-protected memory in case
>>>> of memory pressure, to mitigate the global direct reclaim
>>>> pressures which could cause jitters to the response time of
>>>> lantency-sensitive groups.
>>>
>>> Please be more descriptive about the problem you are trying to handle
>>> here. I haven't actually read the patch but let me emphasise that the
>>> low limit protection is important isolation tool. And allowing kswapd to
>>> reclaim protected memcgs is going to break the semantic as it has been
>>> introduced and designed.
>>
>> We have two types of memcgs: online groups(important business)
>> and offline groups(unimportant business). Online groups are
>> all configured with MAX low protection, while offline groups
>> are not at all protected(with default 0 low).
>>
>> When offline groups are overcommitted, the global memory pressure
>> suffers. This will cause the memory allocations from online groups
>> constantly go to the slow global direct reclaim in order to reclaim
>> online's page caches, as kswap is not able to reclaim low-protection
>> memory. low is not hard limit, it's reasonable to be reclaimed by
>> kswapd if there's no other reclaimable memory.
> 
> I am sorry I still do not follow. What role do offline cgroups play.
> Those are certainly not low mem protected because mem_cgroup_css_offline
> will reset them to 0.
> 

Oh, I meant "offline groups" to be "offline-business groups", memcgs
refered to here are all "online state" from kernel's perspective.


Re: [PATCH 2/3] mm/vmscan: Enable kswapd to reclaim low-protected memory

2018-12-03 Thread Xunlei Pang
On 2018/12/3 下午7:56, Michal Hocko wrote:
> On Mon 03-12-18 16:01:18, Xunlei Pang wrote:
>> There may be cgroup memory overcommitment, it will become
>> even common in the future.
>>
>> Let's enable kswapd to reclaim low-protected memory in case
>> of memory pressure, to mitigate the global direct reclaim
>> pressures which could cause jitters to the response time of
>> lantency-sensitive groups.
> 
> Please be more descriptive about the problem you are trying to handle
> here. I haven't actually read the patch but let me emphasise that the
> low limit protection is important isolation tool. And allowing kswapd to
> reclaim protected memcgs is going to break the semantic as it has been
> introduced and designed.

We have two types of memcgs: online groups(important business)
and offline groups(unimportant business). Online groups are
all configured with MAX low protection, while offline groups
are not at all protected(with default 0 low).

When offline groups are overcommitted, the global memory pressure
suffers. This will cause the memory allocations from online groups
constantly go to the slow global direct reclaim in order to reclaim
online's page caches, as kswap is not able to reclaim low-protection
memory. low is not hard limit, it's reasonable to be reclaimed by
kswapd if there's no other reclaimable memory.

> 
>>
>> Signed-off-by: Xunlei Pang 
>> ---
>>  mm/vmscan.c | 8 
>>  1 file changed, 8 insertions(+)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 62ac0c488624..3d412eb91f73 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -3531,6 +3531,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
>> int classzone_idx)
>>  
>>  count_vm_event(PAGEOUTRUN);
>>  
>> +retry:
>>  do {
>>  unsigned long nr_reclaimed = sc.nr_reclaimed;
>>  bool raise_priority = true;
>> @@ -3622,6 +3623,13 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
>> int classzone_idx)
>>  sc.priority--;
>>  } while (sc.priority >= 1);
>>  
>> +if (!sc.nr_reclaimed && sc.memcg_low_skipped) {
>> +sc.priority = DEF_PRIORITY;
>> +sc.memcg_low_reclaim = 1;
>> +sc.memcg_low_skipped = 0;
>> +goto retry;
>> +}
>> +
>>  if (!sc.nr_reclaimed)
>>  pgdat->kswapd_failures++;
>>  
>> -- 
>> 2.13.5 (Apple Git-94)
>>
> 


Re: [PATCH 2/3] mm/vmscan: Enable kswapd to reclaim low-protected memory

2018-12-03 Thread Xunlei Pang
On 2018/12/3 下午7:56, Michal Hocko wrote:
> On Mon 03-12-18 16:01:18, Xunlei Pang wrote:
>> There may be cgroup memory overcommitment, it will become
>> even common in the future.
>>
>> Let's enable kswapd to reclaim low-protected memory in case
>> of memory pressure, to mitigate the global direct reclaim
>> pressures which could cause jitters to the response time of
>> lantency-sensitive groups.
> 
> Please be more descriptive about the problem you are trying to handle
> here. I haven't actually read the patch but let me emphasise that the
> low limit protection is important isolation tool. And allowing kswapd to
> reclaim protected memcgs is going to break the semantic as it has been
> introduced and designed.

We have two types of memcgs: online groups(important business)
and offline groups(unimportant business). Online groups are
all configured with MAX low protection, while offline groups
are not at all protected(with default 0 low).

When offline groups are overcommitted, the global memory pressure
suffers. This will cause the memory allocations from online groups
constantly go to the slow global direct reclaim in order to reclaim
online's page caches, as kswap is not able to reclaim low-protection
memory. low is not hard limit, it's reasonable to be reclaimed by
kswapd if there's no other reclaimable memory.

> 
>>
>> Signed-off-by: Xunlei Pang 
>> ---
>>  mm/vmscan.c | 8 
>>  1 file changed, 8 insertions(+)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 62ac0c488624..3d412eb91f73 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -3531,6 +3531,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
>> int classzone_idx)
>>  
>>  count_vm_event(PAGEOUTRUN);
>>  
>> +retry:
>>  do {
>>  unsigned long nr_reclaimed = sc.nr_reclaimed;
>>  bool raise_priority = true;
>> @@ -3622,6 +3623,13 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
>> int classzone_idx)
>>  sc.priority--;
>>  } while (sc.priority >= 1);
>>  
>> +if (!sc.nr_reclaimed && sc.memcg_low_skipped) {
>> +sc.priority = DEF_PRIORITY;
>> +sc.memcg_low_reclaim = 1;
>> +sc.memcg_low_skipped = 0;
>> +goto retry;
>> +}
>> +
>>  if (!sc.nr_reclaimed)
>>  pgdat->kswapd_failures++;
>>  
>> -- 
>> 2.13.5 (Apple Git-94)
>>
> 


Re: [PATCH 1/3] mm/memcg: Fix min/low usage in propagate_protected_usage()

2018-12-03 Thread Xunlei Pang
On 2018/12/3 下午7:54, Michal Hocko wrote:
> On Mon 03-12-18 16:01:17, Xunlei Pang wrote:
>> When usage exceeds min, min usage should be min other than 0.
>> Apply the same for low.
> 
> Why? What is the actual problem.

children_min_usage tracks the total children usages under min,
it's natural that min should be added into children_min_usage
when above min, I can't image why 0 is added, is there special
history I missed?

See mem_cgroup_protected(), when usage exceeds min, emin is
calculated as "parent_emin*min/children_min_usage".

> 
>> Signed-off-by: Xunlei Pang 
>> ---
>>  mm/page_counter.c | 12 ++--
>>  1 file changed, 2 insertions(+), 10 deletions(-)
>>
>> diff --git a/mm/page_counter.c b/mm/page_counter.c
>> index de31470655f6..75d53f15f040 100644
>> --- a/mm/page_counter.c
>> +++ b/mm/page_counter.c
>> @@ -23,11 +23,7 @@ static void propagate_protected_usage(struct page_counter 
>> *c,
>>  return;
>>  
>>  if (c->min || atomic_long_read(>min_usage)) {
>> -if (usage <= c->min)
>> -protected = usage;
>> -else
>> -protected = 0;
>> -
>> +protected = min(usage, c->min);
>>  old_protected = atomic_long_xchg(>min_usage, protected);
>>  delta = protected - old_protected;
>>  if (delta)
>> @@ -35,11 +31,7 @@ static void propagate_protected_usage(struct page_counter 
>> *c,
>>  }
>>  
>>  if (c->low || atomic_long_read(>low_usage)) {
>> -if (usage <= c->low)
>> -protected = usage;
>> -else
>> -protected = 0;
>> -
>> +protected = min(usage, c->low);
>>  old_protected = atomic_long_xchg(>low_usage, protected);
>>  delta = protected - old_protected;
>>  if (delta)
>> -- 
>> 2.13.5 (Apple Git-94)
>>
> 


Re: [PATCH 1/3] mm/memcg: Fix min/low usage in propagate_protected_usage()

2018-12-03 Thread Xunlei Pang
On 2018/12/3 下午7:54, Michal Hocko wrote:
> On Mon 03-12-18 16:01:17, Xunlei Pang wrote:
>> When usage exceeds min, min usage should be min other than 0.
>> Apply the same for low.
> 
> Why? What is the actual problem.

children_min_usage tracks the total children usages under min,
it's natural that min should be added into children_min_usage
when above min, I can't image why 0 is added, is there special
history I missed?

See mem_cgroup_protected(), when usage exceeds min, emin is
calculated as "parent_emin*min/children_min_usage".

> 
>> Signed-off-by: Xunlei Pang 
>> ---
>>  mm/page_counter.c | 12 ++--
>>  1 file changed, 2 insertions(+), 10 deletions(-)
>>
>> diff --git a/mm/page_counter.c b/mm/page_counter.c
>> index de31470655f6..75d53f15f040 100644
>> --- a/mm/page_counter.c
>> +++ b/mm/page_counter.c
>> @@ -23,11 +23,7 @@ static void propagate_protected_usage(struct page_counter 
>> *c,
>>  return;
>>  
>>  if (c->min || atomic_long_read(>min_usage)) {
>> -if (usage <= c->min)
>> -protected = usage;
>> -else
>> -protected = 0;
>> -
>> +protected = min(usage, c->min);
>>  old_protected = atomic_long_xchg(>min_usage, protected);
>>  delta = protected - old_protected;
>>  if (delta)
>> @@ -35,11 +31,7 @@ static void propagate_protected_usage(struct page_counter 
>> *c,
>>  }
>>  
>>  if (c->low || atomic_long_read(>low_usage)) {
>> -if (usage <= c->low)
>> -protected = usage;
>> -else
>> -protected = 0;
>> -
>> +protected = min(usage, c->low);
>>  old_protected = atomic_long_xchg(>low_usage, protected);
>>  delta = protected - old_protected;
>>  if (delta)
>> -- 
>> 2.13.5 (Apple Git-94)
>>
> 


[PATCH 2/3] mm/vmscan: Enable kswapd to reclaim low-protected memory

2018-12-03 Thread Xunlei Pang
There may be cgroup memory overcommitment, it will become
even common in the future.

Let's enable kswapd to reclaim low-protected memory in case
of memory pressure, to mitigate the global direct reclaim
pressures which could cause jitters to the response time of
lantency-sensitive groups.

Signed-off-by: Xunlei Pang 
---
 mm/vmscan.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 62ac0c488624..3d412eb91f73 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3531,6 +3531,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int 
classzone_idx)
 
count_vm_event(PAGEOUTRUN);
 
+retry:
do {
unsigned long nr_reclaimed = sc.nr_reclaimed;
bool raise_priority = true;
@@ -3622,6 +3623,13 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
int classzone_idx)
sc.priority--;
} while (sc.priority >= 1);
 
+   if (!sc.nr_reclaimed && sc.memcg_low_skipped) {
+   sc.priority = DEF_PRIORITY;
+   sc.memcg_low_reclaim = 1;
+   sc.memcg_low_skipped = 0;
+   goto retry;
+   }
+
if (!sc.nr_reclaimed)
pgdat->kswapd_failures++;
 
-- 
2.13.5 (Apple Git-94)



[PATCH 3/3] mm/memcg: Avoid reclaiming below hard protection

2018-12-03 Thread Xunlei Pang
When memcgs get reclaimed after its usage exceeds min, some
usages below the min may also be reclaimed in the current
implementation, the amount is considerably large during kswapd
reclaim according to my ftrace results.

This patch calculates the part over hard protection limit,
and allows only this part of usages to be reclaimed.

Signed-off-by: Xunlei Pang 
---
 include/linux/memcontrol.h |  7 +--
 mm/memcontrol.c|  9 +++--
 mm/vmscan.c| 17 +++--
 3 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7ab2120155a4..637ef975792f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -334,7 +334,8 @@ static inline bool mem_cgroup_disabled(void)
 }
 
 enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
-   struct mem_cgroup *memcg);
+   struct mem_cgroup *memcg,
+   unsigned long *min_excess);
 
 int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
  gfp_t gfp_mask, struct mem_cgroup **memcgp,
@@ -818,7 +819,9 @@ static inline void memcg_memory_event_mm(struct mm_struct 
*mm,
 }
 
 static inline enum mem_cgroup_protection mem_cgroup_protected(
-   struct mem_cgroup *root, struct mem_cgroup *memcg)
+   struct mem_cgroup *root,
+   struct mem_cgroup *memcg,
+   unsigned long *min_excess)
 {
return MEMCG_PROT_NONE;
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6e1469b80cb7..ca96f68e07a0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5694,6 +5694,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
  * mem_cgroup_protected - check if memory consumption is in the normal range
  * @root: the top ancestor of the sub-tree being checked
  * @memcg: the memory cgroup to check
+ * @min_excess: store the number of pages exceeding hard protection
  *
  * WARNING: This function is not stateless! It can only be used as part
  *  of a top-down tree iteration, not for isolated queries.
@@ -5761,7 +5762,8 @@ struct cgroup_subsys memory_cgrp_subsys = {
  * as memory.low is a best-effort mechanism.
  */
 enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
-   struct mem_cgroup *memcg)
+   struct mem_cgroup *memcg,
+   unsigned long *min_excess)
 {
struct mem_cgroup *parent;
unsigned long emin, parent_emin;
@@ -5827,8 +5829,11 @@ enum mem_cgroup_protection mem_cgroup_protected(struct 
mem_cgroup *root,
return MEMCG_PROT_MIN;
else if (usage <= elow)
return MEMCG_PROT_LOW;
-   else
+   else {
+   if (emin)
+   *min_excess = usage - emin;
return MEMCG_PROT_NONE;
+   }
 }
 
 /**
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3d412eb91f73..e4fa7a2a63d0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -66,6 +66,9 @@ struct scan_control {
/* How many pages shrink_list() should reclaim */
unsigned long nr_to_reclaim;
 
+   /* How many pages hard protection allows */
+   unsigned long min_excess;
+
/*
 * Nodemask of nodes allowed by the caller. If NULL, all nodes
 * are scanned.
@@ -2503,10 +2506,14 @@ static void shrink_node_memcg(struct pglist_data 
*pgdat, struct mem_cgroup *memc
unsigned long nr_to_scan;
enum lru_list lru;
unsigned long nr_reclaimed = 0;
-   unsigned long nr_to_reclaim = sc->nr_to_reclaim;
+   unsigned long nr_to_reclaim;
struct blk_plug plug;
bool scan_adjusted;
 
+   nr_to_reclaim = sc->nr_to_reclaim;
+   if (sc->min_excess)
+   nr_to_reclaim = min(nr_to_reclaim, sc->min_excess);
+
get_scan_count(lruvec, memcg, sc, nr, lru_pages);
 
/* Record the original scan target for proportional adjustments later */
@@ -2544,6 +2551,10 @@ static void shrink_node_memcg(struct pglist_data *pgdat, 
struct mem_cgroup *memc
 
cond_resched();
 
+   /* Abort proportional reclaim when hard protection applies */
+   if (sc->min_excess && nr_reclaimed >= sc->min_excess)
+   break;
+
if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
continue;
 
@@ -2725,8 +2736,9 @@ static bool shrink_node(pg_data_t *pgdat, struct 
scan_control *sc)
unsigned long lru_pages;
unsigned long reclaimed;
unsigned long scanned;
+   unsigned long excess = 0;
 
- 

[PATCH 2/3] mm/vmscan: Enable kswapd to reclaim low-protected memory

2018-12-03 Thread Xunlei Pang
There may be cgroup memory overcommitment, it will become
even common in the future.

Let's enable kswapd to reclaim low-protected memory in case
of memory pressure, to mitigate the global direct reclaim
pressures which could cause jitters to the response time of
lantency-sensitive groups.

Signed-off-by: Xunlei Pang 
---
 mm/vmscan.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 62ac0c488624..3d412eb91f73 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3531,6 +3531,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int 
classzone_idx)
 
count_vm_event(PAGEOUTRUN);
 
+retry:
do {
unsigned long nr_reclaimed = sc.nr_reclaimed;
bool raise_priority = true;
@@ -3622,6 +3623,13 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
int classzone_idx)
sc.priority--;
} while (sc.priority >= 1);
 
+   if (!sc.nr_reclaimed && sc.memcg_low_skipped) {
+   sc.priority = DEF_PRIORITY;
+   sc.memcg_low_reclaim = 1;
+   sc.memcg_low_skipped = 0;
+   goto retry;
+   }
+
if (!sc.nr_reclaimed)
pgdat->kswapd_failures++;
 
-- 
2.13.5 (Apple Git-94)



[PATCH 3/3] mm/memcg: Avoid reclaiming below hard protection

2018-12-03 Thread Xunlei Pang
When memcgs get reclaimed after its usage exceeds min, some
usages below the min may also be reclaimed in the current
implementation, the amount is considerably large during kswapd
reclaim according to my ftrace results.

This patch calculates the part over hard protection limit,
and allows only this part of usages to be reclaimed.

Signed-off-by: Xunlei Pang 
---
 include/linux/memcontrol.h |  7 +--
 mm/memcontrol.c|  9 +++--
 mm/vmscan.c| 17 +++--
 3 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7ab2120155a4..637ef975792f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -334,7 +334,8 @@ static inline bool mem_cgroup_disabled(void)
 }
 
 enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
-   struct mem_cgroup *memcg);
+   struct mem_cgroup *memcg,
+   unsigned long *min_excess);
 
 int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
  gfp_t gfp_mask, struct mem_cgroup **memcgp,
@@ -818,7 +819,9 @@ static inline void memcg_memory_event_mm(struct mm_struct 
*mm,
 }
 
 static inline enum mem_cgroup_protection mem_cgroup_protected(
-   struct mem_cgroup *root, struct mem_cgroup *memcg)
+   struct mem_cgroup *root,
+   struct mem_cgroup *memcg,
+   unsigned long *min_excess)
 {
return MEMCG_PROT_NONE;
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6e1469b80cb7..ca96f68e07a0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5694,6 +5694,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
  * mem_cgroup_protected - check if memory consumption is in the normal range
  * @root: the top ancestor of the sub-tree being checked
  * @memcg: the memory cgroup to check
+ * @min_excess: store the number of pages exceeding hard protection
  *
  * WARNING: This function is not stateless! It can only be used as part
  *  of a top-down tree iteration, not for isolated queries.
@@ -5761,7 +5762,8 @@ struct cgroup_subsys memory_cgrp_subsys = {
  * as memory.low is a best-effort mechanism.
  */
 enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
-   struct mem_cgroup *memcg)
+   struct mem_cgroup *memcg,
+   unsigned long *min_excess)
 {
struct mem_cgroup *parent;
unsigned long emin, parent_emin;
@@ -5827,8 +5829,11 @@ enum mem_cgroup_protection mem_cgroup_protected(struct 
mem_cgroup *root,
return MEMCG_PROT_MIN;
else if (usage <= elow)
return MEMCG_PROT_LOW;
-   else
+   else {
+   if (emin)
+   *min_excess = usage - emin;
return MEMCG_PROT_NONE;
+   }
 }
 
 /**
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3d412eb91f73..e4fa7a2a63d0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -66,6 +66,9 @@ struct scan_control {
/* How many pages shrink_list() should reclaim */
unsigned long nr_to_reclaim;
 
+   /* How many pages hard protection allows */
+   unsigned long min_excess;
+
/*
 * Nodemask of nodes allowed by the caller. If NULL, all nodes
 * are scanned.
@@ -2503,10 +2506,14 @@ static void shrink_node_memcg(struct pglist_data 
*pgdat, struct mem_cgroup *memc
unsigned long nr_to_scan;
enum lru_list lru;
unsigned long nr_reclaimed = 0;
-   unsigned long nr_to_reclaim = sc->nr_to_reclaim;
+   unsigned long nr_to_reclaim;
struct blk_plug plug;
bool scan_adjusted;
 
+   nr_to_reclaim = sc->nr_to_reclaim;
+   if (sc->min_excess)
+   nr_to_reclaim = min(nr_to_reclaim, sc->min_excess);
+
get_scan_count(lruvec, memcg, sc, nr, lru_pages);
 
/* Record the original scan target for proportional adjustments later */
@@ -2544,6 +2551,10 @@ static void shrink_node_memcg(struct pglist_data *pgdat, 
struct mem_cgroup *memc
 
cond_resched();
 
+   /* Abort proportional reclaim when hard protection applies */
+   if (sc->min_excess && nr_reclaimed >= sc->min_excess)
+   break;
+
if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
continue;
 
@@ -2725,8 +2736,9 @@ static bool shrink_node(pg_data_t *pgdat, struct 
scan_control *sc)
unsigned long lru_pages;
unsigned long reclaimed;
unsigned long scanned;
+   unsigned long excess = 0;
 
- 

[PATCH 1/3] mm/memcg: Fix min/low usage in propagate_protected_usage()

2018-12-03 Thread Xunlei Pang
When usage exceeds min, min usage should be min other than 0.
Apply the same for low.

Signed-off-by: Xunlei Pang 
---
 mm/page_counter.c | 12 ++--
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/mm/page_counter.c b/mm/page_counter.c
index de31470655f6..75d53f15f040 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -23,11 +23,7 @@ static void propagate_protected_usage(struct page_counter *c,
return;
 
if (c->min || atomic_long_read(>min_usage)) {
-   if (usage <= c->min)
-   protected = usage;
-   else
-   protected = 0;
-
+   protected = min(usage, c->min);
old_protected = atomic_long_xchg(>min_usage, protected);
delta = protected - old_protected;
if (delta)
@@ -35,11 +31,7 @@ static void propagate_protected_usage(struct page_counter *c,
}
 
if (c->low || atomic_long_read(>low_usage)) {
-   if (usage <= c->low)
-   protected = usage;
-   else
-   protected = 0;
-
+   protected = min(usage, c->low);
old_protected = atomic_long_xchg(>low_usage, protected);
delta = protected - old_protected;
if (delta)
-- 
2.13.5 (Apple Git-94)



[PATCH 1/3] mm/memcg: Fix min/low usage in propagate_protected_usage()

2018-12-03 Thread Xunlei Pang
When usage exceeds min, min usage should be min other than 0.
Apply the same for low.

Signed-off-by: Xunlei Pang 
---
 mm/page_counter.c | 12 ++--
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/mm/page_counter.c b/mm/page_counter.c
index de31470655f6..75d53f15f040 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -23,11 +23,7 @@ static void propagate_protected_usage(struct page_counter *c,
return;
 
if (c->min || atomic_long_read(>min_usage)) {
-   if (usage <= c->min)
-   protected = usage;
-   else
-   protected = 0;
-
+   protected = min(usage, c->min);
old_protected = atomic_long_xchg(>min_usage, protected);
delta = protected - old_protected;
if (delta)
@@ -35,11 +31,7 @@ static void propagate_protected_usage(struct page_counter *c,
}
 
if (c->low || atomic_long_read(>low_usage)) {
-   if (usage <= c->low)
-   protected = usage;
-   else
-   protected = 0;
-
+   protected = min(usage, c->low);
old_protected = atomic_long_xchg(>low_usage, protected);
delta = protected - old_protected;
if (delta)
-- 
2.13.5 (Apple Git-94)



Re: [PATCH] sched/fair: sync expires_seq in distribute_cfs_runtime()

2018-07-31 Thread Xunlei Pang
On 8/1/18 4:55 AM, Cong Wang wrote:
> On Tue, Jul 31, 2018 at 10:13 AM  wrote:
>>
>> Xunlei Pang  writes:
>>
>>> On 7/31/18 1:55 AM, Cong Wang wrote:
>>>> On Sun, Jul 29, 2018 at 10:29 PM Xunlei Pang  
>>>> wrote:
>>>>>
>>>>> Hi Cong,
>>>>>
>>>>> On 7/28/18 8:24 AM, Cong Wang wrote:
>>>>>> Each time we sync cfs_rq->runtime_expires with cfs_b->runtime_expires,
>>>>>> we should sync its ->expires_seq too. However it is missing
>>>>>> for distribute_cfs_runtime(), especially the slack timer call path.
>>>>>
>>>>> I don't think it's a problem, as expires_seq will get synced in
>>>>> assign_cfs_rq_runtime().
>>>>
>>>> Sure, but there is a small window during which they are not synced.
>>>> Why do you want to wait until the next assign_cfs_rq_runtime() when
>>>> you already know runtime_expires is synced?
>>>>
>>>> Also, expire_cfs_rq_runtime() is called before assign_cfs_rq_runtime()
>>>> inside __account_cfs_rq_runtime(), which means the check of
>>>> cfs_rq->expires_seq is not accurate for unthrottling case if the clock
>>>> drift happens soon enough?
>>>>
>>>
>>> expire_cfs_rq_runtime():
>>> if (cfs_rq->expires_seq == cfs_b->expires_seq) {
>>> /* extend local deadline, drift is bounded above by 2 ticks */
>>> cfs_rq->runtime_expires += TICK_NSEC;
>>> } else {
>>> /* global deadline is ahead, expiration has passed */
>>> cfs_rq->runtime_remaining = 0;
>>> }
>>>
>>> So if clock drift happens soon, then expires_seq decides the correct
>>> thing we should do: if cfs_b->expires_seq advanced, then clear the stale
>>> cfs_rq->runtime_remaining from the slack timer of the past period, then
>>> assign_cfs_rq_runtime() will refresh them afterwards, otherwise it is a
>>> real clock drift. I am still not getting where the race is?
> 
> But expires_seq is supposed to be the same here, after
> distribute_cfs_runtime(), therefore runtime_remaining is not supposed
> to be cleared.
> 
> Which part do I misunderstand? expires_seq should not be same here?
> Or you are saying a wrongly clear of runtime_remaning is fine?
> 

Let's see the unthrottle cases.
1. for the periodic timer
distribute_cfs_runtime updates the throttled cfs_rq->runtime_expires to
be a new value, so expire_cfs_rq_runtime does nothing because of:
  rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires < 0

Afterwards assign_cfs_rq_runtime() will sync its expires_seq.

2. for the slack timer
the two expires_seq should be the same, so if clock drift happens soon,
expire_cfs_rq_runtime regards it as true clock drift:
  cfs_rq->runtime_expires += TICK_NSEC
If it happens that global expires_seq advances, it also doesn't matter,
expire_cfs_rq_runtime will clear the stale expire_cfs_rq_runtime as
expected.

> 
>>
>> Nothing /important/ goes wrong because distribute_cfs_runtime only fills
>> runtime_remaining up to 1, not a real amount.
> 
> No, runtime_remaining is updated right before expire_cfs_rq_runtime():
> 
> static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
> {
> /* dock delta_exec before expiring quota (as it could span periods) */
> cfs_rq->runtime_remaining -= delta_exec;
> expire_cfs_rq_runtime(cfs_rq);
> 
> so almost certainly it can't be 1.

I think Ben means it firstly gets a distributtion of 1 to run after
unthrottling, soon it will have a negative runtime_remaining, and go
to assign_cfs_rq_runtime().

Thanks,
Xunlei

> 
> Which means the following check could be passed:
> 
>  4655 if (cfs_rq->runtime_remaining < 0)
>  4656 return;
> 
> therefore we are reaching the clock drift logic code inside
> expire_cfs_rq_runtime()
> where expires_seq is supposed to be same as they should be sync'ed.
> Therefore without patch, we wrongly clear the runtime_remainng?
> 
> Thanks.
> 


Re: [PATCH] sched/fair: sync expires_seq in distribute_cfs_runtime()

2018-07-31 Thread Xunlei Pang
On 8/1/18 4:55 AM, Cong Wang wrote:
> On Tue, Jul 31, 2018 at 10:13 AM  wrote:
>>
>> Xunlei Pang  writes:
>>
>>> On 7/31/18 1:55 AM, Cong Wang wrote:
>>>> On Sun, Jul 29, 2018 at 10:29 PM Xunlei Pang  
>>>> wrote:
>>>>>
>>>>> Hi Cong,
>>>>>
>>>>> On 7/28/18 8:24 AM, Cong Wang wrote:
>>>>>> Each time we sync cfs_rq->runtime_expires with cfs_b->runtime_expires,
>>>>>> we should sync its ->expires_seq too. However it is missing
>>>>>> for distribute_cfs_runtime(), especially the slack timer call path.
>>>>>
>>>>> I don't think it's a problem, as expires_seq will get synced in
>>>>> assign_cfs_rq_runtime().
>>>>
>>>> Sure, but there is a small window during which they are not synced.
>>>> Why do you want to wait until the next assign_cfs_rq_runtime() when
>>>> you already know runtime_expires is synced?
>>>>
>>>> Also, expire_cfs_rq_runtime() is called before assign_cfs_rq_runtime()
>>>> inside __account_cfs_rq_runtime(), which means the check of
>>>> cfs_rq->expires_seq is not accurate for unthrottling case if the clock
>>>> drift happens soon enough?
>>>>
>>>
>>> expire_cfs_rq_runtime():
>>> if (cfs_rq->expires_seq == cfs_b->expires_seq) {
>>> /* extend local deadline, drift is bounded above by 2 ticks */
>>> cfs_rq->runtime_expires += TICK_NSEC;
>>> } else {
>>> /* global deadline is ahead, expiration has passed */
>>> cfs_rq->runtime_remaining = 0;
>>> }
>>>
>>> So if clock drift happens soon, then expires_seq decides the correct
>>> thing we should do: if cfs_b->expires_seq advanced, then clear the stale
>>> cfs_rq->runtime_remaining from the slack timer of the past period, then
>>> assign_cfs_rq_runtime() will refresh them afterwards, otherwise it is a
>>> real clock drift. I am still not getting where the race is?
> 
> But expires_seq is supposed to be the same here, after
> distribute_cfs_runtime(), therefore runtime_remaining is not supposed
> to be cleared.
> 
> Which part do I misunderstand? expires_seq should not be same here?
> Or you are saying a wrongly clear of runtime_remaning is fine?
> 

Let's see the unthrottle cases.
1. for the periodic timer
distribute_cfs_runtime updates the throttled cfs_rq->runtime_expires to
be a new value, so expire_cfs_rq_runtime does nothing because of:
  rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires < 0

Afterwards assign_cfs_rq_runtime() will sync its expires_seq.

2. for the slack timer
the two expires_seq should be the same, so if clock drift happens soon,
expire_cfs_rq_runtime regards it as true clock drift:
  cfs_rq->runtime_expires += TICK_NSEC
If it happens that global expires_seq advances, it also doesn't matter,
expire_cfs_rq_runtime will clear the stale expire_cfs_rq_runtime as
expected.

> 
>>
>> Nothing /important/ goes wrong because distribute_cfs_runtime only fills
>> runtime_remaining up to 1, not a real amount.
> 
> No, runtime_remaining is updated right before expire_cfs_rq_runtime():
> 
> static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
> {
> /* dock delta_exec before expiring quota (as it could span periods) */
> cfs_rq->runtime_remaining -= delta_exec;
> expire_cfs_rq_runtime(cfs_rq);
> 
> so almost certainly it can't be 1.

I think Ben means it firstly gets a distributtion of 1 to run after
unthrottling, soon it will have a negative runtime_remaining, and go
to assign_cfs_rq_runtime().

Thanks,
Xunlei

> 
> Which means the following check could be passed:
> 
>  4655 if (cfs_rq->runtime_remaining < 0)
>  4656 return;
> 
> therefore we are reaching the clock drift logic code inside
> expire_cfs_rq_runtime()
> where expires_seq is supposed to be same as they should be sync'ed.
> Therefore without patch, we wrongly clear the runtime_remainng?
> 
> Thanks.
> 


Re: [PATCH] sched/fair: sync expires_seq in distribute_cfs_runtime()

2018-07-31 Thread Xunlei Pang
On 7/31/18 1:55 AM, Cong Wang wrote:
> On Sun, Jul 29, 2018 at 10:29 PM Xunlei Pang  wrote:
>>
>> Hi Cong,
>>
>> On 7/28/18 8:24 AM, Cong Wang wrote:
>>> Each time we sync cfs_rq->runtime_expires with cfs_b->runtime_expires,
>>> we should sync its ->expires_seq too. However it is missing
>>> for distribute_cfs_runtime(), especially the slack timer call path.
>>
>> I don't think it's a problem, as expires_seq will get synced in
>> assign_cfs_rq_runtime().
> 
> Sure, but there is a small window during which they are not synced.
> Why do you want to wait until the next assign_cfs_rq_runtime() when
> you already know runtime_expires is synced?
> 
> Also, expire_cfs_rq_runtime() is called before assign_cfs_rq_runtime()
> inside __account_cfs_rq_runtime(), which means the check of
> cfs_rq->expires_seq is not accurate for unthrottling case if the clock
> drift happens soon enough?
> 

expire_cfs_rq_runtime():
if (cfs_rq->expires_seq == cfs_b->expires_seq) {
/* extend local deadline, drift is bounded above by 2 ticks */
cfs_rq->runtime_expires += TICK_NSEC;
} else {
/* global deadline is ahead, expiration has passed */
cfs_rq->runtime_remaining = 0;
}

So if clock drift happens soon, then expires_seq decides the correct
thing we should do: if cfs_b->expires_seq advanced, then clear the stale
cfs_rq->runtime_remaining from the slack timer of the past period, then
assign_cfs_rq_runtime() will refresh them afterwards, otherwise it is a
real clock drift. I am still not getting where the race is?


Re: [PATCH] sched/fair: sync expires_seq in distribute_cfs_runtime()

2018-07-31 Thread Xunlei Pang
On 7/31/18 1:55 AM, Cong Wang wrote:
> On Sun, Jul 29, 2018 at 10:29 PM Xunlei Pang  wrote:
>>
>> Hi Cong,
>>
>> On 7/28/18 8:24 AM, Cong Wang wrote:
>>> Each time we sync cfs_rq->runtime_expires with cfs_b->runtime_expires,
>>> we should sync its ->expires_seq too. However it is missing
>>> for distribute_cfs_runtime(), especially the slack timer call path.
>>
>> I don't think it's a problem, as expires_seq will get synced in
>> assign_cfs_rq_runtime().
> 
> Sure, but there is a small window during which they are not synced.
> Why do you want to wait until the next assign_cfs_rq_runtime() when
> you already know runtime_expires is synced?
> 
> Also, expire_cfs_rq_runtime() is called before assign_cfs_rq_runtime()
> inside __account_cfs_rq_runtime(), which means the check of
> cfs_rq->expires_seq is not accurate for unthrottling case if the clock
> drift happens soon enough?
> 

expire_cfs_rq_runtime():
if (cfs_rq->expires_seq == cfs_b->expires_seq) {
/* extend local deadline, drift is bounded above by 2 ticks */
cfs_rq->runtime_expires += TICK_NSEC;
} else {
/* global deadline is ahead, expiration has passed */
cfs_rq->runtime_remaining = 0;
}

So if clock drift happens soon, then expires_seq decides the correct
thing we should do: if cfs_b->expires_seq advanced, then clear the stale
cfs_rq->runtime_remaining from the slack timer of the past period, then
assign_cfs_rq_runtime() will refresh them afterwards, otherwise it is a
real clock drift. I am still not getting where the race is?


Re: [PATCH] sched/fair: sync expires_seq in distribute_cfs_runtime()

2018-07-29 Thread Xunlei Pang
Hi Cong,

On 7/28/18 8:24 AM, Cong Wang wrote:
> Each time we sync cfs_rq->runtime_expires with cfs_b->runtime_expires,
> we should sync its ->expires_seq too. However it is missing
> for distribute_cfs_runtime(), especially the slack timer call path.

I don't think it's a problem, as expires_seq will get synced in
assign_cfs_rq_runtime().

Thanks,
Xunlei

> 
> Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
> Cc: Xunlei Pang 
> Cc: Ben Segall 
> Cc: Linus Torvalds 
> Cc: Peter Zijlstra 
> Cc: Thomas Gleixner 
> Signed-off-by: Cong Wang 
> ---
>  kernel/sched/fair.c | 12 
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2f0a0be4d344..910c50db3d74 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4857,7 +4857,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>  }
>  
>  static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
> - u64 remaining, u64 expires)
> + u64 remaining, u64 expires, int expires_seq)
>  {
>   struct cfs_rq *cfs_rq;
>   u64 runtime;
> @@ -4880,6 +4880,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth 
> *cfs_b,
>  
>   cfs_rq->runtime_remaining += runtime;
>   cfs_rq->runtime_expires = expires;
> + cfs_rq->expires_seq = expires_seq;
>  
>   /* we check whether we're throttled above */
>   if (cfs_rq->runtime_remaining > 0)
> @@ -4905,7 +4906,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth 
> *cfs_b,
>  static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int 
> overrun)
>  {
>   u64 runtime, runtime_expires;
> - int throttled;
> + int throttled, expires_seq;
>  
>   /* no need to continue the timer with no bandwidth constraint */
>   if (cfs_b->quota == RUNTIME_INF)
> @@ -4933,6 +4934,7 @@ static int do_sched_cfs_period_timer(struct 
> cfs_bandwidth *cfs_b, int overrun)
>   cfs_b->nr_throttled += overrun;
>  
>   runtime_expires = cfs_b->runtime_expires;
> + expires_seq = cfs_b->expires_seq;
>  
>   /*
>* This check is repeated as we are holding onto the new bandwidth while
> @@ -4946,7 +4948,7 @@ static int do_sched_cfs_period_timer(struct 
> cfs_bandwidth *cfs_b, int overrun)
>   raw_spin_unlock(_b->lock);
>   /* we can't nest cfs_b->lock while distributing bandwidth */
>   runtime = distribute_cfs_runtime(cfs_b, runtime,
> -  runtime_expires);
> +  runtime_expires, expires_seq);
>   raw_spin_lock(_b->lock);
>  
>   throttled = !list_empty(_b->throttled_cfs_rq);
> @@ -5055,6 +5057,7 @@ static __always_inline void 
> return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>  {
>   u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
> + int expires_seq;
>   u64 expires;
>  
>   /* confirm we're still not at a refresh boundary */
> @@ -5068,12 +5071,13 @@ static void do_sched_cfs_slack_timer(struct 
> cfs_bandwidth *cfs_b)
>   runtime = cfs_b->runtime;
>  
>   expires = cfs_b->runtime_expires;
> + expires_seq = cfs_b->expires_seq;
>   raw_spin_unlock(_b->lock);
>  
>   if (!runtime)
>   return;
>  
> - runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
> + runtime = distribute_cfs_runtime(cfs_b, runtime, expires, expires_seq);
>  
>   raw_spin_lock(_b->lock);
>   if (expires == cfs_b->runtime_expires)
> 


Re: [PATCH] sched/fair: sync expires_seq in distribute_cfs_runtime()

2018-07-29 Thread Xunlei Pang
Hi Cong,

On 7/28/18 8:24 AM, Cong Wang wrote:
> Each time we sync cfs_rq->runtime_expires with cfs_b->runtime_expires,
> we should sync its ->expires_seq too. However it is missing
> for distribute_cfs_runtime(), especially the slack timer call path.

I don't think it's a problem, as expires_seq will get synced in
assign_cfs_rq_runtime().

Thanks,
Xunlei

> 
> Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
> Cc: Xunlei Pang 
> Cc: Ben Segall 
> Cc: Linus Torvalds 
> Cc: Peter Zijlstra 
> Cc: Thomas Gleixner 
> Signed-off-by: Cong Wang 
> ---
>  kernel/sched/fair.c | 12 
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2f0a0be4d344..910c50db3d74 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4857,7 +4857,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>  }
>  
>  static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
> - u64 remaining, u64 expires)
> + u64 remaining, u64 expires, int expires_seq)
>  {
>   struct cfs_rq *cfs_rq;
>   u64 runtime;
> @@ -4880,6 +4880,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth 
> *cfs_b,
>  
>   cfs_rq->runtime_remaining += runtime;
>   cfs_rq->runtime_expires = expires;
> + cfs_rq->expires_seq = expires_seq;
>  
>   /* we check whether we're throttled above */
>   if (cfs_rq->runtime_remaining > 0)
> @@ -4905,7 +4906,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth 
> *cfs_b,
>  static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int 
> overrun)
>  {
>   u64 runtime, runtime_expires;
> - int throttled;
> + int throttled, expires_seq;
>  
>   /* no need to continue the timer with no bandwidth constraint */
>   if (cfs_b->quota == RUNTIME_INF)
> @@ -4933,6 +4934,7 @@ static int do_sched_cfs_period_timer(struct 
> cfs_bandwidth *cfs_b, int overrun)
>   cfs_b->nr_throttled += overrun;
>  
>   runtime_expires = cfs_b->runtime_expires;
> + expires_seq = cfs_b->expires_seq;
>  
>   /*
>* This check is repeated as we are holding onto the new bandwidth while
> @@ -4946,7 +4948,7 @@ static int do_sched_cfs_period_timer(struct 
> cfs_bandwidth *cfs_b, int overrun)
>   raw_spin_unlock(_b->lock);
>   /* we can't nest cfs_b->lock while distributing bandwidth */
>   runtime = distribute_cfs_runtime(cfs_b, runtime,
> -  runtime_expires);
> +  runtime_expires, expires_seq);
>   raw_spin_lock(_b->lock);
>  
>   throttled = !list_empty(_b->throttled_cfs_rq);
> @@ -5055,6 +5057,7 @@ static __always_inline void 
> return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>  {
>   u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
> + int expires_seq;
>   u64 expires;
>  
>   /* confirm we're still not at a refresh boundary */
> @@ -5068,12 +5071,13 @@ static void do_sched_cfs_slack_timer(struct 
> cfs_bandwidth *cfs_b)
>   runtime = cfs_b->runtime;
>  
>   expires = cfs_b->runtime_expires;
> + expires_seq = cfs_b->expires_seq;
>   raw_spin_unlock(_b->lock);
>  
>   if (!runtime)
>   return;
>  
> - runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
> + runtime = distribute_cfs_runtime(cfs_b, runtime, expires, expires_seq);
>  
>   raw_spin_lock(_b->lock);
>   if (expires == cfs_b->runtime_expires)
> 


Re: [tip:sched/core] sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()

2018-07-24 Thread Xunlei Pang
On 7/23/18 5:21 PM, Peter Zijlstra wrote:
> On Tue, Jul 17, 2018 at 12:08:36PM +0800, Xunlei Pang wrote:
>> The trace data corresponds to the last sample period:
>> trace entry 1:
>>  cat-20755 [022] d...  1370.106496: cputime_adjust: task
>> tick-based utime 36256000 stime 255100, scheduler rtime 333060702626
>>  cat-20755 [022] d...  1370.106497: cputime_adjust: result:
>> old utime 330729718142 stime 2306983867, new utime 330733635372 stime
>> 2327067254
>>
>> trace entry 2:
>>  cat-20773 [005] d...  1371.109825: cputime_adjust: task
>> tick-based utime 36256700 stime 354700, scheduler rtime 334063718912
>>  cat-20773 [005] d...  1371.109826: cputime_adjust: result:
>> old utime 330733635372 stime 2327067254, new utime 330827229702 stime
>> 3236489210
>>
>> 1) expected behaviour
>> Let's compare the last two trace entries(all the data below is in ns):
>> task tick-based utime: 36256000->36256700 increased 700
>> task tick-based stime: 255100  ->354700   increased 99600
>> scheduler rtime:   333060702626->334063718912 increased 1003016286
>>
>> The application actually runs almost 100%sys at the moment, we can
>> use the task tick-based utime and stime increased to double check:
>> 99600/(700+99600) > 99%sys
>>
>> 2) the current cputime_adjust() inaccurate result
>> But for the current cputime_adjust(), we get the following adjusted
>> utime and stime increase in this sample period:
>> adjusted utime: 330733635372->330827229702 increased 93594330
>> adjusted stime: 2327067254  ->3236489210   increased 909421956
>>
>> so 909421956/(93594330+909421956)=91%sys as the shell script shows above.
>>
>> 3) root cause
>> The root cause of the issue is that the current cputime_adjust() always
>> passes the whole times to scale_stime() to split the whole utime and
>> stime. In this patch, we pass all the increased deltas in 1) within
>> user's sample period to scale_stime() instead and accumulate the
>> corresponding results to the previous saved adjusted utime and stime,
>> so guarantee the accurate usr and sys increase within the user sample
>> period.
> 
> But why it this a problem?
> 
> Since its sample based there's really nothing much you can guarantee.
> What if your test program were to run in userspace for 50% of the time
> but is so constructed to always be in kernel space when the tick
> happens?
> 
> Then you would 'expect' it to be 50% user and 50% sys, but you're also
> not getting that.
> 
> This stuff cannot be perfect, and the current code provides 'sensible'
> numbers over the long run for most programs. Why muck with it?
> 

Basically I am ok with the current implementation, except for one
scenario we've met: when kernel went wrong for some reason with 100% sys
suddenly for seconds(even trigger softlockup), the statistics monitor
didn't reflect the fact, which confused people. One example with our
per-cgroup top, we ever noticed "20% usr, 80% sys" displayed while in
fact the kernel was in some busy loop(100% sys) at that moment, and the
tick based time are of course all sys samples in such case.


Re: [tip:sched/core] sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()

2018-07-24 Thread Xunlei Pang
On 7/23/18 5:21 PM, Peter Zijlstra wrote:
> On Tue, Jul 17, 2018 at 12:08:36PM +0800, Xunlei Pang wrote:
>> The trace data corresponds to the last sample period:
>> trace entry 1:
>>  cat-20755 [022] d...  1370.106496: cputime_adjust: task
>> tick-based utime 36256000 stime 255100, scheduler rtime 333060702626
>>  cat-20755 [022] d...  1370.106497: cputime_adjust: result:
>> old utime 330729718142 stime 2306983867, new utime 330733635372 stime
>> 2327067254
>>
>> trace entry 2:
>>  cat-20773 [005] d...  1371.109825: cputime_adjust: task
>> tick-based utime 36256700 stime 354700, scheduler rtime 334063718912
>>  cat-20773 [005] d...  1371.109826: cputime_adjust: result:
>> old utime 330733635372 stime 2327067254, new utime 330827229702 stime
>> 3236489210
>>
>> 1) expected behaviour
>> Let's compare the last two trace entries(all the data below is in ns):
>> task tick-based utime: 36256000->36256700 increased 700
>> task tick-based stime: 255100  ->354700   increased 99600
>> scheduler rtime:   333060702626->334063718912 increased 1003016286
>>
>> The application actually runs almost 100%sys at the moment, we can
>> use the task tick-based utime and stime increased to double check:
>> 99600/(700+99600) > 99%sys
>>
>> 2) the current cputime_adjust() inaccurate result
>> But for the current cputime_adjust(), we get the following adjusted
>> utime and stime increase in this sample period:
>> adjusted utime: 330733635372->330827229702 increased 93594330
>> adjusted stime: 2327067254  ->3236489210   increased 909421956
>>
>> so 909421956/(93594330+909421956)=91%sys as the shell script shows above.
>>
>> 3) root cause
>> The root cause of the issue is that the current cputime_adjust() always
>> passes the whole times to scale_stime() to split the whole utime and
>> stime. In this patch, we pass all the increased deltas in 1) within
>> user's sample period to scale_stime() instead and accumulate the
>> corresponding results to the previous saved adjusted utime and stime,
>> so guarantee the accurate usr and sys increase within the user sample
>> period.
> 
> But why it this a problem?
> 
> Since its sample based there's really nothing much you can guarantee.
> What if your test program were to run in userspace for 50% of the time
> but is so constructed to always be in kernel space when the tick
> happens?
> 
> Then you would 'expect' it to be 50% user and 50% sys, but you're also
> not getting that.
> 
> This stuff cannot be perfect, and the current code provides 'sensible'
> numbers over the long run for most programs. Why muck with it?
> 

Basically I am ok with the current implementation, except for one
scenario we've met: when kernel went wrong for some reason with 100% sys
suddenly for seconds(even trigger softlockup), the statistics monitor
didn't reflect the fact, which confused people. One example with our
per-cgroup top, we ever noticed "20% usr, 80% sys" displayed while in
fact the kernel was in some busy loop(100% sys) at that moment, and the
tick based time are of course all sys samples in such case.


Re: [tip:sched/core] sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()

2018-07-16 Thread Xunlei Pang
On 7/17/18 1:41 AM, Ingo Molnar wrote:
> 
> * Peter Zijlstra  wrote:
> 
>> On Sun, Jul 15, 2018 at 04:36:17PM -0700, tip-bot for Xunlei Pang wrote:
>>> Commit-ID:  8d4c00dc38a8aa30dae8402955e55e7b34e74bc8
>>> Gitweb: 
>>> https://git.kernel.org/tip/8d4c00dc38a8aa30dae8402955e55e7b34e74bc8
>>> Author: Xunlei Pang 
>>> AuthorDate: Mon, 9 Jul 2018 22:58:43 +0800
>>> Committer:  Ingo Molnar 
>>> CommitDate: Mon, 16 Jul 2018 00:28:31 +0200
>>>
>>> sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()
>>>
>>> If users access "/proc/pid/stat", the utime and stime ratio in the
>>> current SAMPLE period are excepted, but currently cputime_adjust()
>>> always calculates with the ratio of the WHOLE lifetime of the process.
>>>
>>> This results in inaccurate utime and stime in "/proc/pid/stat". For
>>> example,  a process runs for a while with "50% usr, 0% sys", then
>>> followed by "100% sys". For later while, the following is excepted:
>>>
>>>   0.0 usr,  100.0 sys
>>>
>>> but we get:
>>>
>>>   10.0 usr,  90.0 sys
>>>
>>> This patch uses the accurate ratio in cputime_adjust() to address the
>>> issue. A new 'task_cputime' type field is added in prev_cputime to record
>>> previous 'task_cputime' so that we can get the elapsed times as the accurate
>>> ratio.
>>
>> Ingo, please make this one go away.
> 
> Sure, I've removed it from sched/core.
> 

Hi Ingo, Peter, Frederic,

I captured some runtime data using trace to explain it, hope this can
illustrate the motive behind my patch. Anyone could help improve my
changelog is appreciated if you think so after reading.


Here are the simple trace_printk I added to capture the real data:

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 0796f938c4f0..b65c1f250941 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -611,6 +611,9 @@ void cputime_adjust(struct task_cputime *curr,
struct prev_cputime *prev,
stime = curr->stime;
utime = curr->utime;

+   if (!strncmp(current->comm, "cat", 3))
+   trace_printk("task tick-based utime %lld stime %lld,
scheduler rtime %lld\n", utime, stime, rtime);
+
/*
 * If either stime or utime are 0, assume all runtime is userspace.
 * Once a task gets some ticks, the monotonicy code at 'update:'
@@ -651,9 +654,14 @@ void cputime_adjust(struct task_cputime *curr,
struct prev_cputime *prev,
stime = rtime - utime;
}

+   if (!strncmp(current->comm, "cat", 3))
+   trace_printk("result: old utime %lld stime %lld, new
utime %lld stime %lld\n",
+   prev->utime, prev->stime, utime, stime);
+
prev->stime = stime;
prev->utime = utime;
 out:
*ut = prev->utime;
*st = prev->stime;
raw_spin_unlock_irqrestore(>lock, flags);


Using the reproducer I described in the changelog, it runs for a while
with "50% usr, 0% sys", then followed by "100% sys". A shell script
accesses its /proc/pid/stat at the interval of one second, and got:
50.0 usr,  0.0 sys
51.0 usr,  0.0 sys
50.0 usr,  0.0 sys
...
9.0 usr,  91.0 sys
9.0 usr,  91.0 sys


The trace data corresponds to the last sample period:
trace entry 1:
 cat-20755 [022] d...  1370.106496: cputime_adjust: task
tick-based utime 36256000 stime 255100, scheduler rtime 333060702626
 cat-20755 [022] d...  1370.106497: cputime_adjust: result:
old utime 330729718142 stime 2306983867, new utime 330733635372 stime
2327067254

trace entry 2:
 cat-20773 [005] d...  1371.109825: cputime_adjust: task
tick-based utime 36256700 stime 354700, scheduler rtime 334063718912
 cat-20773 [005] d...  1371.109826: cputime_adjust: result:
old utime 330733635372 stime 2327067254, new utime 330827229702 stime
3236489210

1) expected behaviour
Let's compare the last two trace entries(all the data below is in ns):
task tick-based utime: 36256000->36256700 increased 700
task tick-based stime: 255100  ->354700   increased 99600
scheduler rtime:   333060702626->334063718912 increased 1003016286

The application actually runs almost 100%sys at the moment, we can
use the task tick-based utime and stime increased to double check:
99600/(700+99600) > 99%sys

2) the current cputime_adjust() inaccurate result
But for the current cputime_adjust(), we get the following adjusted
utime and stime increase in this sample period:
adjusted utime: 330733635372->330827229702 increased 93594330
adjusted sti

Re: [tip:sched/core] sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()

2018-07-16 Thread Xunlei Pang
On 7/17/18 1:41 AM, Ingo Molnar wrote:
> 
> * Peter Zijlstra  wrote:
> 
>> On Sun, Jul 15, 2018 at 04:36:17PM -0700, tip-bot for Xunlei Pang wrote:
>>> Commit-ID:  8d4c00dc38a8aa30dae8402955e55e7b34e74bc8
>>> Gitweb: 
>>> https://git.kernel.org/tip/8d4c00dc38a8aa30dae8402955e55e7b34e74bc8
>>> Author: Xunlei Pang 
>>> AuthorDate: Mon, 9 Jul 2018 22:58:43 +0800
>>> Committer:  Ingo Molnar 
>>> CommitDate: Mon, 16 Jul 2018 00:28:31 +0200
>>>
>>> sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()
>>>
>>> If users access "/proc/pid/stat", the utime and stime ratio in the
>>> current SAMPLE period are excepted, but currently cputime_adjust()
>>> always calculates with the ratio of the WHOLE lifetime of the process.
>>>
>>> This results in inaccurate utime and stime in "/proc/pid/stat". For
>>> example,  a process runs for a while with "50% usr, 0% sys", then
>>> followed by "100% sys". For later while, the following is excepted:
>>>
>>>   0.0 usr,  100.0 sys
>>>
>>> but we get:
>>>
>>>   10.0 usr,  90.0 sys
>>>
>>> This patch uses the accurate ratio in cputime_adjust() to address the
>>> issue. A new 'task_cputime' type field is added in prev_cputime to record
>>> previous 'task_cputime' so that we can get the elapsed times as the accurate
>>> ratio.
>>
>> Ingo, please make this one go away.
> 
> Sure, I've removed it from sched/core.
> 

Hi Ingo, Peter, Frederic,

I captured some runtime data using trace to explain it, hope this can
illustrate the motive behind my patch. Anyone could help improve my
changelog is appreciated if you think so after reading.


Here are the simple trace_printk I added to capture the real data:

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 0796f938c4f0..b65c1f250941 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -611,6 +611,9 @@ void cputime_adjust(struct task_cputime *curr,
struct prev_cputime *prev,
stime = curr->stime;
utime = curr->utime;

+   if (!strncmp(current->comm, "cat", 3))
+   trace_printk("task tick-based utime %lld stime %lld,
scheduler rtime %lld\n", utime, stime, rtime);
+
/*
 * If either stime or utime are 0, assume all runtime is userspace.
 * Once a task gets some ticks, the monotonicy code at 'update:'
@@ -651,9 +654,14 @@ void cputime_adjust(struct task_cputime *curr,
struct prev_cputime *prev,
stime = rtime - utime;
}

+   if (!strncmp(current->comm, "cat", 3))
+   trace_printk("result: old utime %lld stime %lld, new
utime %lld stime %lld\n",
+   prev->utime, prev->stime, utime, stime);
+
prev->stime = stime;
prev->utime = utime;
 out:
*ut = prev->utime;
*st = prev->stime;
raw_spin_unlock_irqrestore(>lock, flags);


Using the reproducer I described in the changelog, it runs for a while
with "50% usr, 0% sys", then followed by "100% sys". A shell script
accesses its /proc/pid/stat at the interval of one second, and got:
50.0 usr,  0.0 sys
51.0 usr,  0.0 sys
50.0 usr,  0.0 sys
...
9.0 usr,  91.0 sys
9.0 usr,  91.0 sys


The trace data corresponds to the last sample period:
trace entry 1:
 cat-20755 [022] d...  1370.106496: cputime_adjust: task
tick-based utime 36256000 stime 255100, scheduler rtime 333060702626
 cat-20755 [022] d...  1370.106497: cputime_adjust: result:
old utime 330729718142 stime 2306983867, new utime 330733635372 stime
2327067254

trace entry 2:
 cat-20773 [005] d...  1371.109825: cputime_adjust: task
tick-based utime 36256700 stime 354700, scheduler rtime 334063718912
 cat-20773 [005] d...  1371.109826: cputime_adjust: result:
old utime 330733635372 stime 2327067254, new utime 330827229702 stime
3236489210

1) expected behaviour
Let's compare the last two trace entries(all the data below is in ns):
task tick-based utime: 36256000->36256700 increased 700
task tick-based stime: 255100  ->354700   increased 99600
scheduler rtime:   333060702626->334063718912 increased 1003016286

The application actually runs almost 100%sys at the moment, we can
use the task tick-based utime and stime increased to double check:
99600/(700+99600) > 99%sys

2) the current cputime_adjust() inaccurate result
But for the current cputime_adjust(), we get the following adjusted
utime and stime increase in this sample period:
adjusted utime: 330733635372->330827229702 increased 93594330
adjusted sti

[tip:sched/core] sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()

2018-07-15 Thread tip-bot for Xunlei Pang
Commit-ID:  8d4c00dc38a8aa30dae8402955e55e7b34e74bc8
Gitweb: https://git.kernel.org/tip/8d4c00dc38a8aa30dae8402955e55e7b34e74bc8
Author: Xunlei Pang 
AuthorDate: Mon, 9 Jul 2018 22:58:43 +0800
Committer:  Ingo Molnar 
CommitDate: Mon, 16 Jul 2018 00:28:31 +0200

sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()

If users access "/proc/pid/stat", the utime and stime ratio in the
current SAMPLE period are excepted, but currently cputime_adjust()
always calculates with the ratio of the WHOLE lifetime of the process.

This results in inaccurate utime and stime in "/proc/pid/stat". For
example,  a process runs for a while with "50% usr, 0% sys", then
followed by "100% sys". For later while, the following is excepted:

  0.0 usr,  100.0 sys

but we get:

  10.0 usr,  90.0 sys

This patch uses the accurate ratio in cputime_adjust() to address the
issue. A new 'task_cputime' type field is added in prev_cputime to record
previous 'task_cputime' so that we can get the elapsed times as the accurate
ratio.

Signed-off-by: Xunlei Pang 
Cc: Frederic Weisbecker 
Cc: Linus Torvalds 
Cc: Luiz Capitulino 
Cc: Peter Zijlstra 
Cc: Tejun Heo 
Cc: Thomas Gleixner 
Cc: baoyou@gmail.com
Link: http://lkml.kernel.org/r/20180709145843.126583-1-xlp...@linux.alibaba.com
Signed-off-by: Ingo Molnar 
---
 include/linux/sched.h | 34 
 include/linux/sched/cputime.h | 12 -
 kernel/sched/cputime.c| 61 ---
 3 files changed, 52 insertions(+), 55 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 43731fe51c97..fedc69d4a425 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -223,10 +223,27 @@ extern void io_schedule_finish(int token);
 extern long io_schedule_timeout(long timeout);
 extern void io_schedule(void);
 
+/**
+ * struct task_cputime - collected CPU time counts
+ * @utime: time spent in user mode, in nanoseconds
+ * @stime: time spent in kernel mode, in nanoseconds
+ * @sum_exec_runtime:  total time spent on the CPU, in nanoseconds
+ *
+ * This structure groups together three kinds of CPU time that are tracked for
+ * threads and thread groups.  Most things considering CPU time want to group
+ * these counts together and treat all three of them in parallel.
+ */
+struct task_cputime {
+   u64 utime;
+   u64 stime;
+   unsigned long long  sum_exec_runtime;
+};
+
 /**
  * struct prev_cputime - snapshot of system and user cputime
  * @utime: time spent in user mode
  * @stime: time spent in system mode
+ * @cputime: previous task_cputime to calculate utime/stime
  * @lock: protects the above two fields
  *
  * Stores previous user/system time values such that we can guarantee
@@ -236,26 +253,11 @@ struct prev_cputime {
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
u64 utime;
u64 stime;
+   struct task_cputime cputime;
raw_spinlock_t  lock;
 #endif
 };
 
-/**
- * struct task_cputime - collected CPU time counts
- * @utime: time spent in user mode, in nanoseconds
- * @stime: time spent in kernel mode, in nanoseconds
- * @sum_exec_runtime:  total time spent on the CPU, in nanoseconds
- *
- * This structure groups together three kinds of CPU time that are tracked for
- * threads and thread groups.  Most things considering CPU time want to group
- * these counts together and treat all three of them in parallel.
- */
-struct task_cputime {
-   u64 utime;
-   u64 stime;
-   unsigned long long  sum_exec_runtime;
-};
-
 /* Alternate field names when used on cache expirations: */
 #define virt_exp   utime
 #define prof_exp   stime
diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h
index 53f883f5a2fd..49f8fd2564ed 100644
--- a/include/linux/sched/cputime.h
+++ b/include/linux/sched/cputime.h
@@ -175,10 +175,20 @@ static inline void account_group_exec_runtime(struct 
task_struct *tsk,
atomic64_add(ns, >cputime_atomic.sum_exec_runtime);
 }
 
-static inline void prev_cputime_init(struct prev_cputime *prev)
+static inline void prev_cputime_clear(struct prev_cputime *prev)
 {
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
prev->utime = prev->stime = 0;
+   prev->cputime.utime = 0;
+   prev->cputime.stime = 0;
+   prev->cputime.sum_exec_runtime = 0;
+#endif
+}
+
+static inline void prev_cputime_init(struct prev_cputime *prev)
+{
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+   prev_cputime_clear(prev);
raw_spin_lock_init(>lock);
 #endif
 }
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 0796f938c4f0..a68483

[tip:sched/core] sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()

2018-07-15 Thread tip-bot for Xunlei Pang
Commit-ID:  8d4c00dc38a8aa30dae8402955e55e7b34e74bc8
Gitweb: https://git.kernel.org/tip/8d4c00dc38a8aa30dae8402955e55e7b34e74bc8
Author: Xunlei Pang 
AuthorDate: Mon, 9 Jul 2018 22:58:43 +0800
Committer:  Ingo Molnar 
CommitDate: Mon, 16 Jul 2018 00:28:31 +0200

sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()

If users access "/proc/pid/stat", the utime and stime ratio in the
current SAMPLE period are excepted, but currently cputime_adjust()
always calculates with the ratio of the WHOLE lifetime of the process.

This results in inaccurate utime and stime in "/proc/pid/stat". For
example,  a process runs for a while with "50% usr, 0% sys", then
followed by "100% sys". For later while, the following is excepted:

  0.0 usr,  100.0 sys

but we get:

  10.0 usr,  90.0 sys

This patch uses the accurate ratio in cputime_adjust() to address the
issue. A new 'task_cputime' type field is added in prev_cputime to record
previous 'task_cputime' so that we can get the elapsed times as the accurate
ratio.

Signed-off-by: Xunlei Pang 
Cc: Frederic Weisbecker 
Cc: Linus Torvalds 
Cc: Luiz Capitulino 
Cc: Peter Zijlstra 
Cc: Tejun Heo 
Cc: Thomas Gleixner 
Cc: baoyou@gmail.com
Link: http://lkml.kernel.org/r/20180709145843.126583-1-xlp...@linux.alibaba.com
Signed-off-by: Ingo Molnar 
---
 include/linux/sched.h | 34 
 include/linux/sched/cputime.h | 12 -
 kernel/sched/cputime.c| 61 ---
 3 files changed, 52 insertions(+), 55 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 43731fe51c97..fedc69d4a425 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -223,10 +223,27 @@ extern void io_schedule_finish(int token);
 extern long io_schedule_timeout(long timeout);
 extern void io_schedule(void);
 
+/**
+ * struct task_cputime - collected CPU time counts
+ * @utime: time spent in user mode, in nanoseconds
+ * @stime: time spent in kernel mode, in nanoseconds
+ * @sum_exec_runtime:  total time spent on the CPU, in nanoseconds
+ *
+ * This structure groups together three kinds of CPU time that are tracked for
+ * threads and thread groups.  Most things considering CPU time want to group
+ * these counts together and treat all three of them in parallel.
+ */
+struct task_cputime {
+   u64 utime;
+   u64 stime;
+   unsigned long long  sum_exec_runtime;
+};
+
 /**
  * struct prev_cputime - snapshot of system and user cputime
  * @utime: time spent in user mode
  * @stime: time spent in system mode
+ * @cputime: previous task_cputime to calculate utime/stime
  * @lock: protects the above two fields
  *
  * Stores previous user/system time values such that we can guarantee
@@ -236,26 +253,11 @@ struct prev_cputime {
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
u64 utime;
u64 stime;
+   struct task_cputime cputime;
raw_spinlock_t  lock;
 #endif
 };
 
-/**
- * struct task_cputime - collected CPU time counts
- * @utime: time spent in user mode, in nanoseconds
- * @stime: time spent in kernel mode, in nanoseconds
- * @sum_exec_runtime:  total time spent on the CPU, in nanoseconds
- *
- * This structure groups together three kinds of CPU time that are tracked for
- * threads and thread groups.  Most things considering CPU time want to group
- * these counts together and treat all three of them in parallel.
- */
-struct task_cputime {
-   u64 utime;
-   u64 stime;
-   unsigned long long  sum_exec_runtime;
-};
-
 /* Alternate field names when used on cache expirations: */
 #define virt_exp   utime
 #define prof_exp   stime
diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h
index 53f883f5a2fd..49f8fd2564ed 100644
--- a/include/linux/sched/cputime.h
+++ b/include/linux/sched/cputime.h
@@ -175,10 +175,20 @@ static inline void account_group_exec_runtime(struct 
task_struct *tsk,
atomic64_add(ns, >cputime_atomic.sum_exec_runtime);
 }
 
-static inline void prev_cputime_init(struct prev_cputime *prev)
+static inline void prev_cputime_clear(struct prev_cputime *prev)
 {
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
prev->utime = prev->stime = 0;
+   prev->cputime.utime = 0;
+   prev->cputime.stime = 0;
+   prev->cputime.sum_exec_runtime = 0;
+#endif
+}
+
+static inline void prev_cputime_init(struct prev_cputime *prev)
+{
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+   prev_cputime_clear(prev);
raw_spin_lock_init(>lock);
 #endif
 }
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 0796f938c4f0..a68483

Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion

2018-07-09 Thread Xunlei Pang
Hi Peter,

On 7/9/18 6:48 PM, Peter Zijlstra wrote:
> On Mon, Jul 09, 2018 at 01:52:38PM +0800, Xunlei Pang wrote:
>> Please see the enclosure for the reproducer cputime_adjust.tgz
> 
> No, I'm not going to reverse engineer something if you cannot even
> explain what the problem is.
> 

I rewrote the whole changelog and sent v2, please have a look.

Thanks,
Xunlei


Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion

2018-07-09 Thread Xunlei Pang
Hi Peter,

On 7/9/18 6:48 PM, Peter Zijlstra wrote:
> On Mon, Jul 09, 2018 at 01:52:38PM +0800, Xunlei Pang wrote:
>> Please see the enclosure for the reproducer cputime_adjust.tgz
> 
> No, I'm not going to reverse engineer something if you cannot even
> explain what the problem is.
> 

I rewrote the whole changelog and sent v2, please have a look.

Thanks,
Xunlei


[PATCH v2] sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()

2018-07-09 Thread Xunlei Pang
If users access "/proc/pid/stat", the utime and stime ratio in the
current SAMPLE period are excepted, but currently cputime_adjust()
always calculates with the ratio of the WHOLE lifetime of the process.

This results in inaccurate utime and stime in "/proc/pid/stat". For
example,  a process runs for a while with "50% usr, 0% sys", then
followed by "100% sys". For later while, the following is excepted:
  0.0 usr,  100.0 sys
but we got:
  10.0 usr,  90.0 sys

This patch uses the accurate ratio in cputime_adjust() to address the
issue. A new task_cputime type field is added in prev_cputime to record
previous task_cputime so that we can get the elapsed times as the accurate
ratio.

Signed-off-by: Xunlei Pang 
---
v1->v2:
 - Rewrite the changelog.

 include/linux/sched.h | 34 
 include/linux/sched/cputime.h | 12 -
 kernel/sched/cputime.c| 61 ---
 3 files changed, 52 insertions(+), 55 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 87bf02d93a27..9cb76005b638 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -223,10 +223,27 @@ extern void io_schedule_finish(int token);
 extern long io_schedule_timeout(long timeout);
 extern void io_schedule(void);
 
+/**
+ * struct task_cputime - collected CPU time counts
+ * @utime: time spent in user mode, in nanoseconds
+ * @stime: time spent in kernel mode, in nanoseconds
+ * @sum_exec_runtime:  total time spent on the CPU, in nanoseconds
+ *
+ * This structure groups together three kinds of CPU time that are tracked for
+ * threads and thread groups.  Most things considering CPU time want to group
+ * these counts together and treat all three of them in parallel.
+ */
+struct task_cputime {
+   u64 utime;
+   u64 stime;
+   unsigned long long  sum_exec_runtime;
+};
+
 /**
  * struct prev_cputime - snapshot of system and user cputime
  * @utime: time spent in user mode
  * @stime: time spent in system mode
+ * @cputime: previous task_cputime to calculate utime/stime
  * @lock: protects the above two fields
  *
  * Stores previous user/system time values such that we can guarantee
@@ -236,26 +253,11 @@ struct prev_cputime {
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
u64 utime;
u64 stime;
+   struct task_cputime cputime;
raw_spinlock_t  lock;
 #endif
 };
 
-/**
- * struct task_cputime - collected CPU time counts
- * @utime: time spent in user mode, in nanoseconds
- * @stime: time spent in kernel mode, in nanoseconds
- * @sum_exec_runtime:  total time spent on the CPU, in nanoseconds
- *
- * This structure groups together three kinds of CPU time that are tracked for
- * threads and thread groups.  Most things considering CPU time want to group
- * these counts together and treat all three of them in parallel.
- */
-struct task_cputime {
-   u64 utime;
-   u64 stime;
-   unsigned long long  sum_exec_runtime;
-};
-
 /* Alternate field names when used on cache expirations: */
 #define virt_exp   utime
 #define prof_exp   stime
diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h
index 53f883f5a2fd..49f8fd2564ed 100644
--- a/include/linux/sched/cputime.h
+++ b/include/linux/sched/cputime.h
@@ -175,10 +175,20 @@ static inline void account_group_exec_runtime(struct 
task_struct *tsk,
atomic64_add(ns, >cputime_atomic.sum_exec_runtime);
 }
 
-static inline void prev_cputime_init(struct prev_cputime *prev)
+static inline void prev_cputime_clear(struct prev_cputime *prev)
 {
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
prev->utime = prev->stime = 0;
+   prev->cputime.utime = 0;
+   prev->cputime.stime = 0;
+   prev->cputime.sum_exec_runtime = 0;
+#endif
+}
+
+static inline void prev_cputime_init(struct prev_cputime *prev)
+{
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+   prev_cputime_clear(prev);
raw_spin_lock_init(>lock);
 #endif
 }
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 0796f938c4f0..a68483ee3ad7 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -590,69 +590,54 @@ static u64 scale_stime(u64 stime, u64 rtime, u64 total)
 void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
u64 *ut, u64 *st)
 {
-   u64 rtime, stime, utime;
+   u64 rtime_delta, stime_delta, utime_delta;
unsigned long flags;
 
/* Serialize concurrent callers such that we can honour our guarantees 
*/
raw_spin_lock_irqsave(>lock, flags);
-   rtime = curr->sum_exec_run

[PATCH v2] sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()

2018-07-09 Thread Xunlei Pang
If users access "/proc/pid/stat", the utime and stime ratio in the
current SAMPLE period are excepted, but currently cputime_adjust()
always calculates with the ratio of the WHOLE lifetime of the process.

This results in inaccurate utime and stime in "/proc/pid/stat". For
example,  a process runs for a while with "50% usr, 0% sys", then
followed by "100% sys". For later while, the following is excepted:
  0.0 usr,  100.0 sys
but we got:
  10.0 usr,  90.0 sys

This patch uses the accurate ratio in cputime_adjust() to address the
issue. A new task_cputime type field is added in prev_cputime to record
previous task_cputime so that we can get the elapsed times as the accurate
ratio.

Signed-off-by: Xunlei Pang 
---
v1->v2:
 - Rewrite the changelog.

 include/linux/sched.h | 34 
 include/linux/sched/cputime.h | 12 -
 kernel/sched/cputime.c| 61 ---
 3 files changed, 52 insertions(+), 55 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 87bf02d93a27..9cb76005b638 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -223,10 +223,27 @@ extern void io_schedule_finish(int token);
 extern long io_schedule_timeout(long timeout);
 extern void io_schedule(void);
 
+/**
+ * struct task_cputime - collected CPU time counts
+ * @utime: time spent in user mode, in nanoseconds
+ * @stime: time spent in kernel mode, in nanoseconds
+ * @sum_exec_runtime:  total time spent on the CPU, in nanoseconds
+ *
+ * This structure groups together three kinds of CPU time that are tracked for
+ * threads and thread groups.  Most things considering CPU time want to group
+ * these counts together and treat all three of them in parallel.
+ */
+struct task_cputime {
+   u64 utime;
+   u64 stime;
+   unsigned long long  sum_exec_runtime;
+};
+
 /**
  * struct prev_cputime - snapshot of system and user cputime
  * @utime: time spent in user mode
  * @stime: time spent in system mode
+ * @cputime: previous task_cputime to calculate utime/stime
  * @lock: protects the above two fields
  *
  * Stores previous user/system time values such that we can guarantee
@@ -236,26 +253,11 @@ struct prev_cputime {
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
u64 utime;
u64 stime;
+   struct task_cputime cputime;
raw_spinlock_t  lock;
 #endif
 };
 
-/**
- * struct task_cputime - collected CPU time counts
- * @utime: time spent in user mode, in nanoseconds
- * @stime: time spent in kernel mode, in nanoseconds
- * @sum_exec_runtime:  total time spent on the CPU, in nanoseconds
- *
- * This structure groups together three kinds of CPU time that are tracked for
- * threads and thread groups.  Most things considering CPU time want to group
- * these counts together and treat all three of them in parallel.
- */
-struct task_cputime {
-   u64 utime;
-   u64 stime;
-   unsigned long long  sum_exec_runtime;
-};
-
 /* Alternate field names when used on cache expirations: */
 #define virt_exp   utime
 #define prof_exp   stime
diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h
index 53f883f5a2fd..49f8fd2564ed 100644
--- a/include/linux/sched/cputime.h
+++ b/include/linux/sched/cputime.h
@@ -175,10 +175,20 @@ static inline void account_group_exec_runtime(struct 
task_struct *tsk,
atomic64_add(ns, >cputime_atomic.sum_exec_runtime);
 }
 
-static inline void prev_cputime_init(struct prev_cputime *prev)
+static inline void prev_cputime_clear(struct prev_cputime *prev)
 {
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
prev->utime = prev->stime = 0;
+   prev->cputime.utime = 0;
+   prev->cputime.stime = 0;
+   prev->cputime.sum_exec_runtime = 0;
+#endif
+}
+
+static inline void prev_cputime_init(struct prev_cputime *prev)
+{
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+   prev_cputime_clear(prev);
raw_spin_lock_init(>lock);
 #endif
 }
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 0796f938c4f0..a68483ee3ad7 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -590,69 +590,54 @@ static u64 scale_stime(u64 stime, u64 rtime, u64 total)
 void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
u64 *ut, u64 *st)
 {
-   u64 rtime, stime, utime;
+   u64 rtime_delta, stime_delta, utime_delta;
unsigned long flags;
 
/* Serialize concurrent callers such that we can honour our guarantees 
*/
raw_spin_lock_irqsave(>lock, flags);
-   rtime = curr->sum_exec_run

Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion

2018-07-08 Thread Xunlei Pang
Hi Peter,

On 7/5/18 9:21 PM, Xunlei Pang wrote:
> On 7/5/18 6:46 PM, Peter Zijlstra wrote:
>> On Wed, Jun 27, 2018 at 08:22:42PM +0800, Xunlei Pang wrote:
>>> tick-based whole utime is utime_0, tick-based whole stime
>>> is stime_0, scheduler time is rtime_0. 
>>
>>> For a long time, the process runs mainly in userspace with
>>> run-sleep patterns, and because two different clocks, it
>>> is possible to have the following condition:
>>>   rtime_0 < utime_0 (as with little stime_0)
>>
>> I don't follow... what?
>>
>> Why are you, and why do you think it makes sense to, compare rtime_0
>> against utime_0 ?
>>
>> The [us]time_0 are, per your earlier definition, ticks. They're not an
>> actual measure of time. Do not compare the two, that makes no bloody
>> sense.
>>
> 
> [us]time_0 is task_struct:utime{stime}, I cited directly from
> cputime_adjust(), both in nanoseconds. I assumed "rtime_0 < utime_0"
> here to simple the following proof to help explain the problem we met.
> 

Please see the enclosure for the reproducer cputime_adjust.tgz
(process_top.sh, usr_sys.c):
gcc usr_sys.c -o usr_sys
Firstly, the function consume_sys() in usr_sys.c yields 100% sys
which can be verified as follows:
$ taskset -c 0 ./usr_sys 1
$ ./process_top.sh $(pidof usr_sys)
0.0 usr,  100.0 sys
0.0 usr,  100.0 sys


Tested on my local box on 4.17.0 by executing "taskset -c 0 ./usr_sys",
then executing "./process_top.sh $(pidof usr_sys)" to watch.
1) Before this patch
50.0 usr,  0.0 sys
50.0 usr,  1.0 sys
50.0 usr,  0.0 sys
50.0 usr,  0.0 sys
49.0 usr,  4.0 sys //switch to consume 100% sys, ignore this line
12.0 usr,  88.0 sys
11.0 usr,  89.0 sys
10.0 usr,  90.0 sys
10.0 usr,  90.0 sys
9.0 usr,  91.0 sys
8.0 usr,  91.0 sys

Obviously there were around 10% sys wrongly goes to usr

2) After this patch
50.0 usr,  0.0 sys
50.0 usr,  0.0 sys
50.0 usr,  0.0 sys
50.0 usr,  0.0 sys
11.0 usr,  76.0 sys //switch to consume 100% sys, ignore this line
1.0 usr,  100.0 sys
0.0 usr,  100.0 sys
1.0 usr,  100.0 sys
0.0 usr,  100.0 sys
0.0 usr,  100.0 sys

So it displayed the correct result as we expected after this patch.

Thanks,
Xunlei


cputime_adjust.tgz
Description: GNU Zip compressed data


Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion

2018-07-08 Thread Xunlei Pang
Hi Peter,

On 7/5/18 9:21 PM, Xunlei Pang wrote:
> On 7/5/18 6:46 PM, Peter Zijlstra wrote:
>> On Wed, Jun 27, 2018 at 08:22:42PM +0800, Xunlei Pang wrote:
>>> tick-based whole utime is utime_0, tick-based whole stime
>>> is stime_0, scheduler time is rtime_0. 
>>
>>> For a long time, the process runs mainly in userspace with
>>> run-sleep patterns, and because two different clocks, it
>>> is possible to have the following condition:
>>>   rtime_0 < utime_0 (as with little stime_0)
>>
>> I don't follow... what?
>>
>> Why are you, and why do you think it makes sense to, compare rtime_0
>> against utime_0 ?
>>
>> The [us]time_0 are, per your earlier definition, ticks. They're not an
>> actual measure of time. Do not compare the two, that makes no bloody
>> sense.
>>
> 
> [us]time_0 is task_struct:utime{stime}, I cited directly from
> cputime_adjust(), both in nanoseconds. I assumed "rtime_0 < utime_0"
> here to simple the following proof to help explain the problem we met.
> 

Please see the enclosure for the reproducer cputime_adjust.tgz
(process_top.sh, usr_sys.c):
gcc usr_sys.c -o usr_sys
Firstly, the function consume_sys() in usr_sys.c yields 100% sys
which can be verified as follows:
$ taskset -c 0 ./usr_sys 1
$ ./process_top.sh $(pidof usr_sys)
0.0 usr,  100.0 sys
0.0 usr,  100.0 sys


Tested on my local box on 4.17.0 by executing "taskset -c 0 ./usr_sys",
then executing "./process_top.sh $(pidof usr_sys)" to watch.
1) Before this patch
50.0 usr,  0.0 sys
50.0 usr,  1.0 sys
50.0 usr,  0.0 sys
50.0 usr,  0.0 sys
49.0 usr,  4.0 sys //switch to consume 100% sys, ignore this line
12.0 usr,  88.0 sys
11.0 usr,  89.0 sys
10.0 usr,  90.0 sys
10.0 usr,  90.0 sys
9.0 usr,  91.0 sys
8.0 usr,  91.0 sys

Obviously there were around 10% sys wrongly goes to usr

2) After this patch
50.0 usr,  0.0 sys
50.0 usr,  0.0 sys
50.0 usr,  0.0 sys
50.0 usr,  0.0 sys
11.0 usr,  76.0 sys //switch to consume 100% sys, ignore this line
1.0 usr,  100.0 sys
0.0 usr,  100.0 sys
1.0 usr,  100.0 sys
0.0 usr,  100.0 sys
0.0 usr,  100.0 sys

So it displayed the correct result as we expected after this patch.

Thanks,
Xunlei


cputime_adjust.tgz
Description: GNU Zip compressed data


Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion

2018-07-05 Thread Xunlei Pang
On 7/5/18 6:46 PM, Peter Zijlstra wrote:
> On Wed, Jun 27, 2018 at 08:22:42PM +0800, Xunlei Pang wrote:
>> tick-based whole utime is utime_0, tick-based whole stime
>> is stime_0, scheduler time is rtime_0. 
> 
>> For a long time, the process runs mainly in userspace with
>> run-sleep patterns, and because two different clocks, it
>> is possible to have the following condition:
>>   rtime_0 < utime_0 (as with little stime_0)
> 
> I don't follow... what?
> 
> Why are you, and why do you think it makes sense to, compare rtime_0
> against utime_0 ?
> 
> The [us]time_0 are, per your earlier definition, ticks. They're not an
> actual measure of time. Do not compare the two, that makes no bloody
> sense.
> 

[us]time_0 is task_struct:utime{stime}, I cited directly from
cputime_adjust(), both in nanoseconds. I assumed "rtime_0 < utime_0"
here to simple the following proof to help explain the problem we met.


Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion

2018-07-05 Thread Xunlei Pang
On 7/5/18 6:46 PM, Peter Zijlstra wrote:
> On Wed, Jun 27, 2018 at 08:22:42PM +0800, Xunlei Pang wrote:
>> tick-based whole utime is utime_0, tick-based whole stime
>> is stime_0, scheduler time is rtime_0. 
> 
>> For a long time, the process runs mainly in userspace with
>> run-sleep patterns, and because two different clocks, it
>> is possible to have the following condition:
>>   rtime_0 < utime_0 (as with little stime_0)
> 
> I don't follow... what?
> 
> Why are you, and why do you think it makes sense to, compare rtime_0
> against utime_0 ?
> 
> The [us]time_0 are, per your earlier definition, ticks. They're not an
> actual measure of time. Do not compare the two, that makes no bloody
> sense.
> 

[us]time_0 is task_struct:utime{stime}, I cited directly from
cputime_adjust(), both in nanoseconds. I assumed "rtime_0 < utime_0"
here to simple the following proof to help explain the problem we met.


Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion

2018-07-04 Thread Xunlei Pang
On 7/2/18 11:21 PM, Tejun Heo wrote:
> Hello, Peter.
> 
> On Tue, Jun 26, 2018 at 05:49:08PM +0200, Peter Zijlstra wrote:
>> Well, no, because the Changelog is incomprehensible and the patch
>> doesn't really have useful comments, so I'll have to reverse engineer
>> the entire thing, and I've just not had time for that.
> 
> Just as an additional data point, we also sometimes see artifacts from
> cpu_adjust_time() in the form of per-task user or sys time getting
> stuck for some period (in extreme cases for over a minute) while the
> application isn't doing anything differently.  We're telling the users
> that it's an inherent sampling artifact but it'd be nice to improve it
> if possible without adding noticeable overhead.  No idea whether this
> patch's approach is a good one tho.

The patch has no noticeable overhead except the extra cputime fileds
added into task_struct. We've been running this patch on our servers
for months, looks good till now.

Thanks!


Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion

2018-07-04 Thread Xunlei Pang
On 7/2/18 11:21 PM, Tejun Heo wrote:
> Hello, Peter.
> 
> On Tue, Jun 26, 2018 at 05:49:08PM +0200, Peter Zijlstra wrote:
>> Well, no, because the Changelog is incomprehensible and the patch
>> doesn't really have useful comments, so I'll have to reverse engineer
>> the entire thing, and I've just not had time for that.
> 
> Just as an additional data point, we also sometimes see artifacts from
> cpu_adjust_time() in the form of per-task user or sys time getting
> stuck for some period (in extreme cases for over a minute) while the
> application isn't doing anything differently.  We're telling the users
> that it's an inherent sampling artifact but it'd be nice to improve it
> if possible without adding noticeable overhead.  No idea whether this
> patch's approach is a good one tho.

The patch has no noticeable overhead except the extra cputime fileds
added into task_struct. We've been running this patch on our servers
for months, looks good till now.

Thanks!


[tip:sched/core] sched/fair: Advance global expiration when period timer is restarted

2018-07-03 Thread tip-bot for Xunlei Pang
Commit-ID:  f1d1be8aee6c461652aea8f58bedebaa73d7f4d3
Gitweb: https://git.kernel.org/tip/f1d1be8aee6c461652aea8f58bedebaa73d7f4d3
Author: Xunlei Pang 
AuthorDate: Wed, 20 Jun 2018 18:18:34 +0800
Committer:  Ingo Molnar 
CommitDate: Tue, 3 Jul 2018 09:17:29 +0200

sched/fair: Advance global expiration when period timer is restarted

When period gets restarted after some idle time, start_cfs_bandwidth()
doesn't update the expiration information, expire_cfs_rq_runtime() will
see cfs_rq->runtime_expires smaller than rq clock and go to the clock
drift logic, wasting needless CPU cycles on the scheduler hot path.

Update the global expiration in start_cfs_bandwidth() to avoid frequent
expire_cfs_rq_runtime() calls once a new period begins.

Signed-off-by: Xunlei Pang 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Ben Segall 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/20180620101834.24455-2-xlp...@linux.alibaba.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 791707c56886..840b92ee6f89 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5204,13 +5204,18 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 
 void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 {
+   u64 overrun;
+
lockdep_assert_held(_b->lock);
 
-   if (!cfs_b->period_active) {
-   cfs_b->period_active = 1;
-   hrtimer_forward_now(_b->period_timer, cfs_b->period);
-   hrtimer_start_expires(_b->period_timer, 
HRTIMER_MODE_ABS_PINNED);
-   }
+   if (cfs_b->period_active)
+   return;
+
+   cfs_b->period_active = 1;
+   overrun = hrtimer_forward_now(_b->period_timer, cfs_b->period);
+   cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
+   cfs_b->expires_seq++;
+   hrtimer_start_expires(_b->period_timer, HRTIMER_MODE_ABS_PINNED);
 }
 
 static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)


[tip:sched/core] sched/fair: Advance global expiration when period timer is restarted

2018-07-03 Thread tip-bot for Xunlei Pang
Commit-ID:  f1d1be8aee6c461652aea8f58bedebaa73d7f4d3
Gitweb: https://git.kernel.org/tip/f1d1be8aee6c461652aea8f58bedebaa73d7f4d3
Author: Xunlei Pang 
AuthorDate: Wed, 20 Jun 2018 18:18:34 +0800
Committer:  Ingo Molnar 
CommitDate: Tue, 3 Jul 2018 09:17:29 +0200

sched/fair: Advance global expiration when period timer is restarted

When period gets restarted after some idle time, start_cfs_bandwidth()
doesn't update the expiration information, expire_cfs_rq_runtime() will
see cfs_rq->runtime_expires smaller than rq clock and go to the clock
drift logic, wasting needless CPU cycles on the scheduler hot path.

Update the global expiration in start_cfs_bandwidth() to avoid frequent
expire_cfs_rq_runtime() calls once a new period begins.

Signed-off-by: Xunlei Pang 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Ben Segall 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/20180620101834.24455-2-xlp...@linux.alibaba.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 791707c56886..840b92ee6f89 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5204,13 +5204,18 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 
 void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 {
+   u64 overrun;
+
lockdep_assert_held(_b->lock);
 
-   if (!cfs_b->period_active) {
-   cfs_b->period_active = 1;
-   hrtimer_forward_now(_b->period_timer, cfs_b->period);
-   hrtimer_start_expires(_b->period_timer, 
HRTIMER_MODE_ABS_PINNED);
-   }
+   if (cfs_b->period_active)
+   return;
+
+   cfs_b->period_active = 1;
+   overrun = hrtimer_forward_now(_b->period_timer, cfs_b->period);
+   cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
+   cfs_b->expires_seq++;
+   hrtimer_start_expires(_b->period_timer, HRTIMER_MODE_ABS_PINNED);
 }
 
 static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)


[tip:sched/core] sched/fair: Fix bandwidth timer clock drift condition

2018-07-03 Thread tip-bot for Xunlei Pang
Commit-ID:  512ac999d2755d2b7109e996a76b6fb8b888631d
Gitweb: https://git.kernel.org/tip/512ac999d2755d2b7109e996a76b6fb8b888631d
Author: Xunlei Pang 
AuthorDate: Wed, 20 Jun 2018 18:18:33 +0800
Committer:  Ingo Molnar 
CommitDate: Tue, 3 Jul 2018 09:17:29 +0200

sched/fair: Fix bandwidth timer clock drift condition

I noticed that cgroup task groups constantly get throttled even
if they have low CPU usage, this causes some jitters on the response
time to some of our business containers when enabling CPU quotas.

It's very simple to reproduce:

  mkdir /sys/fs/cgroup/cpu/test
  cd /sys/fs/cgroup/cpu/test
  echo 10 > cpu.cfs_quota_us
  echo $$ > tasks

then repeat:

  cat cpu.stat | grep nr_throttled  # nr_throttled will increase steadily

After some analysis, we found that cfs_rq::runtime_remaining will
be cleared by expire_cfs_rq_runtime() due to two equal but stale
"cfs_{b|q}->runtime_expires" after period timer is re-armed.

The current condition to judge clock drift in expire_cfs_rq_runtime()
is wrong, the two runtime_expires are actually the same when clock
drift happens, so this condtion can never hit. The orginal design was
correctly done by this commit:

  a9cf55b28610 ("sched: Expire invalid runtime")

... but was changed to be the current implementation due to its locking bug.

This patch introduces another way, it adds a new field in both structures
cfs_rq and cfs_bandwidth to record the expiration update sequence, and
uses them to figure out if clock drift happens (true if they are equal).

Signed-off-by: Xunlei Pang 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Ben Segall 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Fixes: 51f2176d74ac ("sched/fair: Fix unlocked reads of some 
cfs_b->quota/period")
Link: http://lkml.kernel.org/r/20180620101834.24455-1-xlp...@linux.alibaba.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c  | 14 --
 kernel/sched/sched.h |  6 --
 2 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1866e64792a7..791707c56886 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4590,6 +4590,7 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth 
*cfs_b)
now = sched_clock_cpu(smp_processor_id());
cfs_b->runtime = cfs_b->quota;
cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
+   cfs_b->expires_seq++;
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4612,6 +4613,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
struct task_group *tg = cfs_rq->tg;
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
u64 amount = 0, min_amount, expires;
+   int expires_seq;
 
/* note: this is a positive sum as runtime_remaining <= 0 */
min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
@@ -4628,6 +4630,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
cfs_b->idle = 0;
}
}
+   expires_seq = cfs_b->expires_seq;
expires = cfs_b->runtime_expires;
raw_spin_unlock(_b->lock);
 
@@ -4637,8 +4640,10 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 * spread between our sched_clock and the one on which runtime was
 * issued.
 */
-   if ((s64)(expires - cfs_rq->runtime_expires) > 0)
+   if (cfs_rq->expires_seq != expires_seq) {
+   cfs_rq->expires_seq = expires_seq;
cfs_rq->runtime_expires = expires;
+   }
 
return cfs_rq->runtime_remaining > 0;
 }
@@ -4664,12 +4669,9 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 * has not truly expired.
 *
 * Fortunately we can check determine whether this the case by checking
-* whether the global deadline has advanced. It is valid to compare
-* cfs_b->runtime_expires without any locks since we only care about
-* exact equality, so a partial write will still work.
+* whether the global deadline(cfs_b->expires_seq) has advanced.
 */
-
-   if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
+   if (cfs_rq->expires_seq == cfs_b->expires_seq) {
/* extend local deadline, drift is bounded above by 2 ticks */
cfs_rq->runtime_expires += TICK_NSEC;
} else {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 27ddec334601..c7742dcc136c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -334,9 +334,10 @@ struct cfs_bandwidth {
u64 runtime;
s64 hierarchical_quota;
u64 runtime_expires;
+   int expires_seq;
 
-   int idle;
-   int   

[tip:sched/core] sched/fair: Fix bandwidth timer clock drift condition

2018-07-03 Thread tip-bot for Xunlei Pang
Commit-ID:  512ac999d2755d2b7109e996a76b6fb8b888631d
Gitweb: https://git.kernel.org/tip/512ac999d2755d2b7109e996a76b6fb8b888631d
Author: Xunlei Pang 
AuthorDate: Wed, 20 Jun 2018 18:18:33 +0800
Committer:  Ingo Molnar 
CommitDate: Tue, 3 Jul 2018 09:17:29 +0200

sched/fair: Fix bandwidth timer clock drift condition

I noticed that cgroup task groups constantly get throttled even
if they have low CPU usage, this causes some jitters on the response
time to some of our business containers when enabling CPU quotas.

It's very simple to reproduce:

  mkdir /sys/fs/cgroup/cpu/test
  cd /sys/fs/cgroup/cpu/test
  echo 10 > cpu.cfs_quota_us
  echo $$ > tasks

then repeat:

  cat cpu.stat | grep nr_throttled  # nr_throttled will increase steadily

After some analysis, we found that cfs_rq::runtime_remaining will
be cleared by expire_cfs_rq_runtime() due to two equal but stale
"cfs_{b|q}->runtime_expires" after period timer is re-armed.

The current condition to judge clock drift in expire_cfs_rq_runtime()
is wrong, the two runtime_expires are actually the same when clock
drift happens, so this condtion can never hit. The orginal design was
correctly done by this commit:

  a9cf55b28610 ("sched: Expire invalid runtime")

... but was changed to be the current implementation due to its locking bug.

This patch introduces another way, it adds a new field in both structures
cfs_rq and cfs_bandwidth to record the expiration update sequence, and
uses them to figure out if clock drift happens (true if they are equal).

Signed-off-by: Xunlei Pang 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Ben Segall 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Fixes: 51f2176d74ac ("sched/fair: Fix unlocked reads of some 
cfs_b->quota/period")
Link: http://lkml.kernel.org/r/20180620101834.24455-1-xlp...@linux.alibaba.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c  | 14 --
 kernel/sched/sched.h |  6 --
 2 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1866e64792a7..791707c56886 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4590,6 +4590,7 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth 
*cfs_b)
now = sched_clock_cpu(smp_processor_id());
cfs_b->runtime = cfs_b->quota;
cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
+   cfs_b->expires_seq++;
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4612,6 +4613,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
struct task_group *tg = cfs_rq->tg;
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
u64 amount = 0, min_amount, expires;
+   int expires_seq;
 
/* note: this is a positive sum as runtime_remaining <= 0 */
min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
@@ -4628,6 +4630,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
cfs_b->idle = 0;
}
}
+   expires_seq = cfs_b->expires_seq;
expires = cfs_b->runtime_expires;
raw_spin_unlock(_b->lock);
 
@@ -4637,8 +4640,10 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 * spread between our sched_clock and the one on which runtime was
 * issued.
 */
-   if ((s64)(expires - cfs_rq->runtime_expires) > 0)
+   if (cfs_rq->expires_seq != expires_seq) {
+   cfs_rq->expires_seq = expires_seq;
cfs_rq->runtime_expires = expires;
+   }
 
return cfs_rq->runtime_remaining > 0;
 }
@@ -4664,12 +4669,9 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 * has not truly expired.
 *
 * Fortunately we can check determine whether this the case by checking
-* whether the global deadline has advanced. It is valid to compare
-* cfs_b->runtime_expires without any locks since we only care about
-* exact equality, so a partial write will still work.
+* whether the global deadline(cfs_b->expires_seq) has advanced.
 */
-
-   if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
+   if (cfs_rq->expires_seq == cfs_b->expires_seq) {
/* extend local deadline, drift is bounded above by 2 ticks */
cfs_rq->runtime_expires += TICK_NSEC;
} else {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 27ddec334601..c7742dcc136c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -334,9 +334,10 @@ struct cfs_bandwidth {
u64 runtime;
s64 hierarchical_quota;
u64 runtime_expires;
+   int expires_seq;
 
-   int idle;
-   int   

Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion

2018-06-27 Thread Xunlei Pang
On 6/26/18 11:49 PM, Peter Zijlstra wrote:
> On Tue, Jun 26, 2018 at 08:19:49PM +0800, Xunlei Pang wrote:
>> On 6/22/18 3:15 PM, Xunlei Pang wrote:
>>> We use per-cgroup cpu usage statistics similar to "cgroup rstat",
>>> and encountered a problem that user and sys usages are wrongly
>>> split sometimes.
>>>
>>> Run tasks with some random run-sleep pattern for a long time, and
>>> when tick-based time and scheduler sum_exec_runtime hugely drifts
>>> apart(scheduler sum_exec_runtime is less than tick-based time),
>>> the current implementation of cputime_adjust() will produce less
>>> sys usage than the actual use after changing to run a different
>>> workload pattern with high sys. This is because total tick-based
>>> utime and stime are used to split the total sum_exec_runtime.
>>>
>>> Same problem exists on utime and stime from "/proc//stat".
>>>
>>> [Example]
>>> Run some random run-sleep patterns for minutes, then change to run
>>> high sys pattern, and watch.
>>> 1) standard "top"(which is the correct one):
>>>4.6 us, 94.5 sy,  0.0 ni,  0.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
>>> 2) our tool parsing utime and stime from "/proc//stat":
>>>20.5 usr, 78.4 sys
>>> We can see "20.5 usr" displayed in 2) was incorrect, it recovers
>>> gradually with time: 9.7 usr, 89.5 sys
>>>
>> High sys probably means there's something abnormal on the kernel
>> path, it may hide issues, so we should make it fairly reliable.
>> It can easily hit this problem with our per-cgroup statistics.
>>
>> Hi Peter, any comment on this patch?
> 
> Well, no, because the Changelog is incomprehensible and the patch
> doesn't really have useful comments, so I'll have to reverse engineer
> the entire thing, and I've just not had time for that.
> 

Let me try the best to describe it again.

There are two types of run time for a process:
1) task_struct::utime, task_struct::stime in ticks.
2) scheduler task_struct::se.sum_exec_runtime(rtime) in ns.

In case of no vtime accounting, the utime/stime fileds of
/proc/pid/stat are calculated by cputime_adjust(), which
splits the precise rtime in the proportion of tick-based
utime and stime.

However cputime_adjust() always does the split using the
total utime/stime of the process, this may cause wrong
splitting in some cases, e.g.

A typical statistic collector accesses "/proc/pid/stat".
1) moment t0
After accessed /proc/pid/stat in t0:
tick-based whole utime is utime_0, tick-based whole stime
is stime_0, scheduler time is rtime_0. The ajusted utime
caculated by cputime_adjust() is autime_0, ajusted stime
is astime_0, so astime_0=rtime_0*stime_0/(utime_0+stime_0).

For a long time, the process runs mainly in userspace with
run-sleep patterns, and because two different clocks, it
is possible to have the following condition:
  rtime_0 < utime_0 (as with little stime_0)

2) moment t1(after dt, i.e. t0+dt)
Then the process suddenly runs 100% sys workload afterwards
lasting "dt", when accessing /proc/pid/stat at t1="t0+dt",
both rtime_0 and stime_0 increase "dt", thus cputime_ajust()
does the calculation for new adjusted astime_1 as follows:
  (rtime_0+dt)*(stime_0+dt)/(utime_0+stime_0+dt)
= (rtime_0*stime_0+rtime_0*dt+stime_0*dt+dt*dt)/(utime_0+stime_0+dt)
= (rtime_0*stime_0+rtime_0*dt-utime_0*dt)/(utime_0+stime_0+dt) + dt
< rtime_0*stime_0/(utime_0+stime_0+dt) + dt (for rtime_0 < utime_0)
< rtime_0*stime_0/(utime_0+stime_0) + dt
< astime_0+dt

The actual astime_1 should be "astime_0+dt"(as it runs 100%
sys during dt), but the caculated figure by cputime_adjust()
becomes much smaller, as a result the statistics collector
shows less cpu sys usage than the actual one.

That's why we occasionally observed the incorrect cpu usage
described in the changelog:
[Example]
Run some random run-sleep patterns for minutes, then change
to run high sys pattern, and watch.
1) standard "top"(which is the correct one):
   4.6 us, 94.5 sy,  0.0 ni,  0.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
2) Parsing utime and stime from "/proc//stat":
   20.5 usr, 78.4 sys
We can see "20.5 usr" displayed in 2) was incorrect, it recovers
gradually with time: 9.7 usr, 89.5 sys


Thanks,
Xunlei


Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion

2018-06-27 Thread Xunlei Pang
On 6/26/18 11:49 PM, Peter Zijlstra wrote:
> On Tue, Jun 26, 2018 at 08:19:49PM +0800, Xunlei Pang wrote:
>> On 6/22/18 3:15 PM, Xunlei Pang wrote:
>>> We use per-cgroup cpu usage statistics similar to "cgroup rstat",
>>> and encountered a problem that user and sys usages are wrongly
>>> split sometimes.
>>>
>>> Run tasks with some random run-sleep pattern for a long time, and
>>> when tick-based time and scheduler sum_exec_runtime hugely drifts
>>> apart(scheduler sum_exec_runtime is less than tick-based time),
>>> the current implementation of cputime_adjust() will produce less
>>> sys usage than the actual use after changing to run a different
>>> workload pattern with high sys. This is because total tick-based
>>> utime and stime are used to split the total sum_exec_runtime.
>>>
>>> Same problem exists on utime and stime from "/proc//stat".
>>>
>>> [Example]
>>> Run some random run-sleep patterns for minutes, then change to run
>>> high sys pattern, and watch.
>>> 1) standard "top"(which is the correct one):
>>>4.6 us, 94.5 sy,  0.0 ni,  0.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
>>> 2) our tool parsing utime and stime from "/proc//stat":
>>>20.5 usr, 78.4 sys
>>> We can see "20.5 usr" displayed in 2) was incorrect, it recovers
>>> gradually with time: 9.7 usr, 89.5 sys
>>>
>> High sys probably means there's something abnormal on the kernel
>> path, it may hide issues, so we should make it fairly reliable.
>> It can easily hit this problem with our per-cgroup statistics.
>>
>> Hi Peter, any comment on this patch?
> 
> Well, no, because the Changelog is incomprehensible and the patch
> doesn't really have useful comments, so I'll have to reverse engineer
> the entire thing, and I've just not had time for that.
> 

Let me try the best to describe it again.

There are two types of run time for a process:
1) task_struct::utime, task_struct::stime in ticks.
2) scheduler task_struct::se.sum_exec_runtime(rtime) in ns.

In case of no vtime accounting, the utime/stime fileds of
/proc/pid/stat are calculated by cputime_adjust(), which
splits the precise rtime in the proportion of tick-based
utime and stime.

However cputime_adjust() always does the split using the
total utime/stime of the process, this may cause wrong
splitting in some cases, e.g.

A typical statistic collector accesses "/proc/pid/stat".
1) moment t0
After accessed /proc/pid/stat in t0:
tick-based whole utime is utime_0, tick-based whole stime
is stime_0, scheduler time is rtime_0. The ajusted utime
caculated by cputime_adjust() is autime_0, ajusted stime
is astime_0, so astime_0=rtime_0*stime_0/(utime_0+stime_0).

For a long time, the process runs mainly in userspace with
run-sleep patterns, and because two different clocks, it
is possible to have the following condition:
  rtime_0 < utime_0 (as with little stime_0)

2) moment t1(after dt, i.e. t0+dt)
Then the process suddenly runs 100% sys workload afterwards
lasting "dt", when accessing /proc/pid/stat at t1="t0+dt",
both rtime_0 and stime_0 increase "dt", thus cputime_ajust()
does the calculation for new adjusted astime_1 as follows:
  (rtime_0+dt)*(stime_0+dt)/(utime_0+stime_0+dt)
= (rtime_0*stime_0+rtime_0*dt+stime_0*dt+dt*dt)/(utime_0+stime_0+dt)
= (rtime_0*stime_0+rtime_0*dt-utime_0*dt)/(utime_0+stime_0+dt) + dt
< rtime_0*stime_0/(utime_0+stime_0+dt) + dt (for rtime_0 < utime_0)
< rtime_0*stime_0/(utime_0+stime_0) + dt
< astime_0+dt

The actual astime_1 should be "astime_0+dt"(as it runs 100%
sys during dt), but the caculated figure by cputime_adjust()
becomes much smaller, as a result the statistics collector
shows less cpu sys usage than the actual one.

That's why we occasionally observed the incorrect cpu usage
described in the changelog:
[Example]
Run some random run-sleep patterns for minutes, then change
to run high sys pattern, and watch.
1) standard "top"(which is the correct one):
   4.6 us, 94.5 sy,  0.0 ni,  0.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
2) Parsing utime and stime from "/proc//stat":
   20.5 usr, 78.4 sys
We can see "20.5 usr" displayed in 2) was incorrect, it recovers
gradually with time: 9.7 usr, 89.5 sys


Thanks,
Xunlei


Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion

2018-06-26 Thread Xunlei Pang
On 6/22/18 3:15 PM, Xunlei Pang wrote:
> We use per-cgroup cpu usage statistics similar to "cgroup rstat",
> and encountered a problem that user and sys usages are wrongly
> split sometimes.
> 
> Run tasks with some random run-sleep pattern for a long time, and
> when tick-based time and scheduler sum_exec_runtime hugely drifts
> apart(scheduler sum_exec_runtime is less than tick-based time),
> the current implementation of cputime_adjust() will produce less
> sys usage than the actual use after changing to run a different
> workload pattern with high sys. This is because total tick-based
> utime and stime are used to split the total sum_exec_runtime.
> 
> Same problem exists on utime and stime from "/proc//stat".
> 
> [Example]
> Run some random run-sleep patterns for minutes, then change to run
> high sys pattern, and watch.
> 1) standard "top"(which is the correct one):
>4.6 us, 94.5 sy,  0.0 ni,  0.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> 2) our tool parsing utime and stime from "/proc//stat":
>20.5 usr, 78.4 sys
> We can see "20.5 usr" displayed in 2) was incorrect, it recovers
> gradually with time: 9.7 usr, 89.5 sys
> 
High sys probably means there's something abnormal on the kernel
path, it may hide issues, so we should make it fairly reliable.
It can easily hit this problem with our per-cgroup statistics.

Hi Peter, any comment on this patch?


Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion

2018-06-26 Thread Xunlei Pang
On 6/22/18 3:15 PM, Xunlei Pang wrote:
> We use per-cgroup cpu usage statistics similar to "cgroup rstat",
> and encountered a problem that user and sys usages are wrongly
> split sometimes.
> 
> Run tasks with some random run-sleep pattern for a long time, and
> when tick-based time and scheduler sum_exec_runtime hugely drifts
> apart(scheduler sum_exec_runtime is less than tick-based time),
> the current implementation of cputime_adjust() will produce less
> sys usage than the actual use after changing to run a different
> workload pattern with high sys. This is because total tick-based
> utime and stime are used to split the total sum_exec_runtime.
> 
> Same problem exists on utime and stime from "/proc//stat".
> 
> [Example]
> Run some random run-sleep patterns for minutes, then change to run
> high sys pattern, and watch.
> 1) standard "top"(which is the correct one):
>4.6 us, 94.5 sy,  0.0 ni,  0.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> 2) our tool parsing utime and stime from "/proc//stat":
>20.5 usr, 78.4 sys
> We can see "20.5 usr" displayed in 2) was incorrect, it recovers
> gradually with time: 9.7 usr, 89.5 sys
> 
High sys probably means there's something abnormal on the kernel
path, it may hide issues, so we should make it fairly reliable.
It can easily hit this problem with our per-cgroup statistics.

Hi Peter, any comment on this patch?


Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion

2018-06-25 Thread Xunlei Pang
On 6/22/18 6:35 PM, kbuild test robot wrote:
> Hi Xunlei,
> 
> Thank you for the patch! Perhaps something to improve:
> 
> [auto build test WARNING on tip/sched/core]
> [also build test WARNING on v4.18-rc1 next-20180622]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> 
> url:
> https://github.com/0day-ci/linux/commits/Xunlei-Pang/sched-cputime-Ensure-correct-utime-and-stime-proportion/20180622-153720
> reproduce: make htmldocs
> 
> All warnings (new ones prefixed by >>):
> 
>WARNING: convert(1) not found, for SVG to PDF conversion install 
> ImageMagick (https://www.imagemagick.org)
>mm/mempool.c:228: warning: Function parameter or member 'pool' not 
> described in 'mempool_init'
>include/net/cfg80211.h:4279: warning: Function parameter or member 
> 'wext.ibss' not described in 'wireless_dev'
>include/net/cfg80211.h:4279: warning: Function parameter or member 
> 'wext.connect' not described in 'wireless_dev'
>include/net/cfg80211.h:4279: warning: Function parameter or member 
> 'wext.keys' not described in 'wireless_dev'
>include/net/cfg80211.h:4279: warning: Function parameter or member 
> 'wext.ie' not described in 'wireless_dev'
>include/net/cfg80211.h:4279: warning: Function parameter or member 
> 'wext.ie_len' not described in 'wireless_dev'
>include/net/cfg80211.h:4279: warning: Function parameter or member 
> 'wext.bssid' not described in 'wireless_dev'
>include/net/cfg80211.h:4279: warning: Function parameter or member 
> 'wext.ssid' not described in 'wireless_dev'
>include/net/cfg80211.h:4279: warning: Function parameter or member 
> 'wext.default_key' not described in 'wireless_dev'
>include/net/cfg80211.h:4279: warning: Function parameter or member 
> 'wext.default_mgmt_key' not described in 'wireless_dev'
>include/net/cfg80211.h:4279: warning: Function parameter or member 
> 'wext.prev_bssid_valid' not described in 'wireless_dev'
>include/net/mac80211.h:2282: warning: Function parameter or member 
> 'radiotap_timestamp.units_pos' not described in 'ieee80211_hw'
>include/net/mac80211.h:2282: warning: Function parameter or member 
> 'radiotap_timestamp.accuracy' not described in 'ieee80211_hw'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'control.rates' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'control.rts_cts_rate_idx' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'control.use_rts' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'control.use_cts_prot' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'control.short_preamble' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'control.skip_table' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'control.jiffies' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'control.vif' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'control.hw_key' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'control.flags' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'control.enqueue_time' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 'ack' 
> not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'ack.cookie' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'status.rates' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'status.ack_signal' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'status.ampdu_ack_len' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'status.ampdu_len' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'status.antenna' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'status.tx_time' not described in 'ieee80211_tx_info'
>include/net/mac80211

Re: [PATCH] sched/cputime: Ensure correct utime and stime proportion

2018-06-25 Thread Xunlei Pang
On 6/22/18 6:35 PM, kbuild test robot wrote:
> Hi Xunlei,
> 
> Thank you for the patch! Perhaps something to improve:
> 
> [auto build test WARNING on tip/sched/core]
> [also build test WARNING on v4.18-rc1 next-20180622]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> 
> url:
> https://github.com/0day-ci/linux/commits/Xunlei-Pang/sched-cputime-Ensure-correct-utime-and-stime-proportion/20180622-153720
> reproduce: make htmldocs
> 
> All warnings (new ones prefixed by >>):
> 
>WARNING: convert(1) not found, for SVG to PDF conversion install 
> ImageMagick (https://www.imagemagick.org)
>mm/mempool.c:228: warning: Function parameter or member 'pool' not 
> described in 'mempool_init'
>include/net/cfg80211.h:4279: warning: Function parameter or member 
> 'wext.ibss' not described in 'wireless_dev'
>include/net/cfg80211.h:4279: warning: Function parameter or member 
> 'wext.connect' not described in 'wireless_dev'
>include/net/cfg80211.h:4279: warning: Function parameter or member 
> 'wext.keys' not described in 'wireless_dev'
>include/net/cfg80211.h:4279: warning: Function parameter or member 
> 'wext.ie' not described in 'wireless_dev'
>include/net/cfg80211.h:4279: warning: Function parameter or member 
> 'wext.ie_len' not described in 'wireless_dev'
>include/net/cfg80211.h:4279: warning: Function parameter or member 
> 'wext.bssid' not described in 'wireless_dev'
>include/net/cfg80211.h:4279: warning: Function parameter or member 
> 'wext.ssid' not described in 'wireless_dev'
>include/net/cfg80211.h:4279: warning: Function parameter or member 
> 'wext.default_key' not described in 'wireless_dev'
>include/net/cfg80211.h:4279: warning: Function parameter or member 
> 'wext.default_mgmt_key' not described in 'wireless_dev'
>include/net/cfg80211.h:4279: warning: Function parameter or member 
> 'wext.prev_bssid_valid' not described in 'wireless_dev'
>include/net/mac80211.h:2282: warning: Function parameter or member 
> 'radiotap_timestamp.units_pos' not described in 'ieee80211_hw'
>include/net/mac80211.h:2282: warning: Function parameter or member 
> 'radiotap_timestamp.accuracy' not described in 'ieee80211_hw'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'control.rates' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'control.rts_cts_rate_idx' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'control.use_rts' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'control.use_cts_prot' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'control.short_preamble' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'control.skip_table' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'control.jiffies' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'control.vif' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'control.hw_key' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'control.flags' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'control.enqueue_time' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 'ack' 
> not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'ack.cookie' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'status.rates' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'status.ack_signal' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'status.ampdu_ack_len' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'status.ampdu_len' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'status.antenna' not described in 'ieee80211_tx_info'
>include/net/mac80211.h:955: warning: Function parameter or member 
> 'status.tx_time' not described in 'ieee80211_tx_info'
>include/net/mac80211

[PATCH] sched/cputime: Ensure correct utime and stime proportion

2018-06-22 Thread Xunlei Pang
We use per-cgroup cpu usage statistics similar to "cgroup rstat",
and encountered a problem that user and sys usages are wrongly
split sometimes.

Run tasks with some random run-sleep pattern for a long time, and
when tick-based time and scheduler sum_exec_runtime hugely drifts
apart(scheduler sum_exec_runtime is less than tick-based time),
the current implementation of cputime_adjust() will produce less
sys usage than the actual use after changing to run a different
workload pattern with high sys. This is because total tick-based
utime and stime are used to split the total sum_exec_runtime.

Same problem exists on utime and stime from "/proc//stat".

[Example]
Run some random run-sleep patterns for minutes, then change to run
high sys pattern, and watch.
1) standard "top"(which is the correct one):
   4.6 us, 94.5 sy,  0.0 ni,  0.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
2) our tool parsing utime and stime from "/proc//stat":
   20.5 usr, 78.4 sys
We can see "20.5 usr" displayed in 2) was incorrect, it recovers
gradually with time: 9.7 usr, 89.5 sys

This patch fixes the issue by calculating using all kinds of time
elapsed since last parse in cputime_adjust(), and accumulate the
corresponding results calculated into prev_cputime. A new field
of task_cputime type is added in structure prev_cputime to record
previous task_cputime so that we can get the elapsed time deltas.

Signed-off-by: Xunlei Pang 
---
 include/linux/sched.h | 33 +++
 include/linux/sched/cputime.h | 12 -
 kernel/sched/cputime.c| 61 ---
 3 files changed, 51 insertions(+), 55 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 87bf02d93a27..5108ac8414e0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -223,6 +223,22 @@ extern void io_schedule_finish(int token);
 extern long io_schedule_timeout(long timeout);
 extern void io_schedule(void);
 
+/**
+ * struct task_cputime - collected CPU time counts
+ * @utime: time spent in user mode, in nanoseconds
+ * @stime: time spent in kernel mode, in nanoseconds
+ * @sum_exec_runtime:  total time spent on the CPU, in nanoseconds
+ *
+ * This structure groups together three kinds of CPU time that are tracked for
+ * threads and thread groups.  Most things considering CPU time want to group
+ * these counts together and treat all three of them in parallel.
+ */
+struct task_cputime {
+   u64 utime;
+   u64 stime;
+   unsigned long long  sum_exec_runtime;
+};
+
 /**
  * struct prev_cputime - snapshot of system and user cputime
  * @utime: time spent in user mode
@@ -236,26 +252,11 @@ struct prev_cputime {
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
u64 utime;
u64 stime;
+   struct task_cputime cputime;
raw_spinlock_t  lock;
 #endif
 };
 
-/**
- * struct task_cputime - collected CPU time counts
- * @utime: time spent in user mode, in nanoseconds
- * @stime: time spent in kernel mode, in nanoseconds
- * @sum_exec_runtime:  total time spent on the CPU, in nanoseconds
- *
- * This structure groups together three kinds of CPU time that are tracked for
- * threads and thread groups.  Most things considering CPU time want to group
- * these counts together and treat all three of them in parallel.
- */
-struct task_cputime {
-   u64 utime;
-   u64 stime;
-   unsigned long long  sum_exec_runtime;
-};
-
 /* Alternate field names when used on cache expirations: */
 #define virt_exp   utime
 #define prof_exp   stime
diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h
index 53f883f5a2fd..49f8fd2564ed 100644
--- a/include/linux/sched/cputime.h
+++ b/include/linux/sched/cputime.h
@@ -175,10 +175,20 @@ static inline void account_group_exec_runtime(struct 
task_struct *tsk,
atomic64_add(ns, >cputime_atomic.sum_exec_runtime);
 }
 
-static inline void prev_cputime_init(struct prev_cputime *prev)
+static inline void prev_cputime_clear(struct prev_cputime *prev)
 {
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
prev->utime = prev->stime = 0;
+   prev->cputime.utime = 0;
+   prev->cputime.stime = 0;
+   prev->cputime.sum_exec_runtime = 0;
+#endif
+}
+
+static inline void prev_cputime_init(struct prev_cputime *prev)
+{
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+   prev_cputime_clear(prev);
raw_spin_lock_init(>lock);
 #endif
 }
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 0796f938c4f0..a68483ee3ad7 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -590,69 +590,54 @@ static u64 s

[PATCH] sched/cputime: Ensure correct utime and stime proportion

2018-06-22 Thread Xunlei Pang
We use per-cgroup cpu usage statistics similar to "cgroup rstat",
and encountered a problem that user and sys usages are wrongly
split sometimes.

Run tasks with some random run-sleep pattern for a long time, and
when tick-based time and scheduler sum_exec_runtime hugely drifts
apart(scheduler sum_exec_runtime is less than tick-based time),
the current implementation of cputime_adjust() will produce less
sys usage than the actual use after changing to run a different
workload pattern with high sys. This is because total tick-based
utime and stime are used to split the total sum_exec_runtime.

Same problem exists on utime and stime from "/proc//stat".

[Example]
Run some random run-sleep patterns for minutes, then change to run
high sys pattern, and watch.
1) standard "top"(which is the correct one):
   4.6 us, 94.5 sy,  0.0 ni,  0.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
2) our tool parsing utime and stime from "/proc//stat":
   20.5 usr, 78.4 sys
We can see "20.5 usr" displayed in 2) was incorrect, it recovers
gradually with time: 9.7 usr, 89.5 sys

This patch fixes the issue by calculating using all kinds of time
elapsed since last parse in cputime_adjust(), and accumulate the
corresponding results calculated into prev_cputime. A new field
of task_cputime type is added in structure prev_cputime to record
previous task_cputime so that we can get the elapsed time deltas.

Signed-off-by: Xunlei Pang 
---
 include/linux/sched.h | 33 +++
 include/linux/sched/cputime.h | 12 -
 kernel/sched/cputime.c| 61 ---
 3 files changed, 51 insertions(+), 55 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 87bf02d93a27..5108ac8414e0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -223,6 +223,22 @@ extern void io_schedule_finish(int token);
 extern long io_schedule_timeout(long timeout);
 extern void io_schedule(void);
 
+/**
+ * struct task_cputime - collected CPU time counts
+ * @utime: time spent in user mode, in nanoseconds
+ * @stime: time spent in kernel mode, in nanoseconds
+ * @sum_exec_runtime:  total time spent on the CPU, in nanoseconds
+ *
+ * This structure groups together three kinds of CPU time that are tracked for
+ * threads and thread groups.  Most things considering CPU time want to group
+ * these counts together and treat all three of them in parallel.
+ */
+struct task_cputime {
+   u64 utime;
+   u64 stime;
+   unsigned long long  sum_exec_runtime;
+};
+
 /**
  * struct prev_cputime - snapshot of system and user cputime
  * @utime: time spent in user mode
@@ -236,26 +252,11 @@ struct prev_cputime {
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
u64 utime;
u64 stime;
+   struct task_cputime cputime;
raw_spinlock_t  lock;
 #endif
 };
 
-/**
- * struct task_cputime - collected CPU time counts
- * @utime: time spent in user mode, in nanoseconds
- * @stime: time spent in kernel mode, in nanoseconds
- * @sum_exec_runtime:  total time spent on the CPU, in nanoseconds
- *
- * This structure groups together three kinds of CPU time that are tracked for
- * threads and thread groups.  Most things considering CPU time want to group
- * these counts together and treat all three of them in parallel.
- */
-struct task_cputime {
-   u64 utime;
-   u64 stime;
-   unsigned long long  sum_exec_runtime;
-};
-
 /* Alternate field names when used on cache expirations: */
 #define virt_exp   utime
 #define prof_exp   stime
diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h
index 53f883f5a2fd..49f8fd2564ed 100644
--- a/include/linux/sched/cputime.h
+++ b/include/linux/sched/cputime.h
@@ -175,10 +175,20 @@ static inline void account_group_exec_runtime(struct 
task_struct *tsk,
atomic64_add(ns, >cputime_atomic.sum_exec_runtime);
 }
 
-static inline void prev_cputime_init(struct prev_cputime *prev)
+static inline void prev_cputime_clear(struct prev_cputime *prev)
 {
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
prev->utime = prev->stime = 0;
+   prev->cputime.utime = 0;
+   prev->cputime.stime = 0;
+   prev->cputime.sum_exec_runtime = 0;
+#endif
+}
+
+static inline void prev_cputime_init(struct prev_cputime *prev)
+{
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+   prev_cputime_clear(prev);
raw_spin_lock_init(>lock);
 #endif
 }
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 0796f938c4f0..a68483ee3ad7 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -590,69 +590,54 @@ static u64 s

Re: [PATCH v2 1/2] sched/fair: Fix bandwidth timer clock drift condition

2018-06-21 Thread Xunlei Pang
On 6/21/18 4:08 PM, Peter Zijlstra wrote:
> On Thu, Jun 21, 2018 at 11:56:56AM +0800, Xunlei Pang wrote:
>>>> Fixes: 51f2176d74ac ("sched/fair: Fix unlocked reads of some 
>>>> cfs_b->quota/period")
>>>> Cc: Ben Segall 
>>>
>>> Reviewed-By: Ben Segall 
>>
>> Thanks Ben :-)
>>
>> Hi Peter, could you please have a look at them?
> 
> I grabbed them, thanks guys!
> 
>>
>> Cc sta...@vger.kernel.org
> 
> For both or just the first?
> 

Both, thanks!


Re: [PATCH v2 1/2] sched/fair: Fix bandwidth timer clock drift condition

2018-06-21 Thread Xunlei Pang
On 6/21/18 4:08 PM, Peter Zijlstra wrote:
> On Thu, Jun 21, 2018 at 11:56:56AM +0800, Xunlei Pang wrote:
>>>> Fixes: 51f2176d74ac ("sched/fair: Fix unlocked reads of some 
>>>> cfs_b->quota/period")
>>>> Cc: Ben Segall 
>>>
>>> Reviewed-By: Ben Segall 
>>
>> Thanks Ben :-)
>>
>> Hi Peter, could you please have a look at them?
> 
> I grabbed them, thanks guys!
> 
>>
>> Cc sta...@vger.kernel.org
> 
> For both or just the first?
> 

Both, thanks!


Re: [PATCH v2 1/2] sched/fair: Fix bandwidth timer clock drift condition

2018-06-20 Thread Xunlei Pang
On 6/21/18 1:01 AM, bseg...@google.com wrote:
> Xunlei Pang  writes:
> 
>> I noticed the group constantly got throttled even it consumed
>> low cpu usage, this caused some jitters on the response time
>> to some of our business containers enabling cpu quota.
>>
>> It's very simple to reproduce:
>>   mkdir /sys/fs/cgroup/cpu/test
>>   cd /sys/fs/cgroup/cpu/test
>>   echo 10 > cpu.cfs_quota_us
>>   echo $$ > tasks
>> then repeat:
>>   cat cpu.stat |grep nr_throttled  // nr_throttled will increase
>>
>> After some analysis, we found that cfs_rq::runtime_remaining will
>> be cleared by expire_cfs_rq_runtime() due to two equal but stale
>> "cfs_{b|q}->runtime_expires" after period timer is re-armed.
>>
>> The current condition to judge clock drift in expire_cfs_rq_runtime()
>> is wrong, the two runtime_expires are actually the same when clock
>> drift happens, so this condtion can never hit. The orginal design was
>> correctly done by commit a9cf55b28610 ("sched: Expire invalid runtime"),
>> but was changed to be the current one due to its locking issue.
>>
>> This patch introduces another way, it adds a new field in both structures
>> cfs_rq and cfs_bandwidth to record the expiration update sequence, and
>> use them to figure out if clock drift happens(true if they equal).
>>
>> Fixes: 51f2176d74ac ("sched/fair: Fix unlocked reads of some 
>> cfs_b->quota/period")
>> Cc: Ben Segall 
> 
> Reviewed-By: Ben Segall 

Thanks Ben :-)

Hi Peter, could you please have a look at them?

Cc sta...@vger.kernel.org

> 
>> Signed-off-by: Xunlei Pang 
>> ---
>>  kernel/sched/fair.c  | 14 --
>>  kernel/sched/sched.h |  6 --
>>  2 files changed, 12 insertions(+), 8 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index e497c05aab7f..e6bb68d52962 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4590,6 +4590,7 @@ void __refill_cfs_bandwidth_runtime(struct 
>> cfs_bandwidth *cfs_b)
>>  now = sched_clock_cpu(smp_processor_id());
>>  cfs_b->runtime = cfs_b->quota;
>>  cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
>> +cfs_b->expires_seq++;
>>  }
>>  
>>  static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
>> @@ -4612,6 +4613,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>>  struct task_group *tg = cfs_rq->tg;
>>  struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
>>  u64 amount = 0, min_amount, expires;
>> +int expires_seq;
>>  
>>  /* note: this is a positive sum as runtime_remaining <= 0 */
>>  min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
>> @@ -4628,6 +4630,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>>  cfs_b->idle = 0;
>>  }
>>  }
>> +expires_seq = cfs_b->expires_seq;
>>  expires = cfs_b->runtime_expires;
>>  raw_spin_unlock(_b->lock);
>>  
>> @@ -4637,8 +4640,10 @@ static int assign_cfs_rq_runtime(struct cfs_rq 
>> *cfs_rq)
>>   * spread between our sched_clock and the one on which runtime was
>>   * issued.
>>   */
>> -if ((s64)(expires - cfs_rq->runtime_expires) > 0)
>> +if (cfs_rq->expires_seq != expires_seq) {
>> +cfs_rq->expires_seq = expires_seq;
>>  cfs_rq->runtime_expires = expires;
>> +}
>>  
>>  return cfs_rq->runtime_remaining > 0;
>>  }
>> @@ -4664,12 +4669,9 @@ static void expire_cfs_rq_runtime(struct cfs_rq 
>> *cfs_rq)
>>   * has not truly expired.
>>   *
>>   * Fortunately we can check determine whether this the case by checking
>> - * whether the global deadline has advanced. It is valid to compare
>> - * cfs_b->runtime_expires without any locks since we only care about
>> - * exact equality, so a partial write will still work.
>> + * whether the global deadline(cfs_b->expires_seq) has advanced.
>>   */
>> -
>> -if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
>> +if (cfs_rq->expires_seq == cfs_b->expires_seq) {
>>  /* extend local deadline, drift is bounded above by 2 ticks */
>>  cfs_rq->runtime_expires += TICK_NSEC;
>>  } else {
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 6601baf2361c..e977e04f8daf

  1   2   3   4   5   6   7   8   9   10   >