Re: [PATCH][RFC v3] PM / Hibernate: Feed the wathdog when creating snapshot

2017-08-19 Thread Chen Yu
On Thu, Aug 17, 2017 at 01:28:06PM +0200, Michal Hocko wrote:
> On Thu 17-08-17 12:04:34, Chen Yu wrote:
> [...]
> >  #ifdef CONFIG_HIBERNATION
> >  
> > +/* Touch watchdog for every WD_INTERVAL_PAGE pages. */
> > +#define WD_INTERVAL_PAGE   1000
> 
> traversing 1000 pages should never take too much time so this could be
> overly aggressive. 100k pages could be acceptable as well. But I haven't
> measure that so I might be easily wrong here. So this is just my 2c
>
After checking the log:
[ 1144.690405] done (allocated 6590003 pages)
[ 1144.694971] PM: Allocated 26360012 kbytes in 19.89 seconds (1325.28 MB/s)

The default NMI timeout is 10 seconds AFAIK, in case the user might modify it to
1 second, a safe interval page could be 6590003/20 = 320k pages, but there
might be other machines running in a lower freq, 100k should be more robust.
I'll change it to 100k.
> > +
> >  void mark_free_pages(struct zone *zone)
> >  {
> > unsigned long pfn, max_zone_pfn;
> > unsigned long flags;
> > -   unsigned int order, t;
> > +   unsigned int order, t, page_num = 0;
> > struct page *page;
> >  
> > if (zone_is_empty(zone))
> > @@ -2548,6 +2552,9 @@ void mark_free_pages(struct zone *zone)
> > if (pfn_valid(pfn)) {
> > page = pfn_to_page(pfn);
> >  
> > +   if (!((page_num++) % WD_INTERVAL_PAGE))
> > +   touch_nmi_watchdog();
> > +
> > if (page_zone(page) != zone)
> > continue;
> >  
> > @@ -2555,14 +2562,19 @@ void mark_free_pages(struct zone *zone)
> > swsusp_unset_page_free(page);
> > }
> >  
> > +   page_num = 0;
> > +
> 
> this part doesn't make much sense to me. You are still inside the same
> IRQ disabled section. So why would you want to start counting from 0
> again. Not that this would make any difference in real life but the code
> is not logical
> 
Ok, will delete this.
> > for_each_migratetype_order(order, t) {
> > list_for_each_entry(page,
> > >free_area[order].free_list[t], lru) {
> > unsigned long i;
> >  
> > pfn = page_to_pfn(page);
> > -   for (i = 0; i < (1UL << order); i++)
> > +   for (i = 0; i < (1UL << order); i++) {
> > +   if (!((page_num++) % WD_INTERVAL_PAGE))
> > +   touch_nmi_watchdog();
> > swsusp_set_page_free(pfn_to_page(pfn + i));
> > +   }
> > }
> > }
> > spin_unlock_irqrestore(>lock, flags);
> > -- 
> > 2.7.4
> 
> -- 
> Michal Hocko
> SUSE Labs
Thanks,
Yu


Re: [PATCH][RFC v3] PM / Hibernate: Feed the wathdog when creating snapshot

2017-08-19 Thread Chen Yu
On Thu, Aug 17, 2017 at 01:28:06PM +0200, Michal Hocko wrote:
> On Thu 17-08-17 12:04:34, Chen Yu wrote:
> [...]
> >  #ifdef CONFIG_HIBERNATION
> >  
> > +/* Touch watchdog for every WD_INTERVAL_PAGE pages. */
> > +#define WD_INTERVAL_PAGE   1000
> 
> traversing 1000 pages should never take too much time so this could be
> overly aggressive. 100k pages could be acceptable as well. But I haven't
> measure that so I might be easily wrong here. So this is just my 2c
>
After checking the log:
[ 1144.690405] done (allocated 6590003 pages)
[ 1144.694971] PM: Allocated 26360012 kbytes in 19.89 seconds (1325.28 MB/s)

The default NMI timeout is 10 seconds AFAIK, in case the user might modify it to
1 second, a safe interval page could be 6590003/20 = 320k pages, but there
might be other machines running in a lower freq, 100k should be more robust.
I'll change it to 100k.
> > +
> >  void mark_free_pages(struct zone *zone)
> >  {
> > unsigned long pfn, max_zone_pfn;
> > unsigned long flags;
> > -   unsigned int order, t;
> > +   unsigned int order, t, page_num = 0;
> > struct page *page;
> >  
> > if (zone_is_empty(zone))
> > @@ -2548,6 +2552,9 @@ void mark_free_pages(struct zone *zone)
> > if (pfn_valid(pfn)) {
> > page = pfn_to_page(pfn);
> >  
> > +   if (!((page_num++) % WD_INTERVAL_PAGE))
> > +   touch_nmi_watchdog();
> > +
> > if (page_zone(page) != zone)
> > continue;
> >  
> > @@ -2555,14 +2562,19 @@ void mark_free_pages(struct zone *zone)
> > swsusp_unset_page_free(page);
> > }
> >  
> > +   page_num = 0;
> > +
> 
> this part doesn't make much sense to me. You are still inside the same
> IRQ disabled section. So why would you want to start counting from 0
> again. Not that this would make any difference in real life but the code
> is not logical
> 
Ok, will delete this.
> > for_each_migratetype_order(order, t) {
> > list_for_each_entry(page,
> > >free_area[order].free_list[t], lru) {
> > unsigned long i;
> >  
> > pfn = page_to_pfn(page);
> > -   for (i = 0; i < (1UL << order); i++)
> > +   for (i = 0; i < (1UL << order); i++) {
> > +   if (!((page_num++) % WD_INTERVAL_PAGE))
> > +   touch_nmi_watchdog();
> > swsusp_set_page_free(pfn_to_page(pfn + i));
> > +   }
> > }
> > }
> > spin_unlock_irqrestore(>lock, flags);
> > -- 
> > 2.7.4
> 
> -- 
> Michal Hocko
> SUSE Labs
Thanks,
Yu


Re: [PATCH][RFC v3] PM / Hibernate: Feed the wathdog when creating snapshot

2017-08-17 Thread Michal Hocko
On Thu 17-08-17 12:04:34, Chen Yu wrote:
[...]
>  #ifdef CONFIG_HIBERNATION
>  
> +/* Touch watchdog for every WD_INTERVAL_PAGE pages. */
> +#define WD_INTERVAL_PAGE 1000

traversing 1000 pages should never take too much time so this could be
overly aggressive. 100k pages could be acceptable as well. But I haven't
measure that so I might be easily wrong here. So this is just my 2c

> +
>  void mark_free_pages(struct zone *zone)
>  {
>   unsigned long pfn, max_zone_pfn;
>   unsigned long flags;
> - unsigned int order, t;
> + unsigned int order, t, page_num = 0;
>   struct page *page;
>  
>   if (zone_is_empty(zone))
> @@ -2548,6 +2552,9 @@ void mark_free_pages(struct zone *zone)
>   if (pfn_valid(pfn)) {
>   page = pfn_to_page(pfn);
>  
> + if (!((page_num++) % WD_INTERVAL_PAGE))
> + touch_nmi_watchdog();
> +
>   if (page_zone(page) != zone)
>   continue;
>  
> @@ -2555,14 +2562,19 @@ void mark_free_pages(struct zone *zone)
>   swsusp_unset_page_free(page);
>   }
>  
> + page_num = 0;
> +

this part doesn't make much sense to me. You are still inside the same
IRQ disabled section. So why would you want to start counting from 0
again. Not that this would make any difference in real life but the code
is not logical

>   for_each_migratetype_order(order, t) {
>   list_for_each_entry(page,
>   >free_area[order].free_list[t], lru) {
>   unsigned long i;
>  
>   pfn = page_to_pfn(page);
> - for (i = 0; i < (1UL << order); i++)
> + for (i = 0; i < (1UL << order); i++) {
> + if (!((page_num++) % WD_INTERVAL_PAGE))
> + touch_nmi_watchdog();
>   swsusp_set_page_free(pfn_to_page(pfn + i));
> + }
>   }
>   }
>   spin_unlock_irqrestore(>lock, flags);
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs


Re: [PATCH][RFC v3] PM / Hibernate: Feed the wathdog when creating snapshot

2017-08-17 Thread Michal Hocko
On Thu 17-08-17 12:04:34, Chen Yu wrote:
[...]
>  #ifdef CONFIG_HIBERNATION
>  
> +/* Touch watchdog for every WD_INTERVAL_PAGE pages. */
> +#define WD_INTERVAL_PAGE 1000

traversing 1000 pages should never take too much time so this could be
overly aggressive. 100k pages could be acceptable as well. But I haven't
measure that so I might be easily wrong here. So this is just my 2c

> +
>  void mark_free_pages(struct zone *zone)
>  {
>   unsigned long pfn, max_zone_pfn;
>   unsigned long flags;
> - unsigned int order, t;
> + unsigned int order, t, page_num = 0;
>   struct page *page;
>  
>   if (zone_is_empty(zone))
> @@ -2548,6 +2552,9 @@ void mark_free_pages(struct zone *zone)
>   if (pfn_valid(pfn)) {
>   page = pfn_to_page(pfn);
>  
> + if (!((page_num++) % WD_INTERVAL_PAGE))
> + touch_nmi_watchdog();
> +
>   if (page_zone(page) != zone)
>   continue;
>  
> @@ -2555,14 +2562,19 @@ void mark_free_pages(struct zone *zone)
>   swsusp_unset_page_free(page);
>   }
>  
> + page_num = 0;
> +

this part doesn't make much sense to me. You are still inside the same
IRQ disabled section. So why would you want to start counting from 0
again. Not that this would make any difference in real life but the code
is not logical

>   for_each_migratetype_order(order, t) {
>   list_for_each_entry(page,
>   >free_area[order].free_list[t], lru) {
>   unsigned long i;
>  
>   pfn = page_to_pfn(page);
> - for (i = 0; i < (1UL << order); i++)
> + for (i = 0; i < (1UL << order); i++) {
> + if (!((page_num++) % WD_INTERVAL_PAGE))
> + touch_nmi_watchdog();
>   swsusp_set_page_free(pfn_to_page(pfn + i));
> + }
>   }
>   }
>   spin_unlock_irqrestore(>lock, flags);
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs


[PATCH][RFC v3] PM / Hibernate: Feed the wathdog when creating snapshot

2017-08-16 Thread Chen Yu
There is a problem that when counting the pages for creating
the hibernation snapshot will take significant amount of
time, especially on system with large memory. Since the counting
job is performed with irq disabled, this might lead to NMI lockup.
The following warning were found on a system with 1.5TB DRAM:

[ 1124.758184] Freezing user space processes ... (elapsed 0.002 seconds) done.
[ 1124.768721] OOM killer disabled.
[ 1124.847009] PM: Preallocating image memory...
[ 1139.392042] NMI watchdog: Watchdog detected hard LOCKUP on cpu 27
[ 1139.392076] CPU: 27 PID: 3128 Comm: systemd-sleep Not tainted 
4.13.0-0.rc2.git0.1.fc27.x86_64 #1
[ 1139.392077] task: 9f01971ac000 task.stack: b1a3f325c000
[ 1139.392083] RIP: 0010:memory_bm_find_bit+0xf4/0x100
[ 1139.392084] RSP: 0018:b1a3f325fc20 EFLAGS: 0006
[ 1139.392084] RAX:  RBX: 13b83000 RCX: 9fbe89caf000
[ 1139.392085] RDX: b1a3f325fc30 RSI: 3200 RDI: 9fbeae80
[ 1139.392085] RBP: b1a3f325fc40 R08: 13b8 R09: 9fbe89c54878
[ 1139.392085] R10: b1a3f325fc2c R11: 13b83200 R12: 0400
[ 1139.392086] R13: fd552e0c R14: 9fc1bffd31e0 R15: 0202
[ 1139.392086] FS:  7f3189704180() GS:9fbec8ec() 
knlGS:
[ 1139.392087] CS:  0010 DS:  ES:  CR0: 80050033
[ 1139.392087] CR2: 0085da0f7398 CR3: 01771cf9a000 CR4: 007406e0
[ 1139.392088] DR0:  DR1:  DR2: 
[ 1139.392088] DR3:  DR6: fffe0ff0 DR7: 0400
[ 1139.392088] PKRU: 5554
[ 1139.392089] Call Trace:
[ 1139.392092]  ? memory_bm_set_bit+0x29/0x60
[ 1139.392094]  swsusp_set_page_free+0x2b/0x30
[ 1139.392098]  mark_free_pages+0x147/0x1c0
[ 1139.392099]  count_data_pages+0x41/0xa0
[ 1139.392101]  hibernate_preallocate_memory+0x80/0x450
[ 1139.392102]  hibernation_snapshot+0x58/0x410
[ 1139.392103]  hibernate+0x17c/0x310
[ 1139.392104]  state_store+0xdf/0xf0
[ 1139.392107]  kobj_attr_store+0xf/0x20
[ 1139.392111]  sysfs_kf_write+0x37/0x40
[ 1139.392113]  kernfs_fop_write+0x11c/0x1a0
[ 1139.392117]  __vfs_write+0x37/0x170
[ 1139.392121]  ? handle_mm_fault+0xd8/0x230
[ 1139.392122]  vfs_write+0xb1/0x1a0
[ 1139.392123]  SyS_write+0x55/0xc0
[ 1139.392126]  entry_SYSCALL_64_fastpath+0x1a/0xa5

So avoid the NMI lockup by feeding the watchdog every 1000 pages.

Reported-by: Jan Filipcewicz 
Suggested-by: Michal Hocko 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Mel Gorman 
Cc: Vlastimil Babka 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Dan Williams 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Chen Yu 
---
 mm/page_alloc.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6d00f74..0266eb6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -66,6 +66,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -2531,11 +2532,14 @@ void drain_all_pages(struct zone *zone)
 
 #ifdef CONFIG_HIBERNATION
 
+/* Touch watchdog for every WD_INTERVAL_PAGE pages. */
+#define WD_INTERVAL_PAGE   1000
+
 void mark_free_pages(struct zone *zone)
 {
unsigned long pfn, max_zone_pfn;
unsigned long flags;
-   unsigned int order, t;
+   unsigned int order, t, page_num = 0;
struct page *page;
 
if (zone_is_empty(zone))
@@ -2548,6 +2552,9 @@ void mark_free_pages(struct zone *zone)
if (pfn_valid(pfn)) {
page = pfn_to_page(pfn);
 
+   if (!((page_num++) % WD_INTERVAL_PAGE))
+   touch_nmi_watchdog();
+
if (page_zone(page) != zone)
continue;
 
@@ -2555,14 +2562,19 @@ void mark_free_pages(struct zone *zone)
swsusp_unset_page_free(page);
}
 
+   page_num = 0;
+
for_each_migratetype_order(order, t) {
list_for_each_entry(page,
>free_area[order].free_list[t], lru) {
unsigned long i;
 
pfn = page_to_pfn(page);
-   for (i = 0; i < (1UL << order); i++)
+   for (i = 0; i < (1UL << order); i++) {
+   if (!((page_num++) % WD_INTERVAL_PAGE))
+   touch_nmi_watchdog();
swsusp_set_page_free(pfn_to_page(pfn + i));
+   }
}
}
spin_unlock_irqrestore(>lock, flags);
-- 
2.7.4



[PATCH][RFC v3] PM / Hibernate: Feed the wathdog when creating snapshot

2017-08-16 Thread Chen Yu
There is a problem that when counting the pages for creating
the hibernation snapshot will take significant amount of
time, especially on system with large memory. Since the counting
job is performed with irq disabled, this might lead to NMI lockup.
The following warning were found on a system with 1.5TB DRAM:

[ 1124.758184] Freezing user space processes ... (elapsed 0.002 seconds) done.
[ 1124.768721] OOM killer disabled.
[ 1124.847009] PM: Preallocating image memory...
[ 1139.392042] NMI watchdog: Watchdog detected hard LOCKUP on cpu 27
[ 1139.392076] CPU: 27 PID: 3128 Comm: systemd-sleep Not tainted 
4.13.0-0.rc2.git0.1.fc27.x86_64 #1
[ 1139.392077] task: 9f01971ac000 task.stack: b1a3f325c000
[ 1139.392083] RIP: 0010:memory_bm_find_bit+0xf4/0x100
[ 1139.392084] RSP: 0018:b1a3f325fc20 EFLAGS: 0006
[ 1139.392084] RAX:  RBX: 13b83000 RCX: 9fbe89caf000
[ 1139.392085] RDX: b1a3f325fc30 RSI: 3200 RDI: 9fbeae80
[ 1139.392085] RBP: b1a3f325fc40 R08: 13b8 R09: 9fbe89c54878
[ 1139.392085] R10: b1a3f325fc2c R11: 13b83200 R12: 0400
[ 1139.392086] R13: fd552e0c R14: 9fc1bffd31e0 R15: 0202
[ 1139.392086] FS:  7f3189704180() GS:9fbec8ec() 
knlGS:
[ 1139.392087] CS:  0010 DS:  ES:  CR0: 80050033
[ 1139.392087] CR2: 0085da0f7398 CR3: 01771cf9a000 CR4: 007406e0
[ 1139.392088] DR0:  DR1:  DR2: 
[ 1139.392088] DR3:  DR6: fffe0ff0 DR7: 0400
[ 1139.392088] PKRU: 5554
[ 1139.392089] Call Trace:
[ 1139.392092]  ? memory_bm_set_bit+0x29/0x60
[ 1139.392094]  swsusp_set_page_free+0x2b/0x30
[ 1139.392098]  mark_free_pages+0x147/0x1c0
[ 1139.392099]  count_data_pages+0x41/0xa0
[ 1139.392101]  hibernate_preallocate_memory+0x80/0x450
[ 1139.392102]  hibernation_snapshot+0x58/0x410
[ 1139.392103]  hibernate+0x17c/0x310
[ 1139.392104]  state_store+0xdf/0xf0
[ 1139.392107]  kobj_attr_store+0xf/0x20
[ 1139.392111]  sysfs_kf_write+0x37/0x40
[ 1139.392113]  kernfs_fop_write+0x11c/0x1a0
[ 1139.392117]  __vfs_write+0x37/0x170
[ 1139.392121]  ? handle_mm_fault+0xd8/0x230
[ 1139.392122]  vfs_write+0xb1/0x1a0
[ 1139.392123]  SyS_write+0x55/0xc0
[ 1139.392126]  entry_SYSCALL_64_fastpath+0x1a/0xa5

So avoid the NMI lockup by feeding the watchdog every 1000 pages.

Reported-by: Jan Filipcewicz 
Suggested-by: Michal Hocko 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Mel Gorman 
Cc: Vlastimil Babka 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Dan Williams 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Chen Yu 
---
 mm/page_alloc.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6d00f74..0266eb6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -66,6 +66,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -2531,11 +2532,14 @@ void drain_all_pages(struct zone *zone)
 
 #ifdef CONFIG_HIBERNATION
 
+/* Touch watchdog for every WD_INTERVAL_PAGE pages. */
+#define WD_INTERVAL_PAGE   1000
+
 void mark_free_pages(struct zone *zone)
 {
unsigned long pfn, max_zone_pfn;
unsigned long flags;
-   unsigned int order, t;
+   unsigned int order, t, page_num = 0;
struct page *page;
 
if (zone_is_empty(zone))
@@ -2548,6 +2552,9 @@ void mark_free_pages(struct zone *zone)
if (pfn_valid(pfn)) {
page = pfn_to_page(pfn);
 
+   if (!((page_num++) % WD_INTERVAL_PAGE))
+   touch_nmi_watchdog();
+
if (page_zone(page) != zone)
continue;
 
@@ -2555,14 +2562,19 @@ void mark_free_pages(struct zone *zone)
swsusp_unset_page_free(page);
}
 
+   page_num = 0;
+
for_each_migratetype_order(order, t) {
list_for_each_entry(page,
>free_area[order].free_list[t], lru) {
unsigned long i;
 
pfn = page_to_pfn(page);
-   for (i = 0; i < (1UL << order); i++)
+   for (i = 0; i < (1UL << order); i++) {
+   if (!((page_num++) % WD_INTERVAL_PAGE))
+   touch_nmi_watchdog();
swsusp_set_page_free(pfn_to_page(pfn + i));
+   }
}
}
spin_unlock_irqrestore(>lock, flags);
-- 
2.7.4