Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Xishi Qiu
>>> Hi Simon, >>> >>> If we use "/sys/devices/system/memory/soft_offline_page" to offline a >>> free page, the value of mce_bad_pages will be added. Then the page is marked >>> HWPoison, but it is still managed by page buddy alocator. >>> >>> So if we offline it again, the value of mce_bad_pages w

Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Xishi Qiu
On 2012/12/11 11:48, Simon Jeons wrote: > On Tue, 2012-12-11 at 04:19 +0100, Andi Kleen wrote: >> On Mon, Dec 10, 2012 at 09:13:11PM -0600, Simon Jeons wrote: >>> On Tue, 2012-12-11 at 04:01 +0100, Andi Kleen wrote: > Oh, it will be putback to lru list during migration. So does your "some

Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Simon Jeons
On Tue, 2012-12-11 at 04:19 +0100, Andi Kleen wrote: > On Mon, Dec 10, 2012 at 09:13:11PM -0600, Simon Jeons wrote: > > On Tue, 2012-12-11 at 04:01 +0100, Andi Kleen wrote: > > > > Oh, it will be putback to lru list during migration. So does your "some > > > > time" mean before call check_new_page?

Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Andi Kleen
> "There are not so many free pages in a typical server system", sorry I don't > quite understand it. Linux tries to keep most memory in caches. As Linus says "free memory is bad memory" > > buffered_rmqueue() > prep_new_page() > check_new_page() > bad_pa

Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Xishi Qiu
On 2012/12/11 10:58, Andi Kleen wrote: >> That sounds like overkill. There are not so many free pages in a >> typical server system. > > As Fengguang said -- memory error handling is tricky. Lots of things > could be done in theory, but they all have a cost in testing and > maintenance. > > In

Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Andi Kleen
On Mon, Dec 10, 2012 at 09:13:11PM -0600, Simon Jeons wrote: > On Tue, 2012-12-11 at 04:01 +0100, Andi Kleen wrote: > > > Oh, it will be putback to lru list during migration. So does your "some > > > time" mean before call check_new_page? > > > > Yes until the next check_new_page() whenever that i

Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Simon Jeons
On Tue, 2012-12-11 at 04:01 +0100, Andi Kleen wrote: > > Oh, it will be putback to lru list during migration. So does your "some > > time" mean before call check_new_page? > > Yes until the next check_new_page() whenever that is. If the migration > works it will be earlier, otherwise later. But I

Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Andi Kleen
> Oh, it will be putback to lru list during migration. So does your "some > time" mean before call check_new_page? Yes until the next check_new_page() whenever that is. If the migration works it will be earlier, otherwise later. -andi -- To unsubscribe from this list: send the line "unsubscribe l

Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Andi Kleen
> That sounds like overkill. There are not so many free pages in a > typical server system. As Fengguang said -- memory error handling is tricky. Lots of things could be done in theory, but they all have a cost in testing and maintenance. In general they are only worth doing if the situation is

Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Fengguang Wu
On Tue, Dec 11, 2012 at 10:25:00AM +0800, Xishi Qiu wrote: > On 2012/12/10 23:38, Andi Kleen wrote: > > >> It is another topic, I mean since the page is poisoned, so why not isolate > >> it > >> from page buddy alocator in soft_offline_page() rather than in > >> check_new_page(). > >> I find sof

Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Xishi Qiu
On 2012/12/10 23:38, Andi Kleen wrote: >> It is another topic, I mean since the page is poisoned, so why not isolate it >> from page buddy alocator in soft_offline_page() rather than in >> check_new_page(). >> I find soft_offline_page() only migrate the page and mark HWPoison, the >> poisoned >>

Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Simon Jeons
On Tue, 2012-12-11 at 03:03 +0100, Andi Kleen wrote: > > IIUC, soft offlining will isolate and migrate hwpoisoned page, and this > > page will not be accessed by memory management subsystem until unpoison, > > correct? > > No, soft offlining can still allow accesses for some time. It'll never kill

Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Andi Kleen
> IIUC, soft offlining will isolate and migrate hwpoisoned page, and this > page will not be accessed by memory management subsystem until unpoison, > correct? No, soft offlining can still allow accesses for some time. It'll never kill anything. Hard tries much harder and will kill. In some case

Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Simon Jeons
On Mon, 2012-12-10 at 16:38 +0100, Andi Kleen wrote: > > It is another topic, I mean since the page is poisoned, so why not isolate > > it > > from page buddy alocator in soft_offline_page() rather than in > > check_new_page(). > > I find soft_offline_page() only migrate the page and mark HWPoiso

Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Andi Kleen
> HWPoison delays any action on buddy allocator pages, handling can be safely > postponed > until a later time when the page might be referenced. By delaying, some > transient errors > may not reoccur or may be irrelevant. That's not true for soft offlining, only for hard. -Andi -- a...@linu

Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Andi Kleen
> It is another topic, I mean since the page is poisoned, so why not isolate it > from page buddy alocator in soft_offline_page() rather than in > check_new_page(). > I find soft_offline_page() only migrate the page and mark HWPoison, the > poisoned > page is still managed by page buddy alocator.

Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Simon Jeons
Cc other guys. On Mon, 2012-12-10 at 20:40 +0800, Xishi Qiu wrote: > On 2012/12/10 19:56, Simon Jeons wrote: > > > On Mon, 2012-12-10 at 19:16 +0800, Xishi Qiu wrote: > >> On 2012/12/10 18:47, Simon Jeons wrote: > >> > >>> On Mon, 2012-12-10 at 17:06 +0800, Xishi Qiu wrote: > On 2012/12/10 1

Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Borislav Petkov
On Mon, Dec 10, 2012 at 07:54:53PM +0800, Xishi Qiu wrote: > One more question, can we add a list_head to manager the poisoned pages? What would you need that list for? Also, a list is not the most optimal data structure for when you need to traverse it often. Thanks. -- Regards/Gruss, Bori

Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Simon Jeons
On Mon, 2012-12-10 at 19:16 +0800, Xishi Qiu wrote: > On 2012/12/10 18:47, Simon Jeons wrote: > > > On Mon, 2012-12-10 at 17:06 +0800, Xishi Qiu wrote: > >> On 2012/12/10 16:33, Wanpeng Li wrote: > >> > >>> On Fri, Dec 07, 2012 at 02:11:02PM -0800, Andrew Morton wrote: > On Fri, 7 Dec 2012 16

Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Xishi Qiu
On 2012/12/10 19:39, Wanpeng Li wrote: > On Mon, Dec 10, 2012 at 07:16:50PM +0800, Xishi Qiu wrote: >> On 2012/12/10 18:47, Simon Jeons wrote: >> >>> On Mon, 2012-12-10 at 17:06 +0800, Xishi Qiu wrote: On 2012/12/10 16:33, Wanpeng Li wrote: > On Fri, Dec 07, 2012 at 02:11:02PM -0800,

Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Xishi Qiu
On 2012/12/10 18:47, Simon Jeons wrote: > On Mon, 2012-12-10 at 17:06 +0800, Xishi Qiu wrote: >> On 2012/12/10 16:33, Wanpeng Li wrote: >> >>> On Fri, Dec 07, 2012 at 02:11:02PM -0800, Andrew Morton wrote: On Fri, 7 Dec 2012 16:48:45 +0800 Xishi Qiu wrote: > On x86 platform, if

Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Simon Jeons
On Mon, 2012-12-10 at 17:06 +0800, Xishi Qiu wrote: > On 2012/12/10 16:33, Wanpeng Li wrote: > > > On Fri, Dec 07, 2012 at 02:11:02PM -0800, Andrew Morton wrote: > >> On Fri, 7 Dec 2012 16:48:45 +0800 > >> Xishi Qiu wrote: > >> > >>> On x86 platform, if we use "/sys/devices/system/memory/soft_off

Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Xishi Qiu
On 2012/12/10 16:33, Wanpeng Li wrote: > On Fri, Dec 07, 2012 at 02:11:02PM -0800, Andrew Morton wrote: >> On Fri, 7 Dec 2012 16:48:45 +0800 >> Xishi Qiu wrote: >> >>> On x86 platform, if we use "/sys/devices/system/memory/soft_offline_page" >>> to offline a >>> free page twice, the value of mce

Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-09 Thread Xishi Qiu
On 2012/12/8 6:11, Andrew Morton wrote: > On Fri, 7 Dec 2012 16:48:45 +0800 > Xishi Qiu wrote: > >> On x86 platform, if we use "/sys/devices/system/memory/soft_offline_page" to >> offline a >> free page twice, the value of mce_bad_pages will be added twice. So this is >> an error, >> since the

Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-07 Thread Borislav Petkov
On Fri, Dec 07, 2012 at 02:11:02PM -0800, Andrew Morton wrote: > A few things: > > - soft_offline_page() already checks for this case: > > if (PageHWPoison(page)) { > unlock_page(page); > put_page(page); > pr_info("soft offline: %#lx page already po

Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-07 Thread Andrew Morton
On Fri, 7 Dec 2012 16:48:45 +0800 Xishi Qiu wrote: > On x86 platform, if we use "/sys/devices/system/memory/soft_offline_page" to > offline a > free page twice, the value of mce_bad_pages will be added twice. So this is > an error, > since the page was already marked HWPoison, we should skip th

Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-07 Thread Borislav Petkov
On Fri, Dec 07, 2012 at 04:48:45PM +0800, Xishi Qiu wrote: > On x86 platform, if we use "/sys/devices/system/memory/soft_offline_page" to > offline a > free page twice, the value of mce_bad_pages will be added twice. So this is > an error, > since the page was already marked HWPoison, we should s

[PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-07 Thread Xishi Qiu
On x86 platform, if we use "/sys/devices/system/memory/soft_offline_page" to offline a free page twice, the value of mce_bad_pages will be added twice. So this is an error, since the page was already marked HWPoison, we should skip the page and don't add the value of mce_bad_pages. $ cat /proc/