turn panic into data corruption, dangerous patch was Re: [PATCH] [v3] PM / hibernate: Fix hibernation panic caused by inconsistent e820 map

2015-09-16 Thread Pavel Machek
On Wed 2015-09-02 20:06:28, Chen Yu wrote:
> On some platforms, there is occasional panic triggered when trying to
> resume from hibernation, a typical panic looks like:
> 
> BUG: unable to handle kernel paging request at 880085894000
> IP: [] load_image_lzo+0x8c2/0xe70
> 
> This is because e820 map has been changed by BIOS before/after
> hibernation, and one of the page frames from first kernel
> is right located in second kernel's unmapped region, so panic
> comes out when accessing unmapped kernel address.

> After this patch applied, the panic will be replaced with the warning:

and with data corruption.

> PM: Loading and decompressing image data (96092 pages)...
> PM: Image loading progress:   0%
> PM: Image loading progress:  10%
> PM: Image loading progress:  20%
> PM: Image loading progress:  30%
> PM: Image loading progress:  40%
> PM:  0x849dd000 to restored not in valid memory region
> 
> Signed-off-by: Chen Yu 
> ---
> 
> v3:
>  - Adjust the logic to exclude the end_pfn boundary in pfn_mapped
>when invoking mark_valid_pages, because the end_pfn is not
>a mapped page frame, we should not regard it as a valid page.
> 
>Move the sanity check of valid pages to a early stage in resuming
>process(moved to mark_unsafe_pages), in this way, we can avoid
>unnecessarily accessing these invalid pages in later stage(yes,
>move to the original position Joey once introduced in:
>Commit 84c91b7ae07c ("PM / hibernate: avoid unsafe pages in e820
>reserved regions")
> 
>With v3 patch applied, I did 30 cycles on my problematic platform,
>no panic triggered anymore(50% reproducible before patched, by
>plugging/unplugging memory peripheral during hibernation), and it
>just warns of invalid pages.

"Just warns of invalid pages". Do you want to say that you "just cause
data corruption"?

If you don't have enough memory, YOU DON'T RESTORE. Disks were synced,
so not restoring is safe. Running with memory corruption is NOT.

> + if (!swsusp_page_is_valid(pfn_to_page(pfn))) {
> + pr_err(
> + "PM:  %#010llx to restored not in valid memory 
> region\n",
> + (unsigned long long) pfn << PAGE_SHIFT);

And you'd need to fix english here in any case.

Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] [v3] PM / hibernate: Fix hibernation panic caused by inconsistent e820 map

2015-09-04 Thread Chen, Yu C

> -Original Message-
> From: Rafael J. Wysocki [mailto:r...@rjwysocki.net]
> Sent: Thursday, September 03, 2015 9:19 AM
> To: Chen, Yu C
> Cc: Brown, Len; pa...@ucw.cz; mi...@redhat.com;
> joeyli.ker...@gmail.com; ying...@kernel.org; Zhang, Rui; linux-
> p...@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] [v3] PM / hibernate: Fix hibernation panic caused by
> inconsistent e820 map
> 
> On Wednesday, September 02, 2015 08:06:28 PM Chen Yu wrote:
> 
> Well, looks like an improvement, but I wouldn't be comfortable with pushing
> it to Linus before it spent a fair amount of time in linux-next.
> 
> For this reason, I can queue it up for the next merge window when 4.3-rc1 is
> out.

OK, thanks!

Best Regards,
Yu
> 
> Thanks,
> Rafael

N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�&j:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a���
0��h���i

Re: [PATCH] [v3] PM / hibernate: Fix hibernation panic caused by inconsistent e820 map

2015-09-02 Thread Rafael J. Wysocki
On Wednesday, September 02, 2015 08:06:28 PM Chen Yu wrote:
> On some platforms, there is occasional panic triggered when trying to
> resume from hibernation, a typical panic looks like:
> 
> BUG: unable to handle kernel paging request at 880085894000
> IP: [] load_image_lzo+0x8c2/0xe70
> 
> This is because e820 map has been changed by BIOS before/after
> hibernation, and one of the page frames from first kernel
> is right located in second kernel's unmapped region, so panic
> comes out when accessing unmapped kernel address.
> 
> Commit 84c91b7ae07c ("PM / hibernate: avoid unsafe pages in e820 reserved
> regions") was once introduced to fix this problem: to warn on the change
> on BIOS e820 and deny the resuming process, thus avoid the panic
> afterwards. However, this patch makes resuming from hibernation on Lenovo
> x230 failed, and the reason for it is that, this patch can not deal with
> unaligned E820_RESERVED_KERN regions and fails to resume from hibernation:
> https://bugzilla.kernel.org/show_bug.cgi?id=96111
> As a result, this patch is reverted.
> 
> To solve this hibernation panic issue fundamentally, we need to get rid of
> the impact of E820_RESERVED_KERN, so Yinghai,Lu proposes a patch to kill
> E820_RESERVED_KERN and based on his patch we can re-apply
> Commit 84c91b7ae07c ("PM / hibernate: avoid unsafe pages in e820 reserved
> regions"), and stress testing has been performed on problematic platform
> with above two patches applied, it works as expected, no panic anymore.
> 
> However, there is still one thing left, hibernation might fail even after
> above two patches applied, with the following warnning in log:
> 
> PM: Image mismatch: memory size
> 
> This is also because BIOS provides different e820 memory map before/after
> hibernation, thus different memory pages, and linux regards different
> number of memory pages as invalid process and refuses to resume, in order
> to protect against data corruption. However, this check might be too
> strict, consider the following scenario:
> The hibernating system has a smaller memory capacity than the resuming
> system, and the former memory region is a subset of the latter, it should
> be allowed to resume. Here is a case for this situation:
> 
> before hibernation:
> 
> BIOS-e820: [mem 0x2020-0x77517fff] usable
> BIOS-e820: [mem 0x77518000-0x77567fff] reserved
> Memory: 3871356K/4058428K available (7595K kernel code, 1202K rwdata,
> 3492K rodata, 1400K init, 1308K bss, 187072K reserved, 0K cma-reserved)
> 
> after hibernation:
> BIOS-e820: [mem 0x2020-0x7753] usable
> BIOS-e820: [mem 0x7754-0x77567fff] reserved
> Memory: 3871516K/4058588K available (7595K kernel code, 1202K rwdata,
> 3492K rodata, 1400K init, 1308K bss, 187072K reserved, 0K cma-reserved)
> 
> According to above data, the number of present_pages has increased by
> 40(thus 160K), linux will terminate the resuming process. But since
> [0x2020-0x77517fff] is a subset of
> [0x2020-0x7753], we should let system resume.
> 
> Since above two patches can not deal with the hibernation failor, another
> solution to fix both hibernation panic and hibernation failor is proposed
> as follows:
> We simply check that, if each non-highmem page frame to be restored is a
> valid mapped kernel page(by checking if this page is in pfn_mapped
> array in arch/x86/mm/init.c), if it is, resuming process will continue.
> In this way we do not have to touch E820_RESERVED_KERN, and we can:
> 1.prevent the hibernation panic caused by unmapped-page address
> accessing
> 2.remove the code that requires the same memory size before/after
> hibernation.
> 
> Note: for point 2, this patch only works on x86_64 platforms
> (with no highmem), because the highmem page frames on x86_32
> are not directly-mapped by kernel, which is out of the scope
> of pfn_mapped, this patch will not guarantee that whether the
> higmem region is legal for restore. A further work might include
> a logic to check if each page frame to be restored is in E820_RAM
> region, but it might require quite neat checkings in the code.
> For now, just solve the problem reported on x86_64.
> 
> After this patch applied, the panic will be replaced with the warning:
> 
> PM: Loading and decompressing image data (96092 pages)...
> PM: Image loading progress:   0%
> PM: Image loading progress:  10%
> PM: Image loading progress:  20%
> PM: Image loading progress:  30%
> PM: Image loading progress:  40%
> PM:  0x849dd000 to restored not in valid memory region
> 
> Signed-off-by: Chen Yu 

Well, looks like an improvement, but I wouldn't be comfortable with
pushing it to Linus before it spent a fair amount of time in linux-next.

For this reason, I can queue it up for the next merge window when 4.3-rc1
is out.

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ma