[PATCH] kexec: remove useless code in kimage_alloc_page

2020-02-04 Thread Liu Song
From: Liu Song 

"addr = old_addr" has no effect, so remove it.

Signed-off-by: Liu Song 
---
 kernel/kexec_core.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 15d70a90b50d..09c60c9347b1 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -761,7 +761,6 @@ static struct page *kimage_alloc_page(struct kimage *image,
kimage_free_pages(old_page);
continue;
}
-   addr = old_addr;
page = old_page;
break;
}
-- 
2.20.1


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] makedumpfile: cope with not-present mem section

2020-02-04 Thread Thadeu Lima de Souza Cascardo
On Tue, Feb 04, 2020 at 02:24:17PM +0800, piliu wrote:
> Hi,
> 
> Sorry to reply late due to a long festival.
> 
> I have tested this patch against v4.15 and latest kernel with small
> modification to meet the situation we discussed here. Both work fine.
> 
> The below is the modification of two kernel
> 
> test1. latest kernel with two extra modification to expose the problem
> -1.1 reverts commit 1f503443e7df8dc8366608b4d810ce2d6669827c
> (mm/sparse.c: reset section's mem_map when fully deactivated), this
> commit work around this bug
> -1.2. reverts commit a0b1280368d1e91ab72f849ef095b4f07a39bbf1 ("kdump:
> write correct address of mem_section into vmcoreinfo"). This will create
> a buggy situation as we discussed here.
> -1.3. fix building bug due to revert
> a0b1280368d1e91ab72f849ef095b4f07a39bbf1
> 
> test2. v4.15, which include both commit 83e3c48729d9 and a0b1280368d1.
> -2.1. revert commit a0b1280368d1e91ab72f849ef095b4f07a39bbf1 ("kdump:
> write correct address of mem_section into vmcoreinfo")
> 
> So I can not see any problem with my patch.
> Maybe I misunderstand the discussion, but I can not see my original
> patch will break the kernel which have 83e3c48729d9 but not a0b1280368d1.
> 
> Thanks,
> Pingfan
> 

You also need to test the case where 83e3c48729d9 is not present at all. Can
you test on a 4.4 kernel, for example? As far as I understand, a vanilla 4.4
kernel would not be dumpable with your patch.

Thanks.
Cascardo.

> On 01/29/2020 03:33 AM, Thadeu Lima de Souza Cascardo wrote:
> > On Tue, Jan 28, 2020 at 05:03:12PM +, HAGIO KAZUHITO(萩尾 一仁) wrote:
> >> Hi Cascardo,
> >>
> >>> -Original Message-
> >>> On Mon, Jan 27, 2020 at 02:04:54PM -0300, Thadeu Lima de Souza Cascardo 
> >>> wrote:
>  Sorry for taking too long to respond, as I was on vacation.
> 
>  The kernels that had commit 83e3c48729d9, but not commit a0b1280368d1, 
>  are
>  not supported anymore. In a way that it's even hard for me to test them.
> 
>  However, I managed to test it, and those two lines are definitively 
>  needed
>  to dump a VM running such a kernel. Is removing them really needed to fix
>  this issue?
> 
>  Otherwise, I would rather keep them.
> 
>  Thanks.
>  Cascardo.
> >>>
> >>> By the way, I was too fast in sending this. We really need to keep those 
> >>> lines
> >>> as makedumpfile will fail to dump a 4.4 kernel with this patch as is.
> >>
> >> Is that Ubuntu 4.4 kernel which has 83e3c48729d9 and not a0b1280368d1?
> >> Could you elaborate on how it fails?
> > 
> > No, it doesn't have either, so my guess is it would fail on upstream 4.4 as
> > well, so anything that doesn't have 83e3c48729d9.
> > 
> > That's what I get on that 4.4 kernel (4.4.0-171-generic):
> > 
> > # ./makedumpfile /proc/vmcore ../dump
> > get_mem_section: Could not validate mem_section.
> > get_mm_sparsemem: Can't get the address of mem_section.
> > 
> > makedumpfile Failed.
> > #
> > 
> > So, now, I have a better grasp of the whole logic, and understand why it 
> > fails
> > with this patch.
> > 
> > So, we need to either interpret the mem_section as a pointer to the array 
> > of a
> > pointer to the pointer to the array. The only case the second option is 
> > valid
> > is when sparse_extreme is on, so we don't even need to check the second case
> > when it's off.
> > 
> > Then, we check that interpreting it either way is valid. If it's valid in 
> > both
> > interpretations, we can't decide which to use, and will fail. So far, we
> > haven't seen any case in the field where that would accidentally happen. 
> > But in
> > case it does, we should rather fail to dump and fallback to copying, than
> > creating a bogus compressed dump.
> > 
> > When this patch is applied, a kernel which does not have 83e3c48729d9, and
> > thus, has mem_section as a direct pointer to the array, it so happens that 
> > we
> > don't detect the pointer to pointer to the array case as invalid, thus 
> > failing
> > to dump.
> > 
> > The way we validate is that the mem_maps should either have the PRESENT bit 
> > or
> > be NULL. Now, that assumption may be invalid, and we would need to do the
> > validation some other way. I can test the cases where that assumption is
> > invalid in a 4.4 kernel and see how to fix this in a satisfactory way.
> > 
> > Going through the code once again, I don't see how the second section of the
> > patch would be correct by itself too. I will think it through a little more 
> > and
> > see if I can come up with a solution.
> > 
> > Regards.
> > Cascardo.
> > 
> >>
> >> I'm thinking that Pingfan's thought may help:
>  I think it could be if/else, no need to call twice.
> >>
> >> Thanks,
> >> Kazu
> >>
> >>>
> >>> Cascardo.
> > 
> 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


RE: [PATCH] makedumpfile: cope with not-present mem section

2020-02-04 Thread 萩尾 一仁
> -Original Message-
> On Tue, Feb 04, 2020 at 02:24:17PM +0800, piliu wrote:
> > Hi,
> >
> > Sorry to reply late due to a long festival.
> >
> > I have tested this patch against v4.15 and latest kernel with small
> > modification to meet the situation we discussed here. Both work fine.
> >
> > The below is the modification of two kernel
> >
> > test1. latest kernel with two extra modification to expose the problem
> > -1.1 reverts commit 1f503443e7df8dc8366608b4d810ce2d6669827c
> > (mm/sparse.c: reset section's mem_map when fully deactivated), this
> > commit work around this bug
> > -1.2. reverts commit a0b1280368d1e91ab72f849ef095b4f07a39bbf1 ("kdump:
> > write correct address of mem_section into vmcoreinfo"). This will create
> > a buggy situation as we discussed here.
> > -1.3. fix building bug due to revert
> > a0b1280368d1e91ab72f849ef095b4f07a39bbf1
> >
> > test2. v4.15, which include both commit 83e3c48729d9 and a0b1280368d1.
> > -2.1. revert commit a0b1280368d1e91ab72f849ef095b4f07a39bbf1 ("kdump:
> > write correct address of mem_section into vmcoreinfo")
> >
> > So I can not see any problem with my patch.
> > Maybe I misunderstand the discussion, but I can not see my original
> > patch will break the kernel which have 83e3c48729d9 but not a0b1280368d1.
> >
> > Thanks,
> > Pingfan
> >
> 
> You also need to test the case where 83e3c48729d9 is not present at all. Can
> you test on a 4.4 kernel, for example? As far as I understand, a vanilla 4.4
> kernel would not be dumpable with your patch.

As far as I've tested this patch with SPARSEMEM_EXTREME vmcores below, it's OK:
  - 51 vmcores of vanilla kernels (each from 2.6.36 through 5.5) on hand
  - one more vanilla 4.4.0 kernel with a different config from the above

So apparently not all vanilla 4.4 kernels are affected by the patch.

> 
> Thanks.
> Cascardo.
> 
> > On 01/29/2020 03:33 AM, Thadeu Lima de Souza Cascardo wrote:
> > > On Tue, Jan 28, 2020 at 05:03:12PM +, HAGIO KAZUHITO wrote:
> > >> Hi Cascardo,
> > >>
> > >>> -Original Message-
> > >>> On Mon, Jan 27, 2020 at 02:04:54PM -0300, Thadeu Lima de Souza Cascardo 
> > >>> wrote:
> >  Sorry for taking too long to respond, as I was on vacation.
> > 
> >  The kernels that had commit 83e3c48729d9, but not commit a0b1280368d1, 
> >  are
> >  not supported anymore. In a way that it's even hard for me to test 
> >  them.
> > 
> >  However, I managed to test it, and those two lines are definitively 
> >  needed
> >  to dump a VM running such a kernel. Is removing them really needed to 
> >  fix
> >  this issue?
> > 
> >  Otherwise, I would rather keep them.
> > 
> >  Thanks.
> >  Cascardo.
> > >>>
> > >>> By the way, I was too fast in sending this. We really need to keep 
> > >>> those lines
> > >>> as makedumpfile will fail to dump a 4.4 kernel with this patch as is.
> > >>
> > >> Is that Ubuntu 4.4 kernel which has 83e3c48729d9 and not a0b1280368d1?
> > >> Could you elaborate on how it fails?
> > >
> > > No, it doesn't have either, so my guess is it would fail on upstream 4.4 
> > > as
> > > well, so anything that doesn't have 83e3c48729d9.
> > >
> > > That's what I get on that 4.4 kernel (4.4.0-171-generic):
> > >
> > > # ./makedumpfile /proc/vmcore ../dump
> > > get_mem_section: Could not validate mem_section.
> > > get_mm_sparsemem: Can't get the address of mem_section.
> > >
> > > makedumpfile Failed.
> > > #

Thanks for the infomation.
I guess that your 4.4 kernel and machine get a false-positive result (TRUE)
from the second validate_mem_section() with this patch, right?

If we don't have a way to exactly determine whether a mem_section is real
or not, we might have to accept some tradeoff here.  For example, a workaround
which I think of is something like this:

ret = validate_mem_section(SYMBOL(mem_section));
if (!ret && is_sparsemem_extreme()) {
  ...
  ret = validate_mem_section(mem_section_ptr);
  if (!ret)
ERRMSG("Could not determine the valid mem_section.\n");
}

with Pingfan's patch.  This will work for the false-positive fail you hit (if 
so),
but may affect some downstream kernels which have 83e3c48729d9 and do not
have a0b1280368d1.  But at least there is no upstream kernel like that.

Any other solution?

Thanks,
Kazu

> > >
> > > So, now, I have a better grasp of the whole logic, and understand why it 
> > > fails
> > > with this patch.
> > >
> > > So, we need to either interpret the mem_section as a pointer to the array 
> > > of a
> > > pointer to the pointer to the array. The only case the second option is 
> > > valid
> > > is when sparse_extreme is on, so we don't even need to check the second 
> > > case
> > > when it's off.
> > >
> > > Then, we check that interpreting it either way is valid. If it's valid in 
> > > both
> > > interpretations, we can't decide which to use, and will fail. So far, we
> > > haven't seen any case in the field where that would accidentally happen. 
> > 

Re: [PATCH 0/2] printk: replace ringbuffer

2020-02-04 Thread Sergey Senozhatsky
On (20/02/05 12:25), lijiang wrote:
> Hi, John Ogness
> 
> Thank you for improving the patch series and making great efforts.
> 
> I'm not sure if I missed anything else. Or are there any other related 
> patches to be applied?
> 
> After applying this patch series, NMI watchdog detected a hard lockup, which 
> caused that kernel can not boot, please refer to
> the following call trace. And I put the complete kernel log in the attachment.

I'm also having some problems running the code on my laptop. But may be
I did something wrong while applying patch 0002 (which didn't apply
cleanly). Will look more.

-ss

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/2] printk: replace ringbuffer

2020-02-04 Thread Sergey Senozhatsky
On (20/02/05 12:25), lijiang wrote:
[..]
> [   42.111004] Kernel Offset: 0x1f00 from 0x8100 (relocation 
> range: 0x8000-0xbfff)
> [   42.111005] general protection fault:  [#1] SMP PTI
> [   42.111005] CPU: 15 PID: 1395 Comm: systemd-journal Not tainted 5.5.0-rc7+ 
> #4
> [   42.111005] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS 
> SE5C610.86B.01.01.6024.071720181717 07/17/2018
> [   42.111006] RIP: 0010:copy_data+0xf2/0x1e0
> [   42.111006] Code: eb 08 49 83 c4 08 0f 84 8e 00 00 00 4c 89 74 24 08 4c 89 
> cd 41 89 d6 44 89 44 24 04 49 39 db 0f 87 c6 00 00 00 4d 85 c9 74 43 <41> c7 
> 01 00 00 00 00 48 85 db 74 37 4c 89 e7 48 89 da 41 bf 01 00
> [   42.111007] RSP: 0018:bbe207a7bd80 EFLAGS: 00010002
> [   42.111007] RAX: a075d44ca000 RBX: 00a8 RCX: 
> fff000b0
> [   42.111008] RDX: 00a8 RSI: 0f01 RDI: 
> a1456e00
> [   42.111008] RBP: 0801364600307073 R08: 2000 R09: 
> 0801364600307073
> [   42.111008] R10: fff0 R11: 00a8 R12: 
> a1e98330
> [   42.111009] R13: d7efbe00 R14: 00a8 R15: 
> c000
> [   42.111009] FS:  7f7c5642a980() GS:a075df5c() 
> knlGS:
> [   42.111010] CS:  0010 DS:  ES:  CR0: 80050033
> [   42.111010] CR2: 7ffe95f4c4c0 CR3: 00084fbfc004 CR4: 
> 003606e0
> [   42.111011] DR0:  DR1:  DR2: 
> 
> [   42.111011] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [   42.111012] Call Trace:
> [   42.111012]  _prb_read_valid+0xd8/0x190
> [   42.111012]  prb_read_valid+0x15/0x20
> [   42.111013]  devkmsg_read+0x9d/0x2a0
> [   42.111013]  vfs_read+0x91/0x140
> [   42.111013]  ksys_read+0x59/0xd0
> [   42.111014]  do_syscall_64+0x55/0x1b0
> [   42.111014]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [   42.111014] RIP: 0033:0x7f7c55740b62
> [   42.111015] Code: 94 20 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b6 0f 1f 
> 80 00 00 00 00 f3 0f 1e fa 8b 05 e6 d8 20 00 85 c0 75 12 31 c0 0f 05 <48> 3d 
> 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 41 54 49 89 d4 55 48 89
> [   42.111015] RSP: 002b:7ffe95f4c4a8 EFLAGS: 0246 ORIG_RAX: 
> 
> [   42.111016] RAX: ffda RBX: 7ffe95f4e500 RCX: 
> 7f7c55740b62
> [   42.111016] RDX: 2000 RSI: 7ffe95f4c4b0 RDI: 
> 0008
> [   42.111017] RBP:  R08: 0100 R09: 
> 0003
> [   42.111017] R10: 0100 R11: 0246 R12: 
> 7ffe95f4c4b0

So there is a General protection fault. That's the type of a problem that
kills the boot for me as well (different backtrace, tho).

-ss

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/2] printk: replace ringbuffer

2020-02-04 Thread Sergey Senozhatsky
On (20/02/05 13:48), Sergey Senozhatsky wrote:
> On (20/02/05 12:25), lijiang wrote:
> [..]
> > [   42.111004] Kernel Offset: 0x1f00 from 0x8100 
> > (relocation range: 0x8000-0xbfff)
> > [   42.111005] general protection fault:  [#1] SMP PTI
> > [   42.111005] CPU: 15 PID: 1395 Comm: systemd-journal Not tainted 
> > 5.5.0-rc7+ #4
> > [   42.111005] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS 
> > SE5C610.86B.01.01.6024.071720181717 07/17/2018
> > [   42.111006] RIP: 0010:copy_data+0xf2/0x1e0
> > [   42.111006] Code: eb 08 49 83 c4 08 0f 84 8e 00 00 00 4c 89 74 24 08 4c 
> > 89 cd 41 89 d6 44 89 44 24 04 49 39 db 0f 87 c6 00 00 00 4d 85 c9 74 43 
> > <41> c7 01 00 00 00 00 48 85 db 74 37 4c 89 e7 48 89 da 41 bf 01 00
> > [   42.111007] RSP: 0018:bbe207a7bd80 EFLAGS: 00010002
> > [   42.111007] RAX: a075d44ca000 RBX: 00a8 RCX: 
> > fff000b0
> > [   42.111008] RDX: 00a8 RSI: 0f01 RDI: 
> > a1456e00
> > [   42.111008] RBP: 0801364600307073 R08: 2000 R09: 
> > 0801364600307073
> > [   42.111008] R10: fff0 R11: 00a8 R12: 
> > a1e98330
> > [   42.111009] R13: d7efbe00 R14: 00a8 R15: 
> > c000
> > [   42.111009] FS:  7f7c5642a980() GS:a075df5c() 
> > knlGS:
> > [   42.111010] CS:  0010 DS:  ES:  CR0: 80050033
> > [   42.111010] CR2: 7ffe95f4c4c0 CR3: 00084fbfc004 CR4: 
> > 003606e0
> > [   42.111011] DR0:  DR1:  DR2: 
> > 
> > [   42.111011] DR3:  DR6: fffe0ff0 DR7: 
> > 0400
> > [   42.111012] Call Trace:
> > [   42.111012]  _prb_read_valid+0xd8/0x190
> > [   42.111012]  prb_read_valid+0x15/0x20
> > [   42.111013]  devkmsg_read+0x9d/0x2a0
> > [   42.111013]  vfs_read+0x91/0x140
> > [   42.111013]  ksys_read+0x59/0xd0
> > [   42.111014]  do_syscall_64+0x55/0x1b0
> > [   42.111014]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [   42.111014] RIP: 0033:0x7f7c55740b62
> > [   42.111015] Code: 94 20 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b6 0f 
> > 1f 80 00 00 00 00 f3 0f 1e fa 8b 05 e6 d8 20 00 85 c0 75 12 31 c0 0f 05 
> > <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 41 54 49 89 d4 55 48 89
> > [   42.111015] RSP: 002b:7ffe95f4c4a8 EFLAGS: 0246 ORIG_RAX: 
> > 
> > [   42.111016] RAX: ffda RBX: 7ffe95f4e500 RCX: 
> > 7f7c55740b62
> > [   42.111016] RDX: 2000 RSI: 7ffe95f4c4b0 RDI: 
> > 0008
> > [   42.111017] RBP:  R08: 0100 R09: 
> > 0003
> > [   42.111017] R10: 0100 R11: 0246 R12: 
> > 7ffe95f4c4b0
> 
> So there is a General protection fault. That's the type of a problem that
> kills the boot for me as well (different backtrace, tho).

Do you have CONFIG_RELOCATABLE and CONFIG_RANDOMIZE_BASE (KASLR) enabled?

-ss

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/2] printk: replace ringbuffer

2020-02-04 Thread lijiang


> On (20/02/05 13:48), Sergey Senozhatsky wrote:
>> On (20/02/05 12:25), lijiang wrote:
>> [..]
>>> [   42.111004] Kernel Offset: 0x1f00 from 0x8100 
>>> (relocation range: 0x8000-0xbfff)
>>> [   42.111005] general protection fault:  [#1] SMP PTI
>>> [   42.111005] CPU: 15 PID: 1395 Comm: systemd-journal Not tainted 
>>> 5.5.0-rc7+ #4
>>> [   42.111005] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS 
>>> SE5C610.86B.01.01.6024.071720181717 07/17/2018
>>> [   42.111006] RIP: 0010:copy_data+0xf2/0x1e0
>>> [   42.111006] Code: eb 08 49 83 c4 08 0f 84 8e 00 00 00 4c 89 74 24 08 4c 
>>> 89 cd 41 89 d6 44 89 44 24 04 49 39 db 0f 87 c6 00 00 00 4d 85 c9 74 43 
>>> <41> c7 01 00 00 00 00 48 85 db 74 37 4c 89 e7 48 89 da 41 bf 01 00
>>> [   42.111007] RSP: 0018:bbe207a7bd80 EFLAGS: 00010002
>>> [   42.111007] RAX: a075d44ca000 RBX: 00a8 RCX: 
>>> fff000b0
>>> [   42.111008] RDX: 00a8 RSI: 0f01 RDI: 
>>> a1456e00
>>> [   42.111008] RBP: 0801364600307073 R08: 2000 R09: 
>>> 0801364600307073
>>> [   42.111008] R10: fff0 R11: 00a8 R12: 
>>> a1e98330
>>> [   42.111009] R13: d7efbe00 R14: 00a8 R15: 
>>> c000
>>> [   42.111009] FS:  7f7c5642a980() GS:a075df5c() 
>>> knlGS:
>>> [   42.111010] CS:  0010 DS:  ES:  CR0: 80050033
>>> [   42.111010] CR2: 7ffe95f4c4c0 CR3: 00084fbfc004 CR4: 
>>> 003606e0
>>> [   42.111011] DR0:  DR1:  DR2: 
>>> 
>>> [   42.111011] DR3:  DR6: fffe0ff0 DR7: 
>>> 0400
>>> [   42.111012] Call Trace:
>>> [   42.111012]  _prb_read_valid+0xd8/0x190
>>> [   42.111012]  prb_read_valid+0x15/0x20
>>> [   42.111013]  devkmsg_read+0x9d/0x2a0
>>> [   42.111013]  vfs_read+0x91/0x140
>>> [   42.111013]  ksys_read+0x59/0xd0
>>> [   42.111014]  do_syscall_64+0x55/0x1b0
>>> [   42.111014]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [   42.111014] RIP: 0033:0x7f7c55740b62
>>> [   42.111015] Code: 94 20 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b6 0f 
>>> 1f 80 00 00 00 00 f3 0f 1e fa 8b 05 e6 d8 20 00 85 c0 75 12 31 c0 0f 05 
>>> <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 41 54 49 89 d4 55 48 89
>>> [   42.111015] RSP: 002b:7ffe95f4c4a8 EFLAGS: 0246 ORIG_RAX: 
>>> 
>>> [   42.111016] RAX: ffda RBX: 7ffe95f4e500 RCX: 
>>> 7f7c55740b62
>>> [   42.111016] RDX: 2000 RSI: 7ffe95f4c4b0 RDI: 
>>> 0008
>>> [   42.111017] RBP:  R08: 0100 R09: 
>>> 0003
>>> [   42.111017] R10: 0100 R11: 0246 R12: 
>>> 7ffe95f4c4b0
>>
>> So there is a General protection fault. That's the type of a problem that
>> kills the boot for me as well (different backtrace, tho).
> 
> Do you have CONFIG_RELOCATABLE and CONFIG_RANDOMIZE_BASE (KASLR) enabled?
> 

Yes. These two options are enabled.

CONFIG_RELOCATABLE=y
CONFIG_RANDOMIZE_BASE=y

Thanks.

>   -ss
> 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH 0/2] printk: replace ringbuffer

2020-02-04 Thread Sergey Senozhatsky
On (20/02/05 13:38), lijiang wrote:
> > On (20/02/05 13:48), Sergey Senozhatsky wrote:
> >> On (20/02/05 12:25), lijiang wrote:

[..]

> >>
> >> So there is a General protection fault. That's the type of a problem that
> >> kills the boot for me as well (different backtrace, tho).
> > 
> > Do you have CONFIG_RELOCATABLE and CONFIG_RANDOMIZE_BASE (KASLR) enabled?
> > 
> 
> Yes. These two options are enabled.
> 
> CONFIG_RELOCATABLE=y
> CONFIG_RANDOMIZE_BASE=y

So KASLR kills the boot for me. So does KASAN.

John, do you see any of these problems on your test machine?

-ss

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: mgag200 fails kdump kernel booting

2020-02-04 Thread Baoquan He
Hi Dave, Lyude,

On 07/02/19 at 06:51am, David Airlie wrote:
> On Wed, Jun 26, 2019 at 6:29 PM Baoquan He  wrote:
> >
> > On 06/26/19 at 04:15pm, Baoquan He wrote:
> > > Hi Dave,
> > >
> > > We met an kdump kernel boot failure on a lenovo system. Kdump kernel
> > > failed to boot, but just reset to firmware to reboot system. And nothing
> > > is printed out.
> > >
> > > The machine is a big server, with 6T memory and many cpu, its graphic
> > > driver module is mgag200.
> > >
> > > When added 'earlyprintk=ttyS0' into kernel command line, it printed
> > > out only one line to console during kdump kernel booting:
> > >  KASLR disabled: 'nokaslr' on cmdline.
> > >
> > > Then reset to firmware to reboot system.
> > >
> > > By further code debugging, the failure happened in
> > > arch/x86/boot/compressed/misc.c, during kernel decompressing stage. It's
> > > triggered by the vga printing. As you can see, in __putstr() of
> > > arch/x86/boot/compressed/misc.c, the code checks if earlyprintk= is
> > > specified, and print out to the target. And no matter if earlyprintk= is
> > > added or not, it will print to VGA. And printing to VGA caused it to
> > > reset to firmware. That's why we see nothing when didn't specify
> > > earlyprintk=, but see only one line of printing about the 'KASLR
> > > disabled'.
> >
> > Here I mean:
> > That's why we see nothing when didn't specify earlyprintk=, but see only
> > one line of printing about the 'KASLR disabled' message when
> > earlyprintk=ttyS0 added.
> 
> Just to clarify, the original kernel is booted with mgag200 turned
> off, then kexec works, but if the original kernel loads mgag200, the
> kexec kernels resets hard when the VGA is used to write stuff out.
> 
> This *might* be fixable in the controlled kexec case, but having an
> mgag200 shutdown path that tries to put the gpu back into a state
> where VGA doesn't die, but for the uncontrolled kexec it'll still be a
> problem, since once the gpu is up and running and VGA is disabled, it
> doesn't expect to see anymore VGA transactions.

Now we have got other two bug reports on different systems, finally
figured out it's the same issue as this after debugging. And adding
'nomodeset' can work around it.

With the help from our QA, tried to get more systems with mgag200,
seems not all of them have this issue, some of them with mgag200 can
jump to kdump well after panic.

Any suggestion about how to proceed? I can experiment. Or if you would
like to have a look when convenient, I can get one system to you to
check. Or, can we just use 'nomodeset' as work around and hold this
issue for the time being?

Appreciate if any suggestion or idea.

> 
> Dave.
> >
> > >
> > > To confirm it's caused by VGA printing, I blacklist the mgag200 by
> > > writting it into /etc/modprobe.d/blacklist.conf. The kdump kernel can
> > > boot up successfully. And add 'nomodeset' can also make it work. So it's
> > > for sure mgag driver or related code have something wrong when booting
> > > code tries to re-init it.
> > >
> > > This is the only case we ever see, tend to pursuit fix in mgag200 driver
> > > side. Any idea or suggestion? We have two machines to be able to
> > > reproduce it stablly.
> > >
> > > Thanks
> > > Baoquan


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec