Re: [PATCH 0/3] Enable legacy irq mode before jump to kexec/kdump kernel
Hi Jiangang, On 07/20/16 at 03:54am, Wei, Jiangang wrote: > Hi Baoquan He, > > Well, Indeed there‘s a relationship between the dump-capture hangs in > calibrate_delay_converge() and the interrupt mode. > > but there‘s no essential difference between your patches and mine that > calls disable_IO_APIC() again. > Actually, disable_IO_APIC will set APIC to virtual wire mode. You didn't read my patch log carefully. disable_IO_APIC not only set APIC to PIC mode or virtual wire mode, but call clear_IO_APIC. > > In fact, > Eric and Ingo suggested that "it should be fixed in the bootup path of > the dump kernel, not the crash kernel reboot path", which is convincing > and reasonable. Well this patch doesn't do differently with Eric's original implemention in kexec/kdump path. By taking out clear_IO_APIC from disable_IO_APIC, the left code of disable_IO_APIC will only do the virtual wire setting. So for kexec/kdump path, code basically is the same as Eric's method. But for poweroff/halt/reboot, it's enough to call clear_IO_APIC to disable IO-APIC. > > And i find a better method can fix the problem. > It's better to set virtual wire mode for apic in init_bsp_APIC(), which > in the bootup path of dump kernel. > But now, init_bsp_APIC doesn't initialize the apic to vitual wire mode > when smp_found_config is non-zero. And virtual wire mode have two kinds, IO-APIC virtual wire mode and LAPIC virtual wire mode. Please read code comments in init_bsp_APIC, IO-APIC virtual wire mode could be active or need be set, you can't detect IO-APIC pin connected to i8259 equvialent PIC. > > FYI, I'm working on this point. later i will send patches to mail list. > > Wei > > On Wed, 2016-07-20 at 10:58 +0800, Baoquan He wrote: > > Wei Jiangang reported kdump kernel always hang when "notsc" is specified > > in boot parameter. After debugging I found there's no timer interrupt > > in the current kexec/kdump kernel. This is caused by commit 522e66464467 > > ("x86/apic: Disable I/O APIC before shutdown of the local APIC"). Originally > > Eric posted below patch to make system be virtual wire mode in which 8259- > > equivalent PIC fields all interrupts and the LAPIC becomes a virtual wire. > > Like this interrupts can be delivered from PIC to CPU via the LAPIC's local > > interrupt 0 (LINTIN0). In virtual wire APIC mode is disabled while LAPIC > > is software enabled and its LINT0 and LINT1 need be programmed specifically. > > > > https://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.11/2.6.11-mm1/broken-out/x86_64-apic-virtwire-on-shutdown.patch > > > > But with commit 522e66464 you can see after disable_IO_APIC had setting > > virtual wire mode, lapic_shutdown disabled LAPIC again. Now virtual wire > > mode doesn't work, then it cause no timer interrupt during kdump kernel > > initialization stage until system enter into APIC mode. > > > > So people may be wondering why only kdump kernel hang, the normal kernel > > with "notsc" can still work. This is because BIOS has already built PIC mode > > or virtual wire mode while kexec/kdump kernel doesn't go through BIOS > > initialization. That is why we have to change system to be PIC mode or > > virtual wire mode before jump to kexec/kdump kernel. > > > > Then why kdump kernel didn't hang when "notsc" is not specified. This is > > because tsc_init will assign the already calibrated value to lpj_fine. > > Then kernel doesn't need to count cpu loops between jiffies with the help > > of timer interrupt. So "notsc" is not victim, but a informer. > > > > In patch 1/3 disable_IO_APIC is changed to only contain code of changeing > > system to be PIC mode or virtual wire mode and is renamed as > > switch_to_legacy_irq_mode. Now only call clear_IO_APIC where IO-APIC need > > be disabled, and call switch_to_legacy_irq_mode before jump to kexe/kdump > > kernel. > > > > Patch 2/3 and 3/3 are clean up patch. > > > > Baoquan He (3): > > x86/apic/kexec: Enable legacy irq mode before jump to kexec/kdump > > kernel > > x86/apic: Clean up the names of legacy irq mode setting related > > functions > > x86/apic: Clean up the apic delivery mode macro definition > > > > arch/x86/include/asm/apic.h| 2 +- > > arch/x86/include/asm/apicdef.h | 1 - > > arch/x86/include/asm/io_apic.h | 6 +++--- > > arch/x86/kernel/apic/apic.c| 19 +++ > > arch/x86/kernel/apic/io_apic.c | 32 +--- > > arch/x86/kernel/crash.c| 2 +- > > arch/x86/kernel/machine_kexec_32.c | 15 +-- > > arch/x86/kernel/machine_kexec_64.c | 15 +-- > > arch/x86/kernel/reboot.c | 2 +- > > arch/x86/kernel/x86_init.c | 2 +- > > drivers/iommu/irq_remapping.c | 2 +- > > 11 files changed, 46 insertions(+), 52 deletions(-) > > > > > > ___ > kexec mailing list > ke...@lists.infradead.org > http://lists.infradead.org/mailman/li
Re: [PATCH 0/3] Enable legacy irq mode before jump to kexec/kdump kernel
On 07/20/16 at 03:54am, Wei, Jiangang wrote: > Hi Baoquan He, > > Well, Indeed there‘s a relationship between the dump-capture hangs in > calibrate_delay_converge() and the interrupt mode. > > but there‘s no essential difference between your patches and mine that > calls disable_IO_APIC() again. > Actually, disable_IO_APIC will set APIC to virtual wire mode. Well, about this I want to explain a little bit. Usually people posted a patch, reviewers can give comments, suggestions or other ideas. During reviewing stage patch author need answer questions from people interested. So this is an interactive action, reviewers can also learn knowledge, meanwhile give comments. When reviewing your patch, I have many questions, and only get two pieces of information, kdump kernel hang with notsc, and disable_IO_APIC can save it. You even can't answer people's question why disable_IO_APIC need be called twice with your patch. I have to dig code and read intel arch manual and MP spec and Eric's original patch thread, and add debug bug to verify all. How can it be like you said "no essential difference between your patches and mine"? > > In fact, > Eric and Ingo suggested that "it should be fixed in the bootup path of > the dump kernel, not the crash kernel reboot path", which is convincing > and reasonable. > > And i find a better method can fix the problem. > It's better to set virtual wire mode for apic in init_bsp_APIC(), which > in the bootup path of dump kernel. > But now, init_bsp_APIC doesn't initialize the apic to vitual wire mode > when smp_found_config is non-zero. This may be do-able, let's see what Eric and Ingo will say. > > FYI, I'm working on this point. later i will send patches to mail list. > > Wei > > On Wed, 2016-07-20 at 10:58 +0800, Baoquan He wrote: > > Wei Jiangang reported kdump kernel always hang when "notsc" is specified > > in boot parameter. After debugging I found there's no timer interrupt > > in the current kexec/kdump kernel. This is caused by commit 522e66464467 > > ("x86/apic: Disable I/O APIC before shutdown of the local APIC"). Originally > > Eric posted below patch to make system be virtual wire mode in which 8259- > > equivalent PIC fields all interrupts and the LAPIC becomes a virtual wire. > > Like this interrupts can be delivered from PIC to CPU via the LAPIC's local > > interrupt 0 (LINTIN0). In virtual wire APIC mode is disabled while LAPIC > > is software enabled and its LINT0 and LINT1 need be programmed specifically. > > > > https://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.11/2.6.11-mm1/broken-out/x86_64-apic-virtwire-on-shutdown.patch > > > > But with commit 522e66464 you can see after disable_IO_APIC had setting > > virtual wire mode, lapic_shutdown disabled LAPIC again. Now virtual wire > > mode doesn't work, then it cause no timer interrupt during kdump kernel > > initialization stage until system enter into APIC mode. > > > > So people may be wondering why only kdump kernel hang, the normal kernel > > with "notsc" can still work. This is because BIOS has already built PIC mode > > or virtual wire mode while kexec/kdump kernel doesn't go through BIOS > > initialization. That is why we have to change system to be PIC mode or > > virtual wire mode before jump to kexec/kdump kernel. > > > > Then why kdump kernel didn't hang when "notsc" is not specified. This is > > because tsc_init will assign the already calibrated value to lpj_fine. > > Then kernel doesn't need to count cpu loops between jiffies with the help > > of timer interrupt. So "notsc" is not victim, but a informer. > > > > In patch 1/3 disable_IO_APIC is changed to only contain code of changeing > > system to be PIC mode or virtual wire mode and is renamed as > > switch_to_legacy_irq_mode. Now only call clear_IO_APIC where IO-APIC need > > be disabled, and call switch_to_legacy_irq_mode before jump to kexe/kdump > > kernel. > > > > Patch 2/3 and 3/3 are clean up patch. > > > > Baoquan He (3): > > x86/apic/kexec: Enable legacy irq mode before jump to kexec/kdump > > kernel > > x86/apic: Clean up the names of legacy irq mode setting related > > functions > > x86/apic: Clean up the apic delivery mode macro definition > > > > arch/x86/include/asm/apic.h| 2 +- > > arch/x86/include/asm/apicdef.h | 1 - > > arch/x86/include/asm/io_apic.h | 6 +++--- > > arch/x86/kernel/apic/apic.c| 19 +++ > > arch/x86/kernel/apic/io_apic.c | 32 +--- > > arch/x86/kernel/crash.c| 2 +- > > arch/x86/kernel/machine_kexec_32.c | 15 +-- > > arch/x86/kernel/machine_kexec_64.c | 15 +-- > > arch/x86/kernel/reboot.c | 2 +- > > arch/x86/kernel/x86_init.c | 2 +- > > drivers/iommu/irq_remapping.c | 2 +- > > 11 files changed, 46 insertions(+), 52 deletions(-) > > > > > > ___ > kexec mailing list > ke..
Re: [PATCH 0/3] Enable legacy irq mode before jump to kexec/kdump kernel
On 07/20/16 at 08:32am, Thomas Gleixner wrote: > On Wed, 20 Jul 2016, b...@redhat.com wrote: > > On 07/20/16 at 03:54am, Wei, Jiangang wrote: > > > > > In fact, Eric and Ingo suggested that "it should be fixed in the bootup > > > path of the dump kernel, not the crash kernel reboot path", which is > > > convincing and reasonable. > > > > Well this patch doesn't do differently with Eric's original implemention > > in kexec/kdump path. > > By taking out clear_IO_APIC from disable_IO_APIC, the left code of > > disable_IO_APIC will only do the virtual wire setting. So for > > kexec/kdump path, code basically is the same as Eric's method. But for > > poweroff/halt/reboot, it's enough to call clear_IO_APIC to disable > > IO-APIC. > > You're completely ignoring what Jiangang said: > > "it should be fixed in the bootup path of the dump kernel, not the crash > kernel reboot path" > > and that's the right way to do it. End of story. Thanks, tglx. What I did is like reverting commit 522e6646446. But it would be great if we can change to fix it in bootup path. Thanks Baoquan
Re: [PATCH v2] x86/boot: Use EFI setup data if provided
Hi Junichi, On 03/26/19 at 02:57pm, Borislav Petkov wrote: > On Mon, Mar 25, 2019 at 11:10:01PM +, Junichi Nomura wrote: > > efi_get_rsdp_addr() and kexec_get_rsdp_addr() could be implemented > > like this (sorry about the pseudo code): > > This doesn't look like what I suggested: > > > So efi_get_rsdp_addr() needs to be refactored in such a way so that at > > least the loop towards the end gets carved out into a separate function > > - __efi_get_rsdp_addr() or so - which gets config_tables, nr_tables and > > size as arguments and finds the RSDP address in the kexec-ed kernel. > > You need to carve out the loop at the end and make it into a separate > __efi_get_rsdp_addr() function which gets the physical or the virtual > address. I guess Boris is suggesting code like below. Please correct me if I am wrong. static acpi_physical_address _efi_get_rsdp_addr(efi_config_table tbl, ...) { /* Get EFI tables from systab. */ for (i = 0; i < nr_tables; i++) { ... } return rsdp_addr; } static acpi_physical_address efi_get_rsdp_addr(void) { ... /* Get systab from boot params. */ ... /* Handle EFI bitness properly */ ... return _efi_get_rsdp_addr(); } static acpi_physical_address kexec_get_rsdp_addr(void) { if (!is_kexec_booted) return 0; efi_get_setup_data_addr(); ... /* Handle EFI bitness properly */ ... return _efi_get_rsdp_addr(); } acpi_physical_address get_rsdp_addr(void) { acpi_physical_address pa; pa = get_acpi_rsdp(); if (!pa) pa = boot_params->acpi_rsdp_addr; /** /*I think here we should check if it's kexec booted firstly. * Skip it if not kexec. this can avoid the wrong kexec virt * addr parsing./ if (!pa) pa = kexec_get_rdsp_addr(); <--- new function if (!pa) pa = efi_get_rsdp_addr(); if (!pa) pa = bios_get_rsdp_addr(); return pa; }
Re: Performance regressions in "boot_time" tests in Linux 5.8 Kernel
On 12/11/20 at 04:16pm, Rahul Gopakumar wrote: > Hi Baoquan, > > We re-evaluated your last patch and it seems to be fixing the > initial performance bug reported. During our previous testing, > we did not apply the patch rightly hence it was reporting > some issues. > > Here is the dmesg log confirming no delay in the draft patch. > > Vanilla (5.10 rc3) > -- > > [0.024011] On node 2 totalpages: 89391104 > [0.024012] Normal zone: 1445888 pages used for memmap > [0.024012] Normal zone: 89391104 pages, LIFO batch:63 > [2.054646] ACPI: PM-Timer IO Port: 0x448 --> 2 secs delay > > Patch > -- > > [0.024166] On node 2 totalpages: 89391104 > [0.024167] Normal zone: 1445888 pages used for memmap > [0.024167] Normal zone: 89391104 pages, LIFO batch:63 > [0.026694] ACPI: PM-Timer IO Port: 0x448 --> No delay > > Attached dmesg logs. Let me know if anything is needed from our end. I posted formal patchset to fix this issue. The patch 1 is doing the fix, and almost the same as the draft v2 patch I attached in this thread. Please feel free to help test and add your Tested-by: tag in the patch thread if possible. > > > > From: Rahul Gopakumar > Sent: 24 November 2020 8:33 PM > To: b...@redhat.com > Cc: linux...@kvack.org ; linux-kernel@vger.kernel.org > ; a...@linux-foundation.org > ; natechancel...@gmail.com > ; ndesaulni...@google.com > ; clang-built-li...@googlegroups.com > ; rost...@goodmis.org > ; Rajender M ; Yiu Cho Lau > ; Peter Jonasson ; Venkatesh > Rajaram > Subject: Re: Performance regressions in "boot_time" tests in Linux 5.8 Kernel > > Hi Baoquan, > > We applied the new patch to 5.10 rc3 and tested it. We are still > observing the same page corruption issue which we saw with the > old patch. This is causing 3 secs delay in boot time. > > Attached dmesg log from the new patch and also from vanilla > 5.10 rc3 kernel. > > There are multiple lines like below in the dmesg log of the > new patch. > > "BUG: Bad page state in process swapper pfn:ab08001" > > > From: b...@redhat.com > Sent: 22 November 2020 6:38 AM > To: Rahul Gopakumar > Cc: linux...@kvack.org; linux-kernel@vger.kernel.org; > a...@linux-foundation.org; natechancel...@gmail.com; ndesaulni...@google.com; > clang-built-li...@googlegroups.com; rost...@goodmis.org; Rajender M; Yiu Cho > Lau; Peter Jonasson; Venkatesh Rajaram > Subject: Re: Performance regressions in "boot_time" tests in Linux 5.8 Kernel > > On 11/20/20 at 03:11am, Rahul Gopakumar wrote: > > Hi Baoquan, > > > > To which commit should we apply the draft patch. We tried applying > > the patch to the commit 3e4fb4346c781068610d03c12b16c0cfb0fd24a3 > > (the one we used for applying the previous patch) but it fails. > > I tested on 5.10-rc3+. You can append below change to the old patch in > your testing kernel. > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index fa6076e1a840..5e5b74e88d69 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -448,6 +448,8 @@ defer_init(int nid, unsigned long pfn, unsigned long > end_pfn) > if (end_pfn < pgdat_end_pfn(NODE_DATA(nid))) > return false; > > + if (NODE_DATA(nid)->first_deferred_pfn != ULONG_MAX) > + return true; > /* > * We start only with one section of pages, more pages are added as > * needed until the rest of deferred pages are initialized.
Re: Performance regressions in "boot_time" tests in Linux 5.8 Kernel
On 11/20/20 at 03:11am, Rahul Gopakumar wrote: > Hi Baoquan, > > To which commit should we apply the draft patch. We tried applying > the patch to the commit 3e4fb4346c781068610d03c12b16c0cfb0fd24a3 > (the one we used for applying the previous patch) but it fails. I tested on 5.10-rc3+. You can append below change to the old patch in your testing kernel. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index fa6076e1a840..5e5b74e88d69 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -448,6 +448,8 @@ defer_init(int nid, unsigned long pfn, unsigned long end_pfn) if (end_pfn < pgdat_end_pfn(NODE_DATA(nid))) return false; + if (NODE_DATA(nid)->first_deferred_pfn != ULONG_MAX) + return true; /* * We start only with one section of pages, more pages are added as * needed until the rest of deferred pages are initialized.
Re: v4.4-rc1: /dev/console open fails with -EIO
Hi Junichi, A little earlier Peter Hurley has posted a patch to fix this problem. https://lkml.org/lkml/2015/11/27/546 It may be found firstly on arm by Pratyush Anand . I found it too this week on Fedora 23. Anyway, it's great problem has been fixed very quickly. Just reply to let you know this. Thanks Baoquan On 12/16/15 at 06:32am, Junichi Nomura wrote: > Since kernel v4.4-rc1, kdump capture service with Fedora23 / RHEL7.2 > almost always fails on my test system which uses serial console. It > used to work fine until kernel v4.3. > > Kdump fails with an error like this: > kdump.sh[1040]: /bin/kdump.sh: line 8: /dev/console: Input/output error > > The line 8 of kdump.sh is doing this: > exec &> /dev/console > (http://pkgs.fedoraproject.org/cgit/kexec-tools.git/tree/dracut-kdump.sh) > > and the EIO is returned by this code in tty_reopen(): > if (!tty->count) > return -EIO; > > Bisection tells that commit 79c1faa4511e ("tty: Remove > tty_wait_until_sent_from_close()") is the first bad commit. > Actually, after reverting the commit, kdump capture starts working > again. > > Open of /dev/console used to return -EIO when it races with close. > (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/554172/comments/245) > But the commit seems widening the race window. > > Before the commit: > tty_release() > tty_lock(tty) > tty->ops->close(tty, filp) > tty_unlock(tty) > tty_wait_until_sent() > // the window starts from here > tty_lock(tty) > decrement tty->count > tty_unlock(tty) > (releasing tty if count became zero) > > After the commit > tty_release() > // the window starts from here > tty_lock(tty) > tty->ops->close(tty, filp) > tty_wait_until_sent() > decrement tty->count > tty_unlock(tty) > (releasing tty if count became zero) > > While it might be possible for user space to cope with the problem > by retrying open(), there is no clue whether and how long it should. > Also current situation makes shell scripting like the above kdump.sh > fragile for this sort of timing change. > > How about retrying tty_open in kernel instead, like the attached patch? > If !tty->count in tty_reopen() means the race has happened, that > seems reasonable. > > --- > Jun'ichi Nomura, NEC Corporation > > diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c > index bcc8e1e..070ea66 100644 > --- a/drivers/tty/tty_io.c > +++ b/drivers/tty/tty_io.c > @@ -1462,8 +1462,9 @@ static int tty_reopen(struct tty_struct *tty) > { > struct tty_driver *driver = tty->driver; > > + /* We cannot re-open tty which is being released. */ > if (!tty->count) > - return -EIO; > + return -ERESTARTSYS; > > if (driver->type == TTY_DRIVER_TYPE_PTY && > driver->subtype == PTY_TYPE_MASTER) > @@ -2087,6 +2088,11 @@ retry_open: > > if (IS_ERR(tty)) { > retval = PTR_ERR(tty); > + if (retval == -ERESTARTSYS && !signal_pending(current)) { > + tty_free_file(filp); > + schedule(); > + goto retry_open; > + } > goto err_file; > } > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/3] Fix dump-capture kernel hangs with notsc
On 08/02/16 at 07:45am, Wei, Jiangang wrote: > Hi Eric, > > Thanks for your reply firstly. > > On Mon, 2016-08-01 at 12:09 -0500, Eric W. Biederman wrote: > > "Wei, Jiangang" writes: > > > > > Ping ... > > > May I ask for some community attention to this series? > > > I purpose is fixing the dump-capture kernel hangs in > > > calibrate_delay_converge() while specifying notsc. > > > > Did you not see my reply to patch 3/3? > > Yes, I read your email and made a reply > (https://lkml.org/lkml/2016/7/26/112) . I put forward several questions > in that letter, but no feedback... > > > > > The short version of my feedback is that you seem to be fixing a case > > that should not exist. So the good fix is to skip completely past > > virtual wire mode and into full apic mode as soon as possible. > > I am afraid that there are some disagreements between us. > > 1) The case that dump-capture kernel boot up with the disabled APIC is > very real, and the bug can be reproduced 100%. I want to emphasize that > there is no guarantee of the interrupt mode of APIC and status of local > APIC, Especially for the dump-capture kernel that won't through the BIOS > phrase. That's why I do more check in init_bsp_APIC(), not only depends > on the MP tables which be generated before the first kernel boots up. > > Make a point here, The BIOS must disable interrupts to all processors > and set the APICs to the system initial state before giving control to > the operating system. That means APICs won't be reset to initial state > without BIOS phrase. > > 2) Your proposal (switch into full apic mode as soon as possible) seems > to contradict the Intel Spec, "An MP operating system is booted under > either one of the two PC/AT-compatible modes. Later the operating system > switches to Symmetric I/O Mode **as it enters multiprocessor mode**." > And in other words, the BSP should be in PIC mode or Virtual wire mode > in startup stage. Well, Eric has clearly told hardware-reduced ACPI platform doesn't have legacy mode irq. It only has APIC mode. The quotation from MP spec is very old. I check code and think now you should investigate the current implementation, see if APIC mode can be enabled as soon as possible. Though it can't, detailed explanation need be given to convince people. > > 3) The apic initialization codes maybe need a overhaul, but it goes out > the scope of this patch. I focus on fixing kdump failure with notsc. And > the apic initialization codes has no modification for a long time and > can be regard as stable. Overhaul of it increases the chances of > hitting a bug. > If there's anything wrong with my understanding, please point out. > > Thanks, > wei > > > > For a subset of cases the code already supports that. > > > > Eric > > > > > > >
Re: Performance regressions in "boot_time" tests in Linux 5.8 Kernel
On 10/09/20 at 01:15pm, Rahul Gopakumar wrote: > As part of VMware's performance regression testing for Linux Kernel > upstream releases, we identified boot time increase when comparing > Linux 5.8 kernel against Linux 5.7 kernel. Increase in boot time is > noticeable on VM with a **large amount of memory**. > > In our test cases, it's noticeable with memory 1TB and more, whereas > there was no major difference noticed in testcases with <1TB. > > On bisecting between 5.7 and 5.8, we found the following commit from > “Baoquan He” to be the cause of boot time increase in big VM test cases. > > - > > commit 73a6e474cb376921a311786652782155eac2fdf0 > Author: Baoquan He > Date: Wed Jun 3 15:57:55 2020 -0700 > > mm: memmap_init: iterate over memblock regions rather that check each PFN > > When called during boot the memmap_init_zone() function checks if each PFN > is valid and actually belongs to the node being initialized using > early_pfn_valid() and early_pfn_in_nid(). > > Each such check may cost up to O(log(n)) where n is the number of memory > banks, so for large amount of memory overall time spent in early_pfn*() > becomes substantial. > > - > > For boot time test, we used RHEL 8.1 as the guest OS. > VM config is 84 vcpu and 1TB vRAM. > > Here are the actual performance numbers. > > 5.7 GA - 18.17 secs > Baoquan's commit - 21.6 secs (-16% increase in time) > > From dmesg logs, we can see significant time delay around memmap. > > Refer below logs. > > Good commit > > [0.033176] Normal zone: 1445888 pages used for memmap > [0.033176] Normal zone: 89391104 pages, LIFO batch:63 > [0.035851] ACPI: PM-Timer IO Port: 0x448 > > Problem commit > > [0.026874] Normal zone: 1445888 pages used for memmap > [0.026875] Normal zone: 89391104 pages, LIFO batch:63 > [2.028450] ACPI: PM-Timer IO Port: 0x448 Could you add memblock=debug to kernel cmdline and paste the boot logs of system w and w/o the commit? > > We did some analysis, and it looks like with the problem commit it's > not deferring the memory initialization to a later stage and it's > initializing the huge chunk of memory in serial - during the boot-up > time. Whereas with the good commit, it was able to defer the > initialization of the memory when it could be done in parallel. > > > Rahul Gopakumar > Performance Engineering > VMware, Inc. >
Re: Performance regressions in "boot_time" tests in Linux 5.8 Kernel
On 10/12/20 at 05:21pm, Rahul Gopakumar wrote: > Hi Baoquan, > > Attached collected dmesg logs for with and without > commit after adding memblock=debug to kernel cmdline. Thanks, I have got the root cause, will make a patch for your testing soon.
Re: Performance regressions in "boot_time" tests in Linux 5.8 Kernel
Hi Rahul, On 10/12/20 at 05:21pm, Rahul Gopakumar wrote: > Hi Baoquan, > > Attached collected dmesg logs for with and without > commit after adding memblock=debug to kernel cmdline. Can you test below draft patch and see if it works for you? >From a2ea6caef3c73ad9efb2dd2b48039065fe430bb2 Mon Sep 17 00:00:00 2001 From: Baoquan He Date: Tue, 13 Oct 2020 20:05:30 +0800 Subject: [PATCH] mm: make memmap defer init only take effect per zone Deferred struct page init is designed to work per zone. However since commit 73a6e474cb376 ("mm: memmap_init: iterate over memblock regions rather that check each PFN"), the handling is mistakenly done in all memory ranges inside one zone. Especially in those unmovable zones of multiple nodes, memblock reservation split them into many memory ranges. This makes initialized struct page more than expected in early stage, then increases much boot time. Let's fix it to make the memmap defer init handled in zone wide, but not in memory range of one zone. Signed-off-by: Baoquan He --- arch/ia64/mm/init.c | 4 ++-- include/linux/mm.h | 5 +++-- mm/memory_hotplug.c | 2 +- mm/page_alloc.c | 6 +++--- 4 files changed, 9 insertions(+), 8 deletions(-) diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c index ef12e097f318..27ca549ff47e 100644 --- a/arch/ia64/mm/init.c +++ b/arch/ia64/mm/init.c @@ -536,7 +536,7 @@ virtual_memmap_init(u64 start, u64 end, void *arg) if (map_start < map_end) memmap_init_zone((unsigned long)(map_end - map_start), -args->nid, args->zone, page_to_pfn(map_start), +args->nid, args->zone, page_to_pfn(map_start), page_to_pfn(map_end), MEMINIT_EARLY, NULL, MIGRATE_MOVABLE); return 0; } @@ -546,7 +546,7 @@ memmap_init (unsigned long size, int nid, unsigned long zone, unsigned long start_pfn) { if (!vmem_map) { - memmap_init_zone(size, nid, zone, start_pfn, + memmap_init_zone(size, nid, zone, start_pfn, start_pfn + size, MEMINIT_EARLY, NULL, MIGRATE_MOVABLE); } else { struct page *start; diff --git a/include/linux/mm.h b/include/linux/mm.h index ef360fe70aaf..5f9fc61d5be2 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2439,8 +2439,9 @@ extern int __meminit __early_pfn_to_nid(unsigned long pfn, #endif extern void set_dma_reserve(unsigned long new_dma_reserve); -extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long, - enum meminit_context, struct vmem_altmap *, int migratetype); +extern void memmap_init_zone(unsigned long, int, unsigned long, + unsigned long, unsigned long, enum meminit_context, + struct vmem_altmap *, int migratetype); extern void setup_per_zone_wmarks(void); extern int __meminit init_per_zone_wmark_min(void); extern void mem_init(void); diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index b44d4c7ba73b..f9a37e6abc1c 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -732,7 +732,7 @@ void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn, * expects the zone spans the pfn range. All the pages in the range * are reserved so nobody should be touching them so we should be safe */ - memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn, + memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn, 0, MEMINIT_HOTPLUG, altmap, migratetype); set_zone_contiguous(zone); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2ebf9ddafa3a..e8b19fdd18ec 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6044,7 +6044,7 @@ overlap_memmap_init(unsigned long zone, unsigned long *pfn) * zone stats (e.g., nr_isolate_pageblock) are touched. */ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, - unsigned long start_pfn, + unsigned long start_pfn, unsigned long zone_end_pfn, enum meminit_context context, struct vmem_altmap *altmap, int migratetype) { @@ -6080,7 +6080,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, if (context == MEMINIT_EARLY) { if (overlap_memmap_init(zone, &pfn)) continue; - if (defer_init(nid, pfn, end_pfn)) + if (defer_init(nid, pfn, zone_end_pfn)) break; } @@ -6194,7 +6194,7 @@ void __meminit __weak memmap_init(unsigned long size, int nid, if (end_pfn > start_pfn) { size = end_pfn - start_pfn; - memmap_init_zone(size, nid, zone, start_pfn, + memmap_init_zone(size, nid, zone, start_pfn, range_end_pfn,
Re: Performance regressions in "boot_time" tests in Linux 5.8 Kernel
gt; int nid, > > if (end_pfn > start_pfn) { > size = end_pfn - start_pfn; > - memmap_init_zone(size, nid, zone, start_pfn, > + memmap_init_zone(size, nid, zone, start_pfn, > range_end_pfn, > MEMINIT_EARLY, NULL); > } > } > > > > > We have attached default dmesg logs and also dmesg logs collected with > memblock=debug kernel cmdline for both vanilla and patched kernels. Let me > know if you need more info. > > > > From: b...@redhat.com > Sent: 13 October 2020 6:47 PM > To: Rahul Gopakumar > Cc: linux...@kvack.org ; linux-kernel@vger.kernel.org > ; a...@linux-foundation.org > ; natechancel...@gmail.com > ; ndesaulni...@google.com > ; clang-built-li...@googlegroups.com > ; rost...@goodmis.org > ; Rajender M ; Yiu Cho Lau > ; Peter Jonasson ; Venkatesh > Rajaram > Subject: Re: Performance regressions in "boot_time" tests in Linux 5.8 Kernel > > Hi Rahul, > > On 10/12/20 at 05:21pm, Rahul Gopakumar wrote: > > Hi Baoquan, > > > > Attached collected dmesg logs for with and without > > commit after adding memblock=debug to kernel cmdline. > > Can you test below draft patch and see if it works for you? > > From a2ea6caef3c73ad9efb2dd2b48039065fe430bb2 Mon Sep 17 00:00:00 2001 > From: Baoquan He > Date: Tue, 13 Oct 2020 20:05:30 +0800 > Subject: [PATCH] mm: make memmap defer init only take effect per zone > > Deferred struct page init is designed to work per zone. However since > commit 73a6e474cb376 ("mm: memmap_init: iterate over memblock regions > rather that check each PFN"), the handling is mistakenly done in all memory > ranges inside one zone. Especially in those unmovable zones of multiple nodes, > memblock reservation split them into many memory ranges. This makes > initialized struct page more than expected in early stage, then increases > much boot time. > > Let's fix it to make the memmap defer init handled in zone wide, but not in > memory range of one zone. > > Signed-off-by: Baoquan He > --- > arch/ia64/mm/init.c | 4 ++-- > include/linux/mm.h | 5 +++-- > mm/memory_hotplug.c | 2 +- > mm/page_alloc.c | 6 +++--- > 4 files changed, 9 insertions(+), 8 deletions(-) > > diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c > index ef12e097f318..27ca549ff47e 100644 > --- a/arch/ia64/mm/init.c > +++ b/arch/ia64/mm/init.c > @@ -536,7 +536,7 @@ virtual_memmap_init(u64 start, u64 end, void *arg) > > if (map_start < map_end) > memmap_init_zone((unsigned long)(map_end - map_start), > - args->nid, args->zone, > page_to_pfn(map_start), > + args->nid, args->zone, > page_to_pfn(map_start), page_to_pfn(map_end), > MEMINIT_EARLY, NULL, MIGRATE_MOVABLE); > return 0; > } > @@ -546,7 +546,7 @@ memmap_init (unsigned long size, int nid, unsigned long > zone, > unsigned long start_pfn) > { > if (!vmem_map) { > - memmap_init_zone(size, nid, zone, start_pfn, > + memmap_init_zone(size, nid, zone, start_pfn, start_pfn + size, > MEMINIT_EARLY, NULL, MIGRATE_MOVABLE); > } else { > struct page *start; > diff --git a/include/linux/mm.h b/include/linux/mm.h > index ef360fe70aaf..5f9fc61d5be2 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2439,8 +2439,9 @@ extern int __meminit __early_pfn_to_nid(unsigned long > pfn, > #endif > > extern void set_dma_reserve(unsigned long new_dma_reserve); > -extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned > long, > - enum meminit_context, struct vmem_altmap *, int migratetype); > +extern void memmap_init_zone(unsigned long, int, unsigned long, > + unsigned long, unsigned long, enum meminit_context, > + struct vmem_altmap *, int migratetype); > extern void setup_per_zone_wmarks(void); > extern int __meminit init_per_zone_wmark_min(void); > extern void mem_init(void); > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index b44d4c7ba73b..f9a37e6abc1c 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -732,7 +732,7 @@ void __ref move_pfn_range_to_zone(struct zone *zone, > unsigned long start_pfn, > * expects the zone spans the pfn range. All the pages in the range > * are reserved so nobody should be touching them so we should be
Re: Performance regressions in "boot_time" tests in Linux 5.8 Kernel
Hi Rahul, On 10/20/20 at 03:26pm, Rahul Gopakumar wrote: > >> Here, do you mean it even cost more time with the patch applied? > > Yes, we ran it multiple times and it looks like there is a > very minor increase with the patch. > .. > On 10/20/20 at 01:45pm, Rahul Gopakumar wrote: > > Hi Baoquan, > > > > We had some trouble applying the patch to problem commit and the latest > > upstream commit. Steven (CC'ed) helped us by providing the updated draft > > patch. We applied it on the latest commit > > (3e4fb4346c781068610d03c12b16c0cfb0fd24a3), and it doesn't look like > > improving the performance numbers. > > Thanks for your feedback. From the code, I am sure what the problem is, > but I didn't test it on system with huge memory. Forget mentioning my > draft patch is based on akpm/master branch since it's a mm issue, it > might be a little different with linus's mainline kernel, sorry for the > inconvenience. > > I will test and debug this on a server with 4T memory in our lab, and > update if any progress. > > > > > Patch on latest commit - 20.161 secs > > Vanilla latest commit - 19.50 secs > Can you tell how you measure the boot time? I checked the boot logs you attached, E.g in below two logs, I saw patch_dmesg.log even has less time during memmap init. Now I have got a machine with 1T memory for testing, but didn't see obvious time cost increase. At above, you said "Patch on latest commit - 20.161 secs", could you tell where this 20.161 secs comes from, so that I can investigate and reproduce on my system? patch_dmesg.log: [0.023126] Initmem setup node 1 [mem 0x0056-0x00aa] [0.023128] On node 1 totalpages: 89128960 [0.023129] Normal zone: 1392640 pages used for memmap [0.023130] Normal zone: 89128960 pages, LIFO batch:63 [0.023893] Initmem setup node 2 [mem 0x00ab-0x01033fff] [0.023895] On node 2 totalpages: 89391104 [0.023896] Normal zone: 1445888 pages used for memmap [0.023897] Normal zone: 89391104 pages, LIFO batch:63 [0.026744] ACPI: PM-Timer IO Port: 0x448 [0.026747] ACPI: Local APIC address 0xfee0 vanilla_dmesg.log: [0.024295] Initmem setup node 1 [mem 0x0056-0x00aa] [0.024298] On node 1 totalpages: 89128960 [0.024299] Normal zone: 1392640 pages used for memmap [0.024299] Normal zone: 89128960 pages, LIFO batch:63 [0.025289] Initmem setup node 2 [mem 0x00ab-0x01033fff] [0.025291] On node 2 totalpages: 89391104 [0.025292] Normal zone: 1445888 pages used for memmap [0.025293] Normal zone: 89391104 pages, LIFO batch:63 [2.096982] ACPI: PM-Timer IO Port: 0x448 [2.096987] ACPI: Local APIC address 0xfee0
Re: Performance regressions in "boot_time" tests in Linux 5.8 Kernel
On 11/03/20 at 12:34pm, Rahul Gopakumar wrote: > >> So, you mean with the draft patch applied, the initial performance > regression goes away, just many page corruption errors with call trace > are seen, right? > > Yes, that's right. > > >> And the performance regression is about 2sec delay in > your system? > > The delay due to this new page corruption issue is about > 3 secs. > > Here is the summary > > * Initial problem - 2 secs > * Draft patch - Fixes initial problem (recovers 2 secs) but > brings in new page corruption issue (3 secs) > > >> Could you tell how you setup vmware VM so that I can ask our QA for > help to create a vmware VM for me to test? > > * Use vSphere ESXi 6.7 or 7.0 GA. > * Create VM using vSphere Web Client and specify 1TB VM Memory. > * Install RHEL 8.1, that's the guest used in this test. Can you try the attached draft patch? > > With draft patch, you should be able to reproduce the issue. > Let me know if you need more details. >From 24d9b1fe55d79892cac3478711af216d898c7159 Mon Sep 17 00:00:00 2001 From: Baoquan He Date: Tue, 13 Oct 2020 20:05:30 +0800 Subject: [PATCH v2] mm: make memmap defer init only take effect per zone Deferred struct page init is designed to work in zone wide. However since commit 73a6e474cb376 ("mm: memmap_init: iterate over memblock regions rather that check each PFN"), the handling is mistakenly done in all memory ranges inside one zone. Especially in those unmovable zones of multiple nodes, memblock allocation split them into many memory ranges. This makes initialized struct page more than expected in early stage, then increases much boot time. Let's fix it to make the memmap defer init handled in zone wide, but not in sub memor range of one zone. Signed-off-by: Baoquan He --- arch/ia64/mm/init.c | 4 ++-- include/linux/mm.h | 5 +++-- mm/memory_hotplug.c | 2 +- mm/page_alloc.c | 8 +--- 4 files changed, 11 insertions(+), 8 deletions(-) diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c index ef12e097f318..27ca549ff47e 100644 --- a/arch/ia64/mm/init.c +++ b/arch/ia64/mm/init.c @@ -536,7 +536,7 @@ virtual_memmap_init(u64 start, u64 end, void *arg) if (map_start < map_end) memmap_init_zone((unsigned long)(map_end - map_start), -args->nid, args->zone, page_to_pfn(map_start), +args->nid, args->zone, page_to_pfn(map_start), page_to_pfn(map_end), MEMINIT_EARLY, NULL, MIGRATE_MOVABLE); return 0; } @@ -546,7 +546,7 @@ memmap_init (unsigned long size, int nid, unsigned long zone, unsigned long start_pfn) { if (!vmem_map) { - memmap_init_zone(size, nid, zone, start_pfn, + memmap_init_zone(size, nid, zone, start_pfn, start_pfn + size, MEMINIT_EARLY, NULL, MIGRATE_MOVABLE); } else { struct page *start; diff --git a/include/linux/mm.h b/include/linux/mm.h index dae8e599f6c1..f82e73fd5d61 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2439,8 +2439,9 @@ extern int __meminit __early_pfn_to_nid(unsigned long pfn, #endif extern void set_dma_reserve(unsigned long new_dma_reserve); -extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long, - enum meminit_context, struct vmem_altmap *, int migratetype); +extern void memmap_init_zone(unsigned long, int, unsigned long, + unsigned long, unsigned long, enum meminit_context, + struct vmem_altmap *, int migratetype); extern void setup_per_zone_wmarks(void); extern int __meminit init_per_zone_wmark_min(void); extern void mem_init(void); diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index b44d4c7ba73b..f9a37e6abc1c 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -732,7 +732,7 @@ void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn, * expects the zone spans the pfn range. All the pages in the range * are reserved so nobody should be touching them so we should be safe */ - memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn, + memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn, 0, MEMINIT_HOTPLUG, altmap, migratetype); set_zone_contiguous(zone); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index fa6076e1a840..5e5b74e88d69 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -448,6 +448,8 @@ defer_init(int nid, unsigned long pfn, unsigned long end_pfn) if (end_pfn < pgdat_end_pfn(NODE_DATA(nid))) return false; + if (NODE_DATA(nid)->first_deferred_pfn != ULONG_MAX) + return true; /* * We start only with one section of pages, more pages are added as * needed until the rest of deferred pages are initialized. @@ -6044,7 +6046,7 @@ overlap_memmap_init(unsigned long zone, unsigned l
Re: Performance regressions in "boot_time" tests in Linux 5.8 Kernel
On 11/02/20 at 02:15pm, Rahul Gopakumar wrote: > Hi Baoquan, > > There could still be some memory initialization problem with > the draft patch. I see a lot of page corruption errors. > > BUG: Bad page state in process swapper pfn:ab0803c > > Here is the call trace > > [0.262826] dump_stack+0x57/0x6a > [0.262827] bad_page.cold.119+0x63/0x93 > [0.262828] __free_pages_ok+0x31f/0x330 > [0.262829] memblock_free_all+0x153/0x1bf > [0.262830] mem_init+0x23/0x1f2 > [0.262831] start_kernel+0x299/0x57a > [0.262832] secondary_startup_64_no_verify+0xb8/0xbb > > I don't see this in dmesg log with vanilla kernel. > > It looks like the overhead due to this initialization problem > is around 3 secs. > > [0.262831] start_kernel+0x299/0x57a > [0.262832] secondary_startup_64_no_verify+0xb8/0xbb > [3.758185] Memory: 3374072K/1073740756K available (12297K kernel code, > 5778Krwdata, 4376K rodata, 2352K init, 6480K bss, 16999716K reserved, 0K > cma-reserved) > > But the draft patch is fixing the initial problem > reported around 2 secs (log snippet below) hence the total > delay of 1 sec. > > [0.024752] Normal zone: 1445888 pages used for memmap > [0.024753] Normal zone: 89391104 pages, LIFO batch:63 > [0.027379] ACPI: PM-Timer IO Port: 0x448 So, you mean with the draft patch applied, the initial performance regression goes away, just many page corruption errors with call trace are seen, right? And the performance regression is about 2sec delay in your system? Could you tell how you setup vmware VM so that I can ask our QA for help to create a vmware VM for me to test? I tested the draft patch on bare metal system with more than 1T memory, didn't see the page corruption call trace, need reproduce and have a look. > > > > From: Rahul Gopakumar > Sent: 22 October 2020 10:51 PM > To: b...@redhat.com > Cc: linux...@kvack.org; linux-kernel@vger.kernel.org; > a...@linux-foundation.org; natechancel...@gmail.com; ndesaulni...@google.com; > clang-built-li...@googlegroups.com; rost...@goodmis.org; Rajender M; Yiu Cho > Lau; Peter Jonasson; Venkatesh Rajaram > Subject: Re: Performance regressions in "boot_time" tests in Linux 5.8 Kernel > > Hi Baoquan, > > >> Can you tell how you measure the boot time? > > Our test is actually boothalt, time reported by this test > includes both boot-up and shutdown time. > > >> At above, you said "Patch on latest commit - 20.161 secs", > >> could you tell where this 20.161 secs comes from, > > So this time is boot-up time + shutdown time. > > From the dmesg.log it looks like during the memmap_init > it's taking less time in the patch. Let me take a closer look to > confirm this and also to find where the 1-sec delay in the patch > run is coming from. > > > From: b...@redhat.com > Sent: 22 October 2020 9:34 AM > To: Rahul Gopakumar > Cc: linux...@kvack.org ; linux-kernel@vger.kernel.org > ; a...@linux-foundation.org > ; natechancel...@gmail.com > ; ndesaulni...@google.com > ; clang-built-li...@googlegroups.com > ; rost...@goodmis.org > ; Rajender M ; Yiu Cho Lau > ; Peter Jonasson ; Venkatesh > Rajaram > Subject: Re: Performance regressions in "boot_time" tests in Linux 5.8 Kernel > > Hi Rahul, > > On 10/20/20 at 03:26pm, Rahul Gopakumar wrote: > > >> Here, do you mean it even cost more time with the patch applied? > > > > Yes, we ran it multiple times and it looks like there is a > > very minor increase with the patch. > > > .. > > On 10/20/20 at 01:45pm, Rahul Gopakumar wrote: > > > Hi Baoquan, > > > > > > We had some trouble applying the patch to problem commit and the latest > > > upstream commit. Steven (CC'ed) helped us by providing the updated draft > > > patch. We applied it on the latest commit > > > (3e4fb4346c781068610d03c12b16c0cfb0fd24a3), and it doesn't look like > > > improving the performance numbers. > > > > Thanks for your feedback. From the code, I am sure what the problem is, > > but I didn't test it on system with huge memory. Forget mentioning my > > draft patch is based on akpm/master branch since it's a mm issue, it > > might be a little different with linus's mainline kernel, sorry for the > > inconvenience. > > > > I will test and debug this on a server with 4T memory in our lab, and > > update if any progress. > > > > > > > > Patch on latest commit - 20.161 secs > > > Vanilla latest commit - 19.50 secs &
Re: Performance regressions in "boot_time" tests in Linux 5.8 Kernel
On 11/03/20 at 12:34pm, Rahul Gopakumar wrote: > >> So, you mean with the draft patch applied, the initial performance > regression goes away, just many page corruption errors with call trace > are seen, right? > > Yes, that's right. > > >> And the performance regression is about 2sec delay in > your system? > > The delay due to this new page corruption issue is about > 3 secs. > > Here is the summary > > * Initial problem - 2 secs > * Draft patch - Fixes initial problem (recovers 2 secs) but > brings in new page corruption issue (3 secs) > > >> Could you tell how you setup vmware VM so that I can ask our QA for > help to create a vmware VM for me to test? > > * Use vSphere ESXi 6.7 or 7.0 GA. > * Create VM using vSphere Web Client and specify 1TB VM Memory. > * Install RHEL 8.1, that's the guest used in this test. OK, I see. The draft patch fix the original issue, seems some boundary of memory region is not handled correctly. Thanks for confirmation. The memory layout is important in this case. Not sure if making a VM gesut as you suggested can also create a system with below memory layout. [0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x-0x0009] [0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x0010-0xbfff] [0.008843] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x55] [0.008844] ACPI: SRAT: Node 1 PXM 1 [mem 0x56-0xaa] [0.008844] ACPI: SRAT: Node 2 PXM 2 [mem 0xab-0xfc] [0.008845] ACPI: SRAT: Node 2 PXM 2 [mem 0x100-0x1033fff] > > With draft patch, you should be able to reproduce the issue. > Let me know if you need more details.
Re: [PATCH v7 12/13] ACPI / init: Invoke early ACPI initialization earlier
On 07/18/17 at 02:08pm, Dou Liyang wrote: > Hi, Zheng > > At 07/18/2017 01:18 PM, Zheng, Lv wrote: > > Hi, > > > > Can the problem be fixed by invoking acpi_put_table() for mapped DMAR table? > > Invoking acpi_put_table() is my first choice. But it made the kernel > *panic* when we try to get the table again in intel_iommu_init() in > late stage. > > I am also confused that: > > There are two places where we used DMAR table in Linux: > > 1) In detect_intel_iommu() in ACPI early stage: > > ... > status = acpi_get_table(ACPI_SIG_DMAR, 0, &dmar_tbl); > > if (dmar_tbl) { > acpi_put_table(dmar_tbl); > dmar_tbl = NULL; > } > > 2) In dmar_table_init() in ACPI late stage: > > ... > status = acpi_get_table(ACPI_SIG_DMAR, 0, &dmar_tbl); > ... > > As we know, dmar_table_init() is called by intel_iommu_init() and > intel_prepare_irq_remapping(). > > When I invoked acpi_put_table() in the intel_prepare_irq_remapping() in > early stage like 1) shows, kernel will panic. That's because acpi_put_table() will make the table pointer be NULL, while dmar_table_init() will skip parse_dmar_table() calling if dmar_table_initialized is set to 1 in intel_prepare_irq_remapping(). Dmar hardware support interrupt remapping and io remapping separately. But intel_iommu_init() is called later than intel_prepare_irq_remapping(). So what if make dmar_table_init() a reentrant function? You can just have a try, but maybe not a good idea, the dmar table will be parsed twice. > > > Thanks, > > dou. > > > > Thanks > > Lv > > > > > From: Dou Liyang [mailto:douly.f...@cn.fujitsu.com] > > > Sent: Friday, July 14, 2017 1:53 PM > > > To: x...@kernel.org; linux-kernel@vger.kernel.org > > > Cc: t...@linutronix.de; mi...@kernel.org; h...@zytor.com; > > > ebied...@xmission.com; b...@redhat.com; > > > pet...@infradead.org; izumi.t...@jp.fujitsu.com; > > > tokunaga.kei...@jp.fujitsu.com; Dou Liyang > > > ; linux-a...@vger.kernel.org; Rafael J. > > > Wysocki ; Zheng, > > > Lv ; Julian Wollrath > > > Subject: [PATCH v7 12/13] ACPI / init: Invoke early ACPI initialization > > > earlier > > > > > > Linux uses acpi_early_init() to put the ACPI table management into > > > the late stage from the early stage where the mapped ACPI tables is > > > temporary and should be unmapped. > > > > > > But, now initializing interrupt delivery mode should map and parse the > > > DMAR table earlier in the early stage. This causes an ACPI error when > > > Linux reallocates the ACPI root tables. Because Linux doesn't unmapped > > > the DMAR table after using in the early stage. > > > > > > Invoke acpi_early_init() earlier before late_time_init(), Keep the DMAR > > > be mapped and parsed in late stage like before. > > > > > > Reported-by: Xiaolong Ye > > > Signed-off-by: Dou Liyang > > > Cc: linux-a...@vger.kernel.org > > > Cc: Rafael J. Wysocki > > > Cc: Zheng, Lv > > > Cc: Julian Wollrath > > > --- > > > Test in my own PC(Lenovo M4340). > > > Ask help for doing regression testing for the bug said in commit > > > c4e1acbb35e4 > > > ("ACPI / init: Invoke early ACPI initialization later"). > > > > > > init/main.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/init/main.c b/init/main.c > > > index df58a41..7a09467 100644 > > > --- a/init/main.c > > > +++ b/init/main.c > > > @@ -654,12 +654,12 @@ asmlinkage __visible void __init start_kernel(void) > > > kmemleak_init(); > > > setup_per_cpu_pageset(); > > > numa_policy_init(); > > > + acpi_early_init(); > > > if (late_time_init) > > > late_time_init(); > > > calibrate_delay(); > > > pidmap_init(); > > > anon_vma_init(); > > > - acpi_early_init(); > > > #ifdef CONFIG_X86 > > > if (efi_enabled(EFI_RUNTIME_SERVICES)) > > > efi_enter_virtual_mode(); > > > -- > > > 2.5.5 > > > > > > > > > > > > > > > >
Re: [PATCH v7 12/13] ACPI / init: Invoke early ACPI initialization earlier
On 07/31/17 at 07:20pm, Dou Liyang wrote: > Hi Baoquan, > > At 07/27/2017 02:08 PM, b...@redhat.com wrote: > > On 07/26/17 at 08:19pm, Dou Liyang wrote: > > > Hi Baoquan, > [...] > > > > > > I am looking for Thinkpad x121e (AMD E-450 APU) to test. I have tested > > > it in Thinkpad s430, It's OK. > > > > > > BTY, I am confused how does the ACPI subsystem affect PIT which > > > will be used to fast calibrate CPU frequency[2]. > > In combination with what Joey said[1] and the code, I guess, > > The acpi_enable() called in acpi_enable_system() transfers the system > into ACPI mode and write data to an I/O port. PIT works using the I/O > port. In Thinkpad x121e with ACPI mode, the counter in PIT couldn't > count right and might got an error value. > > [1] https://lkml.org/lkml/2014/3/12/3 Awesome! I guess you are possibly correct. A MUST job for entering into ACPI mode is to enable SCI and hook a handler. SCI will take over all acpi interrupt events. I googled SCI and vaguely remember someone said the I/O port r/w event also will be handled by SCI, but didn't find a clear description in ACPI spec. If that cause failure, it might happen inside pit_expect_msb() called by quick_pit_calibrate(), maybe you can try to add debug message in this place, not very sure if it be debugged. Maybe tglx or Rafael can help have a look and confirm if it's correct or not. Thanks Baoquan > > > > I checked code, and have no any idea. Add Joey Lee to list, see if he > > can help answer this. > > > > > > > > Do you have any idea? > > > > > > [1] https://lkml.org/lkml/2014/3/10/123 > > > [2] https://lkml.org/lkml/2014/3/12/3 > > > > > > > > > drivers/iommu/dmar.c| 27 +++ > > > drivers/iommu/intel-iommu.c | 2 ++ > > > drivers/iommu/intel_irq_remapping.c | 17 - > > > include/linux/dmar.h| 2 ++ > > > init/main.c | 2 +- > > > 5 files changed, 32 insertions(+), 18 deletions(-) > > > > > > diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c > > > index c8b0329..e6261b7 100644 > > > --- a/drivers/iommu/dmar.c > > > +++ b/drivers/iommu/dmar.c > > > @@ -68,6 +68,8 @@ DECLARE_RWSEM(dmar_global_lock); > > > LIST_HEAD(dmar_drhd_units); > > > > > > struct acpi_table_header * __initdata dmar_tbl; > > > +struct acpi_table_header * __initdata dmar_tbl_original; > > > + > > > static int dmar_dev_scope_status = 1; > > > static unsigned long dmar_seq_ids[BITS_TO_LONGS(DMAR_UNITS_SUPPORTED)]; > > > > > > @@ -627,6 +629,7 @@ parse_dmar_table(void) > > >* fixed map. > > >*/ > > > dmar_table_detect(); > > > + dmar_tbl_original = dmar_tbl; > > > > > > /* > > >* ACPI tables may not be DMA protected by tboot, so use DMAR copy > > > @@ -811,26 +814,18 @@ int __init dmar_dev_scope_init(void) > > > > > > int __init dmar_table_init(void) > > > { > > > - static int dmar_table_initialized; > > > int ret; > > > > > > - if (dmar_table_initialized == 0) { > > > - ret = parse_dmar_table(); > > > - if (ret < 0) { > > > - if (ret != -ENODEV) > > > - pr_info("Parse DMAR table failure.\n"); > > > - } else if (list_empty(&dmar_drhd_units)) { > > > - pr_info("No DMAR devices found\n"); > > > - ret = -ENODEV; > > > - } > > > - > > > - if (ret < 0) > > > - dmar_table_initialized = ret; > > > - else > > > - dmar_table_initialized = 1; > > > + ret = parse_dmar_table(); > > > + if (ret < 0) { > > > + if (ret != -ENODEV) > > > + pr_info("Parse DMAR table failure.\n"); > > > + } else if (list_empty(&dmar_drhd_units)) { > > > + pr_info("No DMAR devices found\n"); > > > + ret = -ENODEV; > > > } > > > > > > - return dmar_table_initialized < 0 ? dmar_table_initialized : 0; > > > + return ret; > > > } > > > > > > static void warn_invalid_dmar(u64 addr, const char *message) > > > diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c > > > index 687f1
Re: [PATCH v7 12/13] ACPI / init: Invoke early ACPI initialization earlier
(ret != -ENODEV) > + pr_info("Parse DMAR table failure.\n"); > + } else if (list_empty(&dmar_drhd_units)) { > + pr_info("No DMAR devices found\n"); > + ret = -ENODEV; > } > > - return dmar_table_initialized < 0 ? dmar_table_initialized : 0; > + return ret; > } > > static void warn_invalid_dmar(u64 addr, const char *message) > diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c > index 687f18f..90f74f4 100644 > --- a/drivers/iommu/intel-iommu.c > +++ b/drivers/iommu/intel-iommu.c > @@ -4832,6 +4832,8 @@ int __init intel_iommu_init(void) > } > > down_write(&dmar_global_lock); > + > + intel_iommu_free_dmars(); > if (dmar_table_init()) { > if (force_on) > panic("tboot: Failed to initialize DMAR table\n"); > diff --git a/drivers/iommu/intel_irq_remapping.c > b/drivers/iommu/intel_irq_remapping.c > index a5b89f6..ccaacda 100644 > --- a/drivers/iommu/intel_irq_remapping.c > +++ b/drivers/iommu/intel_irq_remapping.c > @@ -675,7 +675,7 @@ static void __init intel_cleanup_irq_remapping(void) > pr_warn("Failed to enable irq remapping. You are vulnerable to > irq-injection attacks.\n"); > } > > -static int __init intel_prepare_irq_remapping(void) > +static int __init __intel_prepare_irq_remapping(void) > { > struct dmar_drhd_unit *drhd; > struct intel_iommu *iommu; > @@ -743,6 +743,21 @@ static int __init intel_prepare_irq_remapping(void) > return -ENODEV; > } > > +static int __init intel_prepare_irq_remapping(void) > +{ > + int ret; > + > + ret = __intel_prepare_irq_remapping(); > + > + if (dmar_tbl_original) { > + acpi_put_table(dmar_tbl_original); > + dmar_tbl_original = NULL; > + dmar_tbl = NULL; > + } > + > + return ret; > +} > + > /* > * Set Posted-Interrupts capability. > */ > diff --git a/include/linux/dmar.h b/include/linux/dmar.h > index e8ffba1..987b076 100644 > --- a/include/linux/dmar.h > +++ b/include/linux/dmar.h > @@ -50,6 +50,8 @@ struct dmar_dev_scope { > > #ifdef CONFIG_DMAR_TABLE > extern struct acpi_table_header *dmar_tbl; > +extern struct acpi_table_header *dmar_tbl_original; > + > struct dmar_drhd_unit { > struct list_head list; /* list of drhd units */ > struct acpi_dmar_header *hdr; /* ACPI header */ > diff --git a/init/main.c b/init/main.c > index 52dee20..052481f 100644 > --- a/init/main.c > +++ b/init/main.c > @@ -655,12 +655,12 @@ asmlinkage __visible void __init start_kernel(void) > kmemleak_init(); > setup_per_cpu_pageset(); > numa_policy_init(); > - acpi_early_init(); > if (late_time_init) > late_time_init(); > calibrate_delay(); > pidmap_init(); > anon_vma_init(); > + acpi_early_init(); > #ifdef CONFIG_X86 > if (efi_enabled(EFI_RUNTIME_SERVICES)) > efi_enter_virtual_mode(); > > Thanks, > dou. > > > > > > > > > > > Thanks, > > > > > > dou. > > > > > > > > Thanks > > > > Lv > > > > > > > > > From: Dou Liyang [mailto:douly.f...@cn.fujitsu.com] > > > > > Sent: Friday, July 14, 2017 1:53 PM > > > > > To: x...@kernel.org; linux-kernel@vger.kernel.org > > > > > Cc: t...@linutronix.de; mi...@kernel.org; h...@zytor.com; > > > > > ebied...@xmission.com; b...@redhat.com; > > > > > pet...@infradead.org; izumi.t...@jp.fujitsu.com; > > > > > tokunaga.kei...@jp.fujitsu.com; Dou Liyang > > > > > ; linux-a...@vger.kernel.org; Rafael J. > > > > > Wysocki ; Zheng, > > > > > Lv ; Julian Wollrath > > > > > Subject: [PATCH v7 12/13] ACPI / init: Invoke early ACPI > > > > > initialization earlier > > > > > > > > > > Linux uses acpi_early_init() to put the ACPI table management into > > > > > the late stage from the early stage where the mapped ACPI tables is > > > > > temporary and should be unmapped. > > > > > > > > > > But, now initializing interrupt delivery mode should map and parse the > > > > > DMAR table earlier in the early stage. This causes an ACPI error when > > > > > Linux reallocates the ACPI root tables. Because Linux doesn't unmapped > > > > > the DMAR table after us